DPDK-dev Archive on lore.kernel.org
 help / color / Atom feed
* [dpdk-dev] [RFC 0/5] use WFE for locks and ring on aarch64
@ 2019-06-30 16:21 Gavin Hu
  2019-06-30 16:21 ` [dpdk-dev] [RFC 1/5] eal: add the APIs to wait until equal Gavin Hu
                   ` (81 more replies)
  0 siblings, 82 replies; 163+ messages in thread
From: Gavin Hu @ 2019-06-30 16:21 UTC (permalink / raw)
  To: dev
  Cc: thomas, jerinj, hemant.agrawal, bruce.richardson, chaozhu,
	Honnappa.Nagarahalli, nd, gavin.hu

DPDK has multiple use cases where the core repeatedly polls a location in
memory. This polling results in many cache and memory transactions.

Arm architecture provides WFE (Wait For Event) instruction, which allows
the cpu core to enter a low power state until woken up by the update to the
memory location being polled. Thus reducing the cache and memory
transactions.

x86 has the PAUSE hint instruction to reduce such overhead.

The rte_wait_until_equal_xxx APIs abstract the functionality of 'polling
for a memory location to become equal to a given value'.

For non-Arm platforms, these APIs are just wrappers around do-while loop
with rte_pause, so there are no performance differences.

For Arm platforms, use of WFE can be configured using CONFIG_RTE_USE_WFE
option. It is disabled by default.

Currently, use of WFE is supported only for aarch64 platforms. armv7
platforms do support the WFE instruction, but they require explicit wake up
events(sev) and are less performannt.

Testing shows that, performance varies across different platforms, with
some showing degradation.

CONFIG_RTE_USE_WFE should be enabled depending on the performance on the
target platforms.

Gavin Hu (5):
  eal: add the APIs to wait until equal
  ticketlock: use new API to reduce contention on aarch64
  ring: use wfe to wait for ring tail update on aarch64
  spinlock: use wfe to reduce contention on aarch64
  config: add WFE config entry for aarch64

 config/arm/meson.build                             |   1 +
 config/common_armv8a_linux                         |   6 +
 .../common/include/arch/arm/rte_pause_64.h         | 143 +++++++++++++++++++++
 .../common/include/arch/arm/rte_spinlock.h         |  25 ++++
 lib/librte_eal/common/include/generic/rte_pause.h  |  20 +++
 .../common/include/generic/rte_spinlock.h          |   2 +-
 .../common/include/generic/rte_ticketlock.h        |   4 +-
 lib/librte_ring/rte_ring_c11_mem.h                 |   5 +-
 lib/librte_ring/rte_ring_generic.h                 |   4 +-
 9 files changed, 203 insertions(+), 7 deletions(-)

-- 
2.7.4


^ permalink raw reply	[flat|nested] 163+ messages in thread

* [dpdk-dev] [RFC 1/5] eal: add the APIs to wait until equal
  2019-06-30 16:21 [dpdk-dev] [RFC 0/5] use WFE for locks and ring on aarch64 Gavin Hu
@ 2019-06-30 16:21 ` Gavin Hu
  2019-06-30 20:27   ` Stephen Hemminger
  2019-07-01  9:58   ` Pavan Nikhilesh Bhagavatula
  2019-06-30 16:21 ` [dpdk-dev] [RFC 2/5] ticketlock: use new API to reduce contention on aarch64 Gavin Hu
                   ` (80 subsequent siblings)
  81 siblings, 2 replies; 163+ messages in thread
From: Gavin Hu @ 2019-06-30 16:21 UTC (permalink / raw)
  To: dev
  Cc: thomas, jerinj, hemant.agrawal, bruce.richardson, chaozhu,
	Honnappa.Nagarahalli, nd, gavin.hu

The rte_wait_until_equal_xxx APIs abstract the functionality of 'polling
for a memory location to become equal to a given value'.

Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
Reviewed-by: Steve Capper <steve.capper@arm.com>
Reviewed-by: Ola Liljedahl <ola.liljedahl@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
---
 .../common/include/arch/arm/rte_pause_64.h         | 143 +++++++++++++++++++++
 lib/librte_eal/common/include/generic/rte_pause.h  |  20 +++
 2 files changed, 163 insertions(+)

diff --git a/lib/librte_eal/common/include/arch/arm/rte_pause_64.h b/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
index 93895d3..0095da6 100644
--- a/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
+++ b/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
@@ -17,6 +17,149 @@ static inline void rte_pause(void)
 	asm volatile("yield" ::: "memory");
 }
 
+#ifdef RTE_USE_WFE
+#define rte_wait_until_equal_relaxed(addr, expected) do {\
+		typeof(*addr) tmp;  \
+		if (__builtin_constant_p((expected))) \
+			do { \
+				if (sizeof(*(addr)) == 16)\
+					asm volatile(  \
+						"sevl\n"  \
+						"1:	 wfe\n"  \
+						"ldxrh  %w0, %1\n"  \
+						"cmp	%w0, %w2\n"  \
+						"bne	1b\n"  \
+						: "=&r"(tmp)  \
+						: "Q"(*addr), "i"(expected)  \
+						: "cc", "memory");  \
+				else if (sizeof(*(addr)) == 32)\
+					asm volatile(  \
+						"sevl\n"  \
+						"1:	 wfe\n"  \
+						"ldxr  %w0, %1\n"  \
+						"cmp	%w0, %w2\n"  \
+						"bne	1b\n"  \
+						: "=&r"(tmp)  \
+						: "Q"(*addr), "i"(expected)  \
+						: "cc", "memory");  \
+				else if (sizeof(*(addr)) == 64)\
+					asm volatile(  \
+						"sevl\n"  \
+						"1:	 wfe\n"  \
+						"ldxr  %x0, %1\n"  \
+						"cmp	%x0, %x2\n"  \
+						"bne	1b\n"  \
+						: "=&r" (tmp)  \
+						: "Q"(*addr), "i"(expected)  \
+						: "cc", "memory"); \
+			} while (0); \
+		else \
+			do { \
+				if (sizeof(*(addr)) == 16)\
+					asm volatile(  \
+						"sevl\n"  \
+						"1:	 wfe\n"  \
+						"ldxrh  %w0, %1\n"  \
+						"cmp	%w0, %w2\n"  \
+						"bne	1b\n"  \
+						: "=&r"(tmp)  \
+						: "Q"(*addr), "r"(expected)  \
+						: "cc", "memory");  \
+				else if (sizeof(*(addr)) == 32)\
+					asm volatile(  \
+						"sevl\n"  \
+						"1:	 wfe\n"  \
+						"ldxr  %w0, %1\n"  \
+						"cmp	%w0, %w2\n"  \
+						"bne	1b\n"  \
+						: "=&r"(tmp)  \
+						: "Q"(*addr), "r"(expected)  \
+						: "cc", "memory");  \
+				else if (sizeof(*(addr)) == 64)\
+					asm volatile(  \
+						"sevl\n"  \
+						"1:	 wfe\n"  \
+						"ldxr  %x0, %1\n"  \
+						"cmp	%x0, %x2\n"  \
+						"bne	1b\n"  \
+						: "=&r" (tmp)  \
+						: "Q"(*addr), "r"(expected)  \
+						: "cc", "memory");  \
+		} while (0); \
+} while (0)
+
+#define rte_wait_until_equal_acquire(addr, expected) do {\
+		typeof(*addr) tmp;  \
+		if (__builtin_constant_p((expected))) \
+			do { \
+				if (sizeof(*(addr)) == 16)\
+					asm volatile(  \
+						"sevl\n"  \
+						"1:	 wfe\n"  \
+						"ldaxrh  %w0, %1\n"  \
+						"cmp	%w0, %w2\n"  \
+						"bne	1b\n"  \
+						: "=&r"(tmp)  \
+						: "Q"(*addr), "i"(expected)  \
+						: "cc", "memory");  \
+				else if (sizeof(*(addr)) == 32)\
+					asm volatile(  \
+						"sevl\n"  \
+						"1:	 wfe\n"  \
+						"ldaxr  %w0, %1\n"  \
+						"cmp	%w0, %w2\n"  \
+						"bne	1b\n"  \
+						: "=&r"(tmp)  \
+						: "Q"(*addr), "i"(expected)  \
+						: "cc", "memory");  \
+				else if (sizeof(*(addr)) == 64)\
+					asm volatile(  \
+						"sevl\n"  \
+						"1:	 wfe\n"  \
+						"ldaxr  %x0, %1\n"  \
+						"cmp	%x0, %x2\n"  \
+						"bne	1b\n"  \
+						: "=&r" (tmp)  \
+						: "Q"(*addr), "i"(expected)  \
+						: "cc", "memory"); \
+			} while (0); \
+		else \
+			do { \
+				if (sizeof(*(addr)) == 16)\
+					asm volatile(  \
+						"sevl\n"  \
+						"1:	 wfe\n"  \
+						"ldaxrh  %w0, %1\n"  \
+						"cmp	%w0, %w2\n"  \
+						"bne	1b\n"  \
+						: "=&r"(tmp)  \
+						: "Q"(*addr), "r"(expected)  \
+						: "cc", "memory");  \
+				else if (sizeof(*(addr)) == 32)\
+					asm volatile(  \
+						"sevl\n"  \
+						"1:	 wfe\n"  \
+						"ldaxr  %w0, %1\n"  \
+						"cmp	%w0, %w2\n"  \
+						"bne	1b\n"  \
+						: "=&r"(tmp)  \
+						: "Q"(*addr), "r"(expected)  \
+						: "cc", "memory");  \
+				else if (sizeof(*(addr)) == 64)\
+					asm volatile(  \
+						"sevl\n"  \
+						"1:	 wfe\n"  \
+						"ldaxr  %x0, %1\n"  \
+						"cmp	%x0, %x2\n"  \
+						"bne	1b\n"  \
+						: "=&r" (tmp)  \
+						: "Q"(*addr), "r"(expected)  \
+						: "cc", "memory");  \
+		} while (0); \
+} while (0)
+
+#endif
+
 #ifdef __cplusplus
 }
 #endif
diff --git a/lib/librte_eal/common/include/generic/rte_pause.h b/lib/librte_eal/common/include/generic/rte_pause.h
index 52bd4db..c115b61 100644
--- a/lib/librte_eal/common/include/generic/rte_pause.h
+++ b/lib/librte_eal/common/include/generic/rte_pause.h
@@ -20,4 +20,24 @@
  */
 static inline void rte_pause(void);
 
+#if !defined(RTE_USE_WFE)
+#define rte_wait_until_equal_relaxed(addr, expected) do {\
+		rte_pause();	\
+	} while (*(addr) != (expected))
+
+#ifdef RTE_USE_C11_MEM_MODEL
+#define rte_wait_until_equal_acquire(addr, expected) do {\
+		rte_pause();	\
+	} while (__atomic_load_n((addr), __ATOMIC_ACQUIRE) != (expected))
+#else
+#define rte_wait_until_equal_acquire(addr, expected) do {\
+		do {\
+			rte_pause(); \
+		} while (*(addr) != (expected)); \
+		rte_smp_rmb(); \
+	} while (0)
+#endif
+#endif
+
+
 #endif /* _RTE_PAUSE_H_ */
-- 
2.7.4


^ permalink raw reply	[flat|nested] 163+ messages in thread

* [dpdk-dev] [RFC 2/5] ticketlock: use new API to reduce contention on aarch64
  2019-06-30 16:21 [dpdk-dev] [RFC 0/5] use WFE for locks and ring on aarch64 Gavin Hu
  2019-06-30 16:21 ` [dpdk-dev] [RFC 1/5] eal: add the APIs to wait until equal Gavin Hu
@ 2019-06-30 16:21 ` Gavin Hu
  2019-06-30 16:21 ` [dpdk-dev] [RFC 3/5] ring: use wfe to wait for ring tail update " Gavin Hu
                   ` (79 subsequent siblings)
  81 siblings, 0 replies; 163+ messages in thread
From: Gavin Hu @ 2019-06-30 16:21 UTC (permalink / raw)
  To: dev
  Cc: thomas, jerinj, hemant.agrawal, bruce.richardson, chaozhu,
	Honnappa.Nagarahalli, nd, gavin.hu

While using ticket lock, cores repeatedly poll the lock variable.
This is replaced by rte_wait_until_equal API.

Running ticketlock_autotest on ThunderX2, with different numbers of cores
and depths of rings, 3%~8% performance gains were measured.

Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
---
 lib/librte_eal/common/include/generic/rte_ticketlock.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/lib/librte_eal/common/include/generic/rte_ticketlock.h b/lib/librte_eal/common/include/generic/rte_ticketlock.h
index 191146f..6820f01 100644
--- a/lib/librte_eal/common/include/generic/rte_ticketlock.h
+++ b/lib/librte_eal/common/include/generic/rte_ticketlock.h
@@ -64,8 +64,8 @@ static inline __rte_experimental void
 rte_ticketlock_lock(rte_ticketlock_t *tl)
 {
 	uint16_t me = __atomic_fetch_add(&tl->s.next, 1, __ATOMIC_RELAXED);
-	while (__atomic_load_n(&tl->s.current, __ATOMIC_ACQUIRE) != me)
-		rte_pause();
+	if (__atomic_load_n(&tl->s.current, __ATOMIC_ACQUIRE) != me)
+		rte_wait_until_equal_acquire(&tl->s.current, me);
 }
 
 /**
-- 
2.7.4


^ permalink raw reply	[flat|nested] 163+ messages in thread

* [dpdk-dev] [RFC 3/5] ring: use wfe to wait for ring tail update on aarch64
  2019-06-30 16:21 [dpdk-dev] [RFC 0/5] use WFE for locks and ring on aarch64 Gavin Hu
  2019-06-30 16:21 ` [dpdk-dev] [RFC 1/5] eal: add the APIs to wait until equal Gavin Hu
  2019-06-30 16:21 ` [dpdk-dev] [RFC 2/5] ticketlock: use new API to reduce contention on aarch64 Gavin Hu
@ 2019-06-30 16:21 ` " Gavin Hu
  2019-06-30 16:21 ` [dpdk-dev] [RFC 4/5] spinlock: use wfe to reduce contention " Gavin Hu
                   ` (78 subsequent siblings)
  81 siblings, 0 replies; 163+ messages in thread
From: Gavin Hu @ 2019-06-30 16:21 UTC (permalink / raw)
  To: dev
  Cc: thomas, jerinj, hemant.agrawal, bruce.richardson, chaozhu,
	Honnappa.Nagarahalli, nd, gavin.hu

Instead of polling for tail to be updated, use wfe instruction.

50%~70% performance gain was measured by running ring_perf_autotest on
ThunderX2.

Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
Reviewed-by: Steve Capper <steve.capper@arm.com>
Reviewed-by: Ola Liljedahl <ola.liljedahl@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>

---
 lib/librte_ring/rte_ring_c11_mem.h | 5 +++--
 lib/librte_ring/rte_ring_generic.h | 4 ++--
 2 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/lib/librte_ring/rte_ring_c11_mem.h b/lib/librte_ring/rte_ring_c11_mem.h
index 0fb73a3..f1de79c 100644
--- a/lib/librte_ring/rte_ring_c11_mem.h
+++ b/lib/librte_ring/rte_ring_c11_mem.h
@@ -2,6 +2,7 @@
  *
  * Copyright (c) 2017,2018 HXT-semitech Corporation.
  * Copyright (c) 2007-2009 Kip Macy kmacy@freebsd.org
+ * Copyright (c) 2019 Arm Limited
  * All rights reserved.
  * Derived from FreeBSD's bufring.h
  * Used as BSD-3 Licensed with permission from Kip Macy.
@@ -21,8 +22,8 @@ update_tail(struct rte_ring_headtail *ht, uint32_t old_val, uint32_t new_val,
 	 * we need to wait for them to complete
 	 */
 	if (!single)
-		while (unlikely(ht->tail != old_val))
-			rte_pause();
+		if (unlikely(ht->tail != old_val))
+			rte_wait_until_equal_relaxed(&ht->tail, old_val);
 
 	__atomic_store_n(&ht->tail, new_val, __ATOMIC_RELEASE);
 }
diff --git a/lib/librte_ring/rte_ring_generic.h b/lib/librte_ring/rte_ring_generic.h
index 953cdbb..bb0dce0 100644
--- a/lib/librte_ring/rte_ring_generic.h
+++ b/lib/librte_ring/rte_ring_generic.h
@@ -23,8 +23,8 @@ update_tail(struct rte_ring_headtail *ht, uint32_t old_val, uint32_t new_val,
 	 * we need to wait for them to complete
 	 */
 	if (!single)
-		while (unlikely(ht->tail != old_val))
-			rte_pause();
+		if (unlikely(ht->tail != old_val))
+			rte_wait_until_equal_relaxed(&ht->tail, old_val);
 
 	ht->tail = new_val;
 }
-- 
2.7.4


^ permalink raw reply	[flat|nested] 163+ messages in thread

* [dpdk-dev] [RFC 4/5] spinlock: use wfe to reduce contention on aarch64
  2019-06-30 16:21 [dpdk-dev] [RFC 0/5] use WFE for locks and ring on aarch64 Gavin Hu
                   ` (2 preceding siblings ...)
  2019-06-30 16:21 ` [dpdk-dev] [RFC 3/5] ring: use wfe to wait for ring tail update " Gavin Hu
@ 2019-06-30 16:21 ` " Gavin Hu
  2019-06-30 16:21 ` [dpdk-dev] [RFC 5/5] config: add WFE config entry for aarch64 Gavin Hu
                   ` (77 subsequent siblings)
  81 siblings, 0 replies; 163+ messages in thread
From: Gavin Hu @ 2019-06-30 16:21 UTC (permalink / raw)
  To: dev
  Cc: thomas, jerinj, hemant.agrawal, bruce.richardson, chaozhu,
	Honnappa.Nagarahalli, nd, gavin.hu

In acquiring a spinlock, cores repeatedly poll the lock variable.
This is replaced by rte_wait_until_equal API.

20% performance gain was measured by running spinlock_autotest on 14
isolated cores of ThunderX2.

Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
Reviewed-by: Phil Yang <phil.yang@arm.com>
Reviewed-by: Steve Capper <steve.capper@arm.com>
Reviewed-by: Ola Liljedahl <ola.liljedahl@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
---
 .../common/include/arch/arm/rte_spinlock.h         | 25 ++++++++++++++++++++++
 .../common/include/generic/rte_spinlock.h          |  2 +-
 2 files changed, 26 insertions(+), 1 deletion(-)

diff --git a/lib/librte_eal/common/include/arch/arm/rte_spinlock.h b/lib/librte_eal/common/include/arch/arm/rte_spinlock.h
index 1a6916b..b7e8521 100644
--- a/lib/librte_eal/common/include/arch/arm/rte_spinlock.h
+++ b/lib/librte_eal/common/include/arch/arm/rte_spinlock.h
@@ -16,6 +16,31 @@ extern "C" {
 #include <rte_common.h>
 #include "generic/rte_spinlock.h"
 
+/* armv7a does support WFE, but an explicit wake-up signal using SEV is
+ * required (must be preceded by DSB to drain the store buffer) and
+ * this is less performant, so keep armv7a implementation unchanged.
+ */
+#if defined(RTE_USE_WFE) && defined(RTE_ARCH_ARM64)
+static inline void
+rte_spinlock_lock(rte_spinlock_t *sl)
+{
+	unsigned int tmp;
+	/* http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.
+	 * faqs/ka16809.html
+	 */
+	asm volatile(
+		"sevl\n"
+		"1:	 wfe\n"
+		"2:	 ldaxr   %w0, %1\n"
+		"cbnz   %w0, 1b\n"
+		"stxr   %w0, %w2, %1\n"
+		"cbnz   %w0, 2b\n"
+		: "=&r" (tmp), "+Q"(sl->locked)
+		: "r" (1)
+		: "cc", "memory");
+}
+#endif
+
 static inline int rte_tm_supported(void)
 {
 	return 0;
diff --git a/lib/librte_eal/common/include/generic/rte_spinlock.h b/lib/librte_eal/common/include/generic/rte_spinlock.h
index 87ae7a4..cf4f15b 100644
--- a/lib/librte_eal/common/include/generic/rte_spinlock.h
+++ b/lib/librte_eal/common/include/generic/rte_spinlock.h
@@ -57,7 +57,7 @@ rte_spinlock_init(rte_spinlock_t *sl)
 static inline void
 rte_spinlock_lock(rte_spinlock_t *sl);
 
-#ifdef RTE_FORCE_INTRINSICS
+#if defined(RTE_FORCE_INTRINSICS) && !defined(RTE_USE_WFE)
 static inline void
 rte_spinlock_lock(rte_spinlock_t *sl)
 {
-- 
2.7.4


^ permalink raw reply	[flat|nested] 163+ messages in thread

* [dpdk-dev] [RFC 5/5] config: add WFE config entry for aarch64
  2019-06-30 16:21 [dpdk-dev] [RFC 0/5] use WFE for locks and ring on aarch64 Gavin Hu
                   ` (3 preceding siblings ...)
  2019-06-30 16:21 ` [dpdk-dev] [RFC 4/5] spinlock: use wfe to reduce contention " Gavin Hu
@ 2019-06-30 16:21 ` Gavin Hu
  2019-06-30 20:29 ` [dpdk-dev] [RFC 0/5] use WFE for locks and ring on aarch64 Stephen Hemminger
                   ` (76 subsequent siblings)
  81 siblings, 0 replies; 163+ messages in thread
From: Gavin Hu @ 2019-06-30 16:21 UTC (permalink / raw)
  To: dev
  Cc: thomas, jerinj, hemant.agrawal, bruce.richardson, chaozhu,
	Honnappa.Nagarahalli, nd, gavin.hu

Add the RTE_USE_WFE configuration entry for aarch64, disabled by default.
It can be enabled selectively based on the performance benchmarking.

Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
Reviewed-by: Steve Capper <steve.capper@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
---
 config/arm/meson.build     | 1 +
 config/common_armv8a_linux | 6 ++++++
 2 files changed, 7 insertions(+)

diff --git a/config/arm/meson.build b/config/arm/meson.build
index 6fa06a1..939d60e 100644
--- a/config/arm/meson.build
+++ b/config/arm/meson.build
@@ -116,6 +116,7 @@ impl_dpaa = ['NXP DPAA', flags_dpaa, machine_args_generic]
 impl_dpaa2 = ['NXP DPAA2', flags_dpaa2, machine_args_generic]
 
 dpdk_conf.set('RTE_FORCE_INTRINSICS', 1)
+dpdk_conf.set('RTE_USE_WFE', 0)
 
 if not dpdk_conf.get('RTE_ARCH_64')
 	dpdk_conf.set('RTE_CACHE_LINE_SIZE', 64)
diff --git a/config/common_armv8a_linux b/config/common_armv8a_linux
index 72091de..ae87a87 100644
--- a/config/common_armv8a_linux
+++ b/config/common_armv8a_linux
@@ -12,6 +12,12 @@ CONFIG_RTE_ARCH_64=y
 
 CONFIG_RTE_FORCE_INTRINSICS=y
 
+# Use WFE instructions to implement the rte_wait_for_equal_xxx APIs,
+# calling these APIs put the cores enter low power state while waiting
+# for the memory address to be become equal to the expected value.
+# This is supported only by aarch64.
+CONFIG_RTE_USE_WFE=n
+
 # Maximum available cache line size in arm64 implementations.
 # Setting to maximum available cache line size in generic config
 # to address minimum DMA alignment across all arm64 implementations.
-- 
2.7.4


^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: [dpdk-dev] [RFC 1/5] eal: add the APIs to wait until equal
  2019-06-30 16:21 ` [dpdk-dev] [RFC 1/5] eal: add the APIs to wait until equal Gavin Hu
@ 2019-06-30 20:27   ` Stephen Hemminger
  2019-07-01  7:16     ` Gavin Hu (Arm Technology China)
  2019-07-01  9:58   ` Pavan Nikhilesh Bhagavatula
  1 sibling, 1 reply; 163+ messages in thread
From: Stephen Hemminger @ 2019-06-30 20:27 UTC (permalink / raw)
  To: Gavin Hu
  Cc: dev, thomas, jerinj, hemant.agrawal, bruce.richardson, chaozhu,
	Honnappa.Nagarahalli, nd

On Mon,  1 Jul 2019 00:21:12 +0800
Gavin Hu <gavin.hu@arm.com> wrote:

> +#ifdef RTE_USE_WFE
> +#define rte_wait_until_equal_relaxed(addr, expected) do {\
> +		typeof(*addr) tmp;  \
> +		if (__builtin_constant_p((expected))) \
> +			do { \
> +				if (sizeof(*(addr)) == 16)\
> +					asm volatile(  \
> +						"sevl\n"  \
> +						"1:	 wfe\n"  \
> +						"ldxrh  %w0, %1\n"  \
> +						"cmp	%w0, %w2\n"  \
> +						"bne	1b\n"  \
> +						: "=&r"(tmp)  \
> +						: "Q"(*addr), "i"(expected)  \
> +						: "cc", "memory");  \
> +				else if (sizeof(*(addr)) == 32)\
> +					asm volatile(  \
> +						"sevl\n"  \
> +						"1:	 wfe\n"  \
> +						"ldxr  %w0, %1\n"  \
> +						"cmp	%w0, %w2\n"  \
> +						"bne	1b\n"  \
> +						: "=&r"(tmp)  \
> +						: "Q"(*addr), "i"(expected)  \
> +						: "cc", "memory");  \
> +				else if (sizeof(*(addr)) == 64)\
> +					asm volatile(  \
> +						"sevl\n"  \
> +						"1:	 wfe\n"  \
> +						"ldxr  %x0, %1\n"  \
> +						"cmp	%x0, %x2\n"  \
> +						"bne	1b\n"  \
> +						: "=&r" (tmp)  \
> +						: "Q"(*addr), "i"(expected)  \
> +						: "cc", "memory"); \
> +			} while (0); \
> +		else \
> +			do { \
> +				if (sizeof(*(addr)) == 16)\
> +					asm volatile(  \
> +						"sevl\n"  \
> +						"1:	 wfe\n"  \
> +						"ldxrh  %w0, %1\n"  \
> +						"cmp	%w0, %w2\n"  \
> +						"bne	1b\n"  \
> +						: "=&r"(tmp)  \
> +						: "Q"(*addr), "r"(expected)  \
> +						: "cc", "memory");  \
> +				else if (sizeof(*(addr)) == 32)\
> +					asm volatile(  \
> +						"sevl\n"  \
> +						"1:	 wfe\n"  \
> +						"ldxr  %w0, %1\n"  \
> +						"cmp	%w0, %w2\n"  \
> +						"bne	1b\n"  \
> +						: "=&r"(tmp)  \
> +						: "Q"(*addr), "r"(expected)  \
> +						: "cc", "memory");  \
> +				else if (sizeof(*(addr)) == 64)\
> +					asm volatile(  \
> +						"sevl\n"  \
> +						"1:	 wfe\n"  \
> +						"ldxr  %x0, %1\n"  \
> +						"cmp	%x0, %x2\n"  \
> +						"bne	1b\n"  \
> +						: "=&r" (tmp)  \
> +						: "Q"(*addr), "r"(expected)  \
> +						: "cc", "memory");  \
> +		} while (0); \
> +} while (0)

That is a hot mess.
Macro's are harder to maintain and offer no benefit over inline functions.

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: [dpdk-dev] [RFC 0/5] use WFE for locks and ring on aarch64
  2019-06-30 16:21 [dpdk-dev] [RFC 0/5] use WFE for locks and ring on aarch64 Gavin Hu
                   ` (4 preceding siblings ...)
  2019-06-30 16:21 ` [dpdk-dev] [RFC 5/5] config: add WFE config entry for aarch64 Gavin Hu
@ 2019-06-30 20:29 ` Stephen Hemminger
  2019-07-01  9:12   ` Gavin Hu (Arm Technology China)
  2019-07-03  8:58 ` [dpdk-dev] [RFC v2 " Gavin Hu
                   ` (75 subsequent siblings)
  81 siblings, 1 reply; 163+ messages in thread
From: Stephen Hemminger @ 2019-06-30 20:29 UTC (permalink / raw)
  To: Gavin Hu
  Cc: dev, thomas, jerinj, hemant.agrawal, bruce.richardson, chaozhu,
	Honnappa.Nagarahalli, nd

On Mon,  1 Jul 2019 00:21:11 +0800
Gavin Hu <gavin.hu@arm.com> wrote:

> DPDK has multiple use cases where the core repeatedly polls a location in
> memory. This polling results in many cache and memory transactions.
> 
> Arm architecture provides WFE (Wait For Event) instruction, which allows
> the cpu core to enter a low power state until woken up by the update to the
> memory location being polled. Thus reducing the cache and memory
> transactions.
> 
> x86 has the PAUSE hint instruction to reduce such overhead.
> 
> The rte_wait_until_equal_xxx APIs abstract the functionality of 'polling
> for a memory location to become equal to a given value'.
> 
> For non-Arm platforms, these APIs are just wrappers around do-while loop
> with rte_pause, so there are no performance differences.
> 
> For Arm platforms, use of WFE can be configured using CONFIG_RTE_USE_WFE
> option. It is disabled by default.
> 
> Currently, use of WFE is supported only for aarch64 platforms. armv7
> platforms do support the WFE instruction, but they require explicit wake up
> events(sev) and are less performannt.
> 
> Testing shows that, performance varies across different platforms, with
> some showing degradation.
> 
> CONFIG_RTE_USE_WFE should be enabled depending on the performance on the
> target platforms.

How does this work if process is preempted?

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: [dpdk-dev] [RFC 1/5] eal: add the APIs to wait until equal
  2019-06-30 20:27   ` Stephen Hemminger
@ 2019-07-01  7:16     ` Gavin Hu (Arm Technology China)
  2019-07-01  7:43       ` Thomas Monjalon
  0 siblings, 1 reply; 163+ messages in thread
From: Gavin Hu (Arm Technology China) @ 2019-07-01  7:16 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: dev, thomas, jerinj, hemant.agrawal, bruce.richardson, chaozhu,
	Honnappa Nagarahalli, nd, gaetan.rivet,
	Gavin Hu (Arm Technology China)

Hi Stephen,

> -----Original Message-----
> From: Stephen Hemminger <stephen@networkplumber.org>
> Sent: Monday, July 1, 2019 4:28 AM
> To: Gavin Hu (Arm Technology China) <Gavin.Hu@arm.com>
> Cc: dev@dpdk.org; thomas@monjalon.net; jerinj@marvell.com;
> hemant.agrawal@nxp.com; bruce.richardson@intel.com;
> chaozhu@linux.vnet.ibm.com; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>; nd <nd@arm.com>
> Subject: Re: [dpdk-dev] [RFC 1/5] eal: add the APIs to wait until equal
> 
> On Mon,  1 Jul 2019 00:21:12 +0800
> Gavin Hu <gavin.hu@arm.com> wrote:
> 
> > +#ifdef RTE_USE_WFE
> > +#define rte_wait_until_equal_relaxed(addr, expected) do {\
> > +		typeof(*addr) tmp;  \
> > +		if (__builtin_constant_p((expected))) \
> > +			do { \
> > +				if (sizeof(*(addr)) == 16)\
> > +					asm volatile(  \
> > +						"sevl\n"  \
> > +						"1:	 wfe\n"  \
> > +						"ldxrh  %w0, %1\n"  \
> > +						"cmp	%w0, %w2\n"  \
> > +						"bne	1b\n"  \
> > +						: "=&r"(tmp)  \
> > +						: "Q"(*addr), "i"(expected)  \
> > +						: "cc", "memory");  \
> > +				else if (sizeof(*(addr)) == 32)\
> > +					asm volatile(  \
> > +						"sevl\n"  \
> > +						"1:	 wfe\n"  \
> > +						"ldxr  %w0, %1\n"  \
> > +						"cmp	%w0, %w2\n"  \
> > +						"bne	1b\n"  \
> > +						: "=&r"(tmp)  \
> > +						: "Q"(*addr), "i"(expected)  \
> > +						: "cc", "memory");  \
> > +				else if (sizeof(*(addr)) == 64)\
> > +					asm volatile(  \
> > +						"sevl\n"  \
> > +						"1:	 wfe\n"  \
> > +						"ldxr  %x0, %1\n"  \
> > +						"cmp	%x0, %x2\n"  \
> > +						"bne	1b\n"  \
> > +						: "=&r" (tmp)  \
> > +						: "Q"(*addr), "i"(expected)  \
> > +						: "cc", "memory"); \
> > +			} while (0); \
> > +		else \
> > +			do { \
> > +				if (sizeof(*(addr)) == 16)\
> > +					asm volatile(  \
> > +						"sevl\n"  \
> > +						"1:	 wfe\n"  \
> > +						"ldxrh  %w0, %1\n"  \
> > +						"cmp	%w0, %w2\n"  \
> > +						"bne	1b\n"  \
> > +						: "=&r"(tmp)  \
> > +						: "Q"(*addr), "r"(expected)  \
> > +						: "cc", "memory");  \
> > +				else if (sizeof(*(addr)) == 32)\
> > +					asm volatile(  \
> > +						"sevl\n"  \
> > +						"1:	 wfe\n"  \
> > +						"ldxr  %w0, %1\n"  \
> > +						"cmp	%w0, %w2\n"  \
> > +						"bne	1b\n"  \
> > +						: "=&r"(tmp)  \
> > +						: "Q"(*addr), "r"(expected)  \
> > +						: "cc", "memory");  \
> > +				else if (sizeof(*(addr)) == 64)\
> > +					asm volatile(  \
> > +						"sevl\n"  \
> > +						"1:	 wfe\n"  \
> > +						"ldxr  %x0, %1\n"  \
> > +						"cmp	%x0, %x2\n"  \
> > +						"bne	1b\n"  \
> > +						: "=&r" (tmp)  \
> > +						: "Q"(*addr), "r"(expected)  \
> > +						: "cc", "memory");  \
> > +		} while (0); \
> > +} while (0)
> 
> That is a hot mess.
> Macro's are harder to maintain and offer no benefit over inline functions.
During our internal review, I ever used C11 _Generic to generalize the API to take different types of arguments. 
That makes the API look much simpler and better, but it poses a hard requirement for C11 and gcc 4.9+.
That means, Gaetan's patch, as shown below, has to be reverted, otherwise there are compiling errors.
https://gcc.gnu.org/wiki/C11Status 
$ git show ea7726a6
commit ea7726a6ee4b2b63313c4a198522a8dcea70c13d
Author: Gaetan Rivet <gaetan.rivet@6wind.com>
Date:   Thu Jul 20 14:27:53 2017 +0200

    net/failsafe: fix build on FreeBSD 10 with GCC 4.8

diff --git a/drivers/net/failsafe/Makefile b/drivers/net/failsafe/Makefile
index 32aaaa2..d516d36 100644
--- a/drivers/net/failsafe/Makefile
+++ b/drivers/net/failsafe/Makefile
@@ -50,7 +50,7 @@ SRCS-$(CONFIG_RTE_LIBRTE_PMD_FAILSAFE) += failsafe_flow.c
 # No exported include files

 # Basic CFLAGS:
-CFLAGS += -std=c11 -Wextra
+CFLAGS += -std=gnu99 -Wextra

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: [dpdk-dev] [RFC 1/5] eal: add the APIs to wait until equal
  2019-07-01  7:16     ` Gavin Hu (Arm Technology China)
@ 2019-07-01  7:43       ` Thomas Monjalon
  2019-07-02 14:07         ` Gavin Hu (Arm Technology China)
  0 siblings, 1 reply; 163+ messages in thread
From: Thomas Monjalon @ 2019-07-01  7:43 UTC (permalink / raw)
  To: Gavin Hu (Arm Technology China)
  Cc: Stephen Hemminger, dev, jerinj, hemant.agrawal, bruce.richardson,
	chaozhu, Honnappa Nagarahalli, nd, gaetan.rivet

01/07/2019 09:16, Gavin Hu (Arm Technology China):
> From: Stephen Hemminger <stephen@networkplumber.org>
> > Gavin Hu <gavin.hu@arm.com> wrote:
> > 
> > > +#ifdef RTE_USE_WFE
> > > +#define rte_wait_until_equal_relaxed(addr, expected) do {\
[...]
> > That is a hot mess.
> > Macro's are harder to maintain and offer no benefit over inline functions.
> 
> During our internal review, I ever used C11 _Generic to generalize the API to take different types of arguments. 

Gavin, the question is about macros versus functions.
Please, could you convert it into an inline function?




^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: [dpdk-dev] [RFC 0/5] use WFE for locks and ring on aarch64
  2019-06-30 20:29 ` [dpdk-dev] [RFC 0/5] use WFE for locks and ring on aarch64 Stephen Hemminger
@ 2019-07-01  9:12   ` Gavin Hu (Arm Technology China)
  0 siblings, 0 replies; 163+ messages in thread
From: Gavin Hu (Arm Technology China) @ 2019-07-01  9:12 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: dev, thomas, jerinj, hemant.agrawal, bruce.richardson, chaozhu,
	Honnappa Nagarahalli, nd

Hi Stephen,

> -----Original Message-----
> From: Stephen Hemminger <stephen@networkplumber.org>
> Sent: Monday, July 1, 2019 4:30 AM
> To: Gavin Hu (Arm Technology China) <Gavin.Hu@arm.com>
> Cc: dev@dpdk.org; thomas@monjalon.net; jerinj@marvell.com;
> hemant.agrawal@nxp.com; bruce.richardson@intel.com;
> chaozhu@linux.vnet.ibm.com; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>; nd <nd@arm.com>
> Subject: Re: [dpdk-dev] [RFC 0/5] use WFE for locks and ring on aarch64
> 
> On Mon,  1 Jul 2019 00:21:11 +0800
> Gavin Hu <gavin.hu@arm.com> wrote:
> 
> > DPDK has multiple use cases where the core repeatedly polls a location in
> > memory. This polling results in many cache and memory transactions.
> >
> > Arm architecture provides WFE (Wait For Event) instruction, which allows
> > the cpu core to enter a low power state until woken up by the update to the
> > memory location being polled. Thus reducing the cache and memory
> > transactions.
> >
> > x86 has the PAUSE hint instruction to reduce such overhead.
> >
> > The rte_wait_until_equal_xxx APIs abstract the functionality of 'polling
> > for a memory location to become equal to a given value'.
> >
> > For non-Arm platforms, these APIs are just wrappers around do-while loop
> > with rte_pause, so there are no performance differences.
> >
> > For Arm platforms, use of WFE can be configured using
> CONFIG_RTE_USE_WFE
> > option. It is disabled by default.
> >
> > Currently, use of WFE is supported only for aarch64 platforms. armv7
> > platforms do support the WFE instruction, but they require explicit wake up
> > events(sev) and are less performannt.
> >
> > Testing shows that, performance varies across different platforms, with
> > some showing degradation.
> >
> > CONFIG_RTE_USE_WFE should be enabled depending on the performance
> on the
> > target platforms.
> 
> How does this work if process is preempted?
WFE won't prevent pre-emption from the kernel as that is down to a timer/re-scheduling interrupt.
Software using the WFE mechanism must tolerate spurious wake-up events, including timer/re-scheduling interrupts, so a re-check of the condition upon exit of WFE is needed to be in place(this is already included in the patch)

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: [dpdk-dev] [RFC 1/5] eal: add the APIs to wait until equal
  2019-06-30 16:21 ` [dpdk-dev] [RFC 1/5] eal: add the APIs to wait until equal Gavin Hu
  2019-06-30 20:27   ` Stephen Hemminger
@ 2019-07-01  9:58   ` Pavan Nikhilesh Bhagavatula
  2019-07-02 14:08     ` Gavin Hu (Arm Technology China)
  1 sibling, 1 reply; 163+ messages in thread
From: Pavan Nikhilesh Bhagavatula @ 2019-07-01  9:58 UTC (permalink / raw)
  To: Gavin Hu, dev
  Cc: thomas, Jerin Jacob Kollanukkaran, hemant.agrawal,
	bruce.richardson, chaozhu, Honnappa.Nagarahalli, nd

Hi Gavin,

>-----Original Message-----
>From: dev <dev-bounces@dpdk.org> On Behalf Of Gavin Hu
>Sent: Sunday, June 30, 2019 9:51 PM
>To: dev@dpdk.org
>Cc: thomas@monjalon.net; Jerin Jacob Kollanukkaran
><jerinj@marvell.com>; hemant.agrawal@nxp.com;
>bruce.richardson@intel.com; chaozhu@linux.vnet.ibm.com;
>Honnappa.Nagarahalli@arm.com; nd@arm.com; gavin.hu@arm.com
>Subject: [dpdk-dev] [RFC 1/5] eal: add the APIs to wait until equal
>
>The rte_wait_until_equal_xxx APIs abstract the functionality of 'polling
>for a memory location to become equal to a given value'.
>
>Signed-off-by: Gavin Hu <gavin.hu@arm.com>
>Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
>Reviewed-by: Steve Capper <steve.capper@arm.com>
>Reviewed-by: Ola Liljedahl <ola.liljedahl@arm.com>
>Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
>---
> .../common/include/arch/arm/rte_pause_64.h         | 143
>+++++++++++++++++++++
> lib/librte_eal/common/include/generic/rte_pause.h  |  20 +++
> 2 files changed, 163 insertions(+)
>
>diff --git a/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
>b/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
>index 93895d3..0095da6 100644
>--- a/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
>+++ b/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
>@@ -17,6 +17,149 @@ static inline void rte_pause(void)
> 	asm volatile("yield" ::: "memory");
> }
>
>+#ifdef RTE_USE_WFE
>+#define rte_wait_until_equal_relaxed(addr, expected) do {\
>+		typeof(*addr) tmp;  \
>+		if (__builtin_constant_p((expected))) \
>+			do { \
>+				if (sizeof(*(addr)) == 16)\
>+					asm volatile(  \
>+						"sevl\n"  \
>+						"1:	 wfe\n"  \
>+						"ldxrh  %w0, %1\n"  \
>+						"cmp	%w0, %w2\n"  \
>+						"bne	1b\n"  \
>+						: "=&r"(tmp)  \
>+						: "Q"(*addr),
>"i"(expected)  \
>+						: "cc", "memory");  \

Can we have early exit here i.e. instead of going directly to wfe can we first check the condition and then fallthrough?
Something like:
		asm volatile("	ldxrh	%w0 %1	\n"
			        "    cmp	%w0 %w2	\n"
			        "	b.eq  	2:		\n"
			        "1: wfe			\n"
			        "    ldxrh	%w0, %1	\n"  
			        "    cmp	%w0, %w2	\n"  
			        "    b.ne	1b		\n"  
			        "2:				\n"
			        :::);

Regards,
Pavan.

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: [dpdk-dev] [RFC 1/5] eal: add the APIs to wait until equal
  2019-07-01  7:43       ` Thomas Monjalon
@ 2019-07-02 14:07         ` Gavin Hu (Arm Technology China)
  0 siblings, 0 replies; 163+ messages in thread
From: Gavin Hu (Arm Technology China) @ 2019-07-02 14:07 UTC (permalink / raw)
  To: thomas
  Cc: Stephen Hemminger, dev, jerinj, hemant.agrawal, bruce.richardson,
	chaozhu, Honnappa Nagarahalli, nd, gaetan.rivet

Hi Thomas,
> -----Original Message-----
> From: Thomas Monjalon <thomas@monjalon.net>
> Sent: Monday, July 1, 2019 3:43 PM
> To: Gavin Hu (Arm Technology China) <Gavin.Hu@arm.com>
> Cc: Stephen Hemminger <stephen@networkplumber.org>; dev@dpdk.org;
> jerinj@marvell.com; hemant.agrawal@nxp.com;
> bruce.richardson@intel.com; chaozhu@linux.vnet.ibm.com; Honnappa
> Nagarahalli <Honnappa.Nagarahalli@arm.com>; nd <nd@arm.com>;
> gaetan.rivet@6wind.com
> Subject: Re: [dpdk-dev] [RFC 1/5] eal: add the APIs to wait until equal
> 
> 01/07/2019 09:16, Gavin Hu (Arm Technology China):
> > From: Stephen Hemminger <stephen@networkplumber.org>
> > > Gavin Hu <gavin.hu@arm.com> wrote:
> > >
> > > > +#ifdef RTE_USE_WFE
> > > > +#define rte_wait_until_equal_relaxed(addr, expected) do {\
> [...]
> > > That is a hot mess.
> > > Macro's are harder to maintain and offer no benefit over inline functions.
> >
> > During our internal review, I ever used C11 _Generic to generalize the API
> to take different types of arguments.
> 
> Gavin, the question is about macros versus functions.
> Please, could you convert it into an inline function?
Sure, I will do it in next version.


^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: [dpdk-dev] [RFC 1/5] eal: add the APIs to wait until equal
  2019-07-01  9:58   ` Pavan Nikhilesh Bhagavatula
@ 2019-07-02 14:08     ` Gavin Hu (Arm Technology China)
  0 siblings, 0 replies; 163+ messages in thread
From: Gavin Hu (Arm Technology China) @ 2019-07-02 14:08 UTC (permalink / raw)
  To: Pavan Nikhilesh Bhagavatula, dev
  Cc: thomas, jerinj, hemant.agrawal, bruce.richardson, chaozhu,
	Honnappa Nagarahalli, nd

Hi Pavan,

> -----Original Message-----
> From: Pavan Nikhilesh Bhagavatula <pbhagavatula@marvell.com>
> Sent: Monday, July 1, 2019 5:59 PM
> To: Gavin Hu (Arm Technology China) <Gavin.Hu@arm.com>; dev@dpdk.org
> Cc: thomas@monjalon.net; jerinj@marvell.com; hemant.agrawal@nxp.com;
> bruce.richardson@intel.com; chaozhu@linux.vnet.ibm.com; Honnappa
> Nagarahalli <Honnappa.Nagarahalli@arm.com>; nd <nd@arm.com>
> Subject: RE: [dpdk-dev] [RFC 1/5] eal: add the APIs to wait until equal
> 
> Hi Gavin,
> 
> >-----Original Message-----
> >From: dev <dev-bounces@dpdk.org> On Behalf Of Gavin Hu
> >Sent: Sunday, June 30, 2019 9:51 PM
> >To: dev@dpdk.org
> >Cc: thomas@monjalon.net; Jerin Jacob Kollanukkaran
> ><jerinj@marvell.com>; hemant.agrawal@nxp.com;
> >bruce.richardson@intel.com; chaozhu@linux.vnet.ibm.com;
> >Honnappa.Nagarahalli@arm.com; nd@arm.com; gavin.hu@arm.com
> >Subject: [dpdk-dev] [RFC 1/5] eal: add the APIs to wait until equal
> >
> >The rte_wait_until_equal_xxx APIs abstract the functionality of 'polling
> >for a memory location to become equal to a given value'.
> >
> >Signed-off-by: Gavin Hu <gavin.hu@arm.com>
> >Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> >Reviewed-by: Steve Capper <steve.capper@arm.com>
> >Reviewed-by: Ola Liljedahl <ola.liljedahl@arm.com>
> >Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> >---
> > .../common/include/arch/arm/rte_pause_64.h         | 143
> >+++++++++++++++++++++
> > lib/librte_eal/common/include/generic/rte_pause.h  |  20 +++
> > 2 files changed, 163 insertions(+)
> >
> >diff --git a/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
> >b/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
> >index 93895d3..0095da6 100644
> >--- a/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
> >+++ b/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
> >@@ -17,6 +17,149 @@ static inline void rte_pause(void)
> > 	asm volatile("yield" ::: "memory");
> > }
> >
> >+#ifdef RTE_USE_WFE
> >+#define rte_wait_until_equal_relaxed(addr, expected) do {\
> >+		typeof(*addr) tmp;  \
> >+		if (__builtin_constant_p((expected))) \
> >+			do { \
> >+				if (sizeof(*(addr)) == 16)\
> >+					asm volatile(  \
> >+						"sevl\n"  \
> >+						"1:	 wfe\n"  \
> >+						"ldxrh  %w0, %1\n"  \
> >+						"cmp	%w0, %w2\n"  \
> >+						"bne	1b\n"  \
> >+						: "=&r"(tmp)  \
> >+						: "Q"(*addr),
> >"i"(expected)  \
> >+						: "cc", "memory");  \
> 
> Can we have early exit here i.e. instead of going directly to wfe can we first
> check the condition and then fallthrough?
> Something like:
> 		asm volatile("	ldxrh	%w0 %1	\n"
> 			        "    cmp	%w0 %w2	\n"
> 			        "	b.eq  	2:		\n"
> 			        "1: wfe			\n"
> 			        "    ldxrh	%w0, %1	\n"
> 			        "    cmp	%w0, %w2	\n"
> 			        "    b.ne	1b		\n"
> 			        "2:				\n"
> 			        :::);
> 
> Regards,
> Pavan.
Ok, I will do it in next version.

^ permalink raw reply	[flat|nested] 163+ messages in thread

* [dpdk-dev] [RFC v2 0/5] use WFE for locks and ring on aarch64
  2019-06-30 16:21 [dpdk-dev] [RFC 0/5] use WFE for locks and ring on aarch64 Gavin Hu
                   ` (5 preceding siblings ...)
  2019-06-30 20:29 ` [dpdk-dev] [RFC 0/5] use WFE for locks and ring on aarch64 Stephen Hemminger
@ 2019-07-03  8:58 ` " Gavin Hu
  2019-07-03  8:58 ` [dpdk-dev] [RFC v2 1/5] eal: add the APIs to wait until equal Gavin Hu
                   ` (74 subsequent siblings)
  81 siblings, 0 replies; 163+ messages in thread
From: Gavin Hu @ 2019-07-03  8:58 UTC (permalink / raw)
  To: dev; +Cc: nd

DPDK has multiple use cases where the core repeatedly polls a location in
memory. This polling results in many cache and memory transactions.

Arm architecture provides WFE (Wait For Event) instruction, which allows
the cpu core to enter a low power state until woken up by the update to the
memory location being polled. Thus reducing the cache and memory
transactions.

x86 has the PAUSE hint instruction to reduce such overhead.

The rte_wait_until_equal_xxx APIs abstract the functionality of 'polling
for a memory location to become equal to a given value'.

For non-Arm platforms, these APIs are just wrappers around do-while loop
with rte_pause, so there are no performance differences.

For Arm platforms, use of WFE can be configured using CONFIG_RTE_USE_WFE
option. It is disabled by default.

Currently, use of WFE is supported only for aarch64 platforms. armv7
platforms do support the WFE instruction, but they require explicit wake up
events(sev) and are less performannt.

Testing shows that, performance varies across different platforms, with
some showing degradation.

CONFIG_RTE_USE_WFE should be enabled depending on the performance on the
target platforms.

V2:
* Use inline functions instead of marcos
* Add load and compare in the beginning of the APIs
* Fix some style errors in asm inline 

V1:
* Add the new APIs and use it for ring and locks

Gavin Hu (5):
  eal: add the APIs to wait until equal
  ticketlock: use new API to reduce contention on aarch64
  ring: use wfe to wait for ring tail update on aarch64
  spinlock: use wfe to reduce contention on aarch64
  config: add WFE config entry for aarch64

 config/arm/meson.build                             |   1 +
 config/common_armv8a_linux                         |   6 ++
 .../common/include/arch/arm/rte_atomic_64.h        |   4 +
 .../common/include/arch/arm/rte_pause_64.h         | 106 +++++++++++++++++++++
 .../common/include/arch/arm/rte_spinlock.h         |  25 +++++
 lib/librte_eal/common/include/generic/rte_pause.h  |  39 +++++++-
 .../common/include/generic/rte_spinlock.h          |   2 +-
 .../common/include/generic/rte_ticketlock.h        |   3 +-
 lib/librte_ring/rte_ring_c11_mem.h                 |   4 +-
 lib/librte_ring/rte_ring_generic.h                 |   3 +-
 10 files changed, 185 insertions(+), 8 deletions(-)

-- 
2.7.4


^ permalink raw reply	[flat|nested] 163+ messages in thread

* [dpdk-dev] [RFC v2 1/5] eal: add the APIs to wait until equal
  2019-06-30 16:21 [dpdk-dev] [RFC 0/5] use WFE for locks and ring on aarch64 Gavin Hu
                   ` (6 preceding siblings ...)
  2019-07-03  8:58 ` [dpdk-dev] [RFC v2 " Gavin Hu
@ 2019-07-03  8:58 ` Gavin Hu
  2019-07-20  6:46   ` [dpdk-dev] [EXT] " Pavan Nikhilesh Bhagavatula
  2019-07-03  8:58 ` [dpdk-dev] [RFC v2 2/5] ticketlock: use new API to reduce contention on aarch64 Gavin Hu
                   ` (73 subsequent siblings)
  81 siblings, 1 reply; 163+ messages in thread
From: Gavin Hu @ 2019-07-03  8:58 UTC (permalink / raw)
  To: dev; +Cc: nd

The rte_wait_until_equalxx APIs abstract the functionality of 'polling
for a memory location to become equal to a given value'.

Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
Reviewed-by: Steve Capper <steve.capper@arm.com>
Reviewed-by: Ola Liljedahl <ola.liljedahl@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
---
 .../common/include/arch/arm/rte_atomic_64.h        |   4 +
 .../common/include/arch/arm/rte_pause_64.h         | 106 +++++++++++++++++++++
 lib/librte_eal/common/include/generic/rte_pause.h  |  39 +++++++-
 3 files changed, 148 insertions(+), 1 deletion(-)

diff --git a/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h b/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h
index 97060e4..8d742c6 100644
--- a/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h
+++ b/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h
@@ -15,8 +15,12 @@ extern "C" {
 
 #include "generic/rte_atomic.h"
 
+#ifndef dsb
 #define dsb(opt) asm volatile("dsb " #opt : : : "memory")
+#endif
+#ifndef dmb
 #define dmb(opt) asm volatile("dmb " #opt : : : "memory")
+#endif
 
 #define rte_mb() dsb(sy)
 
diff --git a/lib/librte_eal/common/include/arch/arm/rte_pause_64.h b/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
index 93895d3..1f7be0a 100644
--- a/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
+++ b/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
@@ -17,6 +17,112 @@ static inline void rte_pause(void)
 	asm volatile("yield" ::: "memory");
 }
 
+#ifdef RTE_USE_WFE
+/* Wait for *addr to be updated with expected value */
+static __rte_always_inline void
+rte_wait_until_equal16(volatile uint16_t *addr, uint16_t expected, int memorder)
+{
+	uint16_t tmp;
+	if (memorder == __ATOMIC_RELAXED)
+		asm volatile(
+			"ldxrh	%w[tmp], %w[addr]\n"
+			"cmp	%w[tmp], %w[expected]\n"
+			"b.eq	2f\n"
+			"sevl\n"
+			"1:	wfe\n"
+			"ldxrh	%w[tmp], %w[addr]\n"
+			"cmp	%w[tmp], %w[expected]\n"
+			"bne	1b\n"
+			"2:\n"
+			: [tmp] "=&r" (tmp)
+			: [addr] "Q"(*addr), [expected] "r"(expected)
+			: "cc", "memory");
+	else
+		asm volatile(
+			"ldaxrh %w[tmp], %w[addr]\n"
+			"cmp	%w[tmp], %w[expected]\n"
+			"b.eq	2f\n"
+			"sevl\n"
+			"1:	wfe\n"
+			"ldaxrh	%w[tmp], %w[addr]\n"
+			"cmp	%w[tmp], %w[expected]\n"
+			"bne	1b\n"
+			"2:\n"
+			: [tmp] "=&r" (tmp)
+			: [addr] "Q"(*addr), [expected] "r"(expected)
+			: "cc", "memory");
+}
+
+static __rte_always_inline void
+rte_wait_until_equal32(volatile uint32_t *addr, uint32_t expected, int memorder)
+{
+	uint32_t tmp;
+	if (memorder == __ATOMIC_RELAXED)
+		asm volatile(
+			"ldxr	%w[tmp], %w[addr]\n"
+			"cmp	%w[tmp], %w[expected]\n"
+			"b.eq	2f\n"
+			"sevl\n"
+			"1:	wfe\n"
+			"ldxr	%w[tmp], %w[addr]\n"
+			"cmp	%w[tmp], %w[expected]\n"
+			"bne	1b\n"
+			"2:\n"
+			: [tmp] "=&r" (tmp)
+			: [addr] "Q"(*addr), [expected] "r"(expected)
+			: "cc", "memory");
+	else
+		asm volatile(
+			"ldaxr  %w[tmp], %w[addr]\n"
+			"cmp	%w[tmp], %w[expected]\n"
+			"b.eq	2f\n"
+			"sevl\n"
+			"1:	wfe\n"
+			"ldaxr  %w[tmp], %w[addr]\n"
+			"cmp	%w[tmp], %w[expected]\n"
+			"bne	1b\n"
+			"2:\n"
+			: [tmp] "=&r" (tmp)
+			: [addr] "Q"(*addr), [expected] "r"(expected)
+			: "cc", "memory");
+}
+
+static __rte_always_inline void
+rte_wait_until_equal64(volatile uint64_t *addr, uint64_t expected, int memorder)
+{
+	uint64_t tmp;
+	if (memorder == __ATOMIC_RELAXED)
+		asm volatile(
+			"ldxr	%x[tmp], %x[addr]\n"
+			"cmp	%x[tmp], %x[expected]\n"
+			"b.eq	2f\n"
+			"sevl\n"
+			"1:	wfe\n"
+			"ldxr	%x[tmp], %x[addr]\n"
+			"cmp	%x[tmp], %x[expected]\n"
+			"bne	1b\n"
+			"2:\n"
+			: [tmp] "=&r" (tmp)
+			: [addr] "Q"(*addr), [expected] "r"(expected)
+			: "cc", "memory");
+	else
+		asm volatile(
+			"ldaxr  %x[tmp], %x[addr]\n"
+			"cmp	%x[tmp], %x[expected]\n"
+			"b.eq	2f\n"
+			"sevl\n"
+			"1:	wfe\n"
+			"ldaxr  %x[tmp], %x[addr]\n"
+			"cmp	%x[tmp], %x[expected]\n"
+			"bne	1b\n"
+			"2:\n"
+			: [tmp] "=&r" (tmp)
+			: [addr] "Q"(*addr), [expected] "r"(expected)
+			: "cc", "memory");
+}
+
+#endif
+
 #ifdef __cplusplus
 }
 #endif
diff --git a/lib/librte_eal/common/include/generic/rte_pause.h b/lib/librte_eal/common/include/generic/rte_pause.h
index 52bd4db..8f5f025 100644
--- a/lib/librte_eal/common/include/generic/rte_pause.h
+++ b/lib/librte_eal/common/include/generic/rte_pause.h
@@ -4,7 +4,6 @@
 
 #ifndef _RTE_PAUSE_H_
 #define _RTE_PAUSE_H_
-
 /**
  * @file
  *
@@ -12,6 +11,10 @@
  *
  */
 
+#include <stdint.h>
+#include <rte_common.h>
+#include <rte_atomic.h>
+
 /**
  * Pause CPU execution for a short while
  *
@@ -20,4 +23,38 @@
  */
 static inline void rte_pause(void);
 
+#if !defined(RTE_USE_WFE)
+#ifdef RTE_USE_C11_MEM_MODEL
+#define __rte_wait_until_equal(addr, expected, memorder) do {\
+	while (__atomic_load_n(addr, memorder) != expected) \
+		rte_pause();\
+} while (0)
+#else
+#define __rte_wait_until_equal(addr, expected, memorder) do {\
+	while (*addr != expected)\
+		rte_pause();\
+	if (memorder != __ATOMIC_RELAXED)\
+		rte_smp_rmb();\
+} while (0)
+#endif
+
+static __rte_always_inline void
+rte_wait_until_equal16(volatile uint16_t *addr, uint16_t expected, int memorder)
+{
+	__rte_wait_until_equal(addr, expected, memorder);
+}
+
+static __rte_always_inline void
+rte_wait_until_equal32(volatile uint32_t *addr, uint32_t expected, int memorder)
+{
+	__rte_wait_until_equal(addr, expected, memorder);
+}
+
+static __rte_always_inline void
+rte_wait_until_equal64(volatile uint64_t *addr, uint64_t expected, int memorder)
+{
+	__rte_wait_until_equal(addr, expected, memorder);
+}
+#endif /* RTE_USE_WFE */
+
 #endif /* _RTE_PAUSE_H_ */
-- 
2.7.4


^ permalink raw reply	[flat|nested] 163+ messages in thread

* [dpdk-dev] [RFC v2 2/5] ticketlock: use new API to reduce contention on aarch64
  2019-06-30 16:21 [dpdk-dev] [RFC 0/5] use WFE for locks and ring on aarch64 Gavin Hu
                   ` (7 preceding siblings ...)
  2019-07-03  8:58 ` [dpdk-dev] [RFC v2 1/5] eal: add the APIs to wait until equal Gavin Hu
@ 2019-07-03  8:58 ` Gavin Hu
  2019-07-20  6:57   ` Pavan Nikhilesh Bhagavatula
  2019-07-03  8:58 ` [dpdk-dev] [RFC v2 3/5] ring: use wfe to wait for ring tail update " Gavin Hu
                   ` (72 subsequent siblings)
  81 siblings, 1 reply; 163+ messages in thread
From: Gavin Hu @ 2019-07-03  8:58 UTC (permalink / raw)
  To: dev; +Cc: nd

While using ticket lock, cores repeatedly poll the lock variable.
This is replaced by rte_wait_until_equal API.

Running ticketlock_autotest on ThunderX2, with different numbers of cores
and depths of rings, 3%~8% performance gains were measured.

Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
---
 lib/librte_eal/common/include/generic/rte_ticketlock.h | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/lib/librte_eal/common/include/generic/rte_ticketlock.h b/lib/librte_eal/common/include/generic/rte_ticketlock.h
index 191146f..8fa1f62 100644
--- a/lib/librte_eal/common/include/generic/rte_ticketlock.h
+++ b/lib/librte_eal/common/include/generic/rte_ticketlock.h
@@ -64,8 +64,7 @@ static inline __rte_experimental void
 rte_ticketlock_lock(rte_ticketlock_t *tl)
 {
 	uint16_t me = __atomic_fetch_add(&tl->s.next, 1, __ATOMIC_RELAXED);
-	while (__atomic_load_n(&tl->s.current, __ATOMIC_ACQUIRE) != me)
-		rte_pause();
+	rte_wait_until_equal16(&tl->s.current, me, __ATOMIC_ACQUIRE);
 }
 
 /**
-- 
2.7.4


^ permalink raw reply	[flat|nested] 163+ messages in thread

* [dpdk-dev] [RFC v2 3/5] ring: use wfe to wait for ring tail update on aarch64
  2019-06-30 16:21 [dpdk-dev] [RFC 0/5] use WFE for locks and ring on aarch64 Gavin Hu
                   ` (8 preceding siblings ...)
  2019-07-03  8:58 ` [dpdk-dev] [RFC v2 2/5] ticketlock: use new API to reduce contention on aarch64 Gavin Hu
@ 2019-07-03  8:58 ` " Gavin Hu
  2019-07-03  8:58 ` [dpdk-dev] [RFC v2 4/5] spinlock: use wfe to reduce contention " Gavin Hu
                   ` (71 subsequent siblings)
  81 siblings, 0 replies; 163+ messages in thread
From: Gavin Hu @ 2019-07-03  8:58 UTC (permalink / raw)
  To: dev; +Cc: nd

Instead of polling for tail to be updated, use wfe instruction.

Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
Reviewed-by: Steve Capper <steve.capper@arm.com>
Reviewed-by: Ola Liljedahl <ola.liljedahl@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
---
 lib/librte_ring/rte_ring_c11_mem.h | 4 ++--
 lib/librte_ring/rte_ring_generic.h | 3 +--
 2 files changed, 3 insertions(+), 4 deletions(-)

diff --git a/lib/librte_ring/rte_ring_c11_mem.h b/lib/librte_ring/rte_ring_c11_mem.h
index 0fb73a3..037811e 100644
--- a/lib/librte_ring/rte_ring_c11_mem.h
+++ b/lib/librte_ring/rte_ring_c11_mem.h
@@ -2,6 +2,7 @@
  *
  * Copyright (c) 2017,2018 HXT-semitech Corporation.
  * Copyright (c) 2007-2009 Kip Macy kmacy@freebsd.org
+ * Copyright (c) 2019 Arm Limited
  * All rights reserved.
  * Derived from FreeBSD's bufring.h
  * Used as BSD-3 Licensed with permission from Kip Macy.
@@ -21,8 +22,7 @@ update_tail(struct rte_ring_headtail *ht, uint32_t old_val, uint32_t new_val,
 	 * we need to wait for them to complete
 	 */
 	if (!single)
-		while (unlikely(ht->tail != old_val))
-			rte_pause();
+		rte_wait_until_equal32(&ht->tail, old_val, __ATOMIC_RELAXED);
 
 	__atomic_store_n(&ht->tail, new_val, __ATOMIC_RELEASE);
 }
diff --git a/lib/librte_ring/rte_ring_generic.h b/lib/librte_ring/rte_ring_generic.h
index 953cdbb..570765c 100644
--- a/lib/librte_ring/rte_ring_generic.h
+++ b/lib/librte_ring/rte_ring_generic.h
@@ -23,8 +23,7 @@ update_tail(struct rte_ring_headtail *ht, uint32_t old_val, uint32_t new_val,
 	 * we need to wait for them to complete
 	 */
 	if (!single)
-		while (unlikely(ht->tail != old_val))
-			rte_pause();
+		rte_wait_until_equal32(&ht->tail, old_val, __ATOMIC_RELAXED);
 
 	ht->tail = new_val;
 }
-- 
2.7.4


^ permalink raw reply	[flat|nested] 163+ messages in thread

* [dpdk-dev] [RFC v2 4/5] spinlock: use wfe to reduce contention on aarch64
  2019-06-30 16:21 [dpdk-dev] [RFC 0/5] use WFE for locks and ring on aarch64 Gavin Hu
                   ` (9 preceding siblings ...)
  2019-07-03  8:58 ` [dpdk-dev] [RFC v2 3/5] ring: use wfe to wait for ring tail update " Gavin Hu
@ 2019-07-03  8:58 ` " Gavin Hu
  2019-07-20  6:59   ` Pavan Nikhilesh Bhagavatula
  2019-07-03  8:58 ` [dpdk-dev] [RFC v2 5/5] config: add WFE config entry for aarch64 Gavin Hu
                   ` (70 subsequent siblings)
  81 siblings, 1 reply; 163+ messages in thread
From: Gavin Hu @ 2019-07-03  8:58 UTC (permalink / raw)
  To: dev; +Cc: nd

In acquiring a spinlock, cores repeatedly poll the lock variable.
This is replaced by rte_wait_until_equal API.

5~10% performance gain was measured by running spinlock_autotest on
14 isolated cores of ThunderX2.

Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
Reviewed-by: Phil Yang <phil.yang@arm.com>
Reviewed-by: Steve Capper <steve.capper@arm.com>
Reviewed-by: Ola Liljedahl <ola.liljedahl@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
---
 .../common/include/arch/arm/rte_spinlock.h         | 25 ++++++++++++++++++++++
 .../common/include/generic/rte_spinlock.h          |  2 +-
 2 files changed, 26 insertions(+), 1 deletion(-)

diff --git a/lib/librte_eal/common/include/arch/arm/rte_spinlock.h b/lib/librte_eal/common/include/arch/arm/rte_spinlock.h
index 1a6916b..f25d17f 100644
--- a/lib/librte_eal/common/include/arch/arm/rte_spinlock.h
+++ b/lib/librte_eal/common/include/arch/arm/rte_spinlock.h
@@ -16,6 +16,31 @@ extern "C" {
 #include <rte_common.h>
 #include "generic/rte_spinlock.h"
 
+/* armv7a does support WFE, but an explicit wake-up signal using SEV is
+ * required (must be preceded by DSB to drain the store buffer) and
+ * this is less performant, so keep armv7a implementation unchanged.
+ */
+#if defined(RTE_USE_WFE) && defined(RTE_ARCH_ARM64)
+static inline void
+rte_spinlock_lock(rte_spinlock_t *sl)
+{
+	unsigned int tmp;
+	/* http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.
+	 * faqs/ka16809.html
+	 */
+	asm volatile(
+		"sevl\n"
+		"1:	wfe\n"
+		"2:	ldaxr %w[tmp], %w[locked]\n"
+		"cbnz   %w[tmp], 1b\n"
+		"stxr   %w[tmp], %w[one], %w[locked]\n"
+		"cbnz   %w[tmp], 2b\n"
+		: [tmp] "=&r" (tmp), [locked] "+Q"(sl->locked)
+		: [one] "r" (1)
+		: "cc", "memory");
+}
+#endif
+
 static inline int rte_tm_supported(void)
 {
 	return 0;
diff --git a/lib/librte_eal/common/include/generic/rte_spinlock.h b/lib/librte_eal/common/include/generic/rte_spinlock.h
index 87ae7a4..cf4f15b 100644
--- a/lib/librte_eal/common/include/generic/rte_spinlock.h
+++ b/lib/librte_eal/common/include/generic/rte_spinlock.h
@@ -57,7 +57,7 @@ rte_spinlock_init(rte_spinlock_t *sl)
 static inline void
 rte_spinlock_lock(rte_spinlock_t *sl);
 
-#ifdef RTE_FORCE_INTRINSICS
+#if defined(RTE_FORCE_INTRINSICS) && !defined(RTE_USE_WFE)
 static inline void
 rte_spinlock_lock(rte_spinlock_t *sl)
 {
-- 
2.7.4


^ permalink raw reply	[flat|nested] 163+ messages in thread

* [dpdk-dev] [RFC v2 5/5] config: add WFE config entry for aarch64
  2019-06-30 16:21 [dpdk-dev] [RFC 0/5] use WFE for locks and ring on aarch64 Gavin Hu
                   ` (10 preceding siblings ...)
  2019-07-03  8:58 ` [dpdk-dev] [RFC v2 4/5] spinlock: use wfe to reduce contention " Gavin Hu
@ 2019-07-03  8:58 ` Gavin Hu
  2019-07-20  7:03   ` Pavan Nikhilesh Bhagavatula
  2019-07-23 15:43 ` [dpdk-dev] [PATCH v3 0/5] use WFE for locks and ring on aarch64 Gavin Hu
                   ` (69 subsequent siblings)
  81 siblings, 1 reply; 163+ messages in thread
From: Gavin Hu @ 2019-07-03  8:58 UTC (permalink / raw)
  To: dev; +Cc: nd

Add the RTE_USE_WFE configuration entry for aarch64, disabled by default.
It can be enabled selectively based on the performance benchmarking.

Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
Reviewed-by: Steve Capper <steve.capper@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
---
 config/arm/meson.build     | 1 +
 config/common_armv8a_linux | 6 ++++++
 2 files changed, 7 insertions(+)

diff --git a/config/arm/meson.build b/config/arm/meson.build
index 6fa06a1..939d60e 100644
--- a/config/arm/meson.build
+++ b/config/arm/meson.build
@@ -116,6 +116,7 @@ impl_dpaa = ['NXP DPAA', flags_dpaa, machine_args_generic]
 impl_dpaa2 = ['NXP DPAA2', flags_dpaa2, machine_args_generic]
 
 dpdk_conf.set('RTE_FORCE_INTRINSICS', 1)
+dpdk_conf.set('RTE_USE_WFE', 0)
 
 if not dpdk_conf.get('RTE_ARCH_64')
 	dpdk_conf.set('RTE_CACHE_LINE_SIZE', 64)
diff --git a/config/common_armv8a_linux b/config/common_armv8a_linux
index 72091de..ae87a87 100644
--- a/config/common_armv8a_linux
+++ b/config/common_armv8a_linux
@@ -12,6 +12,12 @@ CONFIG_RTE_ARCH_64=y
 
 CONFIG_RTE_FORCE_INTRINSICS=y
 
+# Use WFE instructions to implement the rte_wait_for_equal_xxx APIs,
+# calling these APIs put the cores enter low power state while waiting
+# for the memory address to be become equal to the expected value.
+# This is supported only by aarch64.
+CONFIG_RTE_USE_WFE=n
+
 # Maximum available cache line size in arm64 implementations.
 # Setting to maximum available cache line size in generic config
 # to address minimum DMA alignment across all arm64 implementations.
-- 
2.7.4


^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: [dpdk-dev] [EXT] [RFC v2 1/5] eal: add the APIs to wait until equal
  2019-07-03  8:58 ` [dpdk-dev] [RFC v2 1/5] eal: add the APIs to wait until equal Gavin Hu
@ 2019-07-20  6:46   ` " Pavan Nikhilesh Bhagavatula
  0 siblings, 0 replies; 163+ messages in thread
From: Pavan Nikhilesh Bhagavatula @ 2019-07-20  6:46 UTC (permalink / raw)
  To: Gavin Hu, dev; +Cc: nd, Jerin Jacob Kollanukkaran



>-----Original Message-----
>From: dev <dev-bounces@dpdk.org> On Behalf Of Gavin Hu
>Sent: Wednesday, July 3, 2019 2:29 PM
>To: dev@dpdk.org
>Cc: nd@arm.com
>Subject: [EXT] [dpdk-dev] [RFC v2 1/5] eal: add the APIs to wait until
>equal
>
>External Email
>
>----------------------------------------------------------------------
>The rte_wait_until_equalxx APIs abstract the functionality of 'polling
>for a memory location to become equal to a given value'.
>
>Signed-off-by: Gavin Hu <gavin.hu@arm.com>
>Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
>Reviewed-by: Steve Capper <steve.capper@arm.com>
>Reviewed-by: Ola Liljedahl <ola.liljedahl@arm.com>
>Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>

Acked-by: Pavan Nikhilesh <pbhagavatula@marvell.com>

>---
> .../common/include/arch/arm/rte_atomic_64.h        |   4 +
> .../common/include/arch/arm/rte_pause_64.h         | 106
>+++++++++++++++++++++
> lib/librte_eal/common/include/generic/rte_pause.h  |  39 +++++++-
> 3 files changed, 148 insertions(+), 1 deletion(-)
>


^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: [dpdk-dev] [RFC v2 2/5] ticketlock: use new API to reduce contention on aarch64
  2019-07-03  8:58 ` [dpdk-dev] [RFC v2 2/5] ticketlock: use new API to reduce contention on aarch64 Gavin Hu
@ 2019-07-20  6:57   ` Pavan Nikhilesh Bhagavatula
  0 siblings, 0 replies; 163+ messages in thread
From: Pavan Nikhilesh Bhagavatula @ 2019-07-20  6:57 UTC (permalink / raw)
  To: Gavin Hu, dev; +Cc: nd



>-----Original Message-----
>From: dev <dev-bounces@dpdk.org> On Behalf Of Gavin Hu
>Sent: Wednesday, July 3, 2019 2:29 PM
>To: dev@dpdk.org
>Cc: nd@arm.com
>Subject: [dpdk-dev] [RFC v2 2/5] ticketlock: use new API to reduce
>contention on aarch64
>
>While using ticket lock, cores repeatedly poll the lock variable.
>This is replaced by rte_wait_until_equal API.
>
>Running ticketlock_autotest on ThunderX2, with different numbers of
>cores
>and depths of rings, 3%~8% performance gains were measured.

Tested on octeontx2 board.

>
>Signed-off-by: Gavin Hu <gavin.hu@arm.com>
>Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Tested-by: Pavan Nikhilesh <pbhagavatula@marvell.com>
>---
> lib/librte_eal/common/include/generic/rte_ticketlock.h | 3 +--
> 1 file changed, 1 insertion(+), 2 deletions(-)
>
>diff --git a/lib/librte_eal/common/include/generic/rte_ticketlock.h
>b/lib/librte_eal/common/include/generic/rte_ticketlock.h
>index 191146f..8fa1f62 100644
>--- a/lib/librte_eal/common/include/generic/rte_ticketlock.h
>+++ b/lib/librte_eal/common/include/generic/rte_ticketlock.h
>@@ -64,8 +64,7 @@ static inline __rte_experimental void
> rte_ticketlock_lock(rte_ticketlock_t *tl)
> {
> 	uint16_t me = __atomic_fetch_add(&tl->s.next, 1,
>__ATOMIC_RELAXED);
>-	while (__atomic_load_n(&tl->s.current, __ATOMIC_ACQUIRE)
>!= me)
>-		rte_pause();
>+	rte_wait_until_equal16(&tl->s.current, me,
>__ATOMIC_ACQUIRE);
> }
>
> /**
>--
>2.7.4


^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: [dpdk-dev] [RFC v2 4/5] spinlock: use wfe to reduce contention on aarch64
  2019-07-03  8:58 ` [dpdk-dev] [RFC v2 4/5] spinlock: use wfe to reduce contention " Gavin Hu
@ 2019-07-20  6:59   ` Pavan Nikhilesh Bhagavatula
  0 siblings, 0 replies; 163+ messages in thread
From: Pavan Nikhilesh Bhagavatula @ 2019-07-20  6:59 UTC (permalink / raw)
  To: Gavin Hu, dev; +Cc: nd



>-----Original Message-----
>From: dev <dev-bounces@dpdk.org> On Behalf Of Gavin Hu
>Sent: Wednesday, July 3, 2019 2:29 PM
>To: dev@dpdk.org
>Cc: nd@arm.com
>Subject: [dpdk-dev] [RFC v2 4/5] spinlock: use wfe to reduce contention
>on aarch64
>
>In acquiring a spinlock, cores repeatedly poll the lock variable.
>This is replaced by rte_wait_until_equal API.
>
>5~10% performance gain was measured by running spinlock_autotest
>on
>14 isolated cores of ThunderX2.

Tested on octeontx2 board.

>
>Signed-off-by: Gavin Hu <gavin.hu@arm.com>
>Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
>Reviewed-by: Phil Yang <phil.yang@arm.com>
>Reviewed-by: Steve Capper <steve.capper@arm.com>
>Reviewed-by: Ola Liljedahl <ola.liljedahl@arm.com>
>Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>

Tested-by: Pavan Nikhilesh <pbhagavatula@marvell.com>

>---
> .../common/include/arch/arm/rte_spinlock.h         | 25
>++++++++++++++++++++++
> .../common/include/generic/rte_spinlock.h          |  2 +-
> 2 files changed, 26 insertions(+), 1 deletion(-)
>
>diff --git a/lib/librte_eal/common/include/arch/arm/rte_spinlock.h
>b/lib/librte_eal/common/include/arch/arm/rte_spinlock.h
>index 1a6916b..f25d17f 100644
>--- a/lib/librte_eal/common/include/arch/arm/rte_spinlock.h
>+++ b/lib/librte_eal/common/include/arch/arm/rte_spinlock.h
>@@ -16,6 +16,31 @@ extern "C" {
> #include <rte_common.h>
> #include "generic/rte_spinlock.h"
>
>+/* armv7a does support WFE, but an explicit wake-up signal using SEV
>is
>+ * required (must be preceded by DSB to drain the store buffer) and
>+ * this is less performant, so keep armv7a implementation unchanged.
>+ */
>+#if defined(RTE_USE_WFE) && defined(RTE_ARCH_ARM64)
>+static inline void
>+rte_spinlock_lock(rte_spinlock_t *sl)
>+{
>+	unsigned int tmp;
>+	/*
>http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.
>+	 * faqs/ka16809.html
>+	 */
>+	asm volatile(
>+		"sevl\n"
>+		"1:	wfe\n"
>+		"2:	ldaxr %w[tmp], %w[locked]\n"
>+		"cbnz   %w[tmp], 1b\n"
>+		"stxr   %w[tmp], %w[one], %w[locked]\n"
>+		"cbnz   %w[tmp], 2b\n"
>+		: [tmp] "=&r" (tmp), [locked] "+Q"(sl->locked)
>+		: [one] "r" (1)
>+		: "cc", "memory");
>+}
>+#endif
>+
> static inline int rte_tm_supported(void)
> {
> 	return 0;
>diff --git a/lib/librte_eal/common/include/generic/rte_spinlock.h
>b/lib/librte_eal/common/include/generic/rte_spinlock.h
>index 87ae7a4..cf4f15b 100644
>--- a/lib/librte_eal/common/include/generic/rte_spinlock.h
>+++ b/lib/librte_eal/common/include/generic/rte_spinlock.h
>@@ -57,7 +57,7 @@ rte_spinlock_init(rte_spinlock_t *sl)
> static inline void
> rte_spinlock_lock(rte_spinlock_t *sl);
>
>-#ifdef RTE_FORCE_INTRINSICS
>+#if defined(RTE_FORCE_INTRINSICS) && !defined(RTE_USE_WFE)
> static inline void
> rte_spinlock_lock(rte_spinlock_t *sl)
> {
>--
>2.7.4


^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: [dpdk-dev] [RFC v2 5/5] config: add WFE config entry for aarch64
  2019-07-03  8:58 ` [dpdk-dev] [RFC v2 5/5] config: add WFE config entry for aarch64 Gavin Hu
@ 2019-07-20  7:03   ` Pavan Nikhilesh Bhagavatula
  2019-07-23 15:47     ` Gavin Hu (Arm Technology China)
  0 siblings, 1 reply; 163+ messages in thread
From: Pavan Nikhilesh Bhagavatula @ 2019-07-20  7:03 UTC (permalink / raw)
  To: Gavin Hu, dev; +Cc: nd



>-----Original Message-----
>From: dev <dev-bounces@dpdk.org> On Behalf Of Gavin Hu
>Sent: Wednesday, July 3, 2019 2:29 PM
>To: dev@dpdk.org
>Cc: nd@arm.com
>Subject: [dpdk-dev] [RFC v2 5/5] config: add WFE config entry for
>aarch64
>
>Add the RTE_USE_WFE configuration entry for aarch64, disabled by
>default.
>It can be enabled selectively based on the performance benchmarking.
>
>Signed-off-by: Gavin Hu <gavin.hu@arm.com>
>Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
>Reviewed-by: Steve Capper <steve.capper@arm.com>
>Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>

Acked-by: Pavan Nikhilesh <pbhagavatula@marvell.com>

>---
> config/arm/meson.build     | 1 +
> config/common_armv8a_linux | 6 ++++++
> 2 files changed, 7 insertions(+)
>


^ permalink raw reply	[flat|nested] 163+ messages in thread

* [dpdk-dev] [PATCH v3 0/5] use WFE for locks and ring on aarch64
  2019-06-30 16:21 [dpdk-dev] [RFC 0/5] use WFE for locks and ring on aarch64 Gavin Hu
                   ` (11 preceding siblings ...)
  2019-07-03  8:58 ` [dpdk-dev] [RFC v2 5/5] config: add WFE config entry for aarch64 Gavin Hu
@ 2019-07-23 15:43 ` Gavin Hu
  2019-07-23 19:15   ` Honnappa Nagarahalli
  2019-07-23 15:43 ` [dpdk-dev] [PATCH v3 1/5] eal: add the APIs to wait until equal Gavin Hu
                   ` (68 subsequent siblings)
  81 siblings, 1 reply; 163+ messages in thread
From: Gavin Hu @ 2019-07-23 15:43 UTC (permalink / raw)
  To: dev
  Cc: nd, thomas, stephen, jerinj, pbhagavatula, Honnappa.Nagarahalli,
	gavin.hu

DPDK has multiple use cases where the core repeatedly polls a location in
memory. This polling results in many cache and memory transactions.

Arm architecture provides WFE (Wait For Event) instruction, which allows
the cpu core to enter a low power state until woken up by the update to the
memory location being polled. Thus reducing the cache and memory
transactions.

x86 has the PAUSE hint instruction to reduce such overhead.

The rte_wait_until_equal_xxx APIs abstract the functionality of 'polling
for a memory location to become equal to a given value'.

For non-Arm platforms, these APIs are just wrappers around do-while loop
with rte_pause, so there are no performance differences.

For Arm platforms, use of WFE can be configured using CONFIG_RTE_USE_WFE
option. It is disabled by default.

Currently, use of WFE is supported only for aarch64 platforms. armv7
platforms do support the WFE instruction, but they require explicit wake up
events(sev) and are less performannt.

Testing shows that, performance varies across different platforms, with
some showing degradation.

CONFIG_RTE_USE_WFE should be enabled depending on the performance on the
target platforms.

V3:
* Convert RFCs to patches
V2:
* Use inline functions instead of marcos
* Add load and compare in the beginning of the APIs
* Fix some style errors in asm inline 
V1:
* Add the new APIs and use it for ring and locks

Gavin Hu (5):
  eal: add the APIs to wait until equal
  ticketlock: use new API to reduce contention on aarch64
  ring: use wfe to wait for ring tail update on aarch64
  spinlock: use wfe to reduce contention on aarch64
  config: add WFE config entry for aarch64

 config/arm/meson.build                             |   1 +
 config/common_armv8a_linux                         |   6 ++
 .../common/include/arch/arm/rte_atomic_64.h        |   4 +
 .../common/include/arch/arm/rte_pause_64.h         | 106 +++++++++++++++++++++
 .../common/include/arch/arm/rte_spinlock.h         |  25 +++++
 lib/librte_eal/common/include/generic/rte_pause.h  |  39 +++++++-
 .../common/include/generic/rte_spinlock.h          |   2 +-
 .../common/include/generic/rte_ticketlock.h        |   3 +-
 lib/librte_ring/rte_ring_c11_mem.h                 |   4 +-
 lib/librte_ring/rte_ring_generic.h                 |   3 +-
 10 files changed, 185 insertions(+), 8 deletions(-)

-- 
2.7.4


^ permalink raw reply	[flat|nested] 163+ messages in thread

* [dpdk-dev] [PATCH v3 1/5] eal: add the APIs to wait until equal
  2019-06-30 16:21 [dpdk-dev] [RFC 0/5] use WFE for locks and ring on aarch64 Gavin Hu
                   ` (12 preceding siblings ...)
  2019-07-23 15:43 ` [dpdk-dev] [PATCH v3 0/5] use WFE for locks and ring on aarch64 Gavin Hu
@ 2019-07-23 15:43 ` Gavin Hu
  2019-07-24 11:52   ` [dpdk-dev] [EXT] " Jerin Jacob Kollanukkaran
  2019-07-23 15:43 ` [dpdk-dev] [PATCH v3 2/5] ticketlock: use new API to reduce contention on aarch64 Gavin Hu
                   ` (67 subsequent siblings)
  81 siblings, 1 reply; 163+ messages in thread
From: Gavin Hu @ 2019-07-23 15:43 UTC (permalink / raw)
  To: dev
  Cc: nd, thomas, stephen, jerinj, pbhagavatula, Honnappa.Nagarahalli,
	gavin.hu

The rte_wait_until_equalxx APIs abstract the functionality of 'polling
for a memory location to become equal to a given value'.

Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
Reviewed-by: Steve Capper <steve.capper@arm.com>
Reviewed-by: Ola Liljedahl <ola.liljedahl@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Acked-by: Pavan Nikhilesh <pbhagavatula@marvell.com>
---
 .../common/include/arch/arm/rte_atomic_64.h        |   4 +
 .../common/include/arch/arm/rte_pause_64.h         | 106 +++++++++++++++++++++
 lib/librte_eal/common/include/generic/rte_pause.h  |  39 +++++++-
 3 files changed, 148 insertions(+), 1 deletion(-)

diff --git a/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h b/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h
index 97060e4..8d742c6 100644
--- a/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h
+++ b/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h
@@ -15,8 +15,12 @@ extern "C" {
 
 #include "generic/rte_atomic.h"
 
+#ifndef dsb
 #define dsb(opt) asm volatile("dsb " #opt : : : "memory")
+#endif
+#ifndef dmb
 #define dmb(opt) asm volatile("dmb " #opt : : : "memory")
+#endif
 
 #define rte_mb() dsb(sy)
 
diff --git a/lib/librte_eal/common/include/arch/arm/rte_pause_64.h b/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
index 93895d3..1f7be0a 100644
--- a/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
+++ b/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
@@ -17,6 +17,112 @@ static inline void rte_pause(void)
 	asm volatile("yield" ::: "memory");
 }
 
+#ifdef RTE_USE_WFE
+/* Wait for *addr to be updated with expected value */
+static __rte_always_inline void
+rte_wait_until_equal16(volatile uint16_t *addr, uint16_t expected, int memorder)
+{
+	uint16_t tmp;
+	if (memorder == __ATOMIC_RELAXED)
+		asm volatile(
+			"ldxrh	%w[tmp], %w[addr]\n"
+			"cmp	%w[tmp], %w[expected]\n"
+			"b.eq	2f\n"
+			"sevl\n"
+			"1:	wfe\n"
+			"ldxrh	%w[tmp], %w[addr]\n"
+			"cmp	%w[tmp], %w[expected]\n"
+			"bne	1b\n"
+			"2:\n"
+			: [tmp] "=&r" (tmp)
+			: [addr] "Q"(*addr), [expected] "r"(expected)
+			: "cc", "memory");
+	else
+		asm volatile(
+			"ldaxrh %w[tmp], %w[addr]\n"
+			"cmp	%w[tmp], %w[expected]\n"
+			"b.eq	2f\n"
+			"sevl\n"
+			"1:	wfe\n"
+			"ldaxrh	%w[tmp], %w[addr]\n"
+			"cmp	%w[tmp], %w[expected]\n"
+			"bne	1b\n"
+			"2:\n"
+			: [tmp] "=&r" (tmp)
+			: [addr] "Q"(*addr), [expected] "r"(expected)
+			: "cc", "memory");
+}
+
+static __rte_always_inline void
+rte_wait_until_equal32(volatile uint32_t *addr, uint32_t expected, int memorder)
+{
+	uint32_t tmp;
+	if (memorder == __ATOMIC_RELAXED)
+		asm volatile(
+			"ldxr	%w[tmp], %w[addr]\n"
+			"cmp	%w[tmp], %w[expected]\n"
+			"b.eq	2f\n"
+			"sevl\n"
+			"1:	wfe\n"
+			"ldxr	%w[tmp], %w[addr]\n"
+			"cmp	%w[tmp], %w[expected]\n"
+			"bne	1b\n"
+			"2:\n"
+			: [tmp] "=&r" (tmp)
+			: [addr] "Q"(*addr), [expected] "r"(expected)
+			: "cc", "memory");
+	else
+		asm volatile(
+			"ldaxr  %w[tmp], %w[addr]\n"
+			"cmp	%w[tmp], %w[expected]\n"
+			"b.eq	2f\n"
+			"sevl\n"
+			"1:	wfe\n"
+			"ldaxr  %w[tmp], %w[addr]\n"
+			"cmp	%w[tmp], %w[expected]\n"
+			"bne	1b\n"
+			"2:\n"
+			: [tmp] "=&r" (tmp)
+			: [addr] "Q"(*addr), [expected] "r"(expected)
+			: "cc", "memory");
+}
+
+static __rte_always_inline void
+rte_wait_until_equal64(volatile uint64_t *addr, uint64_t expected, int memorder)
+{
+	uint64_t tmp;
+	if (memorder == __ATOMIC_RELAXED)
+		asm volatile(
+			"ldxr	%x[tmp], %x[addr]\n"
+			"cmp	%x[tmp], %x[expected]\n"
+			"b.eq	2f\n"
+			"sevl\n"
+			"1:	wfe\n"
+			"ldxr	%x[tmp], %x[addr]\n"
+			"cmp	%x[tmp], %x[expected]\n"
+			"bne	1b\n"
+			"2:\n"
+			: [tmp] "=&r" (tmp)
+			: [addr] "Q"(*addr), [expected] "r"(expected)
+			: "cc", "memory");
+	else
+		asm volatile(
+			"ldaxr  %x[tmp], %x[addr]\n"
+			"cmp	%x[tmp], %x[expected]\n"
+			"b.eq	2f\n"
+			"sevl\n"
+			"1:	wfe\n"
+			"ldaxr  %x[tmp], %x[addr]\n"
+			"cmp	%x[tmp], %x[expected]\n"
+			"bne	1b\n"
+			"2:\n"
+			: [tmp] "=&r" (tmp)
+			: [addr] "Q"(*addr), [expected] "r"(expected)
+			: "cc", "memory");
+}
+
+#endif
+
 #ifdef __cplusplus
 }
 #endif
diff --git a/lib/librte_eal/common/include/generic/rte_pause.h b/lib/librte_eal/common/include/generic/rte_pause.h
index 52bd4db..8f5f025 100644
--- a/lib/librte_eal/common/include/generic/rte_pause.h
+++ b/lib/librte_eal/common/include/generic/rte_pause.h
@@ -4,7 +4,6 @@
 
 #ifndef _RTE_PAUSE_H_
 #define _RTE_PAUSE_H_
-
 /**
  * @file
  *
@@ -12,6 +11,10 @@
  *
  */
 
+#include <stdint.h>
+#include <rte_common.h>
+#include <rte_atomic.h>
+
 /**
  * Pause CPU execution for a short while
  *
@@ -20,4 +23,38 @@
  */
 static inline void rte_pause(void);
 
+#if !defined(RTE_USE_WFE)
+#ifdef RTE_USE_C11_MEM_MODEL
+#define __rte_wait_until_equal(addr, expected, memorder) do {\
+	while (__atomic_load_n(addr, memorder) != expected) \
+		rte_pause();\
+} while (0)
+#else
+#define __rte_wait_until_equal(addr, expected, memorder) do {\
+	while (*addr != expected)\
+		rte_pause();\
+	if (memorder != __ATOMIC_RELAXED)\
+		rte_smp_rmb();\
+} while (0)
+#endif
+
+static __rte_always_inline void
+rte_wait_until_equal16(volatile uint16_t *addr, uint16_t expected, int memorder)
+{
+	__rte_wait_until_equal(addr, expected, memorder);
+}
+
+static __rte_always_inline void
+rte_wait_until_equal32(volatile uint32_t *addr, uint32_t expected, int memorder)
+{
+	__rte_wait_until_equal(addr, expected, memorder);
+}
+
+static __rte_always_inline void
+rte_wait_until_equal64(volatile uint64_t *addr, uint64_t expected, int memorder)
+{
+	__rte_wait_until_equal(addr, expected, memorder);
+}
+#endif /* RTE_USE_WFE */
+
 #endif /* _RTE_PAUSE_H_ */
-- 
2.7.4


^ permalink raw reply	[flat|nested] 163+ messages in thread

* [dpdk-dev] [PATCH v3 2/5] ticketlock: use new API to reduce contention on aarch64
  2019-06-30 16:21 [dpdk-dev] [RFC 0/5] use WFE for locks and ring on aarch64 Gavin Hu
                   ` (13 preceding siblings ...)
  2019-07-23 15:43 ` [dpdk-dev] [PATCH v3 1/5] eal: add the APIs to wait until equal Gavin Hu
@ 2019-07-23 15:43 ` Gavin Hu
  2019-07-23 15:43 ` [dpdk-dev] [PATCH v3 3/5] ring: use wfe to wait for ring tail update " Gavin Hu
                   ` (66 subsequent siblings)
  81 siblings, 0 replies; 163+ messages in thread
From: Gavin Hu @ 2019-07-23 15:43 UTC (permalink / raw)
  To: dev
  Cc: nd, thomas, stephen, jerinj, pbhagavatula, Honnappa.Nagarahalli,
	gavin.hu

While using ticket lock, cores repeatedly poll the lock variable.
This is replaced by rte_wait_until_equal API.

Running ticketlock_autotest on ThunderX2, with different numbers of cores
and depths of rings, 3%~8% performance gains were measured.

Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Tested-by: Pavan Nikhilesh <pbhagavatula@marvell.com>
---
 lib/librte_eal/common/include/generic/rte_ticketlock.h | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/lib/librte_eal/common/include/generic/rte_ticketlock.h b/lib/librte_eal/common/include/generic/rte_ticketlock.h
index d9bec87..f0821f2 100644
--- a/lib/librte_eal/common/include/generic/rte_ticketlock.h
+++ b/lib/librte_eal/common/include/generic/rte_ticketlock.h
@@ -66,8 +66,7 @@ static inline void
 rte_ticketlock_lock(rte_ticketlock_t *tl)
 {
 	uint16_t me = __atomic_fetch_add(&tl->s.next, 1, __ATOMIC_RELAXED);
-	while (__atomic_load_n(&tl->s.current, __ATOMIC_ACQUIRE) != me)
-		rte_pause();
+	rte_wait_until_equal16(&tl->s.current, me, __ATOMIC_ACQUIRE);
 }
 
 /**
-- 
2.7.4


^ permalink raw reply	[flat|nested] 163+ messages in thread

* [dpdk-dev] [PATCH v3 3/5] ring: use wfe to wait for ring tail update on aarch64
  2019-06-30 16:21 [dpdk-dev] [RFC 0/5] use WFE for locks and ring on aarch64 Gavin Hu
                   ` (14 preceding siblings ...)
  2019-07-23 15:43 ` [dpdk-dev] [PATCH v3 2/5] ticketlock: use new API to reduce contention on aarch64 Gavin Hu
@ 2019-07-23 15:43 ` " Gavin Hu
  2019-07-23 15:43 ` [dpdk-dev] [PATCH v3 4/5] spinlock: use wfe to reduce contention " Gavin Hu
                   ` (65 subsequent siblings)
  81 siblings, 0 replies; 163+ messages in thread
From: Gavin Hu @ 2019-07-23 15:43 UTC (permalink / raw)
  To: dev
  Cc: nd, thomas, stephen, jerinj, pbhagavatula, Honnappa.Nagarahalli,
	gavin.hu

Instead of polling for tail to be updated, use wfe instruction.

Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
Reviewed-by: Steve Capper <steve.capper@arm.com>
Reviewed-by: Ola Liljedahl <ola.liljedahl@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
---
 lib/librte_ring/rte_ring_c11_mem.h | 4 ++--
 lib/librte_ring/rte_ring_generic.h | 3 +--
 2 files changed, 3 insertions(+), 4 deletions(-)

diff --git a/lib/librte_ring/rte_ring_c11_mem.h b/lib/librte_ring/rte_ring_c11_mem.h
index 0fb73a3..037811e 100644
--- a/lib/librte_ring/rte_ring_c11_mem.h
+++ b/lib/librte_ring/rte_ring_c11_mem.h
@@ -2,6 +2,7 @@
  *
  * Copyright (c) 2017,2018 HXT-semitech Corporation.
  * Copyright (c) 2007-2009 Kip Macy kmacy@freebsd.org
+ * Copyright (c) 2019 Arm Limited
  * All rights reserved.
  * Derived from FreeBSD's bufring.h
  * Used as BSD-3 Licensed with permission from Kip Macy.
@@ -21,8 +22,7 @@ update_tail(struct rte_ring_headtail *ht, uint32_t old_val, uint32_t new_val,
 	 * we need to wait for them to complete
 	 */
 	if (!single)
-		while (unlikely(ht->tail != old_val))
-			rte_pause();
+		rte_wait_until_equal32(&ht->tail, old_val, __ATOMIC_RELAXED);
 
 	__atomic_store_n(&ht->tail, new_val, __ATOMIC_RELEASE);
 }
diff --git a/lib/librte_ring/rte_ring_generic.h b/lib/librte_ring/rte_ring_generic.h
index 953cdbb..570765c 100644
--- a/lib/librte_ring/rte_ring_generic.h
+++ b/lib/librte_ring/rte_ring_generic.h
@@ -23,8 +23,7 @@ update_tail(struct rte_ring_headtail *ht, uint32_t old_val, uint32_t new_val,
 	 * we need to wait for them to complete
 	 */
 	if (!single)
-		while (unlikely(ht->tail != old_val))
-			rte_pause();
+		rte_wait_until_equal32(&ht->tail, old_val, __ATOMIC_RELAXED);
 
 	ht->tail = new_val;
 }
-- 
2.7.4


^ permalink raw reply	[flat|nested] 163+ messages in thread

* [dpdk-dev] [PATCH v3 4/5] spinlock: use wfe to reduce contention on aarch64
  2019-06-30 16:21 [dpdk-dev] [RFC 0/5] use WFE for locks and ring on aarch64 Gavin Hu
                   ` (15 preceding siblings ...)
  2019-07-23 15:43 ` [dpdk-dev] [PATCH v3 3/5] ring: use wfe to wait for ring tail update " Gavin Hu
@ 2019-07-23 15:43 ` " Gavin Hu
  2019-07-24 12:17   ` [dpdk-dev] [EXT] " Jerin Jacob Kollanukkaran
  2019-07-23 15:43 ` [dpdk-dev] [PATCH v3 5/5] config: add WFE config entry for aarch64 Gavin Hu
                   ` (64 subsequent siblings)
  81 siblings, 1 reply; 163+ messages in thread
From: Gavin Hu @ 2019-07-23 15:43 UTC (permalink / raw)
  To: dev
  Cc: nd, thomas, stephen, jerinj, pbhagavatula, Honnappa.Nagarahalli,
	gavin.hu

In acquiring a spinlock, cores repeatedly poll the lock variable.
This is replaced by rte_wait_until_equal API.

5~10% performance gain was measured by running spinlock_autotest on
14 isolated cores of ThunderX2.

Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
Reviewed-by: Phil Yang <phil.yang@arm.com>
Reviewed-by: Steve Capper <steve.capper@arm.com>
Reviewed-by: Ola Liljedahl <ola.liljedahl@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Tested-by: Pavan Nikhilesh <pbhagavatula@marvell.com>
---
 .../common/include/arch/arm/rte_spinlock.h         | 25 ++++++++++++++++++++++
 .../common/include/generic/rte_spinlock.h          |  2 +-
 2 files changed, 26 insertions(+), 1 deletion(-)

diff --git a/lib/librte_eal/common/include/arch/arm/rte_spinlock.h b/lib/librte_eal/common/include/arch/arm/rte_spinlock.h
index 1a6916b..f25d17f 100644
--- a/lib/librte_eal/common/include/arch/arm/rte_spinlock.h
+++ b/lib/librte_eal/common/include/arch/arm/rte_spinlock.h
@@ -16,6 +16,31 @@ extern "C" {
 #include <rte_common.h>
 #include "generic/rte_spinlock.h"
 
+/* armv7a does support WFE, but an explicit wake-up signal using SEV is
+ * required (must be preceded by DSB to drain the store buffer) and
+ * this is less performant, so keep armv7a implementation unchanged.
+ */
+#if defined(RTE_USE_WFE) && defined(RTE_ARCH_ARM64)
+static inline void
+rte_spinlock_lock(rte_spinlock_t *sl)
+{
+	unsigned int tmp;
+	/* http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.
+	 * faqs/ka16809.html
+	 */
+	asm volatile(
+		"sevl\n"
+		"1:	wfe\n"
+		"2:	ldaxr %w[tmp], %w[locked]\n"
+		"cbnz   %w[tmp], 1b\n"
+		"stxr   %w[tmp], %w[one], %w[locked]\n"
+		"cbnz   %w[tmp], 2b\n"
+		: [tmp] "=&r" (tmp), [locked] "+Q"(sl->locked)
+		: [one] "r" (1)
+		: "cc", "memory");
+}
+#endif
+
 static inline int rte_tm_supported(void)
 {
 	return 0;
diff --git a/lib/librte_eal/common/include/generic/rte_spinlock.h b/lib/librte_eal/common/include/generic/rte_spinlock.h
index 87ae7a4..cf4f15b 100644
--- a/lib/librte_eal/common/include/generic/rte_spinlock.h
+++ b/lib/librte_eal/common/include/generic/rte_spinlock.h
@@ -57,7 +57,7 @@ rte_spinlock_init(rte_spinlock_t *sl)
 static inline void
 rte_spinlock_lock(rte_spinlock_t *sl);
 
-#ifdef RTE_FORCE_INTRINSICS
+#if defined(RTE_FORCE_INTRINSICS) && !defined(RTE_USE_WFE)
 static inline void
 rte_spinlock_lock(rte_spinlock_t *sl)
 {
-- 
2.7.4


^ permalink raw reply	[flat|nested] 163+ messages in thread

* [dpdk-dev] [PATCH v3 5/5] config: add WFE config entry for aarch64
  2019-06-30 16:21 [dpdk-dev] [RFC 0/5] use WFE for locks and ring on aarch64 Gavin Hu
                   ` (16 preceding siblings ...)
  2019-07-23 15:43 ` [dpdk-dev] [PATCH v3 4/5] spinlock: use wfe to reduce contention " Gavin Hu
@ 2019-07-23 15:43 ` Gavin Hu
  2019-07-23 18:05   ` Stephen Hemminger
  2019-07-24 12:25   ` [dpdk-dev] [EXT] " Jerin Jacob Kollanukkaran
  2019-08-22  6:12 ` [dpdk-dev] [PATCH v4 0/6] use WFE for locks and ring on aarch64 Gavin Hu
                   ` (63 subsequent siblings)
  81 siblings, 2 replies; 163+ messages in thread
From: Gavin Hu @ 2019-07-23 15:43 UTC (permalink / raw)
  To: dev
  Cc: nd, thomas, stephen, jerinj, pbhagavatula, Honnappa.Nagarahalli,
	gavin.hu

Add the RTE_USE_WFE configuration entry for aarch64, disabled by default.
It can be enabled selectively based on the performance benchmarking.

Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
Reviewed-by: Steve Capper <steve.capper@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Acked-by: Pavan Nikhilesh <pbhagavatula@marvell.com>
---
 config/arm/meson.build     | 1 +
 config/common_armv8a_linux | 6 ++++++
 2 files changed, 7 insertions(+)

diff --git a/config/arm/meson.build b/config/arm/meson.build
index 979018e..496813a 100644
--- a/config/arm/meson.build
+++ b/config/arm/meson.build
@@ -116,6 +116,7 @@ impl_dpaa = ['NXP DPAA', flags_dpaa, machine_args_generic]
 impl_dpaa2 = ['NXP DPAA2', flags_dpaa2, machine_args_generic]
 
 dpdk_conf.set('RTE_FORCE_INTRINSICS', 1)
+dpdk_conf.set('RTE_USE_WFE', 0)
 
 if not dpdk_conf.get('RTE_ARCH_64')
 	dpdk_conf.set('RTE_CACHE_LINE_SIZE', 64)
diff --git a/config/common_armv8a_linux b/config/common_armv8a_linux
index 481712e..48c7ab5 100644
--- a/config/common_armv8a_linux
+++ b/config/common_armv8a_linux
@@ -12,6 +12,12 @@ CONFIG_RTE_ARCH_64=y
 
 CONFIG_RTE_FORCE_INTRINSICS=y
 
+# Use WFE instructions to implement the rte_wait_for_equal_xxx APIs,
+# calling these APIs put the cores enter low power state while waiting
+# for the memory address to be become equal to the expected value.
+# This is supported only by aarch64.
+CONFIG_RTE_USE_WFE=n
+
 # Maximum available cache line size in arm64 implementations.
 # Setting to maximum available cache line size in generic config
 # to address minimum DMA alignment across all arm64 implementations.
-- 
2.7.4


^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: [dpdk-dev] [RFC v2 5/5] config: add WFE config entry for aarch64
  2019-07-20  7:03   ` Pavan Nikhilesh Bhagavatula
@ 2019-07-23 15:47     ` Gavin Hu (Arm Technology China)
  0 siblings, 0 replies; 163+ messages in thread
From: Gavin Hu (Arm Technology China) @ 2019-07-23 15:47 UTC (permalink / raw)
  To: Pavan Nikhilesh Bhagavatula, dev, Stephen Hemminger, thomas; +Cc: nd

Hi Stephen,
> -----Original Message-----
> From: Pavan Nikhilesh Bhagavatula <pbhagavatula@marvell.com>
> Sent: Saturday, July 20, 2019 3:03 PM
> To: Gavin Hu (Arm Technology China) <Gavin.Hu@arm.com>; dev@dpdk.org
> Cc: nd <nd@arm.com>
> Subject: RE: [dpdk-dev] [RFC v2 5/5] config: add WFE config entry for
> aarch64
> 
> 
> 
> >-----Original Message-----
> >From: dev <dev-bounces@dpdk.org> On Behalf Of Gavin Hu
> >Sent: Wednesday, July 3, 2019 2:29 PM
> >To: dev@dpdk.org
> >Cc: nd@arm.com
> >Subject: [dpdk-dev] [RFC v2 5/5] config: add WFE config entry for
> >aarch64
> >
> >Add the RTE_USE_WFE configuration entry for aarch64, disabled by
> >default.
> >It can be enabled selectively based on the performance benchmarking.
> >
> >Signed-off-by: Gavin Hu <gavin.hu@arm.com>
> >Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> >Reviewed-by: Steve Capper <steve.capper@arm.com>
> >Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> 
> Acked-by: Pavan Nikhilesh <pbhagavatula@marvell.com>
Hi Stephen,
I just converted the RFCs to patches in V3, could you review your comments for RFCs were addressed? 
Thanks Pavan for review and testing!
Best regards,
Gavin
> 
> >---
> > config/arm/meson.build     | 1 +
> > config/common_armv8a_linux | 6 ++++++
> > 2 files changed, 7 insertions(+)
> >


^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: [dpdk-dev] [PATCH v3 5/5] config: add WFE config entry for aarch64
  2019-07-23 15:43 ` [dpdk-dev] [PATCH v3 5/5] config: add WFE config entry for aarch64 Gavin Hu
@ 2019-07-23 18:05   ` Stephen Hemminger
  2019-07-23 19:10     ` Honnappa Nagarahalli
  2019-07-24 12:25   ` [dpdk-dev] [EXT] " Jerin Jacob Kollanukkaran
  1 sibling, 1 reply; 163+ messages in thread
From: Stephen Hemminger @ 2019-07-23 18:05 UTC (permalink / raw)
  To: Gavin Hu; +Cc: dev, nd, thomas, jerinj, pbhagavatula, Honnappa.Nagarahalli

On Tue, 23 Jul 2019 23:43:46 +0800
Gavin Hu <gavin.hu@arm.com> wrote:

> Add the RTE_USE_WFE configuration entry for aarch64, disabled by default.
> It can be enabled selectively based on the performance benchmarking.
> 
> Signed-off-by: Gavin Hu <gavin.hu@arm.com>
> Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> Reviewed-by: Steve Capper <steve.capper@arm.com>
> Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> Acked-by: Pavan Nikhilesh <pbhagavatula@marvell.com>
> ---
>  config/arm/meson.build     | 1 +
>  config/common_armv8a_linux | 6 ++++++
>  2 files changed, 7 insertions(+)
> 
> diff --git a/config/arm/meson.build b/config/arm/meson.build
> index 979018e..496813a 100644
> --- a/config/arm/meson.build
> +++ b/config/arm/meson.build
> @@ -116,6 +116,7 @@ impl_dpaa = ['NXP DPAA', flags_dpaa, machine_args_generic]
>  impl_dpaa2 = ['NXP DPAA2', flags_dpaa2, machine_args_generic]
>  
>  dpdk_conf.set('RTE_FORCE_INTRINSICS', 1)
> +dpdk_conf.set('RTE_USE_WFE', 0)
>  
>  if not dpdk_conf.get('RTE_ARCH_64')
>  	dpdk_conf.set('RTE_CACHE_LINE_SIZE', 64)
> diff --git a/config/common_armv8a_linux b/config/common_armv8a_linux
> index 481712e..48c7ab5 100644
> --- a/config/common_armv8a_linux
> +++ b/config/common_armv8a_linux
> @@ -12,6 +12,12 @@ CONFIG_RTE_ARCH_64=y
>  
>  CONFIG_RTE_FORCE_INTRINSICS=y
>  
> +# Use WFE instructions to implement the rte_wait_for_equal_xxx APIs,
> +# calling these APIs put the cores enter low power state while waiting
> +# for the memory address to be become equal to the expected value.
> +# This is supported only by aarch64.
> +CONFIG_RTE_USE_WFE=n
> +
>  # Maximum available cache line size in arm64 implementations.
>  # Setting to maximum available cache line size in generic config
>  # to address minimum DMA alignment across all arm64 implementations.

Introducing config options is a maintenance nightmare.
How are distributions supposed to ship a package?
Does full regression test get done on both options?

The user should not be able to change this.

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: [dpdk-dev] [PATCH v3 5/5] config: add WFE config entry for aarch64
  2019-07-23 18:05   ` Stephen Hemminger
@ 2019-07-23 19:10     ` Honnappa Nagarahalli
  2019-07-24 17:59       ` Gavin Hu (Arm Technology China)
  0 siblings, 1 reply; 163+ messages in thread
From: Honnappa Nagarahalli @ 2019-07-23 19:10 UTC (permalink / raw)
  To: Stephen Hemminger, Gavin Hu (Arm Technology China)
  Cc: dev, nd, thomas, jerinj, pbhagavatula, Honnappa Nagarahalli, nd

> 
> On Tue, 23 Jul 2019 23:43:46 +0800
> Gavin Hu <gavin.hu@arm.com> wrote:
> 
> > Add the RTE_USE_WFE configuration entry for aarch64, disabled by default.
> > It can be enabled selectively based on the performance benchmarking.
> >
> > Signed-off-by: Gavin Hu <gavin.hu@arm.com>
> > Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > Reviewed-by: Steve Capper <steve.capper@arm.com>
> > Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > Acked-by: Pavan Nikhilesh <pbhagavatula@marvell.com>
> > ---
> >  config/arm/meson.build     | 1 +
> >  config/common_armv8a_linux | 6 ++++++
> >  2 files changed, 7 insertions(+)
> >
> > diff --git a/config/arm/meson.build b/config/arm/meson.build index
> > 979018e..496813a 100644
> > --- a/config/arm/meson.build
> > +++ b/config/arm/meson.build
> > @@ -116,6 +116,7 @@ impl_dpaa = ['NXP DPAA', flags_dpaa,
> > machine_args_generic]
> >  impl_dpaa2 = ['NXP DPAA2', flags_dpaa2, machine_args_generic]
> >
> >  dpdk_conf.set('RTE_FORCE_INTRINSICS', 1)
> > +dpdk_conf.set('RTE_USE_WFE', 0)
> >
> >  if not dpdk_conf.get('RTE_ARCH_64')
> >  	dpdk_conf.set('RTE_CACHE_LINE_SIZE', 64) diff --git
> > a/config/common_armv8a_linux b/config/common_armv8a_linux index
> > 481712e..48c7ab5 100644
> > --- a/config/common_armv8a_linux
> > +++ b/config/common_armv8a_linux
> > @@ -12,6 +12,12 @@ CONFIG_RTE_ARCH_64=y
> >
> >  CONFIG_RTE_FORCE_INTRINSICS=y
> >
> > +# Use WFE instructions to implement the rte_wait_for_equal_xxx APIs,
> > +# calling these APIs put the cores enter low power state while
> > +waiting # for the memory address to be become equal to the expected value.
> > +# This is supported only by aarch64.
> > +CONFIG_RTE_USE_WFE=n
> > +
> >  # Maximum available cache line size in arm64 implementations.
> >  # Setting to maximum available cache line size in generic config  #
> > to address minimum DMA alignment across all arm64 implementations.
> 
> Introducing config options is a maintenance nightmare.
> How are distributions supposed to ship a package?
> Does full regression test get done on both options?
> 
> The user should not be able to change this.
Agree with these concerns here. In our tests, we are finding that this patch does not result in performance improvements on all micro-architectures. May be these micro-architectures will evolve in the future knowing that WFE is being used in DPDK. But at this point, it does not make sense to enable this by default. This means additional testing/regression with the flag enabled. We could add this to Travis build (Travis yml file).

Currently, this patch will address use cases where the target hardware/environment is known during compilation.

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: [dpdk-dev] [PATCH v3 0/5] use WFE for locks and ring on aarch64
  2019-07-23 15:43 ` [dpdk-dev] [PATCH v3 0/5] use WFE for locks and ring on aarch64 Gavin Hu
@ 2019-07-23 19:15   ` Honnappa Nagarahalli
  2019-07-23 21:27     ` Thomas Monjalon
  0 siblings, 1 reply; 163+ messages in thread
From: Honnappa Nagarahalli @ 2019-07-23 19:15 UTC (permalink / raw)
  To: Gavin Hu (Arm Technology China), dev
  Cc: nd, thomas, stephen, jerinj, pbhagavatula,
	Gavin Hu (Arm Technology China),
	Honnappa Nagarahalli, nd

Hi Gavin,
	I think this should have been V1 (I mean, no versioning, just 'PATCH'), since it is converted to patch. I think we should be able to resend it as V1 and mark this V3 as 'superseded'.

Hi Thomas,
	Please let us know if it is worth/helps fixing the version.

Thanks,
Honnappa

> -----Original Message-----
> From: Gavin Hu <gavin.hu@arm.com>
> Sent: Tuesday, July 23, 2019 10:44 AM
> To: dev@dpdk.org
> Cc: nd <nd@arm.com>; thomas@monjalon.net;
> stephen@networkplumber.org; jerinj@marvell.com;
> pbhagavatula@marvell.com; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>; Gavin Hu (Arm Technology China)
> <Gavin.Hu@arm.com>
> Subject: [PATCH v3 0/5] use WFE for locks and ring on aarch64
> 
> DPDK has multiple use cases where the core repeatedly polls a location in
> memory. This polling results in many cache and memory transactions.
> 
> Arm architecture provides WFE (Wait For Event) instruction, which allows the
> cpu core to enter a low power state until woken up by the update to the
> memory location being polled. Thus reducing the cache and memory
> transactions.
> 
> x86 has the PAUSE hint instruction to reduce such overhead.
> 
> The rte_wait_until_equal_xxx APIs abstract the functionality of 'polling for a
> memory location to become equal to a given value'.
> 
> For non-Arm platforms, these APIs are just wrappers around do-while loop
> with rte_pause, so there are no performance differences.
> 
> For Arm platforms, use of WFE can be configured using
> CONFIG_RTE_USE_WFE option. It is disabled by default.
> 
> Currently, use of WFE is supported only for aarch64 platforms. armv7
> platforms do support the WFE instruction, but they require explicit wake up
> events(sev) and are less performannt.
> 
> Testing shows that, performance varies across different platforms, with some
> showing degradation.
> 
> CONFIG_RTE_USE_WFE should be enabled depending on the performance on
> the target platforms.
> 
> V3:
> * Convert RFCs to patches
> V2:
> * Use inline functions instead of marcos
> * Add load and compare in the beginning of the APIs
> * Fix some style errors in asm inline
> V1:
> * Add the new APIs and use it for ring and locks
> 
> Gavin Hu (5):
>   eal: add the APIs to wait until equal
>   ticketlock: use new API to reduce contention on aarch64
>   ring: use wfe to wait for ring tail update on aarch64
>   spinlock: use wfe to reduce contention on aarch64
>   config: add WFE config entry for aarch64
> 
>  config/arm/meson.build                             |   1 +
>  config/common_armv8a_linux                         |   6 ++
>  .../common/include/arch/arm/rte_atomic_64.h        |   4 +
>  .../common/include/arch/arm/rte_pause_64.h         | 106
> +++++++++++++++++++++
>  .../common/include/arch/arm/rte_spinlock.h         |  25 +++++
>  lib/librte_eal/common/include/generic/rte_pause.h  |  39 +++++++-
>  .../common/include/generic/rte_spinlock.h          |   2 +-
>  .../common/include/generic/rte_ticketlock.h        |   3 +-
>  lib/librte_ring/rte_ring_c11_mem.h                 |   4 +-
>  lib/librte_ring/rte_ring_generic.h                 |   3 +-
>  10 files changed, 185 insertions(+), 8 deletions(-)
> 
> --
> 2.7.4


^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: [dpdk-dev] [PATCH v3 0/5] use WFE for locks and ring on aarch64
  2019-07-23 19:15   ` Honnappa Nagarahalli
@ 2019-07-23 21:27     ` Thomas Monjalon
  2019-07-24  2:44       ` Honnappa Nagarahalli
  0 siblings, 1 reply; 163+ messages in thread
From: Thomas Monjalon @ 2019-07-23 21:27 UTC (permalink / raw)
  To: Honnappa Nagarahalli
  Cc: Gavin Hu (Arm Technology China), dev, nd, stephen, jerinj, pbhagavatula

23/07/2019 21:15, Honnappa Nagarahalli:
> Hi Gavin,
> 	I think this should have been V1 (I mean, no versioning, just 'PATCH'), since it is converted to patch. I think we should be able to resend it as V1 and mark this V3 as 'superseded'.
> 
> Hi Thomas,
> 	Please let us know if it is worth/helps fixing the version.

I don't follow why it should be v1.




^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: [dpdk-dev] [PATCH v3 0/5] use WFE for locks and ring on aarch64
  2019-07-23 21:27     ` Thomas Monjalon
@ 2019-07-24  2:44       ` Honnappa Nagarahalli
  2019-07-24  7:43         ` Thomas Monjalon
  0 siblings, 1 reply; 163+ messages in thread
From: Honnappa Nagarahalli @ 2019-07-24  2:44 UTC (permalink / raw)
  To: thomas
  Cc: Gavin Hu (Arm Technology China),
	dev, nd, stephen, jerinj, pbhagavatula, Honnappa Nagarahalli, nd

> 
> 23/07/2019 21:15, Honnappa Nagarahalli:
> > Hi Gavin,
> > 	I think this should have been V1 (I mean, no versioning, just 'PATCH'),
> since it is converted to patch. I think we should be able to resend it as V1 and
> mark this V3 as 'superseded'.
> >
> > Hi Thomas,
> > 	Please let us know if it is worth/helps fixing the version.
> 
> I don't follow why it should be v1.
This patch series was a RFC (RFC V1 and RFC v2). It is converted to a patch, I thought it should start with V1.

> 
> 


^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: [dpdk-dev] [PATCH v3 0/5] use WFE for locks and ring on aarch64
  2019-07-24  2:44       ` Honnappa Nagarahalli
@ 2019-07-24  7:43         ` Thomas Monjalon
  0 siblings, 0 replies; 163+ messages in thread
From: Thomas Monjalon @ 2019-07-24  7:43 UTC (permalink / raw)
  To: Honnappa Nagarahalli
  Cc: Gavin Hu (Arm Technology China), dev, nd, stephen, jerinj, pbhagavatula

24/07/2019 04:44, Honnappa Nagarahalli:
> > 23/07/2019 21:15, Honnappa Nagarahalli:
> > > Hi Gavin,
> > > 	I think this should have been V1 (I mean, no versioning, just 'PATCH'),
> > since it is converted to patch. I think we should be able to resend it as V1 and
> > mark this V3 as 'superseded'.
> > >
> > > Hi Thomas,
> > > 	Please let us know if it is worth/helps fixing the version.
> > 
> > I don't follow why it should be v1.
> 
> This patch series was a RFC (RFC V1 and RFC v2). It is converted to a patch, I thought it should start with V1.

No it can keep incrementing, it is OK and clear.




^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: [dpdk-dev] [EXT] [PATCH v3 1/5] eal: add the APIs to wait until equal
  2019-07-23 15:43 ` [dpdk-dev] [PATCH v3 1/5] eal: add the APIs to wait until equal Gavin Hu
@ 2019-07-24 11:52   ` " Jerin Jacob Kollanukkaran
  2019-07-24 18:10     ` Gavin Hu (Arm Technology China)
  0 siblings, 1 reply; 163+ messages in thread
From: Jerin Jacob Kollanukkaran @ 2019-07-24 11:52 UTC (permalink / raw)
  To: Gavin Hu, dev
  Cc: nd, thomas, stephen, Pavan Nikhilesh Bhagavatula, Honnappa.Nagarahalli

> -----Original Message-----
> From: Gavin Hu <gavin.hu@arm.com>
> Sent: Tuesday, July 23, 2019 9:14 PM
> To: dev@dpdk.org
> Cc: nd@arm.com; thomas@monjalon.net; stephen@networkplumber.org;
> Jerin Jacob Kollanukkaran <jerinj@marvell.com>; Pavan Nikhilesh
> Bhagavatula <pbhagavatula@marvell.com>;
> Honnappa.Nagarahalli@arm.com; gavin.hu@arm.com
> Subject: [EXT] [PATCH v3 1/5] eal: add the APIs to wait until equal
> 
> The rte_wait_until_equalxx APIs abstract the functionality of 'polling for a
> memory location to become equal to a given value'.
> 
> Signed-off-by: Gavin Hu <gavin.hu@arm.com>
> Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> Reviewed-by: Steve Capper <steve.capper@arm.com>
> Reviewed-by: Ola Liljedahl <ola.liljedahl@arm.com>
> Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> Acked-by: Pavan Nikhilesh <pbhagavatula@marvell.com>
> ---
>  .../common/include/arch/arm/rte_atomic_64.h        |   4 +
>  .../common/include/arch/arm/rte_pause_64.h         | 106
> +++++++++++++++++++++
>  lib/librte_eal/common/include/generic/rte_pause.h  |  39 +++++++-
>  3 files changed, 148 insertions(+), 1 deletion(-)
> 
> diff --git a/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h
> b/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h
> index 97060e4..8d742c6 100644
> --- a/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h
> +++ b/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h
> @@ -15,8 +15,12 @@ extern "C" {
> 
>  #include "generic/rte_atomic.h"
> 
> +#ifndef dsb
>  #define dsb(opt) asm volatile("dsb " #opt : : : "memory")
> +#endif
> +#ifndef dmb
>  #define dmb(opt) asm volatile("dmb " #opt : : : "memory")
> +#endif

Is this change required? Please fix the root cause.
I do see some build error too.

In file included from /home/jerin/dpdk.org/build/include/rte_pause_64.h:13,
                 from /home/jerin/dpdk.org/build/include/rte_pause.h:13,
                 from /home/jerin/dpdk.org/build/include/generic/rte_spinlock.h:25,
                 from /home/jerin/dpdk.org/build/include/rte_spinlock.h:17,
                 from /home/jerin/dpdk.org/drivers/bus/fslmc/mc/mc_sys.c:10:
/home/jerin/dpdk.org/build/include/generic/rte_pause.h: In function 'rte_wait_until_equal16':
/home/jerin/dpdk.org/build/include/generic/rte_pause.h:44:49: error: macro "dmb" passed 1 arguments, but takes just 0
   44 |  __rte_wait_until_equal(addr, expected, memorder);

Command to reproduce(gcc 9.1)

rm -rf build && unset RTE_KERNELDIR && make -j  T=arm64-thunderx-linux-gcc  CROSS=aarch64-linux-gnu- && sed -ri    's,(CONFIG_RTE_KNI_KMOD=)y,\1n,' build/.config && sed -ri  's,(CONFIG_RTE_LIBRTE_VHOST_NUMA=)y,\1n,' build/.config &&  sed -ri  's,(CONFIG_RTE_EAL_NUMA_AWARE_HUGEPAGES=)y,\1n,' build/.config && sed -ri  's,(CONFIG_RTE_EAL_IGB_UIO=)y,\1n,' build/.config && CC="ccache gcc" make -j  test-build CROSS=aarch64-linux-gnu-


> 
>  #define rte_mb() dsb(sy)
> 
> diff --git a/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
> b/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
> index 93895d3..1f7be0a 100644
> --- a/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
> +++ b/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
> @@ -17,6 +17,112 @@ static inline void rte_pause(void)
>  	asm volatile("yield" ::: "memory");
>  }
> 
> +#ifdef RTE_USE_WFE

Do we need it to be under RTE_USE_WFE? If there is no harm, no need to add
Conditional compilation to detect build errors, especially config is disabled by default.

> +/* Wait for *addr to be updated with expected value */ static
> +__rte_always_inline void rte_wait_until_equal16(volatile uint16_t
> +*addr, uint16_t expected, int memorder) {
> +	uint16_t tmp;
> +	if (memorder == __ATOMIC_RELAXED)
> +		asm volatile(
> +			"ldxrh	%w[tmp], %w[addr]\n"
> +			"cmp	%w[tmp], %w[expected]\n"
> +			"b.eq	2f\n"
> +			"sevl\n"
> +			"1:	wfe\n"
> +			"ldxrh	%w[tmp], %w[addr]\n"
> +			"cmp	%w[tmp], %w[expected]\n"
> +			"bne	1b\n"
> +			"2:\n"
> +			: [tmp] "=&r" (tmp)
> +			: [addr] "Q"(*addr), [expected] "r"(expected)
> +			: "cc", "memory");
> +	else
> +		asm volatile(
> +			"ldaxrh %w[tmp], %w[addr]\n"
> +			"cmp	%w[tmp], %w[expected]\n"
> +			"b.eq	2f\n"
> +			"sevl\n"
> +			"1:	wfe\n"
> +			"ldaxrh	%w[tmp], %w[addr]\n"
> +			"cmp	%w[tmp], %w[expected]\n"
> +			"bne	1b\n"
> +			"2:\n"
> +			: [tmp] "=&r" (tmp)
> +			: [addr] "Q"(*addr), [expected] "r"(expected)
> +			: "cc", "memory");
> +}
> +
> +static __rte_always_inline void
> +rte_wait_until_equal32(volatile uint32_t *addr, uint32_t expected, int
> +memorder) {
> +	uint32_t tmp;
> +	if (memorder == __ATOMIC_RELAXED)
> +		asm volatile(
> +			"ldxr	%w[tmp], %w[addr]\n"
> +			"cmp	%w[tmp], %w[expected]\n"
> +			"b.eq	2f\n"
> +			"sevl\n"
> +			"1:	wfe\n"
> +			"ldxr	%w[tmp], %w[addr]\n"
> +			"cmp	%w[tmp], %w[expected]\n"
> +			"bne	1b\n"
> +			"2:\n"
> +			: [tmp] "=&r" (tmp)
> +			: [addr] "Q"(*addr), [expected] "r"(expected)
> +			: "cc", "memory");
> +	else
> +		asm volatile(
> +			"ldaxr  %w[tmp], %w[addr]\n"
> +			"cmp	%w[tmp], %w[expected]\n"
> +			"b.eq	2f\n"
> +			"sevl\n"
> +			"1:	wfe\n"
> +			"ldaxr  %w[tmp], %w[addr]\n"
> +			"cmp	%w[tmp], %w[expected]\n"
> +			"bne	1b\n"
> +			"2:\n"
> +			: [tmp] "=&r" (tmp)
> +			: [addr] "Q"(*addr), [expected] "r"(expected)
> +			: "cc", "memory");
> +}
> +
> +static __rte_always_inline void
> +rte_wait_until_equal64(volatile uint64_t *addr, uint64_t expected, int
> +memorder) {
> +	uint64_t tmp;
> +	if (memorder == __ATOMIC_RELAXED)
> +		asm volatile(
> +			"ldxr	%x[tmp], %x[addr]\n"
> +			"cmp	%x[tmp], %x[expected]\n"
> +			"b.eq	2f\n"
> +			"sevl\n"
> +			"1:	wfe\n"
> +			"ldxr	%x[tmp], %x[addr]\n"
> +			"cmp	%x[tmp], %x[expected]\n"
> +			"bne	1b\n"
> +			"2:\n"
> +			: [tmp] "=&r" (tmp)
> +			: [addr] "Q"(*addr), [expected] "r"(expected)
> +			: "cc", "memory");
> +	else
> +		asm volatile(
> +			"ldaxr  %x[tmp], %x[addr]\n"
> +			"cmp	%x[tmp], %x[expected]\n"
> +			"b.eq	2f\n"
> +			"sevl\n"
> +			"1:	wfe\n"
> +			"ldaxr  %x[tmp], %x[addr]\n"
> +			"cmp	%x[tmp], %x[expected]\n"
> +			"bne	1b\n"
> +			"2:\n"
> +			: [tmp] "=&r" (tmp)
> +			: [addr] "Q"(*addr), [expected] "r"(expected)
> +			: "cc", "memory");

Duplication of code. Please introduce a macro for assembly Skelton.
Something like

http://patches.dpdk.org/patch/56949/

> +}
> +
> +#endif
> +
>  #ifdef __cplusplus
>  }
>  #endif
> diff --git a/lib/librte_eal/common/include/generic/rte_pause.h
> b/lib/librte_eal/common/include/generic/rte_pause.h
> index 52bd4db..8f5f025 100644
> --- a/lib/librte_eal/common/include/generic/rte_pause.h
> +++ b/lib/librte_eal/common/include/generic/rte_pause.h
> @@ -4,7 +4,6 @@
> 
>  #ifndef _RTE_PAUSE_H_
>  #define _RTE_PAUSE_H_
> -
>  /**
>   * @file
>   *
> @@ -12,6 +11,10 @@
>   *
>   */
> 
> +#include <stdint.h>
> +#include <rte_common.h>
> +#include <rte_atomic.h>
> +
>  /**
>   * Pause CPU execution for a short while
>   *
> @@ -20,4 +23,38 @@
>   */
>  static inline void rte_pause(void);
> 
> +#if !defined(RTE_USE_WFE)
> +#ifdef RTE_USE_C11_MEM_MODEL
> +#define __rte_wait_until_equal(addr, expected, memorder) do {\
> +	while (__atomic_load_n(addr, memorder) != expected) \
> +		rte_pause();\
> +} while (0)
> +#else
> +#define __rte_wait_until_equal(addr, expected, memorder) do {\
> +	while (*addr != expected)\
> +		rte_pause();\
> +	if (memorder != __ATOMIC_RELAXED)\
> +		rte_smp_rmb();\

Is this correct wrt all memorder?
If there is no specific gain on no C11 mem model, let have only C11 memmodel
Aka remove RTE_USE_C11_MEM_MODEL

> +} while (0)
> +#endif
> +

Spotted public API. Lets have prototype with very good documentation on the
API details.


> +static __rte_always_inline void
> +rte_wait_until_equal16(volatile uint16_t *addr, uint16_t expected, int
> +memorder) {
> +	__rte_wait_until_equal(addr, expected, memorder); }
> +
> +static __rte_always_inline void
> +rte_wait_until_equal32(volatile uint32_t *addr, uint32_t expected, int
> +memorder) {
> +	__rte_wait_until_equal(addr, expected, memorder); }
> +
> +static __rte_always_inline void
> +rte_wait_until_equal64(volatile uint64_t *addr, uint64_t expected, int
> +memorder) {
> +	__rte_wait_until_equal(addr, expected, memorder); } #endif /*
> +RTE_USE_WFE */
> +
>  #endif /* _RTE_PAUSE_H_ */
> --
> 2.7.4


^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: [dpdk-dev] [EXT] [PATCH v3 4/5] spinlock: use wfe to reduce contention on aarch64
  2019-07-23 15:43 ` [dpdk-dev] [PATCH v3 4/5] spinlock: use wfe to reduce contention " Gavin Hu
@ 2019-07-24 12:17   ` " Jerin Jacob Kollanukkaran
  0 siblings, 0 replies; 163+ messages in thread
From: Jerin Jacob Kollanukkaran @ 2019-07-24 12:17 UTC (permalink / raw)
  To: Gavin Hu, dev
  Cc: nd, thomas, stephen, Pavan Nikhilesh Bhagavatula, Honnappa.Nagarahalli

> -----Original Message-----
> From: Gavin Hu <gavin.hu@arm.com>
> Sent: Tuesday, July 23, 2019 9:14 PM
> To: dev@dpdk.org
> Cc: nd@arm.com; thomas@monjalon.net; stephen@networkplumber.org;
> Jerin Jacob Kollanukkaran <jerinj@marvell.com>; Pavan Nikhilesh
> Bhagavatula <pbhagavatula@marvell.com>;
> Honnappa.Nagarahalli@arm.com; gavin.hu@arm.com
> Subject: [EXT] [PATCH v3 4/5] spinlock: use wfe to reduce contention on
> aarch64
> 
> In acquiring a spinlock, cores repeatedly poll the lock variable.
> This is replaced by rte_wait_until_equal API.
> 
> 5~10% performance gain was measured by running spinlock_autotest on
> 14 isolated cores of ThunderX2.
> 
> Signed-off-by: Gavin Hu <gavin.hu@arm.com>
> Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> Reviewed-by: Phil Yang <phil.yang@arm.com>
> Reviewed-by: Steve Capper <steve.capper@arm.com>
> Reviewed-by: Ola Liljedahl <ola.liljedahl@arm.com>
> Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> Tested-by: Pavan Nikhilesh <pbhagavatula@marvell.com>
> ---
>  .../common/include/arch/arm/rte_spinlock.h         | 25
> ++++++++++++++++++++++
>  .../common/include/generic/rte_spinlock.h          |  2 +-
>  2 files changed, 26 insertions(+), 1 deletion(-)
> 
> diff --git a/lib/librte_eal/common/include/arch/arm/rte_spinlock.h
> b/lib/librte_eal/common/include/arch/arm/rte_spinlock.h
> index 1a6916b..f25d17f 100644
> --- a/lib/librte_eal/common/include/arch/arm/rte_spinlock.h
> +++ b/lib/librte_eal/common/include/arch/arm/rte_spinlock.h
> @@ -16,6 +16,31 @@ extern "C" {
>  #include <rte_common.h>
>  #include "generic/rte_spinlock.h"
> 
> +/* armv7a does support WFE, but an explicit wake-up signal using SEV is
> + * required (must be preceded by DSB to drain the store buffer) and
> + * this is less performant, so keep armv7a implementation unchanged.
> + */
> +#if defined(RTE_USE_WFE) && defined(RTE_ARCH_ARM64) static inline

See below. Please avoid complicated conditional compilation logic for scalability and readability.
 

> void
> +rte_spinlock_lock(rte_spinlock_t *sl) {
> +	unsigned int tmp;
> +	/* http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.
> +	 * faqs/ka16809.html
> +	 */
> +	asm volatile(
> +		"sevl\n"
> +		"1:	wfe\n"
> +		"2:	ldaxr %w[tmp], %w[locked]\n"
> +		"cbnz   %w[tmp], 1b\n"
> +		"stxr   %w[tmp], %w[one], %w[locked]\n"
> +		"cbnz   %w[tmp], 2b\n"
> +		: [tmp] "=&r" (tmp), [locked] "+Q"(sl->locked)
> +		: [one] "r" (1)
> +		: "cc", "memory");
> +}
> +#endif
> +
>  static inline int rte_tm_supported(void)  {
>  	return 0;
> diff --git a/lib/librte_eal/common/include/generic/rte_spinlock.h
> b/lib/librte_eal/common/include/generic/rte_spinlock.h
> index 87ae7a4..cf4f15b 100644
> --- a/lib/librte_eal/common/include/generic/rte_spinlock.h
> +++ b/lib/librte_eal/common/include/generic/rte_spinlock.h
> @@ -57,7 +57,7 @@ rte_spinlock_init(rte_spinlock_t *sl)  static inline void
> rte_spinlock_lock(rte_spinlock_t *sl);
> 
> -#ifdef RTE_FORCE_INTRINSICS
> +#if defined(RTE_FORCE_INTRINSICS) && !defined(RTE_USE_WFE)

I would like to avoid hacking around adjusting generic code to meet specific requirement.
For example, if someone enables RTE_USE_WFE for armv7 it will break
And it will pain for the new architecture to use RTE_FORCE_INTRINSICS.

Looks like the time has come to disable RTE_FORCE_INTRINSICS for arm64. 

Since this patch is targeted for next release. How about enable native
Implementation for RTE_FORCE_INTRINSICS used code for arm64 like spinlock, ticketlock like x86.
If you guys don't have the bandwidth to convert all blocks, let us know, we can collaborate
and Marvell can take up some RTE_FORCE_INTRINSICS conversion for next release.


>  static inline void
>  rte_spinlock_lock(rte_spinlock_t *sl)
>  {
> --
> 2.7.4


^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: [dpdk-dev] [EXT] [PATCH v3 5/5] config: add WFE config entry for aarch64
  2019-07-23 15:43 ` [dpdk-dev] [PATCH v3 5/5] config: add WFE config entry for aarch64 Gavin Hu
  2019-07-23 18:05   ` Stephen Hemminger
@ 2019-07-24 12:25   ` " Jerin Jacob Kollanukkaran
  1 sibling, 0 replies; 163+ messages in thread
From: Jerin Jacob Kollanukkaran @ 2019-07-24 12:25 UTC (permalink / raw)
  To: Gavin Hu, dev
  Cc: nd, thomas, stephen, Pavan Nikhilesh Bhagavatula, Honnappa.Nagarahalli

> -----Original Message-----
> From: Gavin Hu <gavin.hu@arm.com>
> Sent: Tuesday, July 23, 2019 9:14 PM
> To: dev@dpdk.org
> Cc: nd@arm.com; thomas@monjalon.net; stephen@networkplumber.org;
> Jerin Jacob Kollanukkaran <jerinj@marvell.com>; Pavan Nikhilesh
> Bhagavatula <pbhagavatula@marvell.com>;
> Honnappa.Nagarahalli@arm.com; gavin.hu@arm.com
> Subject: [EXT] [PATCH v3 5/5] config: add WFE config entry for aarch64
> 
> Add the RTE_USE_WFE configuration entry for aarch64, disabled by default.
> It can be enabled selectively based on the performance benchmarking.
> 
> Signed-off-by: Gavin Hu <gavin.hu@arm.com>
> Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> Reviewed-by: Steve Capper <steve.capper@arm.com>
> Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> Acked-by: Pavan Nikhilesh <pbhagavatula@marvell.com>
> 
> +# Use WFE instructions to implement the rte_wait_for_equal_xxx APIs, #
> +calling these APIs put the cores enter low power state while waiting #
> +for the memory address to be become equal to the expected value.
> +# This is supported only by aarch64.
> +CONFIG_RTE_USE_WFE=n

# If it specific for arm and none of the other architectures supports it then
I would like to change the config as CONFIG_RTE_ARM_USE_WFE
# Even if it is disabled, have the =n entry in config/common_base to know
all supported configs in DPDK in one place.
# Arrange all CONFIG_RTE_ARM_* together in config/common_base


^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: [dpdk-dev] [PATCH v3 5/5] config: add WFE config entry for aarch64
  2019-07-23 19:10     ` Honnappa Nagarahalli
@ 2019-07-24 17:59       ` Gavin Hu (Arm Technology China)
  0 siblings, 0 replies; 163+ messages in thread
From: Gavin Hu (Arm Technology China) @ 2019-07-24 17:59 UTC (permalink / raw)
  To: Honnappa Nagarahalli, Stephen Hemminger
  Cc: dev, nd, thomas, jerinj, pbhagavatula, nd

Hi Stephen,
> -----Original Message-----
> From: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
> Sent: Wednesday, July 24, 2019 3:10 AM
> To: Stephen Hemminger <stephen@networkplumber.org>; Gavin Hu (Arm
> Technology China) <Gavin.Hu@arm.com>
> Cc: dev@dpdk.org; nd <nd@arm.com>; thomas@monjalon.net;
> jerinj@marvell.com; pbhagavatula@marvell.com; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>; nd <nd@arm.com>
> Subject: RE: [PATCH v3 5/5] config: add WFE config entry for aarch64
> 
> >
> > On Tue, 23 Jul 2019 23:43:46 +0800
> > Gavin Hu <gavin.hu@arm.com> wrote:
> >
> > > Add the RTE_USE_WFE configuration entry for aarch64, disabled by
> default.
> > > It can be enabled selectively based on the performance benchmarking.
> > >
> > > Signed-off-by: Gavin Hu <gavin.hu@arm.com>
> > > Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > > Reviewed-by: Steve Capper <steve.capper@arm.com>
> > > Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > > Acked-by: Pavan Nikhilesh <pbhagavatula@marvell.com>
> > > ---
> > >  config/arm/meson.build     | 1 +
> > >  config/common_armv8a_linux | 6 ++++++
> > >  2 files changed, 7 insertions(+)
> > >
> > > diff --git a/config/arm/meson.build b/config/arm/meson.build index
> > > 979018e..496813a 100644
> > > --- a/config/arm/meson.build
> > > +++ b/config/arm/meson.build
> > > @@ -116,6 +116,7 @@ impl_dpaa = ['NXP DPAA', flags_dpaa,
> > > machine_args_generic]
> > >  impl_dpaa2 = ['NXP DPAA2', flags_dpaa2, machine_args_generic]
> > >
> > >  dpdk_conf.set('RTE_FORCE_INTRINSICS', 1)
> > > +dpdk_conf.set('RTE_USE_WFE', 0)
> > >
> > >  if not dpdk_conf.get('RTE_ARCH_64')
> > >  dpdk_conf.set('RTE_CACHE_LINE_SIZE', 64) diff --git
> > > a/config/common_armv8a_linux b/config/common_armv8a_linux index
> > > 481712e..48c7ab5 100644
> > > --- a/config/common_armv8a_linux
> > > +++ b/config/common_armv8a_linux
> > > @@ -12,6 +12,12 @@ CONFIG_RTE_ARCH_64=y
> > >
> > >  CONFIG_RTE_FORCE_INTRINSICS=y
> > >
> > > +# Use WFE instructions to implement the rte_wait_for_equal_xxx APIs,
> > > +# calling these APIs put the cores enter low power state while
> > > +waiting # for the memory address to be become equal to the expected
> value.
> > > +# This is supported only by aarch64.
> > > +CONFIG_RTE_USE_WFE=n
> > > +
> > >  # Maximum available cache line size in arm64 implementations.
> > >  # Setting to maximum available cache line size in generic config  #
> > > to address minimum DMA alignment across all arm64 implementations.
> >
> > Introducing config options is a maintenance nightmare.
> > How are distributions supposed to ship a package?
> > Does full regression test get done on both options?
> >
> > The user should not be able to change this.
> Agree with these concerns here. In our tests, we are finding that this patch
> does not result in performance improvements on all micro-architectures.
> May be these micro-architectures will evolve in the future knowing that
> WFE is being used in DPDK. But at this point, it does not make sense to
> enable this by default. This means additional testing/regression with the flag
> enabled. We could add this to Travis build (Travis yml file).
> 
> Currently, this patch will address use cases where the target
> hardware/environment is known during compilation.
In our testing, like running testpmd and packet_ordering(for WFE ring benchmarking), it showed no improvements nor degradation in performance. 
For some micro-benchmarking, it showed slight improvements sometimes, no degradation were seen.
The added benefit of the patch set is power saving, but it is not a primary concern in DPDK and we are short of measurement ways to benchmark that.


^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: [dpdk-dev] [EXT] [PATCH v3 1/5] eal: add the APIs to wait until equal
  2019-07-24 11:52   ` [dpdk-dev] [EXT] " Jerin Jacob Kollanukkaran
@ 2019-07-24 18:10     ` Gavin Hu (Arm Technology China)
  0 siblings, 0 replies; 163+ messages in thread
From: Gavin Hu (Arm Technology China) @ 2019-07-24 18:10 UTC (permalink / raw)
  To: jerinj, dev
  Cc: nd, thomas, stephen, Pavan Nikhilesh Bhagavatula, Honnappa Nagarahalli

Hi Jerin,
> -----Original Message-----
> From: Jerin Jacob Kollanukkaran <jerinj@marvell.com>
> Sent: Wednesday, July 24, 2019 7:53 PM
> To: Gavin Hu (Arm Technology China) <Gavin.Hu@arm.com>; dev@dpdk.org
> Cc: nd <nd@arm.com>; thomas@monjalon.net;
> stephen@networkplumber.org; Pavan Nikhilesh Bhagavatula
> <pbhagavatula@marvell.com>; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>
> Subject: RE: [EXT] [PATCH v3 1/5] eal: add the APIs to wait until equal
> 
> > -----Original Message-----
> > From: Gavin Hu <gavin.hu@arm.com>
> > Sent: Tuesday, July 23, 2019 9:14 PM
> > To: dev@dpdk.org
> > Cc: nd@arm.com; thomas@monjalon.net; stephen@networkplumber.org;
> > Jerin Jacob Kollanukkaran <jerinj@marvell.com>; Pavan Nikhilesh
> > Bhagavatula <pbhagavatula@marvell.com>;
> > Honnappa.Nagarahalli@arm.com; gavin.hu@arm.com
> > Subject: [EXT] [PATCH v3 1/5] eal: add the APIs to wait until equal
> >
> > The rte_wait_until_equalxx APIs abstract the functionality of 'polling for a
> > memory location to become equal to a given value'.
> >
> > Signed-off-by: Gavin Hu <gavin.hu@arm.com>
> > Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > Reviewed-by: Steve Capper <steve.capper@arm.com>
> > Reviewed-by: Ola Liljedahl <ola.liljedahl@arm.com>
> > Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > Acked-by: Pavan Nikhilesh <pbhagavatula@marvell.com>
> > ---
> >  .../common/include/arch/arm/rte_atomic_64.h        |   4 +
> >  .../common/include/arch/arm/rte_pause_64.h         | 106
> > +++++++++++++++++++++
> >  lib/librte_eal/common/include/generic/rte_pause.h  |  39 +++++++-
> >  3 files changed, 148 insertions(+), 1 deletion(-)
> >
> > diff --git a/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h
> > b/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h
> > index 97060e4..8d742c6 100644
> > --- a/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h
> > +++ b/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h
> > @@ -15,8 +15,12 @@ extern "C" {
> >
> >  #include "generic/rte_atomic.h"
> >
> > +#ifndef dsb
> >  #define dsb(opt) asm volatile("dsb " #opt : : : "memory")
> > +#endif
> > +#ifndef dmb
> >  #define dmb(opt) asm volatile("dmb " #opt : : : "memory")
> > +#endif
> 
> Is this change required? Please fix the root cause.
> I do see some build error too.
> 
> In file included from /home/jerin/dpdk.org/build/include/rte_pause_64.h:13,
>                  from /home/jerin/dpdk.org/build/include/rte_pause.h:13,
>                  from
> /home/jerin/dpdk.org/build/include/generic/rte_spinlock.h:25,
>                  from /home/jerin/dpdk.org/build/include/rte_spinlock.h:17,
>                  from /home/jerin/dpdk.org/drivers/bus/fslmc/mc/mc_sys.c:10:
> /home/jerin/dpdk.org/build/include/generic/rte_pause.h: In function
> 'rte_wait_until_equal16':
> /home/jerin/dpdk.org/build/include/generic/rte_pause.h:44:49: error:
> macro "dmb" passed 1 arguments, but takes just 0
>    44 |  __rte_wait_until_equal(addr, expected, memorder);
> 
> Command to reproduce(gcc 9.1)
> 
> rm -rf build && unset RTE_KERNELDIR && make -j  T=arm64-thunderx-linux-
> gcc  CROSS=aarch64-linux-gnu- && sed -ri
> 's,(CONFIG_RTE_KNI_KMOD=)y,\1n,' build/.config && sed -ri
> 's,(CONFIG_RTE_LIBRTE_VHOST_NUMA=)y,\1n,' build/.config &&  sed -ri
> 's,(CONFIG_RTE_EAL_NUMA_AWARE_HUGEPAGES=)y,\1n,' build/.config &&
> sed -ri  's,(CONFIG_RTE_EAL_IGB_UIO=)y,\1n,' build/.config && CC="ccache
> gcc" make -j  test-build CROSS=aarch64-linux-gnu-
> 
> 
> >
> >  #define rte_mb() dsb(sy)
> >
> > diff --git a/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
> > b/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
> > index 93895d3..1f7be0a 100644
> > --- a/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
> > +++ b/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
> > @@ -17,6 +17,112 @@ static inline void rte_pause(void)
> >  	asm volatile("yield" ::: "memory");
> >  }
> >
> > +#ifdef RTE_USE_WFE
> 
> Do we need it to be under RTE_USE_WFE? If there is no harm, no need to
> add
> Conditional compilation to detect build errors, especially config is disabled
> by default.
> 
> > +/* Wait for *addr to be updated with expected value */ static
> > +__rte_always_inline void rte_wait_until_equal16(volatile uint16_t
> > +*addr, uint16_t expected, int memorder) {
> > +	uint16_t tmp;
> > +	if (memorder == __ATOMIC_RELAXED)
> > +		asm volatile(
> > +			"ldxrh	%w[tmp], %w[addr]\n"
> > +			"cmp	%w[tmp], %w[expected]\n"
> > +			"b.eq	2f\n"
> > +			"sevl\n"
> > +			"1:	wfe\n"
> > +			"ldxrh	%w[tmp], %w[addr]\n"
> > +			"cmp	%w[tmp], %w[expected]\n"
> > +			"bne	1b\n"
> > +			"2:\n"
> > +			: [tmp] "=&r" (tmp)
> > +			: [addr] "Q"(*addr), [expected] "r"(expected)
> > +			: "cc", "memory");
> > +	else
> > +		asm volatile(
> > +			"ldaxrh %w[tmp], %w[addr]\n"
> > +			"cmp	%w[tmp], %w[expected]\n"
> > +			"b.eq	2f\n"
> > +			"sevl\n"
> > +			"1:	wfe\n"
> > +			"ldaxrh	%w[tmp], %w[addr]\n"
> > +			"cmp	%w[tmp], %w[expected]\n"
> > +			"bne	1b\n"
> > +			"2:\n"
> > +			: [tmp] "=&r" (tmp)
> > +			: [addr] "Q"(*addr), [expected] "r"(expected)
> > +			: "cc", "memory");
> > +}
> > +
> > +static __rte_always_inline void
> > +rte_wait_until_equal32(volatile uint32_t *addr, uint32_t expected, int
> > +memorder) {
> > +	uint32_t tmp;
> > +	if (memorder == __ATOMIC_RELAXED)
> > +		asm volatile(
> > +			"ldxr	%w[tmp], %w[addr]\n"
> > +			"cmp	%w[tmp], %w[expected]\n"
> > +			"b.eq	2f\n"
> > +			"sevl\n"
> > +			"1:	wfe\n"
> > +			"ldxr	%w[tmp], %w[addr]\n"
> > +			"cmp	%w[tmp], %w[expected]\n"
> > +			"bne	1b\n"
> > +			"2:\n"
> > +			: [tmp] "=&r" (tmp)
> > +			: [addr] "Q"(*addr), [expected] "r"(expected)
> > +			: "cc", "memory");
> > +	else
> > +		asm volatile(
> > +			"ldaxr  %w[tmp], %w[addr]\n"
> > +			"cmp	%w[tmp], %w[expected]\n"
> > +			"b.eq	2f\n"
> > +			"sevl\n"
> > +			"1:	wfe\n"
> > +			"ldaxr  %w[tmp], %w[addr]\n"
> > +			"cmp	%w[tmp], %w[expected]\n"
> > +			"bne	1b\n"
> > +			"2:\n"
> > +			: [tmp] "=&r" (tmp)
> > +			: [addr] "Q"(*addr), [expected] "r"(expected)
> > +			: "cc", "memory");
> > +}
> > +
> > +static __rte_always_inline void
> > +rte_wait_until_equal64(volatile uint64_t *addr, uint64_t expected, int
> > +memorder) {
> > +	uint64_t tmp;
> > +	if (memorder == __ATOMIC_RELAXED)
> > +		asm volatile(
> > +			"ldxr	%x[tmp], %x[addr]\n"
> > +			"cmp	%x[tmp], %x[expected]\n"
> > +			"b.eq	2f\n"
> > +			"sevl\n"
> > +			"1:	wfe\n"
> > +			"ldxr	%x[tmp], %x[addr]\n"
> > +			"cmp	%x[tmp], %x[expected]\n"
> > +			"bne	1b\n"
> > +			"2:\n"
> > +			: [tmp] "=&r" (tmp)
> > +			: [addr] "Q"(*addr), [expected] "r"(expected)
> > +			: "cc", "memory");
> > +	else
> > +		asm volatile(
> > +			"ldaxr  %x[tmp], %x[addr]\n"
> > +			"cmp	%x[tmp], %x[expected]\n"
> > +			"b.eq	2f\n"
> > +			"sevl\n"
> > +			"1:	wfe\n"
> > +			"ldaxr  %x[tmp], %x[addr]\n"
> > +			"cmp	%x[tmp], %x[expected]\n"
> > +			"bne	1b\n"
> > +			"2:\n"
> > +			: [tmp] "=&r" (tmp)
> > +			: [addr] "Q"(*addr), [expected] "r"(expected)
> > +			: "cc", "memory");
> 
> Duplication of code. Please introduce a macro for assembly Skelton.
> Something like
> 
> http://patches.dpdk.org/patch/56949/
> 
> > +}
> > +
> > +#endif
> > +
> >  #ifdef __cplusplus
> >  }
> >  #endif
> > diff --git a/lib/librte_eal/common/include/generic/rte_pause.h
> > b/lib/librte_eal/common/include/generic/rte_pause.h
> > index 52bd4db..8f5f025 100644
> > --- a/lib/librte_eal/common/include/generic/rte_pause.h
> > +++ b/lib/librte_eal/common/include/generic/rte_pause.h
> > @@ -4,7 +4,6 @@
> >
> >  #ifndef _RTE_PAUSE_H_
> >  #define _RTE_PAUSE_H_
> > -
> >  /**
> >   * @file
> >   *
> > @@ -12,6 +11,10 @@
> >   *
> >   */
> >
> > +#include <stdint.h>
> > +#include <rte_common.h>
> > +#include <rte_atomic.h>
> > +
> >  /**
> >   * Pause CPU execution for a short while
> >   *
> > @@ -20,4 +23,38 @@
> >   */
> >  static inline void rte_pause(void);
> >
> > +#if !defined(RTE_USE_WFE)
> > +#ifdef RTE_USE_C11_MEM_MODEL
> > +#define __rte_wait_until_equal(addr, expected, memorder) do {\
> > +	while (__atomic_load_n(addr, memorder) != expected) \
> > +		rte_pause();\
> > +} while (0)
> > +#else
> > +#define __rte_wait_until_equal(addr, expected, memorder) do {\
> > +	while (*addr != expected)\
> > +		rte_pause();\
> > +	if (memorder != __ATOMIC_RELAXED)\
> > +		rte_smp_rmb();\
> 
> Is this correct wrt all memorder?
> If there is no specific gain on no C11 mem model, let have only C11
> memmodel
> Aka remove RTE_USE_C11_MEM_MODEL

I am looking into all your comments(thanks!) and will submit a new version. 

> > +} while (0)
> > +#endif
> > +
> 
> Spotted public API. Lets have prototype with very good documentation on
> the
> API details.
> 
> 
> > +static __rte_always_inline void
> > +rte_wait_until_equal16(volatile uint16_t *addr, uint16_t expected, int
> > +memorder) {
> > +	__rte_wait_until_equal(addr, expected, memorder); }
> > +
> > +static __rte_always_inline void
> > +rte_wait_until_equal32(volatile uint32_t *addr, uint32_t expected, int
> > +memorder) {
> > +	__rte_wait_until_equal(addr, expected, memorder); }
> > +
> > +static __rte_always_inline void
> > +rte_wait_until_equal64(volatile uint64_t *addr, uint64_t expected, int
> > +memorder) {
> > +	__rte_wait_until_equal(addr, expected, memorder); } #endif /*
> > +RTE_USE_WFE */
> > +
> >  #endif /* _RTE_PAUSE_H_ */
> > --
> > 2.7.4


^ permalink raw reply	[flat|nested] 163+ messages in thread

* [dpdk-dev] [PATCH v4 0/6] use WFE for locks and ring on aarch64
  2019-06-30 16:21 [dpdk-dev] [RFC 0/5] use WFE for locks and ring on aarch64 Gavin Hu
                   ` (17 preceding siblings ...)
  2019-07-23 15:43 ` [dpdk-dev] [PATCH v3 5/5] config: add WFE config entry for aarch64 Gavin Hu
@ 2019-08-22  6:12 ` Gavin Hu
  2019-10-16  8:08   ` David Marchand
  2019-08-22  6:12 ` [dpdk-dev] [PATCH v4 1/6] bus/fslmc: fix the conflicting dmb function Gavin Hu
                   ` (62 subsequent siblings)
  81 siblings, 1 reply; 163+ messages in thread
From: Gavin Hu @ 2019-08-22  6:12 UTC (permalink / raw)
  To: dev; +Cc: nd, thomas, stephen, jerinj, pbhagavatula, Honnappa.Nagarahalli

DPDK has multiple use cases where the core repeatedly polls a location in
memory. This polling results in many cache and memory transactions.

Arm architecture provides WFE (Wait For Event) instruction, which allows
the cpu core to enter a low power state until woken up by the update to the
memory location being polled. Thus reducing the cache and memory
transactions.

x86 has the PAUSE hint instruction to reduce such overhead.

The rte_wait_until_equal_xxx APIs abstract the functionality of 'polling
for a memory location to become equal to a given value'.

For non-Arm platforms, these APIs are just wrappers around do-while loop
with rte_pause, so there are no performance differences.

For Arm platforms, use of WFE can be configured using CONFIG_RTE_USE_WFE
option. It is disabled by default.

Currently, use of WFE is supported only for aarch64 platforms. armv7
platforms do support the WFE instruction, but they require explicit wake up
events(sev) and are less performannt.

Testing shows that, performance varies across different platforms, with
some showing degradation.

CONFIG_RTE_ARM_USE_WFE should be enabled depending on the performance
benchmarking on the target platforms. Power saving should be an bonus,
but currenly we don't have ways to characterize that.

V4:
- rename the config as CONFIG_RTE_ARM_USE_WFE to indicate it applys to arm only
- introduce a macro for assembly Skelton to reduce the duplication of code
- add one patch for nxp fslmc to address a compiling error
V3:
- Convert RFCs to patches
V2:
- Use inline functions instead of marcos
- Add load and compare in the beginning of the APIs
- Fix some style errors in asm inline 
V1:
- Add the new APIs and use it for ring and locks

Gavin Hu (6):
  bus/fslmc: fix the conflicting dmb function
  eal: add the APIs to wait until equal
  ticketlock: use new API to reduce contention on aarch64
  ring: use wfe to wait for ring tail update on aarch64
  spinlock: use wfe to reduce contention on aarch64
  config: add WFE config entry for aarch64

 config/arm/meson.build                             |  1 +
 config/common_base                                 |  6 +++++
 drivers/bus/fslmc/mc/fsl_mc_sys.h                  | 10 +++++---
 drivers/bus/fslmc/mc/mc_sys.c                      |  3 +--
 .../common/include/arch/arm/rte_pause_64.h         | 30 ++++++++++++++++++++++
 .../common/include/arch/arm/rte_spinlock.h         | 25 ++++++++++++++++++
 lib/librte_eal/common/include/generic/rte_pause.h  | 26 ++++++++++++++++++-
 .../common/include/generic/rte_ticketlock.h        |  3 +--
 lib/librte_ring/rte_ring_c11_mem.h                 |  4 +--
 lib/librte_ring/rte_ring_generic.h                 |  3 +--
 10 files changed, 99 insertions(+), 12 deletions(-)

-- 
2.7.4


^ permalink raw reply	[flat|nested] 163+ messages in thread

* [dpdk-dev] [PATCH v4 1/6] bus/fslmc: fix the conflicting dmb function
  2019-06-30 16:21 [dpdk-dev] [RFC 0/5] use WFE for locks and ring on aarch64 Gavin Hu
                   ` (18 preceding siblings ...)
  2019-08-22  6:12 ` [dpdk-dev] [PATCH v4 0/6] use WFE for locks and ring on aarch64 Gavin Hu
@ 2019-08-22  6:12 ` Gavin Hu
  2019-08-22  6:12 ` [dpdk-dev] [PATCH v4 2/6] eal: add the APIs to wait until equal Gavin Hu
                   ` (61 subsequent siblings)
  81 siblings, 0 replies; 163+ messages in thread
From: Gavin Hu @ 2019-08-22  6:12 UTC (permalink / raw)
  To: dev
  Cc: nd, thomas, stephen, jerinj, pbhagavatula, Honnappa.Nagarahalli, stable

There are two definitions conflicting each other, for more
details, refer to [1].

include/rte_atomic_64.h:19: error: "dmb" redefined [-Werror]
drivers/bus/fslmc/mc/fsl_mc_sys.h:36: note: this is the location of the
previous definition
 #define dmb() {__asm__ __volatile__("" : : : "memory"); }

The fix is to include the spinlock.h file before the other header files,
this is inline with the coding style[2] about the "header includes".
The fix changes the function to take the argument for arm to be
meaningful.

[1] http://inbox.dpdk.org/users/VI1PR08MB537631AB25F41B8880DCCA988FDF0@i
VI1PR08MB5376.eurprd08.prod.outlook.com/T/#u
[2] https://doc.dpdk.org/guides/contributing/coding_style.html

Fixes: 3af733ba8da8 ("bus/fslmc: introduce MC object functions")
Cc: stable@dpdk.org

Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Phil Yang <phi.yang@arm.com>
---
 drivers/bus/fslmc/mc/fsl_mc_sys.h | 10 +++++++---
 drivers/bus/fslmc/mc/mc_sys.c     |  3 +--
 2 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/drivers/bus/fslmc/mc/fsl_mc_sys.h b/drivers/bus/fslmc/mc/fsl_mc_sys.h
index d0c7b39..fe9dc95 100644
--- a/drivers/bus/fslmc/mc/fsl_mc_sys.h
+++ b/drivers/bus/fslmc/mc/fsl_mc_sys.h
@@ -33,10 +33,14 @@ struct fsl_mc_io {
 #include <linux/byteorder/little_endian.h>
 
 #ifndef dmb
-#define dmb() {__asm__ __volatile__("" : : : "memory"); }
+#ifdef RTE_ARCH_ARM64
+#define dmb(opt) {asm volatile("dmb " #opt : : : "memory"); }
+#else
+#define dmb(opt)
 #endif
-#define __iormb()	dmb()
-#define __iowmb()	dmb()
+#endif
+#define __iormb()	dmb(ld)
+#define __iowmb()	dmb(st)
 #define __arch_getq(a)		(*(volatile uint64_t *)(a))
 #define __arch_putq(v, a)	(*(volatile uint64_t *)(a) = (v))
 #define __arch_putq32(v, a)	(*(volatile uint32_t *)(a) = (v))
diff --git a/drivers/bus/fslmc/mc/mc_sys.c b/drivers/bus/fslmc/mc/mc_sys.c
index efafdc3..22143ef 100644
--- a/drivers/bus/fslmc/mc/mc_sys.c
+++ b/drivers/bus/fslmc/mc/mc_sys.c
@@ -4,11 +4,10 @@
  * Copyright 2017 NXP
  *
  */
+#include <rte_spinlock.h>
 #include <fsl_mc_sys.h>
 #include <fsl_mc_cmd.h>
 
-#include <rte_spinlock.h>
-
 /** User space framework uses MC Portal in shared mode. Following change
  * introduces lock in MC FLIB
  */
-- 
2.7.4


^ permalink raw reply	[flat|nested] 163+ messages in thread

* [dpdk-dev] [PATCH v4 2/6] eal: add the APIs to wait until equal
  2019-06-30 16:21 [dpdk-dev] [RFC 0/5] use WFE for locks and ring on aarch64 Gavin Hu
                   ` (19 preceding siblings ...)
  2019-08-22  6:12 ` [dpdk-dev] [PATCH v4 1/6] bus/fslmc: fix the conflicting dmb function Gavin Hu
@ 2019-08-22  6:12 ` Gavin Hu
  2019-09-11 12:26   ` Jerin Jacob
  2019-08-22  6:12 ` [dpdk-dev] [PATCH v4 3/6] ticketlock: use new API to reduce contention on aarch64 Gavin Hu
                   ` (60 subsequent siblings)
  81 siblings, 1 reply; 163+ messages in thread
From: Gavin Hu @ 2019-08-22  6:12 UTC (permalink / raw)
  To: dev; +Cc: nd, thomas, stephen, jerinj, pbhagavatula, Honnappa.Nagarahalli

The rte_wait_until_equalxx APIs abstract the functionality of 'polling
for a memory location to become equal to a given value'.

Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
Reviewed-by: Steve Capper <steve.capper@arm.com>
Reviewed-by: Ola Liljedahl <ola.liljedahl@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Reviewed-by: Phil Yang <phil.yang@arm.com>
Acked-by: Pavan Nikhilesh <pbhagavatula@marvell.com>
---
 .../common/include/arch/arm/rte_pause_64.h         | 30 ++++++++++++++++++++++
 lib/librte_eal/common/include/generic/rte_pause.h  | 26 ++++++++++++++++++-
 2 files changed, 55 insertions(+), 1 deletion(-)

diff --git a/lib/librte_eal/common/include/arch/arm/rte_pause_64.h b/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
index 93895d3..dabde17 100644
--- a/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
+++ b/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
@@ -1,5 +1,6 @@
 /* SPDX-License-Identifier: BSD-3-Clause
  * Copyright(c) 2017 Cavium, Inc
+ * Copyright(c) 2019 Arm Limited
  */
 
 #ifndef _RTE_PAUSE_ARM64_H_
@@ -17,6 +18,35 @@ static inline void rte_pause(void)
 	asm volatile("yield" ::: "memory");
 }
 
+#ifdef RTE_ARM_USE_WFE
+#define __WAIT_UNTIL_EQUAL(name, asm_op, wide, type) \
+static __rte_always_inline void \
+rte_wait_until_equal_##name(volatile type * addr, type expected) \
+{ \
+	type tmp; \
+	asm volatile( \
+		#asm_op " %" #wide "[tmp], %[addr]\n" \
+		"cmp	%" #wide "[tmp], %" #wide "[expected]\n" \
+		"b.eq	2f\n" \
+		"sevl\n" \
+		"1:	wfe\n" \
+		#asm_op " %" #wide "[tmp], %[addr]\n" \
+		"cmp	%" #wide "[tmp], %" #wide "[expected]\n" \
+		"bne	1b\n" \
+		"2:\n" \
+		: [tmp] "=&r" (tmp) \
+		: [addr] "Q"(*addr), [expected] "r"(expected) \
+		: "cc", "memory"); \
+}
+/* Wait for *addr to be updated with expected value */
+__WAIT_UNTIL_EQUAL(relaxed_16, ldxrh, w, uint16_t)
+__WAIT_UNTIL_EQUAL(acquire_16, ldaxrh, w, uint16_t)
+__WAIT_UNTIL_EQUAL(relaxed_32, ldxr, w, uint32_t)
+__WAIT_UNTIL_EQUAL(acquire_32, ldaxr, w, uint32_t)
+__WAIT_UNTIL_EQUAL(relaxed_64, ldxr, x, uint64_t)
+__WAIT_UNTIL_EQUAL(acquire_64, ldaxr, x, uint64_t)
+#endif
+
 #ifdef __cplusplus
 }
 #endif
diff --git a/lib/librte_eal/common/include/generic/rte_pause.h b/lib/librte_eal/common/include/generic/rte_pause.h
index 52bd4db..4741f8a 100644
--- a/lib/librte_eal/common/include/generic/rte_pause.h
+++ b/lib/librte_eal/common/include/generic/rte_pause.h
@@ -1,10 +1,10 @@
 /* SPDX-License-Identifier: BSD-3-Clause
  * Copyright(c) 2017 Cavium, Inc
+ * Copyright(c) 2019 Arm Limited
  */
 
 #ifndef _RTE_PAUSE_H_
 #define _RTE_PAUSE_H_
-
 /**
  * @file
  *
@@ -12,6 +12,10 @@
  *
  */
 
+#include <stdint.h>
+#include <rte_common.h>
+#include <rte_atomic.h>
+
 /**
  * Pause CPU execution for a short while
  *
@@ -20,4 +24,24 @@
  */
 static inline void rte_pause(void);
 
+#if !defined(RTE_ARM_USE_WFE)
+#define __WAIT_UNTIL_EQUAL(op_name, size, type, memorder) \
+__rte_always_inline \
+static void	\
+rte_wait_until_equal_##op_name##_##size(volatile type *addr, \
+	type expected) \
+{ \
+	while (__atomic_load_n(addr, memorder) != expected) \
+		rte_pause(); \
+}
+
+/* Wait for *addr to be updated with expected value */
+__WAIT_UNTIL_EQUAL(relaxed, 16, uint16_t, __ATOMIC_RELAXED)
+__WAIT_UNTIL_EQUAL(acquire, 16, uint16_t, __ATOMIC_ACQUIRE)
+__WAIT_UNTIL_EQUAL(relaxed, 32, uint32_t, __ATOMIC_RELAXED)
+__WAIT_UNTIL_EQUAL(acquire, 32, uint32_t, __ATOMIC_ACQUIRE)
+__WAIT_UNTIL_EQUAL(relaxed, 64, uint64_t, __ATOMIC_RELAXED)
+__WAIT_UNTIL_EQUAL(acquire, 64, uint64_t, __ATOMIC_ACQUIRE)
+#endif /* RTE_ARM_USE_WFE */
+
 #endif /* _RTE_PAUSE_H_ */
-- 
2.7.4


^ permalink raw reply	[flat|nested] 163+ messages in thread

* [dpdk-dev] [PATCH v4 3/6] ticketlock: use new API to reduce contention on aarch64
  2019-06-30 16:21 [dpdk-dev] [RFC 0/5] use WFE for locks and ring on aarch64 Gavin Hu
                   ` (20 preceding siblings ...)
  2019-08-22  6:12 ` [dpdk-dev] [PATCH v4 2/6] eal: add the APIs to wait until equal Gavin Hu
@ 2019-08-22  6:12 ` Gavin Hu
  2019-08-22  6:12 ` [dpdk-dev] [PATCH v4 4/6] ring: use wfe to wait for ring tail update " Gavin Hu
                   ` (59 subsequent siblings)
  81 siblings, 0 replies; 163+ messages in thread
From: Gavin Hu @ 2019-08-22  6:12 UTC (permalink / raw)
  To: dev; +Cc: nd, thomas, stephen, jerinj, pbhagavatula, Honnappa.Nagarahalli

While using ticket lock, cores repeatedly poll the lock variable.
This is replaced by rte_wait_until_equal API.

Running ticketlock_autotest on ThunderX2, Ampere eMAG80, and Arm N1SDP[1],
there were variances between runs, but no notable performance gain or
degradation were seen with and without this patch.

[1] https://community.arm.com/developer/tools-software/oss-platforms/w/\
docs/440/neoverse-n1-sdp

Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Tested-by: Phil Yang <phil.yang@arm.com>
Tested-by: Pavan Nikhilesh <pbhagavatula@marvell.com>
---
 lib/librte_eal/common/include/generic/rte_ticketlock.h | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/lib/librte_eal/common/include/generic/rte_ticketlock.h b/lib/librte_eal/common/include/generic/rte_ticketlock.h
index d9bec87..232bbe9 100644
--- a/lib/librte_eal/common/include/generic/rte_ticketlock.h
+++ b/lib/librte_eal/common/include/generic/rte_ticketlock.h
@@ -66,8 +66,7 @@ static inline void
 rte_ticketlock_lock(rte_ticketlock_t *tl)
 {
 	uint16_t me = __atomic_fetch_add(&tl->s.next, 1, __ATOMIC_RELAXED);
-	while (__atomic_load_n(&tl->s.current, __ATOMIC_ACQUIRE) != me)
-		rte_pause();
+	rte_wait_until_equal_acquire_16(&tl->s.current, me);
 }
 
 /**
-- 
2.7.4


^ permalink raw reply	[flat|nested] 163+ messages in thread

* [dpdk-dev] [PATCH v4 4/6] ring: use wfe to wait for ring tail update on aarch64
  2019-06-30 16:21 [dpdk-dev] [RFC 0/5] use WFE for locks and ring on aarch64 Gavin Hu
                   ` (21 preceding siblings ...)
  2019-08-22  6:12 ` [dpdk-dev] [PATCH v4 3/6] ticketlock: use new API to reduce contention on aarch64 Gavin Hu
@ 2019-08-22  6:12 ` " Gavin Hu
  2019-08-22  6:12 ` [dpdk-dev] [PATCH v4 5/6] spinlock: use wfe to reduce contention " Gavin Hu
                   ` (58 subsequent siblings)
  81 siblings, 0 replies; 163+ messages in thread
From: Gavin Hu @ 2019-08-22  6:12 UTC (permalink / raw)
  To: dev; +Cc: nd, thomas, stephen, jerinj, pbhagavatula, Honnappa.Nagarahalli

Instead of polling for tail to be updated, use wfe instruction.

Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
Reviewed-by: Steve Capper <steve.capper@arm.com>
Reviewed-by: Ola Liljedahl <ola.liljedahl@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
---
 lib/librte_ring/rte_ring_c11_mem.h | 4 ++--
 lib/librte_ring/rte_ring_generic.h | 3 +--
 2 files changed, 3 insertions(+), 4 deletions(-)

diff --git a/lib/librte_ring/rte_ring_c11_mem.h b/lib/librte_ring/rte_ring_c11_mem.h
index 0fb73a3..764d8f1 100644
--- a/lib/librte_ring/rte_ring_c11_mem.h
+++ b/lib/librte_ring/rte_ring_c11_mem.h
@@ -2,6 +2,7 @@
  *
  * Copyright (c) 2017,2018 HXT-semitech Corporation.
  * Copyright (c) 2007-2009 Kip Macy kmacy@freebsd.org
+ * Copyright (c) 2019 Arm Limited
  * All rights reserved.
  * Derived from FreeBSD's bufring.h
  * Used as BSD-3 Licensed with permission from Kip Macy.
@@ -21,8 +22,7 @@ update_tail(struct rte_ring_headtail *ht, uint32_t old_val, uint32_t new_val,
 	 * we need to wait for them to complete
 	 */
 	if (!single)
-		while (unlikely(ht->tail != old_val))
-			rte_pause();
+		rte_wait_until_equal_relaxed_32(&ht->tail, old_val);
 
 	__atomic_store_n(&ht->tail, new_val, __ATOMIC_RELEASE);
 }
diff --git a/lib/librte_ring/rte_ring_generic.h b/lib/librte_ring/rte_ring_generic.h
index 953cdbb..6828527 100644
--- a/lib/librte_ring/rte_ring_generic.h
+++ b/lib/librte_ring/rte_ring_generic.h
@@ -23,8 +23,7 @@ update_tail(struct rte_ring_headtail *ht, uint32_t old_val, uint32_t new_val,
 	 * we need to wait for them to complete
 	 */
 	if (!single)
-		while (unlikely(ht->tail != old_val))
-			rte_pause();
+		rte_wait_until_equal_relaxed_32(&ht->tail, old_val);
 
 	ht->tail = new_val;
 }
-- 
2.7.4


^ permalink raw reply	[flat|nested] 163+ messages in thread

* [dpdk-dev] [PATCH v4 5/6] spinlock: use wfe to reduce contention on aarch64
  2019-06-30 16:21 [dpdk-dev] [RFC 0/5] use WFE for locks and ring on aarch64 Gavin Hu
                   ` (22 preceding siblings ...)
  2019-08-22  6:12 ` [dpdk-dev] [PATCH v4 4/6] ring: use wfe to wait for ring tail update " Gavin Hu
@ 2019-08-22  6:12 ` " Gavin Hu
       [not found]   ` <CY4PR1801MB1863AF9695BB10930E817D78DEB00@CY4PR1801MB1863.namprd18.prod.outlook.com>
  2019-08-22  6:12 ` [dpdk-dev] [PATCH v4 6/6] config: add WFE config entry for aarch64 Gavin Hu
                   ` (57 subsequent siblings)
  81 siblings, 1 reply; 163+ messages in thread
From: Gavin Hu @ 2019-08-22  6:12 UTC (permalink / raw)
  To: dev; +Cc: nd, thomas, stephen, jerinj, pbhagavatula, Honnappa.Nagarahalli

In acquiring a spinlock, cores repeatedly poll the lock variable.
This is replaced by rte_wait_until_equal API.

Running the micro benchmarking and the testpmd and l3fwd traffic tests
on ThunderX2, Ampere eMAG80 and Arm N1SDP, everything went well and no
notable performance gain nor degradation was measured.

Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
Reviewed-by: Phil Yang <phil.yang@arm.com>
Reviewed-by: Steve Capper <steve.capper@arm.com>
Reviewed-by: Ola Liljedahl <ola.liljedahl@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Tested-by: Pavan Nikhilesh <pbhagavatula@marvell.com>
---
 .../common/include/arch/arm/rte_spinlock.h         | 25 ++++++++++++++++++++++
 1 file changed, 25 insertions(+)

diff --git a/lib/librte_eal/common/include/arch/arm/rte_spinlock.h b/lib/librte_eal/common/include/arch/arm/rte_spinlock.h
index 1a6916b..7b8328e 100644
--- a/lib/librte_eal/common/include/arch/arm/rte_spinlock.h
+++ b/lib/librte_eal/common/include/arch/arm/rte_spinlock.h
@@ -16,6 +16,31 @@ extern "C" {
 #include <rte_common.h>
 #include "generic/rte_spinlock.h"
 
+/* armv7a does support WFE, but an explicit wake-up signal using SEV is
+ * required (must be preceded by DSB to drain the store buffer) and
+ * this is less performant, so keep armv7a implementation unchanged.
+ */
+#ifndef RTE_FORCE_INTRINSICS
+static inline void
+rte_spinlock_lock(rte_spinlock_t *sl)
+{
+	unsigned int tmp;
+	/* http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.
+	 * faqs/ka16809.html
+	 */
+	asm volatile(
+		"sevl\n"
+		"1:	wfe\n"
+		"2:	ldaxr %w[tmp], %w[locked]\n"
+		"cbnz   %w[tmp], 1b\n"
+		"stxr   %w[tmp], %w[one], %w[locked]\n"
+		"cbnz   %w[tmp], 2b\n"
+		: [tmp] "=&r" (tmp), [locked] "+Q"(sl->locked)
+		: [one] "r" (1)
+		: "cc", "memory");
+}
+#endif
+
 static inline int rte_tm_supported(void)
 {
 	return 0;
-- 
2.7.4


^ permalink raw reply	[flat|nested] 163+ messages in thread

* [dpdk-dev] [PATCH v4 6/6] config: add WFE config entry for aarch64
  2019-06-30 16:21 [dpdk-dev] [RFC 0/5] use WFE for locks and ring on aarch64 Gavin Hu
                   ` (23 preceding siblings ...)
  2019-08-22  6:12 ` [dpdk-dev] [PATCH v4 5/6] spinlock: use wfe to reduce contention " Gavin Hu
@ 2019-08-22  6:12 ` Gavin Hu
  2019-09-12 11:24 ` [dpdk-dev] [PATCH v5 0/8] use WFE " Gavin Hu
                   ` (56 subsequent siblings)
  81 siblings, 0 replies; 163+ messages in thread
From: Gavin Hu @ 2019-08-22  6:12 UTC (permalink / raw)
  To: dev; +Cc: nd, thomas, stephen, jerinj, pbhagavatula, Honnappa.Nagarahalli

Add the RTE_USE_WFE configuration entry for aarch64, disabled by default.
It can be enabled selectively based on the performance benchmarking.

Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
Reviewed-by: Steve Capper <steve.capper@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Reviewed-by: Phil Yang <phil.yang@arm.com>
Acked-by: Pavan Nikhilesh <pbhagavatula@marvell.com>
---
 config/arm/meson.build | 1 +
 config/common_base     | 6 ++++++
 2 files changed, 7 insertions(+)

diff --git a/config/arm/meson.build b/config/arm/meson.build
index 979018e..18ecd53 100644
--- a/config/arm/meson.build
+++ b/config/arm/meson.build
@@ -116,6 +116,7 @@ impl_dpaa = ['NXP DPAA', flags_dpaa, machine_args_generic]
 impl_dpaa2 = ['NXP DPAA2', flags_dpaa2, machine_args_generic]
 
 dpdk_conf.set('RTE_FORCE_INTRINSICS', 1)
+dpdk_conf.set('RTE_ARM_USE_WFE', 0)
 
 if not dpdk_conf.get('RTE_ARCH_64')
 	dpdk_conf.set('RTE_CACHE_LINE_SIZE', 64)
diff --git a/config/common_base b/config/common_base
index 8ef75c2..d4cf974 100644
--- a/config/common_base
+++ b/config/common_base
@@ -570,6 +570,12 @@ CONFIG_RTE_CRYPTO_MAX_DEVS=64
 CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO=n
 CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO_DEBUG=n
 
+# Use WFE instructions to implement the rte_wait_for_equal_xxx APIs,
+# calling these APIs put the cores in low power state while waiting
+# for the memory address to become equal to the expected value.
+# This is supported only by aarch64.
+CONFIG_RTE_ARM_USE_WFE=n
+
 #
 # Compile NXP CAAM JR crypto Driver
 #
-- 
2.7.4


^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: [dpdk-dev] [PATCH v4 2/6] eal: add the APIs to wait until equal
  2019-08-22  6:12 ` [dpdk-dev] [PATCH v4 2/6] eal: add the APIs to wait until equal Gavin Hu
@ 2019-09-11 12:26   ` Jerin Jacob
  2019-09-12  8:25     ` Gavin Hu (Arm Technology China)
  0 siblings, 1 reply; 163+ messages in thread
From: Jerin Jacob @ 2019-09-11 12:26 UTC (permalink / raw)
  To: Gavin Hu
  Cc: dev, nd, Thomas Monjalon, stephen, Jerin Jacob, Pavan Nikhilesh,
	Honnappa.Nagarahalli

On Thu, Aug 22, 2019 at 11:43 AM Gavin Hu <gavin.hu@arm.com> wrote:
>
> The rte_wait_until_equalxx APIs abstract the functionality of 'polling
> for a memory location to become equal to a given value'.
>
> Signed-off-by: Gavin Hu <gavin.hu@arm.com>
> Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> Reviewed-by: Steve Capper <steve.capper@arm.com>
> Reviewed-by: Ola Liljedahl <ola.liljedahl@arm.com>
> Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> Reviewed-by: Phil Yang <phil.yang@arm.com>
> Acked-by: Pavan Nikhilesh <pbhagavatula@marvell.com>
> ---
>  .../common/include/arch/arm/rte_pause_64.h         | 30 ++++++++++++++++++++++
>  lib/librte_eal/common/include/generic/rte_pause.h  | 26 ++++++++++++++++++-
>  2 files changed, 55 insertions(+), 1 deletion(-)
>
> diff --git a/lib/librte_eal/common/include/arch/arm/rte_pause_64.h b/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
> index 93895d3..dabde17 100644
> --- a/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
> +++ b/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
> @@ -1,5 +1,6 @@
>  /* SPDX-License-Identifier: BSD-3-Clause
>   * Copyright(c) 2017 Cavium, Inc
> + * Copyright(c) 2019 Arm Limited
>   */
>
>  #ifndef _RTE_PAUSE_ARM64_H_
> @@ -17,6 +18,35 @@ static inline void rte_pause(void)
>         asm volatile("yield" ::: "memory");
>  }
>
> +#ifdef RTE_ARM_USE_WFE
> +#define __WAIT_UNTIL_EQUAL(name, asm_op, wide, type) \
> +static __rte_always_inline void \
> +rte_wait_until_equal_##name(volatile type * addr, type expected) \
> +{ \
> +       type tmp; \
> +       asm volatile( \
> +               #asm_op " %" #wide "[tmp], %[addr]\n" \
> +               "cmp    %" #wide "[tmp], %" #wide "[expected]\n" \
> +               "b.eq   2f\n" \
> +               "sevl\n" \
> +               "1:     wfe\n" \
> +               #asm_op " %" #wide "[tmp], %[addr]\n" \
> +               "cmp    %" #wide "[tmp], %" #wide "[expected]\n" \
> +               "bne    1b\n" \
> +               "2:\n" \
> +               : [tmp] "=&r" (tmp) \
> +               : [addr] "Q"(*addr), [expected] "r"(expected) \
> +               : "cc", "memory"); \
> +}
>
> +/* Wait for *addr to be updated with expected value */
> +__WAIT_UNTIL_EQUAL(relaxed_16, ldxrh, w, uint16_t)
> +__WAIT_UNTIL_EQUAL(acquire_16, ldaxrh, w, uint16_t)
> +__WAIT_UNTIL_EQUAL(relaxed_32, ldxr, w, uint32_t)
> +__WAIT_UNTIL_EQUAL(acquire_32, ldaxr, w, uint32_t)
> +__WAIT_UNTIL_EQUAL(relaxed_64, ldxr, x, uint64_t)
> +__WAIT_UNTIL_EQUAL(acquire_64, ldaxr, x, uint64_t)
>

This scheme doesn't allow to write Doxygen comments for the API
Please change to some scheme where you can Doxygen comments for each API
without code duplication. Something like

/**
 * Doxygen comment
 */
rte_wait_until_equal_relaxed_16(..)
{
        __WAIT_UNTIL_EQUAL(..)
}

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: [dpdk-dev] [PATCH v4 2/6] eal: add the APIs to wait until equal
  2019-09-11 12:26   ` Jerin Jacob
@ 2019-09-12  8:25     ` Gavin Hu (Arm Technology China)
  0 siblings, 0 replies; 163+ messages in thread
From: Gavin Hu (Arm Technology China) @ 2019-09-12  8:25 UTC (permalink / raw)
  To: Jerin Jacob
  Cc: dev, nd, thomas, stephen, jerinj, Pavan Nikhilesh, Honnappa Nagarahalli

Hi Jerin,

> -----Original Message-----
> From: Jerin Jacob <jerinjacobk@gmail.com>
> Sent: Wednesday, September 11, 2019 8:27 PM
> To: Gavin Hu (Arm Technology China) <Gavin.Hu@arm.com>
> Cc: dev@dpdk.org; nd <nd@arm.com>; thomas@monjalon.net;
> stephen@networkplumber.org; jerinj@marvell.com; Pavan Nikhilesh
> <pbhagavatula@marvell.com>; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>
> Subject: Re: [dpdk-dev] [PATCH v4 2/6] eal: add the APIs to wait until equal
> 
> On Thu, Aug 22, 2019 at 11:43 AM Gavin Hu <gavin.hu@arm.com> wrote:
> >
> > The rte_wait_until_equalxx APIs abstract the functionality of 'polling
> > for a memory location to become equal to a given value'.
> >
> > Signed-off-by: Gavin Hu <gavin.hu@arm.com>
> > Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > Reviewed-by: Steve Capper <steve.capper@arm.com>
> > Reviewed-by: Ola Liljedahl <ola.liljedahl@arm.com>
> > Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > Reviewed-by: Phil Yang <phil.yang@arm.com>
> > Acked-by: Pavan Nikhilesh <pbhagavatula@marvell.com>
> > ---
> >  .../common/include/arch/arm/rte_pause_64.h         | 30
> ++++++++++++++++++++++
> >  lib/librte_eal/common/include/generic/rte_pause.h  | 26
> ++++++++++++++++++-
> >  2 files changed, 55 insertions(+), 1 deletion(-)
> >
> > diff --git a/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
> b/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
> > index 93895d3..dabde17 100644
> > --- a/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
> > +++ b/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
> > @@ -1,5 +1,6 @@
> >  /* SPDX-License-Identifier: BSD-3-Clause
> >   * Copyright(c) 2017 Cavium, Inc
> > + * Copyright(c) 2019 Arm Limited
> >   */
> >
> >  #ifndef _RTE_PAUSE_ARM64_H_
> > @@ -17,6 +18,35 @@ static inline void rte_pause(void)
> >         asm volatile("yield" ::: "memory");
> >  }
> >
> > +#ifdef RTE_ARM_USE_WFE
> > +#define __WAIT_UNTIL_EQUAL(name, asm_op, wide, type) \
> > +static __rte_always_inline void \
> > +rte_wait_until_equal_##name(volatile type * addr, type expected) \
> > +{ \
> > +       type tmp; \
> > +       asm volatile( \
> > +               #asm_op " %" #wide "[tmp], %[addr]\n" \
> > +               "cmp    %" #wide "[tmp], %" #wide "[expected]\n" \
> > +               "b.eq   2f\n" \
> > +               "sevl\n" \
> > +               "1:     wfe\n" \
> > +               #asm_op " %" #wide "[tmp], %[addr]\n" \
> > +               "cmp    %" #wide "[tmp], %" #wide "[expected]\n" \
> > +               "bne    1b\n" \
> > +               "2:\n" \
> > +               : [tmp] "=&r" (tmp) \
> > +               : [addr] "Q"(*addr), [expected] "r"(expected) \
> > +               : "cc", "memory"); \
> > +}
> >
> > +/* Wait for *addr to be updated with expected value */
> > +__WAIT_UNTIL_EQUAL(relaxed_16, ldxrh, w, uint16_t)
> > +__WAIT_UNTIL_EQUAL(acquire_16, ldaxrh, w, uint16_t)
> > +__WAIT_UNTIL_EQUAL(relaxed_32, ldxr, w, uint32_t)
> > +__WAIT_UNTIL_EQUAL(acquire_32, ldaxr, w, uint32_t)
> > +__WAIT_UNTIL_EQUAL(relaxed_64, ldxr, x, uint64_t)
> > +__WAIT_UNTIL_EQUAL(acquire_64, ldaxr, x, uint64_t)
> >
> 
> This scheme doesn't allow to write Doxygen comments for the API
> Please change to some scheme where you can Doxygen comments for each
> API
> without code duplication. Something like
Thanks for pointing out this, I will fix this in next version. 
> 
> /**
>  * Doxygen comment
>  */
> rte_wait_until_equal_relaxed_16(..)
> {
>         __WAIT_UNTIL_EQUAL(..)
> }
Following the other examples, just add some declarations of the APIs in the beginning of the file, with the Doxygen comments above can fix this problem.
The implementations of the APIs do not need to change, please help review the v5 version.

^ permalink raw reply	[flat|nested] 163+ messages in thread

* [dpdk-dev] [PATCH v5 0/8] use WFE for aarch64
  2019-06-30 16:21 [dpdk-dev] [RFC 0/5] use WFE for locks and ring on aarch64 Gavin Hu
                   ` (24 preceding siblings ...)
  2019-08-22  6:12 ` [dpdk-dev] [PATCH v4 6/6] config: add WFE config entry for aarch64 Gavin Hu
@ 2019-09-12 11:24 ` " Gavin Hu
  2019-09-12 11:24 ` [dpdk-dev] [PATCH v5 1/8] config: add WFE config entry " Gavin Hu
                   ` (55 subsequent siblings)
  81 siblings, 0 replies; 163+ messages in thread
From: Gavin Hu @ 2019-09-12 11:24 UTC (permalink / raw)
  To: dev
  Cc: nd, thomas, stephen, hemant.agrawal, jerinj, pbhagavatula,
	Honnappa.Nagarahalli

V5:
- add doxygen comments for the new APIs
- spinlock early exit without wfe if the spinlock not taken by others.
V4:
- rename the config as CONFIG_RTE_ARM_USE_WFE to indicate it applys to arm only
- introduce a macro for assembly Skelton to reduce the duplication of code
- add one patch for nxp fslmc to address a compiling error
V3:
- Convert RFCs to patches
V2:
- Use inline functions instead of marcos
- Add load and compare in the beginning of the APIs
- Fix some style errors in asm inline 
V1:
- Add the new APIs and use it for ring and locks

DPDK has multiple use cases where the core repeatedly polls a location in
memory. This polling results in many cache and memory transactions.

Arm architecture provides WFE (Wait For Event) instruction, which allows
the cpu core to enter a low power state until woken up by the update to the
memory location being polled. Thus reducing the cache and memory
transactions.

x86 has the PAUSE hint instruction to reduce such overhead.

The rte_wait_until_equal_xxx APIs abstract the functionality of 'polling
for a memory location to become equal to a given value'.

For non-Arm platforms, these APIs are just wrappers around do-while loop
with rte_pause, so there are no performance differences.

For Arm platforms, use of WFE can be configured using CONFIG_RTE_USE_WFE
option. It is disabled by default.

Currently, use of WFE is supported only for aarch64 platforms. armv7
platforms do support the WFE instruction, but they require explicit wake up
events(sev) and are less performannt.

Testing shows that, performance varies across different platforms, with
some showing degradation.

CONFIG_RTE_ARM_USE_WFE should be enabled depending on the performance
benchmarking on the target platforms. Power saving should be an bonus,
but currenly we don't have ways to characterize that.

Gavin Hu (8):
  config: add WFE config entry for aarch64
  bus/fslmc: fix the conflicting dmb function
  eal: add the APIs to wait until equal
  spinlock: use wfe to reduce contention on aarch64
  ticketlock: use new API to reduce contention on aarch64
  ring: use wfe to wait for ring tail update on aarch64
  net/thunderx: use new API to save cycles on aarch64
  event/opdl: use new API to save cycles on aarch64

 config/arm/meson.build                             |   1 +
 config/common_base                                 |   6 +
 drivers/bus/fslmc/mc/fsl_mc_sys.h                  |  10 +-
 drivers/bus/fslmc/mc/mc_sys.c                      |   3 +-
 drivers/event/opdl/opdl_ring.c                     |   5 +-
 drivers/net/thunderx/nicvf_rxtx.c                  |   3 +-
 .../common/include/arch/arm/rte_pause_64.h         |  30 +++++
 .../common/include/arch/arm/rte_spinlock.h         |  26 ++++
 lib/librte_eal/common/include/generic/rte_pause.h  | 146 ++++++++++++++++++++-
 .../common/include/generic/rte_ticketlock.h        |   3 +-
 lib/librte_ring/rte_ring_c11_mem.h                 |   4 +-
 lib/librte_ring/rte_ring_generic.h                 |   3 +-
 12 files changed, 223 insertions(+), 17 deletions(-)

-- 
2.7.4


^ permalink raw reply	[flat|nested] 163+ messages in thread

* [dpdk-dev] [PATCH v5 1/8] config: add WFE config entry for aarch64
  2019-06-30 16:21 [dpdk-dev] [RFC 0/5] use WFE for locks and ring on aarch64 Gavin Hu
                   ` (25 preceding siblings ...)
  2019-09-12 11:24 ` [dpdk-dev] [PATCH v5 0/8] use WFE " Gavin Hu
@ 2019-09-12 11:24 ` " Gavin Hu
  2019-09-12 15:48   ` Jerin Jacob
  2019-09-12 11:24 ` [dpdk-dev] [PATCH v5 2/8] bus/fslmc: fix the conflicting dmb function Gavin Hu
                   ` (54 subsequent siblings)
  81 siblings, 1 reply; 163+ messages in thread
From: Gavin Hu @ 2019-09-12 11:24 UTC (permalink / raw)
  To: dev
  Cc: nd, thomas, stephen, hemant.agrawal, jerinj, pbhagavatula,
	Honnappa.Nagarahalli

Add the RTE_USE_WFE configuration entry for aarch64, disabled by default.
It can be enabled selectively based on the performance benchmarking.

Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
Reviewed-by: Steve Capper <steve.capper@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Reviewed-by: Phil Yang <phil.yang@arm.com>
Acked-by: Pavan Nikhilesh <pbhagavatula@marvell.com>
---
 config/arm/meson.build | 1 +
 config/common_base     | 6 ++++++
 2 files changed, 7 insertions(+)

diff --git a/config/arm/meson.build b/config/arm/meson.build
index 979018e..18ecd53 100644
--- a/config/arm/meson.build
+++ b/config/arm/meson.build
@@ -116,6 +116,7 @@ impl_dpaa = ['NXP DPAA', flags_dpaa, machine_args_generic]
 impl_dpaa2 = ['NXP DPAA2', flags_dpaa2, machine_args_generic]
 
 dpdk_conf.set('RTE_FORCE_INTRINSICS', 1)
+dpdk_conf.set('RTE_ARM_USE_WFE', 0)
 
 if not dpdk_conf.get('RTE_ARCH_64')
 	dpdk_conf.set('RTE_CACHE_LINE_SIZE', 64)
diff --git a/config/common_base b/config/common_base
index 8ef75c2..d4cf974 100644
--- a/config/common_base
+++ b/config/common_base
@@ -570,6 +570,12 @@ CONFIG_RTE_CRYPTO_MAX_DEVS=64
 CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO=n
 CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO_DEBUG=n
 
+# Use WFE instructions to implement the rte_wait_for_equal_xxx APIs,
+# calling these APIs put the cores in low power state while waiting
+# for the memory address to become equal to the expected value.
+# This is supported only by aarch64.
+CONFIG_RTE_ARM_USE_WFE=n
+
 #
 # Compile NXP CAAM JR crypto Driver
 #
-- 
2.7.4


^ permalink raw reply	[flat|nested] 163+ messages in thread

* [dpdk-dev] [PATCH v5 2/8] bus/fslmc: fix the conflicting dmb function
  2019-06-30 16:21 [dpdk-dev] [RFC 0/5] use WFE for locks and ring on aarch64 Gavin Hu
                   ` (26 preceding siblings ...)
  2019-09-12 11:24 ` [dpdk-dev] [PATCH v5 1/8] config: add WFE config entry " Gavin Hu
@ 2019-09-12 11:24 ` Gavin Hu
  2019-09-12 11:24 ` [dpdk-dev] [PATCH v5 3/8] eal: add the APIs to wait until equal Gavin Hu
                   ` (53 subsequent siblings)
  81 siblings, 0 replies; 163+ messages in thread
From: Gavin Hu @ 2019-09-12 11:24 UTC (permalink / raw)
  To: dev
  Cc: nd, thomas, stephen, hemant.agrawal, jerinj, pbhagavatula,
	Honnappa.Nagarahalli, stable

There are two definitions conflicting each other, for more
details, refer to [1].

include/rte_atomic_64.h:19: error: "dmb" redefined [-Werror]
drivers/bus/fslmc/mc/fsl_mc_sys.h:36: note: this is the location of the
previous definition
 #define dmb() {__asm__ __volatile__("" : : : "memory"); }

The fix is to include the spinlock.h file before the other header files,
this is inline with the coding style[2] about the "header includes".
The fix changes the function to take the argument for arm to be
meaningful.

[1] http://inbox.dpdk.org/users/VI1PR08MB537631AB25F41B8880DCCA988FDF0@i
VI1PR08MB5376.eurprd08.prod.outlook.com/T/#u
[2] https://doc.dpdk.org/guides/contributing/coding_style.html

Fixes: 3af733ba8da8 ("bus/fslmc: introduce MC object functions")
Cc: stable@dpdk.org

Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Phil Yang <phi.yang@arm.com>
---
 drivers/bus/fslmc/mc/fsl_mc_sys.h | 10 +++++++---
 drivers/bus/fslmc/mc/mc_sys.c     |  3 +--
 2 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/drivers/bus/fslmc/mc/fsl_mc_sys.h b/drivers/bus/fslmc/mc/fsl_mc_sys.h
index d0c7b39..fe9dc95 100644
--- a/drivers/bus/fslmc/mc/fsl_mc_sys.h
+++ b/drivers/bus/fslmc/mc/fsl_mc_sys.h
@@ -33,10 +33,14 @@ struct fsl_mc_io {
 #include <linux/byteorder/little_endian.h>
 
 #ifndef dmb
-#define dmb() {__asm__ __volatile__("" : : : "memory"); }
+#ifdef RTE_ARCH_ARM64
+#define dmb(opt) {asm volatile("dmb " #opt : : : "memory"); }
+#else
+#define dmb(opt)
 #endif
-#define __iormb()	dmb()
-#define __iowmb()	dmb()
+#endif
+#define __iormb()	dmb(ld)
+#define __iowmb()	dmb(st)
 #define __arch_getq(a)		(*(volatile uint64_t *)(a))
 #define __arch_putq(v, a)	(*(volatile uint64_t *)(a) = (v))
 #define __arch_putq32(v, a)	(*(volatile uint32_t *)(a) = (v))
diff --git a/drivers/bus/fslmc/mc/mc_sys.c b/drivers/bus/fslmc/mc/mc_sys.c
index efafdc3..22143ef 100644
--- a/drivers/bus/fslmc/mc/mc_sys.c
+++ b/drivers/bus/fslmc/mc/mc_sys.c
@@ -4,11 +4,10 @@
  * Copyright 2017 NXP
  *
  */
+#include <rte_spinlock.h>
 #include <fsl_mc_sys.h>
 #include <fsl_mc_cmd.h>
 
-#include <rte_spinlock.h>
-
 /** User space framework uses MC Portal in shared mode. Following change
  * introduces lock in MC FLIB
  */
-- 
2.7.4


^ permalink raw reply	[flat|nested] 163+ messages in thread

* [dpdk-dev] [PATCH v5 3/8] eal: add the APIs to wait until equal
  2019-06-30 16:21 [dpdk-dev] [RFC 0/5] use WFE for locks and ring on aarch64 Gavin Hu
                   ` (27 preceding siblings ...)
  2019-09-12 11:24 ` [dpdk-dev] [PATCH v5 2/8] bus/fslmc: fix the conflicting dmb function Gavin Hu
@ 2019-09-12 11:24 ` Gavin Hu
  2019-09-12 16:11   ` Jerin Jacob
  2019-09-12 11:24 ` [dpdk-dev] [PATCH v5 4/8] spinlock: use wfe to reduce contention on aarch64 Gavin Hu
                   ` (52 subsequent siblings)
  81 siblings, 1 reply; 163+ messages in thread
From: Gavin Hu @ 2019-09-12 11:24 UTC (permalink / raw)
  To: dev
  Cc: nd, thomas, stephen, hemant.agrawal, jerinj, pbhagavatula,
	Honnappa.Nagarahalli

The rte_wait_until_equalxx APIs abstract the functionality of 'polling
for a memory location to become equal to a given value'.

Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
Reviewed-by: Steve Capper <steve.capper@arm.com>
Reviewed-by: Ola Liljedahl <ola.liljedahl@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Reviewed-by: Phil Yang <phil.yang@arm.com>
Acked-by: Pavan Nikhilesh <pbhagavatula@marvell.com>
---
 .../common/include/arch/arm/rte_pause_64.h         | 30 +++++++
 lib/librte_eal/common/include/generic/rte_pause.h  | 98 +++++++++++++++++++++-
 2 files changed, 127 insertions(+), 1 deletion(-)

diff --git a/lib/librte_eal/common/include/arch/arm/rte_pause_64.h b/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
index 93895d3..dabde17 100644
--- a/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
+++ b/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
@@ -1,5 +1,6 @@
 /* SPDX-License-Identifier: BSD-3-Clause
  * Copyright(c) 2017 Cavium, Inc
+ * Copyright(c) 2019 Arm Limited
  */
 
 #ifndef _RTE_PAUSE_ARM64_H_
@@ -17,6 +18,35 @@ static inline void rte_pause(void)
 	asm volatile("yield" ::: "memory");
 }
 
+#ifdef RTE_ARM_USE_WFE
+#define __WAIT_UNTIL_EQUAL(name, asm_op, wide, type) \
+static __rte_always_inline void \
+rte_wait_until_equal_##name(volatile type * addr, type expected) \
+{ \
+	type tmp; \
+	asm volatile( \
+		#asm_op " %" #wide "[tmp], %[addr]\n" \
+		"cmp	%" #wide "[tmp], %" #wide "[expected]\n" \
+		"b.eq	2f\n" \
+		"sevl\n" \
+		"1:	wfe\n" \
+		#asm_op " %" #wide "[tmp], %[addr]\n" \
+		"cmp	%" #wide "[tmp], %" #wide "[expected]\n" \
+		"bne	1b\n" \
+		"2:\n" \
+		: [tmp] "=&r" (tmp) \
+		: [addr] "Q"(*addr), [expected] "r"(expected) \
+		: "cc", "memory"); \
+}
+/* Wait for *addr to be updated with expected value */
+__WAIT_UNTIL_EQUAL(relaxed_16, ldxrh, w, uint16_t)
+__WAIT_UNTIL_EQUAL(acquire_16, ldaxrh, w, uint16_t)
+__WAIT_UNTIL_EQUAL(relaxed_32, ldxr, w, uint32_t)
+__WAIT_UNTIL_EQUAL(acquire_32, ldaxr, w, uint32_t)
+__WAIT_UNTIL_EQUAL(relaxed_64, ldxr, x, uint64_t)
+__WAIT_UNTIL_EQUAL(acquire_64, ldaxr, x, uint64_t)
+#endif
+
 #ifdef __cplusplus
 }
 #endif
diff --git a/lib/librte_eal/common/include/generic/rte_pause.h b/lib/librte_eal/common/include/generic/rte_pause.h
index 52bd4db..dfa6a53 100644
--- a/lib/librte_eal/common/include/generic/rte_pause.h
+++ b/lib/librte_eal/common/include/generic/rte_pause.h
@@ -1,10 +1,10 @@
 /* SPDX-License-Identifier: BSD-3-Clause
  * Copyright(c) 2017 Cavium, Inc
+ * Copyright(c) 2019 Arm Limited
  */
 
 #ifndef _RTE_PAUSE_H_
 #define _RTE_PAUSE_H_
-
 /**
  * @file
  *
@@ -12,6 +12,10 @@
  *
  */
 
+#include <stdint.h>
+#include <rte_common.h>
+#include <rte_atomic.h>
+
 /**
  * Pause CPU execution for a short while
  *
@@ -20,4 +24,96 @@
  */
 static inline void rte_pause(void);
 
+/**
+ * Wait for *addr to be updated with expected value
+ *
+ * @param addr
+ *  A pointer to the memory location.
+ * @param expected
+ *  An expected value to be in the memory location.
+ */
+__rte_always_inline
+static void
+rte_wait_until_equal_relaxed_16(volatile uint16_t *addr, uint16_t expected);
+
+/**
+ * Wait for *addr to be updated with expected value
+ *
+ * @param addr
+ *  A pointer to the memory location.
+ * @param expected
+ *  An expected value to be in the memory location.
+ */
+__rte_always_inline
+static void
+rte_wait_until_equal_relaxed_32(volatile uint32_t *addr, uint32_t expected);
+
+/**
+ * Wait for *addr to be updated with expected value
+ *
+ * @param addr
+ *  A pointer to the memory location.
+ * @param expected
+ *  An expected value to be in the memory location.
+ */
+__rte_always_inline
+static void
+rte_wait_until_equal_relaxed_64(volatile uint64_t *addr, uint64_t expected);
+
+/**
+ * Wait for *addr to be updated with expected value
+ *
+ * @param addr
+ *  A pointer to the memory location.
+ * @param expected
+ *  An expected value to be in the memory location.
+ */
+__rte_always_inline
+static void
+rte_wait_until_equal_acquire_16(volatile uint16_t *addr, uint16_t expected);
+
+/**
+ * Wait for *addr to be updated with expected value
+ *
+ * @param addr
+ *  A pointer to the memory location.
+ * @param expected
+ *  An expected value to be in the memory location.
+ */
+__rte_always_inline
+static void
+rte_wait_until_equal_acquire_32(volatile uint32_t *addr, uint32_t expected);
+
+/**
+ * Wait for *addr to be updated with expected value
+ *
+ * @param addr
+ *  A pointer to the memory location.
+ * @param expected
+ *  An expected value to be in the memory location.
+ */
+__rte_always_inline
+static void
+rte_wait_until_equal_acquire_64(volatile uint64_t *addr, uint64_t expected);
+
+#if !defined(RTE_ARM_USE_WFE)
+#define __WAIT_UNTIL_EQUAL(op_name, size, type, memorder) \
+__rte_always_inline \
+static void	\
+rte_wait_until_equal_##op_name##_##size(volatile type *addr, \
+	type expected) \
+{ \
+	while (__atomic_load_n(addr, memorder) != expected) \
+		rte_pause(); \
+}
+
+/* Wait for *addr to be updated with expected value */
+__WAIT_UNTIL_EQUAL(relaxed, 16, uint16_t, __ATOMIC_RELAXED)
+__WAIT_UNTIL_EQUAL(acquire, 16, uint16_t, __ATOMIC_ACQUIRE)
+__WAIT_UNTIL_EQUAL(relaxed, 32, uint32_t, __ATOMIC_RELAXED)
+__WAIT_UNTIL_EQUAL(acquire, 32, uint32_t, __ATOMIC_ACQUIRE)
+__WAIT_UNTIL_EQUAL(relaxed, 64, uint64_t, __ATOMIC_RELAXED)
+__WAIT_UNTIL_EQUAL(acquire, 64, uint64_t, __ATOMIC_ACQUIRE)
+#endif /* RTE_ARM_USE_WFE */
+
 #endif /* _RTE_PAUSE_H_ */
-- 
2.7.4


^ permalink raw reply	[flat|nested] 163+ messages in thread

* [dpdk-dev] [PATCH v5 4/8] spinlock: use wfe to reduce contention on aarch64
  2019-06-30 16:21 [dpdk-dev] [RFC 0/5] use WFE for locks and ring on aarch64 Gavin Hu
                   ` (28 preceding siblings ...)
  2019-09-12 11:24 ` [dpdk-dev] [PATCH v5 3/8] eal: add the APIs to wait until equal Gavin Hu
@ 2019-09-12 11:24 ` Gavin Hu
  2019-09-12 11:24 ` [dpdk-dev] [PATCH v5 5/8] ticketlock: use new API " Gavin Hu
                   ` (51 subsequent siblings)
  81 siblings, 0 replies; 163+ messages in thread
From: Gavin Hu @ 2019-09-12 11:24 UTC (permalink / raw)
  To: dev
  Cc: nd, thomas, stephen, hemant.agrawal, jerinj, pbhagavatula,
	Honnappa.Nagarahalli

In acquiring a spinlock, cores repeatedly poll the lock variable.
This is replaced by rte_wait_until_equal API.

Running the micro benchmarking and the testpmd and l3fwd traffic tests
on ThunderX2, Ampere eMAG80 and Arm N1SDP, everything went well and no
notable performance gain nor degradation was measured.

Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
Reviewed-by: Phil Yang <phil.yang@arm.com>
Reviewed-by: Steve Capper <steve.capper@arm.com>
Reviewed-by: Ola Liljedahl <ola.liljedahl@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Tested-by: Pavan Nikhilesh <pbhagavatula@marvell.com>
---
 .../common/include/arch/arm/rte_spinlock.h         | 26 ++++++++++++++++++++++
 1 file changed, 26 insertions(+)

diff --git a/lib/librte_eal/common/include/arch/arm/rte_spinlock.h b/lib/librte_eal/common/include/arch/arm/rte_spinlock.h
index 1a6916b..b61c055 100644
--- a/lib/librte_eal/common/include/arch/arm/rte_spinlock.h
+++ b/lib/librte_eal/common/include/arch/arm/rte_spinlock.h
@@ -16,6 +16,32 @@ extern "C" {
 #include <rte_common.h>
 #include "generic/rte_spinlock.h"
 
+/* armv7a does support WFE, but an explicit wake-up signal using SEV is
+ * required (must be preceded by DSB to drain the store buffer) and
+ * this is less performant, so keep armv7a implementation unchanged.
+ */
+#ifndef RTE_FORCE_INTRINSICS
+static inline void
+rte_spinlock_lock(rte_spinlock_t *sl)
+{
+	unsigned int tmp;
+	/* http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.
+	 * faqs/ka16809.html
+	 */
+	asm volatile(
+		"1:	ldaxr %w[tmp], %w[locked]\n"
+		"cbnz   %w[tmp], 2f\n"
+		"stxr   %w[tmp], %w[one], %w[locked]\n"
+		"cbnz   %w[tmp], 1b\n"
+		"ret\n"
+		"2:	sevl\n"
+		"wfe\n"
+		"jmp	1b\n"
+		: [tmp] "=&r" (tmp), [locked] "+Q"(sl->locked)
+		: [one] "r" (1)
+}
+#endif
+
 static inline int rte_tm_supported(void)
 {
 	return 0;
-- 
2.7.4


^ permalink raw reply	[flat|nested] 163+ messages in thread

* [dpdk-dev] [PATCH v5 5/8] ticketlock: use new API to reduce contention on aarch64
  2019-06-30 16:21 [dpdk-dev] [RFC 0/5] use WFE for locks and ring on aarch64 Gavin Hu
                   ` (29 preceding siblings ...)
  2019-09-12 11:24 ` [dpdk-dev] [PATCH v5 4/8] spinlock: use wfe to reduce contention on aarch64 Gavin Hu
@ 2019-09-12 11:24 ` " Gavin Hu
  2019-09-12 16:14   ` Jerin Jacob
  2019-09-12 11:24 ` [dpdk-dev] [PATCH v5 6/8] ring: use wfe to wait for ring tail update " Gavin Hu
                   ` (50 subsequent siblings)
  81 siblings, 1 reply; 163+ messages in thread
From: Gavin Hu @ 2019-09-12 11:24 UTC (permalink / raw)
  To: dev
  Cc: nd, thomas, stephen, hemant.agrawal, jerinj, pbhagavatula,
	Honnappa.Nagarahalli

While using ticket lock, cores repeatedly poll the lock variable.
This is replaced by rte_wait_until_equal API.

Running ticketlock_autotest on ThunderX2, Ampere eMAG80, and Arm N1SDP[1],
there were variances between runs, but no notable performance gain or
degradation were seen with and without this patch.

[1] https://community.arm.com/developer/tools-software/oss-platforms/w/\
docs/440/neoverse-n1-sdp

Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Tested-by: Phil Yang <phil.yang@arm.com>
Tested-by: Pavan Nikhilesh <pbhagavatula@marvell.com>
---
 lib/librte_eal/common/include/generic/rte_ticketlock.h | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/lib/librte_eal/common/include/generic/rte_ticketlock.h b/lib/librte_eal/common/include/generic/rte_ticketlock.h
index d9bec87..232bbe9 100644
--- a/lib/librte_eal/common/include/generic/rte_ticketlock.h
+++ b/lib/librte_eal/common/include/generic/rte_ticketlock.h
@@ -66,8 +66,7 @@ static inline void
 rte_ticketlock_lock(rte_ticketlock_t *tl)
 {
 	uint16_t me = __atomic_fetch_add(&tl->s.next, 1, __ATOMIC_RELAXED);
-	while (__atomic_load_n(&tl->s.current, __ATOMIC_ACQUIRE) != me)
-		rte_pause();
+	rte_wait_until_equal_acquire_16(&tl->s.current, me);
 }
 
 /**
-- 
2.7.4


^ permalink raw reply	[flat|nested] 163+ messages in thread

* [dpdk-dev] [PATCH v5 6/8] ring: use wfe to wait for ring tail update on aarch64
  2019-06-30 16:21 [dpdk-dev] [RFC 0/5] use WFE for locks and ring on aarch64 Gavin Hu
                   ` (30 preceding siblings ...)
  2019-09-12 11:24 ` [dpdk-dev] [PATCH v5 5/8] ticketlock: use new API " Gavin Hu
@ 2019-09-12 11:24 ` " Gavin Hu
  2019-09-12 11:24 ` [dpdk-dev] [PATCH v5 7/8] net/thunderx: use new API to save cycles " Gavin Hu
                   ` (49 subsequent siblings)
  81 siblings, 0 replies; 163+ messages in thread
From: Gavin Hu @ 2019-09-12 11:24 UTC (permalink / raw)
  To: dev
  Cc: nd, thomas, stephen, hemant.agrawal, jerinj, pbhagavatula,
	Honnappa.Nagarahalli

Instead of polling for tail to be updated, use wfe instruction.

Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
Reviewed-by: Steve Capper <steve.capper@arm.com>
Reviewed-by: Ola Liljedahl <ola.liljedahl@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
---
 lib/librte_ring/rte_ring_c11_mem.h | 4 ++--
 lib/librte_ring/rte_ring_generic.h | 3 +--
 2 files changed, 3 insertions(+), 4 deletions(-)

diff --git a/lib/librte_ring/rte_ring_c11_mem.h b/lib/librte_ring/rte_ring_c11_mem.h
index 0fb73a3..764d8f1 100644
--- a/lib/librte_ring/rte_ring_c11_mem.h
+++ b/lib/librte_ring/rte_ring_c11_mem.h
@@ -2,6 +2,7 @@
  *
  * Copyright (c) 2017,2018 HXT-semitech Corporation.
  * Copyright (c) 2007-2009 Kip Macy kmacy@freebsd.org
+ * Copyright (c) 2019 Arm Limited
  * All rights reserved.
  * Derived from FreeBSD's bufring.h
  * Used as BSD-3 Licensed with permission from Kip Macy.
@@ -21,8 +22,7 @@ update_tail(struct rte_ring_headtail *ht, uint32_t old_val, uint32_t new_val,
 	 * we need to wait for them to complete
 	 */
 	if (!single)
-		while (unlikely(ht->tail != old_val))
-			rte_pause();
+		rte_wait_until_equal_relaxed_32(&ht->tail, old_val);
 
 	__atomic_store_n(&ht->tail, new_val, __ATOMIC_RELEASE);
 }
diff --git a/lib/librte_ring/rte_ring_generic.h b/lib/librte_ring/rte_ring_generic.h
index 953cdbb..6828527 100644
--- a/lib/librte_ring/rte_ring_generic.h
+++ b/lib/librte_ring/rte_ring_generic.h
@@ -23,8 +23,7 @@ update_tail(struct rte_ring_headtail *ht, uint32_t old_val, uint32_t new_val,
 	 * we need to wait for them to complete
 	 */
 	if (!single)
-		while (unlikely(ht->tail != old_val))
-			rte_pause();
+		rte_wait_until_equal_relaxed_32(&ht->tail, old_val);
 
 	ht->tail = new_val;
 }
-- 
2.7.4


^ permalink raw reply	[flat|nested] 163+ messages in thread

* [dpdk-dev] [PATCH v5 7/8] net/thunderx: use new API to save cycles on aarch64
  2019-06-30 16:21 [dpdk-dev] [RFC 0/5] use WFE for locks and ring on aarch64 Gavin Hu
                   ` (31 preceding siblings ...)
  2019-09-12 11:24 ` [dpdk-dev] [PATCH v5 6/8] ring: use wfe to wait for ring tail update " Gavin Hu
@ 2019-09-12 11:24 ` " Gavin Hu
  2019-09-12 16:15   ` Jerin Jacob
  2019-09-12 11:24 ` [dpdk-dev] [PATCH v5 8/8] event/opdl: " Gavin Hu
                   ` (48 subsequent siblings)
  81 siblings, 1 reply; 163+ messages in thread
From: Gavin Hu @ 2019-09-12 11:24 UTC (permalink / raw)
  To: dev
  Cc: nd, thomas, stephen, hemant.agrawal, jerinj, pbhagavatula,
	Honnappa.Nagarahalli

Use the new API to wait in low power state instead of continuous
polling to save CPU cycles and power.

Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
---
 drivers/net/thunderx/nicvf_rxtx.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/net/thunderx/nicvf_rxtx.c b/drivers/net/thunderx/nicvf_rxtx.c
index 1c42874..16f34c6 100644
--- a/drivers/net/thunderx/nicvf_rxtx.c
+++ b/drivers/net/thunderx/nicvf_rxtx.c
@@ -385,8 +385,7 @@ nicvf_fill_rbdr(struct nicvf_rxq *rxq, int to_fill)
 		ltail++;
 	}
 
-	while (__atomic_load_n(&rbdr->tail, __ATOMIC_RELAXED) != next_tail)
-		rte_pause();
+	rte_wait_until_equal_relaxed_32(&rbdr->tail, next_tail);
 
 	__atomic_store_n(&rbdr->tail, ltail, __ATOMIC_RELEASE);
 	nicvf_addr_write(door, to_fill);
-- 
2.7.4


^ permalink raw reply	[flat|nested] 163+ messages in thread

* [dpdk-dev] [PATCH v5 8/8] event/opdl: use new API to save cycles on aarch64
  2019-06-30 16:21 [dpdk-dev] [RFC 0/5] use WFE for locks and ring on aarch64 Gavin Hu
                   ` (32 preceding siblings ...)
  2019-09-12 11:24 ` [dpdk-dev] [PATCH v5 7/8] net/thunderx: use new API to save cycles " Gavin Hu
@ 2019-09-12 11:24 ` " Gavin Hu
  2019-09-12 16:16   ` Jerin Jacob
  2019-09-14 14:59 ` [dpdk-dev] [PATCH v6 0/7] use WFE for aarch64 Gavin Hu
                   ` (47 subsequent siblings)
  81 siblings, 1 reply; 163+ messages in thread
From: Gavin Hu @ 2019-09-12 11:24 UTC (permalink / raw)
  To: dev
  Cc: nd, thomas, stephen, hemant.agrawal, jerinj, pbhagavatula,
	Honnappa.Nagarahalli

Use the new API to wait in low power state instead of continuous
polling to save CPU cycles and power.

Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
---
 drivers/event/opdl/opdl_ring.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/drivers/event/opdl/opdl_ring.c b/drivers/event/opdl/opdl_ring.c
index e8b29e2..f446fa3 100644
--- a/drivers/event/opdl/opdl_ring.c
+++ b/drivers/event/opdl/opdl_ring.c
@@ -17,6 +17,7 @@
 #include <rte_memory.h>
 #include <rte_memzone.h>
 #include <rte_eal_memconfig.h>
+#include <rte_atomic.h>
 
 #include "opdl_ring.h"
 #include "opdl_log.h"
@@ -475,9 +476,7 @@ opdl_ring_input_multithread(struct opdl_ring *t, const void *entries,
 	/* If another thread started inputting before this one, but hasn't
 	 * finished, we need to wait for it to complete to update the tail.
 	 */
-	while (unlikely(__atomic_load_n(&s->shared.tail, __ATOMIC_ACQUIRE) !=
-			old_head))
-		rte_pause();
+	rte_wait_until_equal_acquire_32(&s->shared.tail, old_head);
 
 	__atomic_store_n(&s->shared.tail, old_head + num_entries,
 			__ATOMIC_RELEASE);
-- 
2.7.4


^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: [dpdk-dev] [PATCH v5 1/8] config: add WFE config entry for aarch64
  2019-09-12 11:24 ` [dpdk-dev] [PATCH v5 1/8] config: add WFE config entry " Gavin Hu
@ 2019-09-12 15:48   ` Jerin Jacob
  2019-09-13 16:01     ` Gavin Hu (Arm Technology China)
  0 siblings, 1 reply; 163+ messages in thread
From: Jerin Jacob @ 2019-09-12 15:48 UTC (permalink / raw)
  To: Gavin Hu
  Cc: dev, nd, Thomas Monjalon, stephen, hemant.agrawal, Jerin Jacob,
	Pavan Nikhilesh, Honnappa.Nagarahalli

On Thu, Sep 12, 2019 at 5:05 PM Gavin Hu <gavin.hu@arm.com> wrote:
>
> Add the RTE_USE_WFE configuration entry for aarch64, disabled by default.

s/RTE_USE_WFE/RTE_ARM_USE_WFE

> It can be enabled selectively based on the performance benchmarking.
>
> Signed-off-by: Gavin Hu <gavin.hu@arm.com>
> Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> Reviewed-by: Steve Capper <steve.capper@arm.com>
> Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> Reviewed-by: Phil Yang <phil.yang@arm.com>

Does it make sense to add the Reviewed-by without CCing the people?

I understand, There may be an internal review before sending out to
mailing list, IMO, it better to give Reviewed-By in the mailing list.
Not sure about general practice and other people view.



> Acked-by: Pavan Nikhilesh <pbhagavatula@marvell.com>
> ---
>  config/arm/meson.build | 1 +
>  config/common_base     | 6 ++++++
>  2 files changed, 7 insertions(+)
>
> diff --git a/config/arm/meson.build b/config/arm/meson.build
> index 979018e..18ecd53 100644
> --- a/config/arm/meson.build
> +++ b/config/arm/meson.build
> @@ -116,6 +116,7 @@ impl_dpaa = ['NXP DPAA', flags_dpaa, machine_args_generic]
>  impl_dpaa2 = ['NXP DPAA2', flags_dpaa2, machine_args_generic]
>
>  dpdk_conf.set('RTE_FORCE_INTRINSICS', 1)
> +dpdk_conf.set('RTE_ARM_USE_WFE', 0)
>
>  if not dpdk_conf.get('RTE_ARCH_64')
>         dpdk_conf.set('RTE_CACHE_LINE_SIZE', 64)
> diff --git a/config/common_base b/config/common_base
> index 8ef75c2..d4cf974 100644
> --- a/config/common_base
> +++ b/config/common_base
> @@ -570,6 +570,12 @@ CONFIG_RTE_CRYPTO_MAX_DEVS=64
>  CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO=n
>  CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO_DEBUG=n
>
> +# Use WFE instructions to implement the rte_wait_for_equal_xxx APIs,
> +# calling these APIs put the cores in low power state while waiting
> +# for the memory address to become equal to the expected value.
> +# This is supported only by aarch64.
> +CONFIG_RTE_ARM_USE_WFE=n

Since this comes as EAL config, IMO, it is better to move this to end
of EAL section.
i.e move after CONFIG_RTE_USE_LIBBSD

And I think, we should squash this patch to  "eal: add the APIs to
wait until equall" as it single logical change.


> +
>  #
>  # Compile NXP CAAM JR crypto Driver
>  #
> --
> 2.7.4
>

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: [dpdk-dev] [PATCH v5 3/8] eal: add the APIs to wait until equal
  2019-09-12 11:24 ` [dpdk-dev] [PATCH v5 3/8] eal: add the APIs to wait until equal Gavin Hu
@ 2019-09-12 16:11   ` Jerin Jacob
  2019-09-13 17:05     ` Gavin Hu (Arm Technology China)
  0 siblings, 1 reply; 163+ messages in thread
From: Jerin Jacob @ 2019-09-12 16:11 UTC (permalink / raw)
  To: Gavin Hu
  Cc: dev, nd, Thomas Monjalon, stephen, hemant.agrawal, Jerin Jacob,
	Pavan Nikhilesh, Honnappa.Nagarahalli

On Thu, Sep 12, 2019 at 4:56 PM Gavin Hu <gavin.hu@arm.com> wrote:
>
> The rte_wait_until_equalxx APIs abstract the functionality of 'polling
> for a memory location to become equal to a given value'.
>
> Signed-off-by: Gavin Hu <gavin.hu@arm.com>
> Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> Reviewed-by: Steve Capper <steve.capper@arm.com>
> Reviewed-by: Ola Liljedahl <ola.liljedahl@arm.com>
> Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> Reviewed-by: Phil Yang <phil.yang@arm.com>
> Acked-by: Pavan Nikhilesh <pbhagavatula@marvell.com>
> ---
>  .../common/include/arch/arm/rte_pause_64.h         | 30 +++++++
>  lib/librte_eal/common/include/generic/rte_pause.h  | 98 +++++++++++++++++++++-
>  2 files changed, 127 insertions(+), 1 deletion(-)
>
> diff --git a/lib/librte_eal/common/include/arch/arm/rte_pause_64.h b/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
> index 93895d3..dabde17 100644
> --- a/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
> +++ b/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
> @@ -1,5 +1,6 @@
>  /* SPDX-License-Identifier: BSD-3-Clause
>   * Copyright(c) 2017 Cavium, Inc
> + * Copyright(c) 2019 Arm Limited
>   */
>
>  #ifndef _RTE_PAUSE_ARM64_H_
> @@ -17,6 +18,35 @@ static inline void rte_pause(void)
>         asm volatile("yield" ::: "memory");
>  }
>
> +#ifdef RTE_ARM_USE_WFE
> +#define __WAIT_UNTIL_EQUAL(name, asm_op, wide, type) \
> +static __rte_always_inline void \
> +rte_wait_until_equal_##name(volatile type * addr, type expected) \
> +{ \
> +       type tmp; \
> +       asm volatile( \
> +               #asm_op " %" #wide "[tmp], %[addr]\n" \
> +               "cmp    %" #wide "[tmp], %" #wide "[expected]\n" \
> +               "b.eq   2f\n" \
> +               "sevl\n" \
> +               "1:     wfe\n" \
> +               #asm_op " %" #wide "[tmp], %[addr]\n" \
> +               "cmp    %" #wide "[tmp], %" #wide "[expected]\n" \
> +               "bne    1b\n" \
> +               "2:\n" \
> +               : [tmp] "=&r" (tmp) \
> +               : [addr] "Q"(*addr), [expected] "r"(expected) \
> +               : "cc", "memory"); \
> +}
> +/* Wait for *addr to be updated with expected value */
> +__WAIT_UNTIL_EQUAL(relaxed_16, ldxrh, w, uint16_t)
> +__WAIT_UNTIL_EQUAL(acquire_16, ldaxrh, w, uint16_t)
> +__WAIT_UNTIL_EQUAL(relaxed_32, ldxr, w, uint32_t)
> +__WAIT_UNTIL_EQUAL(acquire_32, ldaxr, w, uint32_t)
> +__WAIT_UNTIL_EQUAL(relaxed_64, ldxr, x, uint64_t)
> +__WAIT_UNTIL_EQUAL(acquire_64, ldaxr, x, uint64_t)
> +#endif

Looks good.

> +
>  #ifdef __cplusplus
>  }
>  #endif
> diff --git a/lib/librte_eal/common/include/generic/rte_pause.h b/lib/librte_eal/common/include/generic/rte_pause.h
> index 52bd4db..dfa6a53 100644
> --- a/lib/librte_eal/common/include/generic/rte_pause.h
> +++ b/lib/librte_eal/common/include/generic/rte_pause.h
> @@ -1,10 +1,10 @@
>  /* SPDX-License-Identifier: BSD-3-Clause
>   * Copyright(c) 2017 Cavium, Inc
> + * Copyright(c) 2019 Arm Limited
>   */
>
>  #ifndef _RTE_PAUSE_H_
>  #define _RTE_PAUSE_H_
> -

Unwanted change.

>  /**
>   * @file
>   *
> @@ -12,6 +12,10 @@
>   *
>   */
>
> +#include <stdint.h>
> +#include <rte_common.h>
> +#include <rte_atomic.h>
> +
>  /**
>   * Pause CPU execution for a short while
>   *
> @@ -20,4 +24,96 @@
>   */
>  static inline void rte_pause(void);
>
> +/**
> + * Wait for *addr to be updated with expected value

IMO, We need to mention relaxed attribute also in the comment.

> + *
> + * @param addr
> + *  A pointer to the memory location.
> + * @param expected
> + *  An expected value to be in the memory location.
> + */
> +__rte_always_inline
> +static void
> +rte_wait_until_equal_relaxed_16(volatile uint16_t *addr, uint16_t expected);

> +
> +/**
> + * Wait for *addr to be updated with expected value
> + *
> + * @param addr
> + *  A pointer to the memory location.
> + * @param expected
> + *  An expected value to be in the memory location.
> + */
> +__rte_always_inline
> +static void
> +rte_wait_until_equal_acquire_64(volatile uint64_t *addr, uint64_t expected);
> +
> +#if !defined(RTE_ARM_USE_WFE)

Looks like there is a side effect as meson's build/rte_build_config.h
comes as below
#define RTE_ARM_USE_WFE 0

So actually it is defined.


> +#define __WAIT_UNTIL_EQUAL(op_name, size, type, memorder) \
> +__rte_always_inline \
> +static void    \
> +rte_wait_until_equal_##op_name##_##size(volatile type *addr, \
> +       type expected) \
> +{ \
> +       while (__atomic_load_n(addr, memorder) != expected) \
> +               rte_pause(); \
> +}
> +
> +/* Wait for *addr to be updated with expected value */
> +__WAIT_UNTIL_EQUAL(relaxed, 16, uint16_t, __ATOMIC_RELAXED)
> +__WAIT_UNTIL_EQUAL(acquire, 16, uint16_t, __ATOMIC_ACQUIRE)
> +__WAIT_UNTIL_EQUAL(relaxed, 32, uint32_t, __ATOMIC_RELAXED)
> +__WAIT_UNTIL_EQUAL(acquire, 32, uint32_t, __ATOMIC_ACQUIRE)
> +__WAIT_UNTIL_EQUAL(relaxed, 64, uint64_t, __ATOMIC_RELAXED)
> +__WAIT_UNTIL_EQUAL(acquire, 64, uint64_t, __ATOMIC_ACQUIRE)
> +#endif /* RTE_ARM_USE_WFE */
> +
>  #endif /* _RTE_PAUSE_H_ */
> --
> 2.7.4
>

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: [dpdk-dev] [PATCH v5 5/8] ticketlock: use new API to reduce contention on aarch64
  2019-09-12 11:24 ` [dpdk-dev] [PATCH v5 5/8] ticketlock: use new API " Gavin Hu
@ 2019-09-12 16:14   ` Jerin Jacob
  0 siblings, 0 replies; 163+ messages in thread
From: Jerin Jacob @ 2019-09-12 16:14 UTC (permalink / raw)
  To: Gavin Hu
  Cc: dev, nd, Thomas Monjalon, stephen, hemant.agrawal, Jerin Jacob,
	Pavan Nikhilesh, Honnappa.Nagarahalli

On Thu, Sep 12, 2019 at 4:57 PM Gavin Hu <gavin.hu@arm.com> wrote:
>
> While using ticket lock, cores repeatedly poll the lock variable.
> This is replaced by rte_wait_until_equal API.
>
> Running ticketlock_autotest on ThunderX2, Ampere eMAG80, and Arm N1SDP[1],
> there were variances between runs, but no notable performance gain or
> degradation were seen with and without this patch.
>
> [1] https://community.arm.com/developer/tools-software/oss-platforms/w/\
> docs/440/neoverse-n1-sdp
>
> Signed-off-by: Gavin Hu <gavin.hu@arm.com>
> Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> Tested-by: Phil Yang <phil.yang@arm.com>
> Tested-by: Pavan Nikhilesh <pbhagavatula@marvell.com>

Reviewed-by: Jerin Jacob <jerinj@marvell.com>

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: [dpdk-dev] [PATCH v5 7/8] net/thunderx: use new API to save cycles on aarch64
  2019-09-12 11:24 ` [dpdk-dev] [PATCH v5 7/8] net/thunderx: use new API to save cycles " Gavin Hu
@ 2019-09-12 16:15   ` Jerin Jacob
  0 siblings, 0 replies; 163+ messages in thread
From: Jerin Jacob @ 2019-09-12 16:15 UTC (permalink / raw)
  To: Gavin Hu
  Cc: dev, nd, Thomas Monjalon, stephen, hemant.agrawal, Jerin Jacob,
	Pavan Nikhilesh, Honnappa.Nagarahalli

On Thu, Sep 12, 2019 at 4:57 PM Gavin Hu <gavin.hu@arm.com> wrote:
>
> Use the new API to wait in low power state instead of continuous
> polling to save CPU cycles and power.
>
> Signed-off-by: Gavin Hu <gavin.hu@arm.com>
> Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>

Acked-by: Jerin Jacob <jerinj@marvell.com>


> ---
>  drivers/net/thunderx/nicvf_rxtx.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
>
> diff --git a/drivers/net/thunderx/nicvf_rxtx.c b/drivers/net/thunderx/nicvf_rxtx.c
> index 1c42874..16f34c6 100644
> --- a/drivers/net/thunderx/nicvf_rxtx.c
> +++ b/drivers/net/thunderx/nicvf_rxtx.c
> @@ -385,8 +385,7 @@ nicvf_fill_rbdr(struct nicvf_rxq *rxq, int to_fill)
>                 ltail++;
>         }
>
> -       while (__atomic_load_n(&rbdr->tail, __ATOMIC_RELAXED) != next_tail)
> -               rte_pause();
> +       rte_wait_until_equal_relaxed_32(&rbdr->tail, next_tail);
>
>         __atomic_store_n(&rbdr->tail, ltail, __ATOMIC_RELEASE);
>         nicvf_addr_write(door, to_fill);
> --
> 2.7.4
>

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: [dpdk-dev] [PATCH v5 8/8] event/opdl: use new API to save cycles on aarch64
  2019-09-12 11:24 ` [dpdk-dev] [PATCH v5 8/8] event/opdl: " Gavin Hu
@ 2019-09-12 16:16   ` Jerin Jacob
  0 siblings, 0 replies; 163+ messages in thread
From: Jerin Jacob @ 2019-09-12 16:16 UTC (permalink / raw)
  To: Gavin Hu
  Cc: dev, nd, Thomas Monjalon, stephen, hemant.agrawal, Jerin Jacob,
	Pavan Nikhilesh, Honnappa.Nagarahalli

On Thu, Sep 12, 2019 at 4:57 PM Gavin Hu <gavin.hu@arm.com> wrote:
>
> Use the new API to wait in low power state instead of continuous
> polling to save CPU cycles and power.
>
> Signed-off-by: Gavin Hu <gavin.hu@arm.com>
> Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>

Reviewed-by: Jerin Jacob <jerinj@marvell.com>


> ---
>  drivers/event/opdl/opdl_ring.c | 5 ++---
>  1 file changed, 2 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/event/opdl/opdl_ring.c b/drivers/event/opdl/opdl_ring.c
> index e8b29e2..f446fa3 100644
> --- a/drivers/event/opdl/opdl_ring.c
> +++ b/drivers/event/opdl/opdl_ring.c
> @@ -17,6 +17,7 @@
>  #include <rte_memory.h>
>  #include <rte_memzone.h>
>  #include <rte_eal_memconfig.h>
> +#include <rte_atomic.h>
>
>  #include "opdl_ring.h"
>  #include "opdl_log.h"
> @@ -475,9 +476,7 @@ opdl_ring_input_multithread(struct opdl_ring *t, const void *entries,
>         /* If another thread started inputting before this one, but hasn't
>          * finished, we need to wait for it to complete to update the tail.
>          */
> -       while (unlikely(__atomic_load_n(&s->shared.tail, __ATOMIC_ACQUIRE) !=
> -                       old_head))
> -               rte_pause();
> +       rte_wait_until_equal_acquire_32(&s->shared.tail, old_head);
>
>         __atomic_store_n(&s->shared.tail, old_head + num_entries,
>                         __ATOMIC_RELEASE);
> --
> 2.7.4
>

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: [dpdk-dev] [PATCH v5 1/8] config: add WFE config entry for aarch64
  2019-09-12 15:48   ` Jerin Jacob
@ 2019-09-13 16:01     ` Gavin Hu (Arm Technology China)
  0 siblings, 0 replies; 163+ messages in thread
From: Gavin Hu (Arm Technology China) @ 2019-09-13 16:01 UTC (permalink / raw)
  To: Jerin Jacob
  Cc: dev, nd, thomas, stephen, hemant.agrawal, jerinj,
	Pavan Nikhilesh, Honnappa Nagarahalli

Hi Jerin,

Thanks for reviewing the series, my comments inline. 

> -----Original Message-----
> From: Jerin Jacob <jerinjacobk@gmail.com>
> Sent: Thursday, September 12, 2019 11:49 PM
> To: Gavin Hu (Arm Technology China) <Gavin.Hu@arm.com>
> Cc: dev@dpdk.org; nd <nd@arm.com>; thomas@monjalon.net;
> stephen@networkplumber.org; hemant.agrawal@nxp.com;
> jerinj@marvell.com; Pavan Nikhilesh <pbhagavatula@marvell.com>;
> Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
> Subject: Re: [dpdk-dev] [PATCH v5 1/8] config: add WFE config entry for
> aarch64
> 
> On Thu, Sep 12, 2019 at 5:05 PM Gavin Hu <gavin.hu@arm.com> wrote:
> >
> > Add the RTE_USE_WFE configuration entry for aarch64, disabled by default.
> 
> s/RTE_USE_WFE/RTE_ARM_USE_WFE
Will fix in next version.
> 
> > It can be enabled selectively based on the performance benchmarking.
> >
> > Signed-off-by: Gavin Hu <gavin.hu@arm.com>
> > Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > Reviewed-by: Steve Capper <steve.capper@arm.com>
> > Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > Reviewed-by: Phil Yang <phil.yang@arm.com>
> 
> Does it make sense to add the Reviewed-by without CCing the people?
I will cc all the internal reviewers for the future patches.
> 
> I understand, There may be an internal review before sending out to
> mailing list, IMO, it better to give Reviewed-By in the mailing list.
> Not sure about general practice and other people view.
If there is a guideline here, we will strictly follow it. Currently both ways are ok? 
> 
> 
> > Acked-by: Pavan Nikhilesh <pbhagavatula@marvell.com>
> > ---
> >  config/arm/meson.build | 1 +
> >  config/common_base     | 6 ++++++
> >  2 files changed, 7 insertions(+)
> >
> > diff --git a/config/arm/meson.build b/config/arm/meson.build
> > index 979018e..18ecd53 100644
> > --- a/config/arm/meson.build
> > +++ b/config/arm/meson.build
> > @@ -116,6 +116,7 @@ impl_dpaa = ['NXP DPAA', flags_dpaa,
> machine_args_generic]
> >  impl_dpaa2 = ['NXP DPAA2', flags_dpaa2, machine_args_generic]
> >
> >  dpdk_conf.set('RTE_FORCE_INTRINSICS', 1)
> > +dpdk_conf.set('RTE_ARM_USE_WFE', 0)
> >
> >  if not dpdk_conf.get('RTE_ARCH_64')
> >         dpdk_conf.set('RTE_CACHE_LINE_SIZE', 64)
> > diff --git a/config/common_base b/config/common_base
> > index 8ef75c2..d4cf974 100644
> > --- a/config/common_base
> > +++ b/config/common_base
> > @@ -570,6 +570,12 @@ CONFIG_RTE_CRYPTO_MAX_DEVS=64
> >  CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO=n
> >  CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO_DEBUG=n
> >
> > +# Use WFE instructions to implement the rte_wait_for_equal_xxx APIs,
> > +# calling these APIs put the cores in low power state while waiting
> > +# for the memory address to become equal to the expected value.
> > +# This is supported only by aarch64.
> > +CONFIG_RTE_ARM_USE_WFE=n
> 
> Since this comes as EAL config, IMO, it is better to move this to end
> of EAL section.
> i.e move after CONFIG_RTE_USE_LIBBSD
Will fix it in next version.
> And I think, we should squash this patch to  "eal: add the APIs to
> wait until equall" as it single logical change.
Will fix it in next version.
> 
> > +
> >  #
> >  # Compile NXP CAAM JR crypto Driver
> >  #
> > --
> > 2.7.4
> >

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: [dpdk-dev] [PATCH v5 3/8] eal: add the APIs to wait until equal
  2019-09-12 16:11   ` Jerin Jacob
@ 2019-09-13 17:05     ` Gavin Hu (Arm Technology China)
  0 siblings, 0 replies; 163+ messages in thread
From: Gavin Hu (Arm Technology China) @ 2019-09-13 17:05 UTC (permalink / raw)
  To: Jerin Jacob
  Cc: dev, nd, thomas, stephen, hemant.agrawal, jerinj,
	Pavan Nikhilesh, Honnappa Nagarahalli

Hi Jerin,

> -----Original Message-----
> From: Jerin Jacob <jerinjacobk@gmail.com>
> Sent: Friday, September 13, 2019 12:12 AM
> To: Gavin Hu (Arm Technology China) <Gavin.Hu@arm.com>
> Cc: dev@dpdk.org; nd <nd@arm.com>; thomas@monjalon.net;
> stephen@networkplumber.org; hemant.agrawal@nxp.com;
> jerinj@marvell.com; Pavan Nikhilesh <pbhagavatula@marvell.com>;
> Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
> Subject: Re: [dpdk-dev] [PATCH v5 3/8] eal: add the APIs to wait until equal
> 
> On Thu, Sep 12, 2019 at 4:56 PM Gavin Hu <gavin.hu@arm.com> wrote:
> >
> > The rte_wait_until_equalxx APIs abstract the functionality of 'polling
> > for a memory location to become equal to a given value'.
> >
> > Signed-off-by: Gavin Hu <gavin.hu@arm.com>
> > Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > Reviewed-by: Steve Capper <steve.capper@arm.com>
> > Reviewed-by: Ola Liljedahl <ola.liljedahl@arm.com>
> > Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > Reviewed-by: Phil Yang <phil.yang@arm.com>
> > Acked-by: Pavan Nikhilesh <pbhagavatula@marvell.com>
> > ---
> >  .../common/include/arch/arm/rte_pause_64.h         | 30 +++++++
> >  lib/librte_eal/common/include/generic/rte_pause.h  | 98
> +++++++++++++++++++++-
> >  2 files changed, 127 insertions(+), 1 deletion(-)
> >
> > diff --git a/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
> b/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
> > index 93895d3..dabde17 100644
> > --- a/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
> > +++ b/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
> > @@ -1,5 +1,6 @@
> >  /* SPDX-License-Identifier: BSD-3-Clause
> >   * Copyright(c) 2017 Cavium, Inc
> > + * Copyright(c) 2019 Arm Limited
> >   */
> >
> >  #ifndef _RTE_PAUSE_ARM64_H_
> > @@ -17,6 +18,35 @@ static inline void rte_pause(void)
> >         asm volatile("yield" ::: "memory");
> >  }
> >
> > +#ifdef RTE_ARM_USE_WFE
> > +#define __WAIT_UNTIL_EQUAL(name, asm_op, wide, type) \
> > +static __rte_always_inline void \
> > +rte_wait_until_equal_##name(volatile type * addr, type expected) \
> > +{ \
> > +       type tmp; \
> > +       asm volatile( \
> > +               #asm_op " %" #wide "[tmp], %[addr]\n" \
> > +               "cmp    %" #wide "[tmp], %" #wide "[expected]\n" \
> > +               "b.eq   2f\n" \
> > +               "sevl\n" \
> > +               "1:     wfe\n" \
> > +               #asm_op " %" #wide "[tmp], %[addr]\n" \
> > +               "cmp    %" #wide "[tmp], %" #wide "[expected]\n" \
> > +               "bne    1b\n" \
> > +               "2:\n" \
> > +               : [tmp] "=&r" (tmp) \
> > +               : [addr] "Q"(*addr), [expected] "r"(expected) \
> > +               : "cc", "memory"); \
> > +}
> > +/* Wait for *addr to be updated with expected value */
> > +__WAIT_UNTIL_EQUAL(relaxed_16, ldxrh, w, uint16_t)
> > +__WAIT_UNTIL_EQUAL(acquire_16, ldaxrh, w, uint16_t)
> > +__WAIT_UNTIL_EQUAL(relaxed_32, ldxr, w, uint32_t)
> > +__WAIT_UNTIL_EQUAL(acquire_32, ldaxr, w, uint32_t)
> > +__WAIT_UNTIL_EQUAL(relaxed_64, ldxr, x, uint64_t)
> > +__WAIT_UNTIL_EQUAL(acquire_64, ldaxr, x, uint64_t)
> > +#endif
> 
> Looks good.
> 
> > +
> >  #ifdef __cplusplus
> >  }
> >  #endif
> > diff --git a/lib/librte_eal/common/include/generic/rte_pause.h
> b/lib/librte_eal/common/include/generic/rte_pause.h
> > index 52bd4db..dfa6a53 100644
> > --- a/lib/librte_eal/common/include/generic/rte_pause.h
> > +++ b/lib/librte_eal/common/include/generic/rte_pause.h
> > @@ -1,10 +1,10 @@
> >  /* SPDX-License-Identifier: BSD-3-Clause
> >   * Copyright(c) 2017 Cavium, Inc
> > + * Copyright(c) 2019 Arm Limited
> >   */
> >
> >  #ifndef _RTE_PAUSE_H_
> >  #define _RTE_PAUSE_H_
> > -
> 
> Unwanted change.
Will fix it.
> 
> >  /**
> >   * @file
> >   *
> > @@ -12,6 +12,10 @@
> >   *
> >   */
> >
> > +#include <stdint.h>
> > +#include <rte_common.h>
> > +#include <rte_atomic.h>
> > +
> >  /**
> >   * Pause CPU execution for a short while
> >   *
> > @@ -20,4 +24,96 @@
> >   */
> >  static inline void rte_pause(void);
> >
> > +/**
> > + * Wait for *addr to be updated with expected value
> 
> IMO, We need to mention relaxed attribute also in the comment.
Will fix it.
> 
> > + *
> > + * @param addr
> > + *  A pointer to the memory location.
> > + * @param expected
> > + *  An expected value to be in the memory location.
> > + */
> > +__rte_always_inline
> > +static void
> > +rte_wait_until_equal_relaxed_16(volatile uint16_t *addr, uint16_t
> expected);
> 
> > +
> > +/**
> > + * Wait for *addr to be updated with expected value
> > + *
> > + * @param addr
> > + *  A pointer to the memory location.
> > + * @param expected
> > + *  An expected value to be in the memory location.
> > + */
> > +__rte_always_inline
> > +static void
> > +rte_wait_until_equal_acquire_64(volatile uint64_t *addr, uint64_t
> expected);
> > +
> > +#if !defined(RTE_ARM_USE_WFE)
> 
> Looks like there is a side effect as meson's build/rte_build_config.h
> comes as below
> #define RTE_ARM_USE_WFE 0
> 
> So actually it is defined.
Good catch, thanks, will fix it.
> 
> > +#define __WAIT_UNTIL_EQUAL(op_name, size, type, memorder) \
> > +__rte_always_inline \
> > +static void    \
> > +rte_wait_until_equal_##op_name##_##size(volatile type *addr, \
> > +       type expected) \
> > +{ \
> > +       while (__atomic_load_n(addr, memorder) != expected) \
> > +               rte_pause(); \
> > +}
> > +
> > +/* Wait for *addr to be updated with expected value */
> > +__WAIT_UNTIL_EQUAL(relaxed, 16, uint16_t, __ATOMIC_RELAXED)
> > +__WAIT_UNTIL_EQUAL(acquire, 16, uint16_t, __ATOMIC_ACQUIRE)
> > +__WAIT_UNTIL_EQUAL(relaxed, 32, uint32_t, __ATOMIC_RELAXED)
> > +__WAIT_UNTIL_EQUAL(acquire, 32, uint32_t, __ATOMIC_ACQUIRE)
> > +__WAIT_UNTIL_EQUAL(relaxed, 64, uint64_t, __ATOMIC_RELAXED)
> > +__WAIT_UNTIL_EQUAL(acquire, 64, uint64_t, __ATOMIC_ACQUIRE)
> > +#endif /* RTE_ARM_USE_WFE */
> > +
> >  #endif /* _RTE_PAUSE_H_ */
> > --
> > 2.7.4
> >

^ permalink raw reply	[flat|nested] 163+ messages in thread

* [dpdk-dev] [PATCH v6 0/7] use WFE for aarch64
  2019-06-30 16:21 [dpdk-dev] [RFC 0/5] use WFE for locks and ring on aarch64 Gavin Hu
                   ` (33 preceding siblings ...)
  2019-09-12 11:24 ` [dpdk-dev] [PATCH v5 8/8] event/opdl: " Gavin Hu
@ 2019-09-14 14:59 ` Gavin Hu
  2019-09-26 13:41   ` Jerin Jacob
  2019-09-14 14:59 ` [dpdk-dev] [PATCH v6 1/7] bus/fslmc: fix the conflicting dmb function Gavin Hu
                   ` (46 subsequent siblings)
  81 siblings, 1 reply; 163+ messages in thread
From: Gavin Hu @ 2019-09-14 14:59 UTC (permalink / raw)
  To: dev
  Cc: nd, thomas, stephen, hemant.agrawal, jerinj, pbhagavatula,
	Honnappa.Nagarahalli, ruifeng.wang, phil.yang, steve.capper

V6:
- squash the RTE_ARM_USE_WFE configuration entry patch into the new API patch
- move the new configuration to the end of EAL
- add doxygen comments to reflect the relaxed and acquire semantics
- correct the meson configuration 
V5:
- add doxygen comments for the new APIs
- spinlock early exit without wfe if the spinlock not taken by others.
- add two patches on top for opdl and thunderx
V4:
- rename the config as CONFIG_RTE_ARM_USE_WFE to indicate it applys to arm only
- introduce a macro for assembly Skelton to reduce the duplication of code
- add one patch for nxp fslmc to address a compiling error
V3:
- Convert RFCs to patches
V2:
- Use inline functions instead of marcos
- Add load and compare in the beginning of the APIs
- Fix some style errors in asm inline 
V1:
- Add the new APIs and use it for ring and locks

DPDK has multiple use cases where the core repeatedly polls a location in
memory. This polling results in many cache and memory transactions.

Arm architecture provides WFE (Wait For Event) instruction, which allows
the cpu core to enter a low power state until woken up by the update to the
memory location being polled. Thus reducing the cache and memory
transactions.

x86 has the PAUSE hint instruction to reduce such overhead.

The rte_wait_until_equal_xxx APIs abstract the functionality of 'polling
for a memory location to become equal to a given value'.

For non-Arm platforms, these APIs are just wrappers around do-while loop
with rte_pause, so there are no performance differences.

For Arm platforms, use of WFE can be configured using CONFIG_RTE_USE_WFE
option. It is disabled by default.

Currently, use of WFE is supported only for aarch64 platforms. armv7
platforms do support the WFE instruction, but they require explicit wake up
events(sev) and are less performannt.

Testing shows that, performance varies across different platforms, with
some showing degradation.

CONFIG_RTE_ARM_USE_WFE should be enabled depending on the performance
benchmarking on the target platforms. Power saving should be an bonus,
but currenly we don't have ways to characterize that.

Gavin Hu (7):
  bus/fslmc: fix the conflicting dmb function
  eal: add the APIs to wait until equal
  spinlock: use wfe to reduce contention on aarch64
  ticketlock: use new API to reduce contention on aarch64
  ring: use wfe to wait for ring tail update on aarch64
  net/thunderx: use new API to save cycles on aarch64
  event/opdl: use new API to save cycles on aarch64

 config/arm/meson.build                             |   1 +
 config/common_base                                 |   5 +
 drivers/bus/fslmc/mc/fsl_mc_sys.h                  |  10 +-
 drivers/bus/fslmc/mc/mc_sys.c                      |   3 +-
 drivers/event/opdl/opdl_ring.c                     |   5 +-
 drivers/net/thunderx/nicvf_rxtx.c                  |   3 +-
 .../common/include/arch/arm/rte_pause_64.h         |  30 ++++++
 .../common/include/arch/arm/rte_spinlock.h         |  26 ++++++
 lib/librte_eal/common/include/generic/rte_pause.h  | 103 +++++++++++++++++++++
 .../common/include/generic/rte_ticketlock.h        |   3 +-
 lib/librte_ring/rte_ring_c11_mem.h                 |   4 +-
 lib/librte_ring/rte_ring_generic.h                 |   3 +-
 12 files changed, 180 insertions(+), 16 deletions(-)

-- 
2.7.4


^ permalink raw reply	[flat|nested] 163+ messages in thread

* [dpdk-dev] [PATCH v6 1/7] bus/fslmc: fix the conflicting dmb function
  2019-06-30 16:21 [dpdk-dev] [RFC 0/5] use WFE for locks and ring on aarch64 Gavin Hu
                   ` (34 preceding siblings ...)
  2019-09-14 14:59 ` [dpdk-dev] [PATCH v6 0/7] use WFE for aarch64 Gavin Hu
@ 2019-09-14 14:59 ` Gavin Hu
  2019-09-14 14:59 ` [dpdk-dev] [PATCH v6 2/7] eal: add the APIs to wait until equal Gavin Hu
                   ` (45 subsequent siblings)
  81 siblings, 0 replies; 163+ messages in thread
From: Gavin Hu @ 2019-09-14 14:59 UTC (permalink / raw)
  To: dev
  Cc: nd, thomas, stephen, hemant.agrawal, jerinj, pbhagavatula,
	Honnappa.Nagarahalli, ruifeng.wang, phil.yang, steve.capper,
	stable

There are two definitions conflicting each other, for more
details, refer to [1].

include/rte_atomic_64.h:19: error: "dmb" redefined [-Werror]
drivers/bus/fslmc/mc/fsl_mc_sys.h:36: note: this is the location of the
previous definition
 #define dmb() {__asm__ __volatile__("" : : : "memory"); }

The fix is to include the spinlock.h file before the other header files,
this is inline with the coding style[2] about the "header includes".
The fix changes the function to take the argument for arm to be
meaningful.

[1] http://inbox.dpdk.org/users/VI1PR08MB537631AB25F41B8880DCCA988FDF0@i
VI1PR08MB5376.eurprd08.prod.outlook.com/T/#u
[2] https://doc.dpdk.org/guides/contributing/coding_style.html

Fixes: 3af733ba8da8 ("bus/fslmc: introduce MC object functions")
Cc: stable@dpdk.org

Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Phil Yang <phi.yang@arm.com>
---
 drivers/bus/fslmc/mc/fsl_mc_sys.h | 10 +++++++---
 drivers/bus/fslmc/mc/mc_sys.c     |  3 +--
 2 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/drivers/bus/fslmc/mc/fsl_mc_sys.h b/drivers/bus/fslmc/mc/fsl_mc_sys.h
index d0c7b39..fe9dc95 100644
--- a/drivers/bus/fslmc/mc/fsl_mc_sys.h
+++ b/drivers/bus/fslmc/mc/fsl_mc_sys.h
@@ -33,10 +33,14 @@ struct fsl_mc_io {
 #include <linux/byteorder/little_endian.h>
 
 #ifndef dmb
-#define dmb() {__asm__ __volatile__("" : : : "memory"); }
+#ifdef RTE_ARCH_ARM64
+#define dmb(opt) {asm volatile("dmb " #opt : : : "memory"); }
+#else
+#define dmb(opt)
 #endif
-#define __iormb()	dmb()
-#define __iowmb()	dmb()
+#endif
+#define __iormb()	dmb(ld)
+#define __iowmb()	dmb(st)
 #define __arch_getq(a)		(*(volatile uint64_t *)(a))
 #define __arch_putq(v, a)	(*(volatile uint64_t *)(a) = (v))
 #define __arch_putq32(v, a)	(*(volatile uint32_t *)(a) = (v))
diff --git a/drivers/bus/fslmc/mc/mc_sys.c b/drivers/bus/fslmc/mc/mc_sys.c
index efafdc3..22143ef 100644
--- a/drivers/bus/fslmc/mc/mc_sys.c
+++ b/drivers/bus/fslmc/mc/mc_sys.c
@@ -4,11 +4,10 @@
  * Copyright 2017 NXP
  *
  */
+#include <rte_spinlock.h>
 #include <fsl_mc_sys.h>
 #include <fsl_mc_cmd.h>
 
-#include <rte_spinlock.h>
-
 /** User space framework uses MC Portal in shared mode. Following change
  * introduces lock in MC FLIB
  */
-- 
2.7.4


^ permalink raw reply	[flat|nested] 163+ messages in thread

* [dpdk-dev] [PATCH v6 2/7] eal: add the APIs to wait until equal
  2019-06-30 16:21 [dpdk-dev] [RFC 0/5] use WFE for locks and ring on aarch64 Gavin Hu
                   ` (35 preceding siblings ...)
  2019-09-14 14:59 ` [dpdk-dev] [PATCH v6 1/7] bus/fslmc: fix the conflicting dmb function Gavin Hu
@ 2019-09-14 14:59 ` Gavin Hu
  2019-09-14 14:59 ` [dpdk-dev] [PATCH v6 3/7] spinlock: use wfe to reduce contention on aarch64 Gavin Hu
                   ` (44 subsequent siblings)
  81 siblings, 0 replies; 163+ messages in thread
From: Gavin Hu @ 2019-09-14 14:59 UTC (permalink / raw)
  To: dev
  Cc: nd, thomas, stephen, hemant.agrawal, jerinj, pbhagavatula,
	Honnappa.Nagarahalli, ruifeng.wang, phil.yang, steve.capper

The rte_wait_until_equal_xx APIs abstract the functionality of
'polling for a memory location to become equal to a given value'.

Add the RTE_ARM_USE_WFE configuration entry for aarch64, disabled
by default. When it is enabled, the above APIs will call WFE instruction
to save CPU cycles and power.

Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
Reviewed-by: Steve Capper <steve.capper@arm.com>
Reviewed-by: Ola Liljedahl <ola.liljedahl@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Reviewed-by: Phil Yang <phil.yang@arm.com>
Acked-by: Pavan Nikhilesh <pbhagavatula@marvell.com>
---
 config/arm/meson.build                             |   1 +
 config/common_base                                 |   5 +
 .../common/include/arch/arm/rte_pause_64.h         |  30 ++++++
 lib/librte_eal/common/include/generic/rte_pause.h  | 103 +++++++++++++++++++++
 4 files changed, 139 insertions(+)

diff --git a/config/arm/meson.build b/config/arm/meson.build
index 979018e..b4b4cac 100644
--- a/config/arm/meson.build
+++ b/config/arm/meson.build
@@ -26,6 +26,7 @@ flags_common_default = [
 	['RTE_LIBRTE_AVP_PMD', false],
 
 	['RTE_SCHED_VECTOR', false],
+	['RTE_ARM_USE_WFE', false],
 ]
 
 flags_generic = [
diff --git a/config/common_base b/config/common_base
index 8ef75c2..8861713 100644
--- a/config/common_base
+++ b/config/common_base
@@ -111,6 +111,11 @@ CONFIG_RTE_MAX_VFIO_CONTAINERS=64
 CONFIG_RTE_MALLOC_DEBUG=n
 CONFIG_RTE_EAL_NUMA_AWARE_HUGEPAGES=n
 CONFIG_RTE_USE_LIBBSD=n
+# Use WFE instructions to implement the rte_wait_for_equal_xxx APIs,
+# calling these APIs put the cores in low power state while waiting
+# for the memory address to become equal to the expected value.
+# This is supported only by aarch64.
+CONFIG_RTE_ARM_USE_WFE=n
 
 #
 # Recognize/ignore the AVX/AVX512 CPU flags for performance/power testing.
diff --git a/lib/librte_eal/common/include/arch/arm/rte_pause_64.h b/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
index 93895d3..dabde17 100644
--- a/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
+++ b/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
@@ -1,5 +1,6 @@
 /* SPDX-License-Identifier: BSD-3-Clause
  * Copyright(c) 2017 Cavium, Inc
+ * Copyright(c) 2019 Arm Limited
  */
 
 #ifndef _RTE_PAUSE_ARM64_H_
@@ -17,6 +18,35 @@ static inline void rte_pause(void)
 	asm volatile("yield" ::: "memory");
 }
 
+#ifdef RTE_ARM_USE_WFE
+#define __WAIT_UNTIL_EQUAL(name, asm_op, wide, type) \
+static __rte_always_inline void \
+rte_wait_until_equal_##name(volatile type * addr, type expected) \
+{ \
+	type tmp; \
+	asm volatile( \
+		#asm_op " %" #wide "[tmp], %[addr]\n" \
+		"cmp	%" #wide "[tmp], %" #wide "[expected]\n" \
+		"b.eq	2f\n" \
+		"sevl\n" \
+		"1:	wfe\n" \
+		#asm_op " %" #wide "[tmp], %[addr]\n" \
+		"cmp	%" #wide "[tmp], %" #wide "[expected]\n" \
+		"bne	1b\n" \
+		"2:\n" \
+		: [tmp] "=&r" (tmp) \
+		: [addr] "Q"(*addr), [expected] "r"(expected) \
+		: "cc", "memory"); \
+}
+/* Wait for *addr to be updated with expected value */
+__WAIT_UNTIL_EQUAL(relaxed_16, ldxrh, w, uint16_t)
+__WAIT_UNTIL_EQUAL(acquire_16, ldaxrh, w, uint16_t)
+__WAIT_UNTIL_EQUAL(relaxed_32, ldxr, w, uint32_t)
+__WAIT_UNTIL_EQUAL(acquire_32, ldaxr, w, uint32_t)
+__WAIT_UNTIL_EQUAL(relaxed_64, ldxr, x, uint64_t)
+__WAIT_UNTIL_EQUAL(acquire_64, ldaxr, x, uint64_t)
+#endif
+
 #ifdef __cplusplus
 }
 #endif
diff --git a/lib/librte_eal/common/include/generic/rte_pause.h b/lib/librte_eal/common/include/generic/rte_pause.h
index 52bd4db..772a2ab 100644
--- a/lib/librte_eal/common/include/generic/rte_pause.h
+++ b/lib/librte_eal/common/include/generic/rte_pause.h
@@ -1,5 +1,6 @@
 /* SPDX-License-Identifier: BSD-3-Clause
  * Copyright(c) 2017 Cavium, Inc
+ * Copyright(c) 2019 Arm Limited
  */
 
 #ifndef _RTE_PAUSE_H_
@@ -12,6 +13,10 @@
  *
  */
 
+#include <stdint.h>
+#include <rte_common.h>
+#include <rte_atomic.h>
+
 /**
  * Pause CPU execution for a short while
  *
@@ -20,4 +25,102 @@
  */
 static inline void rte_pause(void);
 
+/**
+ * Wait for *addr to be updated with a 16-bit expected value, with a relaxed memory
+ * ordering model meaning the loads around this API can be reordered.
+ *
+ * @param addr
+ *  A pointer to the memory location.
+ * @param expected
+ *  A 16-bit expected value to be in the memory location.
+ */
+__rte_always_inline
+static void
+rte_wait_until_equal_relaxed_16(volatile uint16_t *addr, uint16_t expected);
+
+/**
+ * Wait for *addr to be updated with a 32-bit expected value, with a relaxed memory
+ * ordering model meaning the loads around this API can be reordered.
+ *
+ * @param addr
+ *  A pointer to the memory location.
+ * @param expected
+ *  A 32-bit expected value to be in the memory location.
+ */
+__rte_always_inline
+static void
+rte_wait_until_equal_relaxed_32(volatile uint32_t *addr, uint32_t expected);
+
+/**
+ * Wait for *addr to be updated with a 64-bit expected value, with a relaxed memory
+ * ordering model meaning the loads around this API can be reordered.
+ *
+ * @param addr
+ *  A pointer to the memory location.
+ * @param expected
+ *  A 64-bit expected value to be in the memory location.
+ */
+__rte_always_inline
+static void
+rte_wait_until_equal_relaxed_64(volatile uint64_t *addr, uint64_t expected);
+
+/**
+ * Wait for *addr to be updated with a 16-bit expected value, with an acquire memory
+ * ordering model meaning the loads after this API can't be observed before this API.
+ *
+ * @param addr
+ *  A pointer to the memory location.
+ * @param expected
+ *  A 16-bit expected value to be in the memory location.
+ */
+__rte_always_inline
+static void
+rte_wait_until_equal_acquire_16(volatile uint16_t *addr, uint16_t expected);
+
+/**
+ * Wait for *addr to be updated with a 32-bit expected value, with an acquire memory
+ * ordering model meaning the loads after this API can't be observed before this API.
+ *
+ * @param addr
+ *  A pointer to the memory location.
+ * @param expected
+ *  A 32-bit expected value to be in the memory location.
+ */
+__rte_always_inline
+static void
+rte_wait_until_equal_acquire_32(volatile uint32_t *addr, uint32_t expected);
+
+/**
+ * Wait for *addr to be updated with a 64-bit expected value, with an acquire memory
+ * ordering model meaning the loads after this API can't be observed before this API.
+ *
+ * @param addr
+ *  A pointer to the memory location.
+ * @param expected
+ *  A 64-bit expected value to be in the memory location.
+ */
+__rte_always_inline
+static void
+rte_wait_until_equal_acquire_64(volatile uint64_t *addr, uint64_t expected);
+
+#if !defined(RTE_ARM_USE_WFE)
+#define __WAIT_UNTIL_EQUAL(op_name, size, type, memorder) \
+__rte_always_inline \
+static void	\
+rte_wait_until_equal_##op_name##_##size(volatile type *addr, \
+	type expected) \
+{ \
+	while (__atomic_load_n(addr, memorder) != expected) \
+		rte_pause(); \
+}
+
+/* Wait for *addr to be updated with expected value */
+__WAIT_UNTIL_EQUAL(relaxed, 16, uint16_t, __ATOMIC_RELAXED)
+__WAIT_UNTIL_EQUAL(acquire, 16, uint16_t, __ATOMIC_ACQUIRE)
+__WAIT_UNTIL_EQUAL(relaxed, 32, uint32_t, __ATOMIC_RELAXED)
+__WAIT_UNTIL_EQUAL(acquire, 32, uint32_t, __ATOMIC_ACQUIRE)
+__WAIT_UNTIL_EQUAL(relaxed, 64, uint64_t, __ATOMIC_RELAXED)
+__WAIT_UNTIL_EQUAL(acquire, 64, uint64_t, __ATOMIC_ACQUIRE)
+#endif /* RTE_ARM_USE_WFE */
+
 #endif /* _RTE_PAUSE_H_ */
-- 
2.7.4


^ permalink raw reply	[flat|nested] 163+ messages in thread

* [dpdk-dev] [PATCH v6 3/7] spinlock: use wfe to reduce contention on aarch64
  2019-06-30 16:21 [dpdk-dev] [RFC 0/5] use WFE for locks and ring on aarch64 Gavin Hu
                   ` (36 preceding siblings ...)
  2019-09-14 14:59 ` [dpdk-dev] [PATCH v6 2/7] eal: add the APIs to wait until equal Gavin Hu
@ 2019-09-14 14:59 ` Gavin Hu
  2019-09-14 14:59 ` [dpdk-dev] [PATCH v6 4/7] ticketlock: use new API " Gavin Hu
                   ` (43 subsequent siblings)
  81 siblings, 0 replies; 163+ messages in thread
From: Gavin Hu @ 2019-09-14 14:59 UTC (permalink / raw)
  To: dev
  Cc: nd, thomas, stephen, hemant.agrawal, jerinj, pbhagavatula,
	Honnappa.Nagarahalli, ruifeng.wang, phil.yang, steve.capper

In acquiring a spinlock, cores repeatedly poll the lock variable.
This is replaced by rte_wait_until_equal API.

Running the micro benchmarking and the testpmd and l3fwd traffic tests
on ThunderX2, Ampere eMAG80 and Arm N1SDP, everything went well and no
notable performance gain nor degradation was measured.

Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
Reviewed-by: Phil Yang <phil.yang@arm.com>
Reviewed-by: Steve Capper <steve.capper@arm.com>
Reviewed-by: Ola Liljedahl <ola.liljedahl@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Tested-by: Pavan Nikhilesh <pbhagavatula@marvell.com>
---
 .../common/include/arch/arm/rte_spinlock.h         | 26 ++++++++++++++++++++++
 1 file changed, 26 insertions(+)

diff --git a/lib/librte_eal/common/include/arch/arm/rte_spinlock.h b/lib/librte_eal/common/include/arch/arm/rte_spinlock.h
index 1a6916b..b61c055 100644
--- a/lib/librte_eal/common/include/arch/arm/rte_spinlock.h
+++ b/lib/librte_eal/common/include/arch/arm/rte_spinlock.h
@@ -16,6 +16,32 @@ extern "C" {
 #include <rte_common.h>
 #include "generic/rte_spinlock.h"
 
+/* armv7a does support WFE, but an explicit wake-up signal using SEV is
+ * required (must be preceded by DSB to drain the store buffer) and
+ * this is less performant, so keep armv7a implementation unchanged.
+ */
+#ifndef RTE_FORCE_INTRINSICS
+static inline void
+rte_spinlock_lock(rte_spinlock_t *sl)
+{
+	unsigned int tmp;
+	/* http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.
+	 * faqs/ka16809.html
+	 */
+	asm volatile(
+		"1:	ldaxr %w[tmp], %w[locked]\n"
+		"cbnz   %w[tmp], 2f\n"
+		"stxr   %w[tmp], %w[one], %w[locked]\n"
+		"cbnz   %w[tmp], 1b\n"
+		"ret\n"
+		"2:	sevl\n"
+		"wfe\n"
+		"jmp	1b\n"
+		: [tmp] "=&r" (tmp), [locked] "+Q"(sl->locked)
+		: [one] "r" (1)
+}
+#endif
+
 static inline int rte_tm_supported(void)
 {
 	return 0;
-- 
2.7.4


^ permalink raw reply	[flat|nested] 163+ messages in thread

* [dpdk-dev] [PATCH v6 4/7] ticketlock: use new API to reduce contention on aarch64
  2019-06-30 16:21 [dpdk-dev] [RFC 0/5] use WFE for locks and ring on aarch64 Gavin Hu
                   ` (37 preceding siblings ...)
  2019-09-14 14:59 ` [dpdk-dev] [PATCH v6 3/7] spinlock: use wfe to reduce contention on aarch64 Gavin Hu
@ 2019-09-14 14:59 ` " Gavin Hu
  2019-09-14 14:59 ` [dpdk-dev] [PATCH v6 5/7] ring: use wfe to wait for ring tail update " Gavin Hu
                   ` (42 subsequent siblings)
  81 siblings, 0 replies; 163+ messages in thread
From: Gavin Hu @ 2019-09-14 14:59 UTC (permalink / raw)
  To: dev
  Cc: nd, thomas, stephen, hemant.agrawal, jerinj, pbhagavatula,
	Honnappa.Nagarahalli, ruifeng.wang, phil.yang, steve.capper

While using ticket lock, cores repeatedly poll the lock variable.
This is replaced by rte_wait_until_equal API.

Running ticketlock_autotest on ThunderX2, Ampere eMAG80, and Arm N1SDP[1],
there were variances between runs, but no notable performance gain or
degradation were seen with and without this patch.

[1] https://community.arm.com/developer/tools-software/oss-platforms/w/\
docs/440/neoverse-n1-sdp

Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Tested-by: Phil Yang <phil.yang@arm.com>
Tested-by: Pavan Nikhilesh <pbhagavatula@marvell.com>
Reviewed-by: Jerin Jacob <jerinj@marvell.com>
---
 lib/librte_eal/common/include/generic/rte_ticketlock.h | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/lib/librte_eal/common/include/generic/rte_ticketlock.h b/lib/librte_eal/common/include/generic/rte_ticketlock.h
index d9bec87..232bbe9 100644
--- a/lib/librte_eal/common/include/generic/rte_ticketlock.h
+++ b/lib/librte_eal/common/include/generic/rte_ticketlock.h
@@ -66,8 +66,7 @@ static inline void
 rte_ticketlock_lock(rte_ticketlock_t *tl)
 {
 	uint16_t me = __atomic_fetch_add(&tl->s.next, 1, __ATOMIC_RELAXED);
-	while (__atomic_load_n(&tl->s.current, __ATOMIC_ACQUIRE) != me)
-		rte_pause();
+	rte_wait_until_equal_acquire_16(&tl->s.current, me);
 }
 
 /**
-- 
2.7.4


^ permalink raw reply	[flat|nested] 163+ messages in thread

* [dpdk-dev] [PATCH v6 5/7] ring: use wfe to wait for ring tail update on aarch64
  2019-06-30 16:21 [dpdk-dev] [RFC 0/5] use WFE for locks and ring on aarch64 Gavin Hu
                   ` (38 preceding siblings ...)
  2019-09-14 14:59 ` [dpdk-dev] [PATCH v6 4/7] ticketlock: use new API " Gavin Hu
@ 2019-09-14 14:59 ` " Gavin Hu
  2019-09-14 14:59 ` [dpdk-dev] [PATCH v6 6/7] net/thunderx: use new API to save cycles " Gavin Hu
                   ` (41 subsequent siblings)
  81 siblings, 0 replies; 163+ messages in thread
From: Gavin Hu @ 2019-09-14 14:59 UTC (permalink / raw)
  To: dev
  Cc: nd, thomas, stephen, hemant.agrawal, jerinj, pbhagavatula,
	Honnappa.Nagarahalli, ruifeng.wang, phil.yang, steve.capper

Instead of polling for tail to be updated, use wfe instruction.

Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
Reviewed-by: Steve Capper <steve.capper@arm.com>
Reviewed-by: Ola Liljedahl <ola.liljedahl@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
---
 lib/librte_ring/rte_ring_c11_mem.h | 4 ++--
 lib/librte_ring/rte_ring_generic.h | 3 +--
 2 files changed, 3 insertions(+), 4 deletions(-)

diff --git a/lib/librte_ring/rte_ring_c11_mem.h b/lib/librte_ring/rte_ring_c11_mem.h
index 0fb73a3..764d8f1 100644
--- a/lib/librte_ring/rte_ring_c11_mem.h
+++ b/lib/librte_ring/rte_ring_c11_mem.h
@@ -2,6 +2,7 @@
  *
  * Copyright (c) 2017,2018 HXT-semitech Corporation.
  * Copyright (c) 2007-2009 Kip Macy kmacy@freebsd.org
+ * Copyright (c) 2019 Arm Limited
  * All rights reserved.
  * Derived from FreeBSD's bufring.h
  * Used as BSD-3 Licensed with permission from Kip Macy.
@@ -21,8 +22,7 @@ update_tail(struct rte_ring_headtail *ht, uint32_t old_val, uint32_t new_val,
 	 * we need to wait for them to complete
 	 */
 	if (!single)
-		while (unlikely(ht->tail != old_val))
-			rte_pause();
+		rte_wait_until_equal_relaxed_32(&ht->tail, old_val);
 
 	__atomic_store_n(&ht->tail, new_val, __ATOMIC_RELEASE);
 }
diff --git a/lib/librte_ring/rte_ring_generic.h b/lib/librte_ring/rte_ring_generic.h
index 953cdbb..6828527 100644
--- a/lib/librte_ring/rte_ring_generic.h
+++ b/lib/librte_ring/rte_ring_generic.h
@@ -23,8 +23,7 @@ update_tail(struct rte_ring_headtail *ht, uint32_t old_val, uint32_t new_val,
 	 * we need to wait for them to complete
 	 */
 	if (!single)
-		while (unlikely(ht->tail != old_val))
-			rte_pause();
+		rte_wait_until_equal_relaxed_32(&ht->tail, old_val);
 
 	ht->tail = new_val;
 }
-- 
2.7.4


^ permalink raw reply	[flat|nested] 163+ messages in thread

* [dpdk-dev] [PATCH v6 6/7] net/thunderx: use new API to save cycles on aarch64
  2019-06-30 16:21 [dpdk-dev] [RFC 0/5] use WFE for locks and ring on aarch64 Gavin Hu
                   ` (39 preceding siblings ...)
  2019-09-14 14:59 ` [dpdk-dev] [PATCH v6 5/7] ring: use wfe to wait for ring tail update " Gavin Hu
@ 2019-09-14 14:59 ` " Gavin Hu
  2019-09-14 14:59 ` [dpdk-dev] [PATCH v6 7/7] event/opdl: " Gavin Hu
                   ` (40 subsequent siblings)
  81 siblings, 0 replies; 163+ messages in thread
From: Gavin Hu @ 2019-09-14 14:59 UTC (permalink / raw)
  To: dev
  Cc: nd, thomas, stephen, hemant.agrawal, jerinj, pbhagavatula,
	Honnappa.Nagarahalli, ruifeng.wang, phil.yang, steve.capper

Use the new API to wait in low power state instead of continuous
polling to save CPU cycles and power.

Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
Acked-by: Jerin Jacob <jerinj@marvell.com>
---
 drivers/net/thunderx/nicvf_rxtx.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/net/thunderx/nicvf_rxtx.c b/drivers/net/thunderx/nicvf_rxtx.c
index 1c42874..16f34c6 100644
--- a/drivers/net/thunderx/nicvf_rxtx.c
+++ b/drivers/net/thunderx/nicvf_rxtx.c
@@ -385,8 +385,7 @@ nicvf_fill_rbdr(struct nicvf_rxq *rxq, int to_fill)
 		ltail++;
 	}
 
-	while (__atomic_load_n(&rbdr->tail, __ATOMIC_RELAXED) != next_tail)
-		rte_pause();
+	rte_wait_until_equal_relaxed_32(&rbdr->tail, next_tail);
 
 	__atomic_store_n(&rbdr->tail, ltail, __ATOMIC_RELEASE);
 	nicvf_addr_write(door, to_fill);
-- 
2.7.4


^ permalink raw reply	[flat|nested] 163+ messages in thread

* [dpdk-dev] [PATCH v6 7/7] event/opdl: use new API to save cycles on aarch64
  2019-06-30 16:21 [dpdk-dev] [RFC 0/5] use WFE for locks and ring on aarch64 Gavin Hu
                   ` (40 preceding siblings ...)
  2019-09-14 14:59 ` [dpdk-dev] [PATCH v6 6/7] net/thunderx: use new API to save cycles " Gavin Hu
@ 2019-09-14 14:59 ` " Gavin Hu
  2019-09-27  5:41 ` [dpdk-dev] [PATCH v7 0/7] use WFE for aarch64 Gavin Hu
                   ` (39 subsequent siblings)
  81 siblings, 0 replies; 163+ messages in thread
From: Gavin Hu @ 2019-09-14 14:59 UTC (permalink / raw)
  To: dev
  Cc: nd, thomas, stephen, hemant.agrawal, jerinj, pbhagavatula,
	Honnappa.Nagarahalli, ruifeng.wang, phil.yang, steve.capper

Use the new API to wait in low power state instead of continuous
polling to save CPU cycles and power.

Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
Reviewed-by: Jerin Jacob <jerinj@marvell.com>
---
 drivers/event/opdl/opdl_ring.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/drivers/event/opdl/opdl_ring.c b/drivers/event/opdl/opdl_ring.c
index e8b29e2..f446fa3 100644
--- a/drivers/event/opdl/opdl_ring.c
+++ b/drivers/event/opdl/opdl_ring.c
@@ -17,6 +17,7 @@
 #include <rte_memory.h>
 #include <rte_memzone.h>
 #include <rte_eal_memconfig.h>
+#include <rte_atomic.h>
 
 #include "opdl_ring.h"
 #include "opdl_log.h"
@@ -475,9 +476,7 @@ opdl_ring_input_multithread(struct opdl_ring *t, const void *entries,
 	/* If another thread started inputting before this one, but hasn't
 	 * finished, we need to wait for it to complete to update the tail.
 	 */
-	while (unlikely(__atomic_load_n(&s->shared.tail, __ATOMIC_ACQUIRE) !=
-			old_head))
-		rte_pause();
+	rte_wait_until_equal_acquire_32(&s->shared.tail, old_head);
 
 	__atomic_store_n(&s->shared.tail, old_head + num_entries,
 			__ATOMIC_RELEASE);
-- 
2.7.4


^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: [dpdk-dev] [EXT] [PATCH v4 5/6] spinlock: use wfe to reduce contention on aarch64
       [not found]     ` <VI1PR08MB5376BEBCC1FD1E03F0B8A8848FB00@VI1PR08MB5376.eurprd08.prod.outlook.com>
@ 2019-09-14 15:21       ` " Gavin Hu (Arm Technology China)
  0 siblings, 0 replies; 163+ messages in thread
From: Gavin Hu (Arm Technology China) @ 2019-09-14 15:21 UTC (permalink / raw)
  To: Pavan Nikhilesh Bhagavatula, jerinj
  Cc: dev, Honnappa Nagarahalli, Phil Yang (Arm Technology China),
	Ruifeng Wang (Arm Technology China)

Hi Jerin,

Add the offlist discussion with Pavan to facilitate the review for the spinlock patch(currently in v6). Thanks Pavan and Jerin for review.

Best Regards,
Gavin

> -----Original Message-----
> From: Gavin Hu (Arm Technology China)
> Sent: Thursday, September 12, 2019 5:22 PM
> To: Pavan Nikhilesh Bhagavatula <pbhagavatula@marvell.com>
> Subject: RE: [EXT] [PATCH v4 5/6] spinlock: use wfe to reduce contention on
> aarch64
>
> Hi Pavan,
>
> Thanks for pointing this out, it was implemented in the API already.
> Spinlock did not use the API to save a comparison branch(loading 0 to a reg
> and compare against).
>
> Anyway it is also a good idea to add it into this asm code.
>
> Best Regards,
> Gavin
>
> > -----Original Message-----
> > From: Pavan Nikhilesh Bhagavatula <pbhagavatula@marvell.com>
> > Sent: Thursday, September 12, 2019 4:45 PM
> > To: Gavin Hu (Arm Technology China) <Gavin.Hu@arm.com>
> > Subject: RE: [EXT] [PATCH v4 5/6] spinlock: use wfe to reduce contention on
> > aarch64
> >
> > Hi Gavin, (Offlist)
> >
> > I there a reason why the below asm doesn't use early exit as discussed in
> > http://patches.dpdk.org/patch/55669/
> >
> > Regards,
> > Pavan.
> >
> > >+#ifndef RTE_FORCE_INTRINSICS
> > >+static inline void
> > >+rte_spinlock_lock(rte_spinlock_t *sl)
> > >+{
> > >+  unsigned int tmp;
> > >+  /*
> > >http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.
> > >+   * faqs/ka16809.html
> > >+   */
> > >+  asm volatile(
> > >+          "sevl\n"
> > >+          "1:     wfe\n"
> > >+          "2:     ldaxr %w[tmp], %w[locked]\n"
> > >+          "cbnz   %w[tmp], 1b\n"
> > >+          "stxr   %w[tmp], %w[one], %w[locked]\n"
> > >+          "cbnz   %w[tmp], 2b\n"
> > >+          : [tmp] "=&r" (tmp), [locked] "+Q"(sl->locked)
> > >+          : [one] "r" (1)
> > >+          : "cc", "memory");
> > >+}
> > >+#endif
> > >+
> > > static inline int rte_tm_supported(void)
> > > {
> > >   return 0;
> > >--
> > >2.7.4

IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: [dpdk-dev] [PATCH v6 0/7] use WFE for aarch64
  2019-09-14 14:59 ` [dpdk-dev] [PATCH v6 0/7] use WFE for aarch64 Gavin Hu
@ 2019-09-26 13:41   ` Jerin Jacob
  2019-09-27  5:45     ` Gavin Hu (Arm Technology China)
  0 siblings, 1 reply; 163+ messages in thread
From: Jerin Jacob @ 2019-09-26 13:41 UTC (permalink / raw)
  To: Gavin Hu
  Cc: dpdk-dev, nd, Thomas Monjalon, Stephen Hemminger, Hemant Agrawal,
	Jerin Jacob, Pavan Nikhilesh, Honnappa.Nagarahalli, ruifeng.wang,
	phil.yang, steve.capper

On Sat, Sep 14, 2019 at 8:30 PM Gavin Hu <gavin.hu@arm.com> wrote:
>
> V6:
> - squash the RTE_ARM_USE_WFE configuration entry patch into the new API patch
> - move the new configuration to the end of EAL
> - add doxygen comments to reflect the relaxed and acquire semantics
> - correct the meson configuration
> V5:
> - add doxygen comments for the new APIs
> - spinlock early exit without wfe if the spinlock not taken by others.
> - add two patches on top for opdl and thunderx
> V4:
> - rename the config as CONFIG_RTE_ARM_USE_WFE to indicate it applys to arm only
> - introduce a macro for assembly Skelton to reduce the duplication of code
> - add one patch for nxp fslmc to address a compiling error
> V3:
> - Convert RFCs to patches
> V2:
> - Use inline functions instead of marcos
> - Add load and compare in the beginning of the APIs
> - Fix some style errors in asm inline
> V1:
> - Add the new APIs and use it for ring and locks
>
> DPDK has multiple use cases where the core repeatedly polls a location in
> memory. This polling results in many cache and memory transactions.
>
> Arm architecture provides WFE (Wait For Event) instruction, which allows
> the cpu core to enter a low power state until woken up by the update to the
> memory location being polled. Thus reducing the cache and memory
> transactions.
>
> x86 has the PAUSE hint instruction to reduce such overhead.
>
> The rte_wait_until_equal_xxx APIs abstract the functionality of 'polling
> for a memory location to become equal to a given value'.
>
> For non-Arm platforms, these APIs are just wrappers around do-while loop
> with rte_pause, so there are no performance differences.
>
> For Arm platforms, use of WFE can be configured using CONFIG_RTE_USE_WFE
> option. It is disabled by default.
>
> Currently, use of WFE is supported only for aarch64 platforms. armv7
> platforms do support the WFE instruction, but they require explicit wake up
> events(sev) and are less performannt.
>
> Testing shows that, performance varies across different platforms, with
> some showing degradation.
>
> CONFIG_RTE_ARM_USE_WFE should be enabled depending on the performance
> benchmarking on the target platforms. Power saving should be an bonus,
> but currenly we don't have ways to characterize that.
>
> Gavin Hu (7):
>   bus/fslmc: fix the conflicting dmb function
>   eal: add the APIs to wait until equal
>   spinlock: use wfe to reduce contention on aarch64
>   ticketlock: use new API to reduce contention on aarch64
>   ring: use wfe to wait for ring tail update on aarch64
>   net/thunderx: use new API to save cycles on aarch64
>   event/opdl: use new API to save cycles on aarch64



There is checkpatch failure.
### eal: add the APIs to wait until equal

WARNING:LONG_LINE_COMMENT: line over 80 characters
#123: FILE: lib/librte_eal/common/include/generic/rte_pause.h:29:
+ * Wait for *addr to be updated with a 16-bit expected value, with a
relaxed memory

With checkpatch fixes:

Acked-by: Jerin Jacob <jerinj@marvell.com>

^ permalink raw reply	[flat|nested] 163+ messages in thread

* [dpdk-dev] [PATCH v7 0/7] use WFE for aarch64
  2019-06-30 16:21 [dpdk-dev] [RFC 0/5] use WFE for locks and ring on aarch64 Gavin Hu
                   ` (41 preceding siblings ...)
  2019-09-14 14:59 ` [dpdk-dev] [PATCH v6 7/7] event/opdl: " Gavin Hu
@ 2019-09-27  5:41 ` Gavin Hu
  2019-10-17 18:37   ` David Marchand
  2019-09-27  5:41 ` [dpdk-dev] [PATCH v7 1/7] bus/fslmc: fix the conflicting dmb function Gavin Hu
                   ` (38 subsequent siblings)
  81 siblings, 1 reply; 163+ messages in thread
From: Gavin Hu @ 2019-09-27  5:41 UTC (permalink / raw)
  To: dev
  Cc: nd, thomas, stephen, hemant.agrawal, jerinj, pbhagavatula,
	Honnappa.Nagarahalli, ruifeng.wang, phil.yang, steve.capper

V7:
- fix the checkpatch LONG_LINE_COMMENT issue
V6:
- squash the RTE_ARM_USE_WFE configuration entry patch into the new API patch
- move the new configuration to the end of EAL
- add doxygen comments to reflect the relaxed and acquire semantics
- correct the meson configuration 
V5:
- add doxygen comments for the new APIs
- spinlock early exit without wfe if the spinlock not taken by others.
- add two patches on top for opdl and thunderx
V4:
- rename the config as CONFIG_RTE_ARM_USE_WFE to indicate it applys to arm only
- introduce a macro for assembly Skelton to reduce the duplication of code
- add one patch for nxp fslmc to address a compiling error
V3:
- Convert RFCs to patches
V2:
- Use inline functions instead of marcos
- Add load and compare in the beginning of the APIs
- Fix some style errors in asm inline 
V1:
- Add the new APIs and use it for ring and locks

DPDK has multiple use cases where the core repeatedly polls a location in
memory. This polling results in many cache and memory transactions.

Arm architecture provides WFE (Wait For Event) instruction, which allows
the cpu core to enter a low power state until woken up by the update to the
memory location being polled. Thus reducing the cache and memory
transactions.

x86 has the PAUSE hint instruction to reduce such overhead.

The rte_wait_until_equal_xxx APIs abstract the functionality of 'polling
for a memory location to become equal to a given value'.

For non-Arm platforms, these APIs are just wrappers around do-while loop
with rte_pause, so there are no performance differences.

For Arm platforms, use of WFE can be configured using CONFIG_RTE_USE_WFE
option. It is disabled by default.

Currently, use of WFE is supported only for aarch64 platforms. armv7
platforms do support the WFE instruction, but they require explicit wake up
events(sev) and are less performannt.

Testing shows that, performance varies across different platforms, with
some showing degradation.

CONFIG_RTE_ARM_USE_WFE should be enabled depending on the performance
benchmarking on the target platforms. Power saving should be an bonus,
but currenly we don't have ways to characterize that.

Gavin Hu (7):
  bus/fslmc: fix the conflicting dmb function
  eal: add the APIs to wait until equal
  spinlock: use wfe to reduce contention on aarch64
  ticketlock: use new API to reduce contention on aarch64
  ring: use wfe to wait for ring tail update on aarch64
  net/thunderx: use new API to save cycles on aarch64
  event/opdl: use new API to save cycles on aarch64

 config/arm/meson.build                             |   1 +
 config/common_base                                 |   5 +
 drivers/bus/fslmc/mc/fsl_mc_sys.h                  |  10 +-
 drivers/bus/fslmc/mc/mc_sys.c                      |   3 +-
 drivers/event/opdl/opdl_ring.c                     |   5 +-
 drivers/net/thunderx/nicvf_rxtx.c                  |   3 +-
 .../common/include/arch/arm/rte_pause_64.h         |  30 ++++++
 .../common/include/arch/arm/rte_spinlock.h         |  26 +++++
 lib/librte_eal/common/include/generic/rte_pause.h  | 106 +++++++++++++++++++++
 .../common/include/generic/rte_ticketlock.h        |   3 +-
 lib/librte_ring/rte_ring_c11_mem.h                 |   4 +-
 lib/librte_ring/rte_ring_generic.h                 |   3 +-
 12 files changed, 183 insertions(+), 16 deletions(-)

-- 
2.7.4


^ permalink raw reply	[flat|nested] 163+ messages in thread

* [dpdk-dev] [PATCH v7 1/7] bus/fslmc: fix the conflicting dmb function
  2019-06-30 16:21 [dpdk-dev] [RFC 0/5] use WFE for locks and ring on aarch64 Gavin Hu
                   ` (42 preceding siblings ...)
  2019-09-27  5:41 ` [dpdk-dev] [PATCH v7 0/7] use WFE for aarch64 Gavin Hu
@ 2019-09-27  5:41 ` Gavin Hu
  2019-09-27  8:24   ` Hemant Agrawal
  2019-10-17 15:06   ` David Marchand
  2019-09-27  5:41 ` [dpdk-dev] [PATCH v7 2/7] eal: add the APIs to wait until equal Gavin Hu
                   ` (37 subsequent siblings)
  81 siblings, 2 replies; 163+ messages in thread
From: Gavin Hu @ 2019-09-27  5:41 UTC (permalink / raw)
  To: dev
  Cc: nd, thomas, stephen, hemant.agrawal, jerinj, pbhagavatula,
	Honnappa.Nagarahalli, ruifeng.wang, phil.yang, steve.capper,
	stable

There are two definitions conflicting each other, for more
details, refer to [1].

include/rte_atomic_64.h:19: error: "dmb" redefined [-Werror]
drivers/bus/fslmc/mc/fsl_mc_sys.h:36: note: this is the location of the
previous definition
 #define dmb() {__asm__ __volatile__("" : : : "memory"); }

The fix is to include the spinlock.h file before the other header files,
this is inline with the coding style[2] about the "header includes".
The fix changes the function to take the argument for arm to be
meaningful.

[1] http://inbox.dpdk.org/users/VI1PR08MB537631AB25F41B8880DCCA988FDF0@i
VI1PR08MB5376.eurprd08.prod.outlook.com/T/#u
[2] https://doc.dpdk.org/guides/contributing/coding_style.html

Fixes: 3af733ba8da8 ("bus/fslmc: introduce MC object functions")
Cc: stable@dpdk.org

Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Phil Yang <phi.yang@arm.com>
---
 drivers/bus/fslmc/mc/fsl_mc_sys.h | 10 +++++++---
 drivers/bus/fslmc/mc/mc_sys.c     |  3 +--
 2 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/drivers/bus/fslmc/mc/fsl_mc_sys.h b/drivers/bus/fslmc/mc/fsl_mc_sys.h
index d0c7b39..fe9dc95 100644
--- a/drivers/bus/fslmc/mc/fsl_mc_sys.h
+++ b/drivers/bus/fslmc/mc/fsl_mc_sys.h
@@ -33,10 +33,14 @@ struct fsl_mc_io {
 #include <linux/byteorder/little_endian.h>
 
 #ifndef dmb
-#define dmb() {__asm__ __volatile__("" : : : "memory"); }
+#ifdef RTE_ARCH_ARM64
+#define dmb(opt) {asm volatile("dmb " #opt : : : "memory"); }
+#else
+#define dmb(opt)
 #endif
-#define __iormb()	dmb()
-#define __iowmb()	dmb()
+#endif
+#define __iormb()	dmb(ld)
+#define __iowmb()	dmb(st)
 #define __arch_getq(a)		(*(volatile uint64_t *)(a))
 #define __arch_putq(v, a)	(*(volatile uint64_t *)(a) = (v))
 #define __arch_putq32(v, a)	(*(volatile uint32_t *)(a) = (v))
diff --git a/drivers/bus/fslmc/mc/mc_sys.c b/drivers/bus/fslmc/mc/mc_sys.c
index efafdc3..22143ef 100644
--- a/drivers/bus/fslmc/mc/mc_sys.c
+++ b/drivers/bus/fslmc/mc/mc_sys.c
@@ -4,11 +4,10 @@
  * Copyright 2017 NXP
  *
  */
+#include <rte_spinlock.h>
 #include <fsl_mc_sys.h>
 #include <fsl_mc_cmd.h>
 
-#include <rte_spinlock.h>
-
 /** User space framework uses MC Portal in shared mode. Following change
  * introduces lock in MC FLIB
  */
-- 
2.7.4


^ permalink raw reply	[flat|nested] 163+ messages in thread

* [dpdk-dev] [PATCH v7 2/7] eal: add the APIs to wait until equal
  2019-06-30 16:21 [dpdk-dev] [RFC 0/5] use WFE for locks and ring on aarch64 Gavin Hu
                   ` (43 preceding siblings ...)
  2019-09-27  5:41 ` [dpdk-dev] [PATCH v7 1/7] bus/fslmc: fix the conflicting dmb function Gavin Hu
@ 2019-09-27  5:41 ` Gavin Hu
  2019-09-27 11:03   ` Jerin Jacob
                     ` (3 more replies)
  2019-09-27  5:41 ` [dpdk-dev] [PATCH v7 3/7] spinlock: use wfe to reduce contention on aarch64 Gavin Hu
                   ` (36 subsequent siblings)
  81 siblings, 4 replies; 163+ messages in thread
From: Gavin Hu @ 2019-09-27  5:41 UTC (permalink / raw)
  To: dev
  Cc: nd, thomas, stephen, hemant.agrawal, jerinj, pbhagavatula,
	Honnappa.Nagarahalli, ruifeng.wang, phil.yang, steve.capper

The rte_wait_until_equal_xx APIs abstract the functionality of
'polling for a memory location to become equal to a given value'.

Add the RTE_ARM_USE_WFE configuration entry for aarch64, disabled
by default. When it is enabled, the above APIs will call WFE instruction
to save CPU cycles and power.

Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
Reviewed-by: Steve Capper <steve.capper@arm.com>
Reviewed-by: Ola Liljedahl <ola.liljedahl@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Reviewed-by: Phil Yang <phil.yang@arm.com>
Acked-by: Pavan Nikhilesh <pbhagavatula@marvell.com>
---
 config/arm/meson.build                             |   1 +
 config/common_base                                 |   5 +
 .../common/include/arch/arm/rte_pause_64.h         |  30 ++++++
 lib/librte_eal/common/include/generic/rte_pause.h  | 106 +++++++++++++++++++++
 4 files changed, 142 insertions(+)

diff --git a/config/arm/meson.build b/config/arm/meson.build
index 979018e..b4b4cac 100644
--- a/config/arm/meson.build
+++ b/config/arm/meson.build
@@ -26,6 +26,7 @@ flags_common_default = [
 	['RTE_LIBRTE_AVP_PMD', false],
 
 	['RTE_SCHED_VECTOR', false],
+	['RTE_ARM_USE_WFE', false],
 ]
 
 flags_generic = [
diff --git a/config/common_base b/config/common_base
index 8ef75c2..8861713 100644
--- a/config/common_base
+++ b/config/common_base
@@ -111,6 +111,11 @@ CONFIG_RTE_MAX_VFIO_CONTAINERS=64
 CONFIG_RTE_MALLOC_DEBUG=n
 CONFIG_RTE_EAL_NUMA_AWARE_HUGEPAGES=n
 CONFIG_RTE_USE_LIBBSD=n
+# Use WFE instructions to implement the rte_wait_for_equal_xxx APIs,
+# calling these APIs put the cores in low power state while waiting
+# for the memory address to become equal to the expected value.
+# This is supported only by aarch64.
+CONFIG_RTE_ARM_USE_WFE=n
 
 #
 # Recognize/ignore the AVX/AVX512 CPU flags for performance/power testing.
diff --git a/lib/librte_eal/common/include/arch/arm/rte_pause_64.h b/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
index 93895d3..dabde17 100644
--- a/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
+++ b/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
@@ -1,5 +1,6 @@
 /* SPDX-License-Identifier: BSD-3-Clause
  * Copyright(c) 2017 Cavium, Inc
+ * Copyright(c) 2019 Arm Limited
  */
 
 #ifndef _RTE_PAUSE_ARM64_H_
@@ -17,6 +18,35 @@ static inline void rte_pause(void)
 	asm volatile("yield" ::: "memory");
 }
 
+#ifdef RTE_ARM_USE_WFE
+#define __WAIT_UNTIL_EQUAL(name, asm_op, wide, type) \
+static __rte_always_inline void \
+rte_wait_until_equal_##name(volatile type * addr, type expected) \
+{ \
+	type tmp; \
+	asm volatile( \
+		#asm_op " %" #wide "[tmp], %[addr]\n" \
+		"cmp	%" #wide "[tmp], %" #wide "[expected]\n" \
+		"b.eq	2f\n" \
+		"sevl\n" \
+		"1:	wfe\n" \
+		#asm_op " %" #wide "[tmp], %[addr]\n" \
+		"cmp	%" #wide "[tmp], %" #wide "[expected]\n" \
+		"bne	1b\n" \
+		"2:\n" \
+		: [tmp] "=&r" (tmp) \
+		: [addr] "Q"(*addr), [expected] "r"(expected) \
+		: "cc", "memory"); \
+}
+/* Wait for *addr to be updated with expected value */
+__WAIT_UNTIL_EQUAL(relaxed_16, ldxrh, w, uint16_t)
+__WAIT_UNTIL_EQUAL(acquire_16, ldaxrh, w, uint16_t)
+__WAIT_UNTIL_EQUAL(relaxed_32, ldxr, w, uint32_t)
+__WAIT_UNTIL_EQUAL(acquire_32, ldaxr, w, uint32_t)
+__WAIT_UNTIL_EQUAL(relaxed_64, ldxr, x, uint64_t)
+__WAIT_UNTIL_EQUAL(acquire_64, ldaxr, x, uint64_t)
+#endif
+
 #ifdef __cplusplus
 }
 #endif
diff --git a/lib/librte_eal/common/include/generic/rte_pause.h b/lib/librte_eal/common/include/generic/rte_pause.h
index 52bd4db..8906473 100644
--- a/lib/librte_eal/common/include/generic/rte_pause.h
+++ b/lib/librte_eal/common/include/generic/rte_pause.h
@@ -1,5 +1,6 @@
 /* SPDX-License-Identifier: BSD-3-Clause
  * Copyright(c) 2017 Cavium, Inc
+ * Copyright(c) 2019 Arm Limited
  */
 
 #ifndef _RTE_PAUSE_H_
@@ -12,6 +13,10 @@
  *
  */
 
+#include <stdint.h>
+#include <rte_common.h>
+#include <rte_atomic.h>
+
 /**
  * Pause CPU execution for a short while
  *
@@ -20,4 +25,105 @@
  */
 static inline void rte_pause(void);
 
+/**
+ * Wait for *addr to be updated with a 16-bit expected value, with a relaxed
+ * memory ordering model meaning the loads around this API can be reordered.
+ *
+ * @param addr
+ *  A pointer to the memory location.
+ * @param expected
+ *  A 16-bit expected value to be in the memory location.
+ */
+__rte_always_inline
+static void
+rte_wait_until_equal_relaxed_16(volatile uint16_t *addr, uint16_t expected);
+
+/**
+ * Wait for *addr to be updated with a 32-bit expected value, with a relaxed
+ * memory ordering model meaning the loads around this API can be reordered.
+ *
+ * @param addr
+ *  A pointer to the memory location.
+ * @param expected
+ *  A 32-bit expected value to be in the memory location.
+ */
+__rte_always_inline
+static void
+rte_wait_until_equal_relaxed_32(volatile uint32_t *addr, uint32_t expected);
+
+/**
+ * Wait for *addr to be updated with a 64-bit expected value, with a relaxed
+ * memory ordering model meaning the loads around this API can be reordered.
+ *
+ * @param addr
+ *  A pointer to the memory location.
+ * @param expected
+ *  A 64-bit expected value to be in the memory location.
+ */
+__rte_always_inline
+static void
+rte_wait_until_equal_relaxed_64(volatile uint64_t *addr, uint64_t expected);
+
+/**
+ * Wait for *addr to be updated with a 16-bit expected value, with an acquire
+ * memory ordering model meaning the loads after this API can't be observed
+ * before this API.
+ *
+ * @param addr
+ *  A pointer to the memory location.
+ * @param expected
+ *  A 16-bit expected value to be in the memory location.
+ */
+__rte_always_inline
+static void
+rte_wait_until_equal_acquire_16(volatile uint16_t *addr, uint16_t expected);
+
+/**
+ * Wait for *addr to be updated with a 32-bit expected value, with an acquire
+ * memory ordering model meaning the loads after this API can't be observed
+ * before this API.
+ *
+ * @param addr
+ *  A pointer to the memory location.
+ * @param expected
+ *  A 32-bit expected value to be in the memory location.
+ */
+__rte_always_inline
+static void
+rte_wait_until_equal_acquire_32(volatile uint32_t *addr, uint32_t expected);
+
+/**
+ * Wait for *addr to be updated with a 64-bit expected value, with an acquire
+ * memory ordering model meaning the loads after this API can't be observed
+ * before this API.
+ *
+ * @param addr
+ *  A pointer to the memory location.
+ * @param expected
+ *  A 64-bit expected value to be in the memory location.
+ */
+__rte_always_inline
+static void
+rte_wait_until_equal_acquire_64(volatile uint64_t *addr, uint64_t expected);
+
+#if !defined(RTE_ARM_USE_WFE)
+#define __WAIT_UNTIL_EQUAL(op_name, size, type, memorder) \
+__rte_always_inline \
+static void	\
+rte_wait_until_equal_##op_name##_##size(volatile type *addr, \
+	type expected) \
+{ \
+	while (__atomic_load_n(addr, memorder) != expected) \
+		rte_pause(); \
+}
+
+/* Wait for *addr to be updated with expected value */
+__WAIT_UNTIL_EQUAL(relaxed, 16, uint16_t, __ATOMIC_RELAXED)
+__WAIT_UNTIL_EQUAL(acquire, 16, uint16_t, __ATOMIC_ACQUIRE)
+__WAIT_UNTIL_EQUAL(relaxed, 32, uint32_t, __ATOMIC_RELAXED)
+__WAIT_UNTIL_EQUAL(acquire, 32, uint32_t, __ATOMIC_ACQUIRE)
+__WAIT_UNTIL_EQUAL(relaxed, 64, uint64_t, __ATOMIC_RELAXED)
+__WAIT_UNTIL_EQUAL(acquire, 64, uint64_t, __ATOMIC_ACQUIRE)
+#endif /* RTE_ARM_USE_WFE */
+
 #endif /* _RTE_PAUSE_H_ */
-- 
2.7.4


^ permalink raw reply	[flat|nested] 163+ messages in thread

* [dpdk-dev] [PATCH v7 3/7] spinlock: use wfe to reduce contention on aarch64
  2019-06-30 16:21 [dpdk-dev] [RFC 0/5] use WFE for locks and ring on aarch64 Gavin Hu
                   ` (44 preceding siblings ...)
  2019-09-27  5:41 ` [dpdk-dev] [PATCH v7 2/7] eal: add the APIs to wait until equal Gavin Hu
@ 2019-09-27  5:41 ` Gavin Hu
  2019-10-17 18:27   ` David Marchand
  2019-09-27  5:41 ` [dpdk-dev] [PATCH v7 4/7] ticketlock: use new API " Gavin Hu
                   ` (35 subsequent siblings)
  81 siblings, 1 reply; 163+ messages in thread
From: Gavin Hu @ 2019-09-27  5:41 UTC (permalink / raw)
  To: dev
  Cc: nd, thomas, stephen, hemant.agrawal, jerinj, pbhagavatula,
	Honnappa.Nagarahalli, ruifeng.wang, phil.yang, steve.capper

In acquiring a spinlock, cores repeatedly poll the lock variable.
This is replaced by rte_wait_until_equal API.

Running the micro benchmarking and the testpmd and l3fwd traffic tests
on ThunderX2, Ampere eMAG80 and Arm N1SDP, everything went well and no
notable performance gain nor degradation was measured.

Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
Reviewed-by: Phil Yang <phil.yang@arm.com>
Reviewed-by: Steve Capper <steve.capper@arm.com>
Reviewed-by: Ola Liljedahl <ola.liljedahl@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Tested-by: Pavan Nikhilesh <pbhagavatula@marvell.com>
---
 .../common/include/arch/arm/rte_spinlock.h         | 26 ++++++++++++++++++++++
 1 file changed, 26 insertions(+)

diff --git a/lib/librte_eal/common/include/arch/arm/rte_spinlock.h b/lib/librte_eal/common/include/arch/arm/rte_spinlock.h
index 1a6916b..b61c055 100644
--- a/lib/librte_eal/common/include/arch/arm/rte_spinlock.h
+++ b/lib/librte_eal/common/include/arch/arm/rte_spinlock.h
@@ -16,6 +16,32 @@ extern "C" {
 #include <rte_common.h>
 #include "generic/rte_spinlock.h"
 
+/* armv7a does support WFE, but an explicit wake-up signal using SEV is
+ * required (must be preceded by DSB to drain the store buffer) and
+ * this is less performant, so keep armv7a implementation unchanged.
+ */
+#ifndef RTE_FORCE_INTRINSICS
+static inline void
+rte_spinlock_lock(rte_spinlock_t *sl)
+{
+	unsigned int tmp;
+	/* http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.
+	 * faqs/ka16809.html
+	 */
+	asm volatile(
+		"1:	ldaxr %w[tmp], %w[locked]\n"
+		"cbnz   %w[tmp], 2f\n"
+		"stxr   %w[tmp], %w[one], %w[locked]\n"
+		"cbnz   %w[tmp], 1b\n"
+		"ret\n"
+		"2:	sevl\n"
+		"wfe\n"
+		"jmp	1b\n"
+		: [tmp] "=&r" (tmp), [locked] "+Q"(sl->locked)
+		: [one] "r" (1)
+}
+#endif
+
 static inline int rte_tm_supported(void)
 {
 	return 0;
-- 
2.7.4


^ permalink raw reply	[flat|nested] 163+ messages in thread

* [dpdk-dev] [PATCH v7 4/7] ticketlock: use new API to reduce contention on aarch64
  2019-06-30 16:21 [dpdk-dev] [RFC 0/5] use WFE for locks and ring on aarch64 Gavin Hu
                   ` (45 preceding siblings ...)
  2019-09-27  5:41 ` [dpdk-dev] [PATCH v7 3/7] spinlock: use wfe to reduce contention on aarch64 Gavin Hu
@ 2019-09-27  5:41 ` " Gavin Hu
  2019-09-27  5:41 ` [dpdk-dev] [PATCH v7 5/7] ring: use wfe to wait for ring tail update " Gavin Hu
                   ` (34 subsequent siblings)
  81 siblings, 0 replies; 163+ messages in thread
From: Gavin Hu @ 2019-09-27  5:41 UTC (permalink / raw)
  To: dev
  Cc: nd, thomas, stephen, hemant.agrawal, jerinj, pbhagavatula,
	Honnappa.Nagarahalli, ruifeng.wang, phil.yang, steve.capper

While using ticket lock, cores repeatedly poll the lock variable.
This is replaced by rte_wait_until_equal API.

Running ticketlock_autotest on ThunderX2, Ampere eMAG80, and Arm N1SDP[1],
there were variances between runs, but no notable performance gain or
degradation were seen with and without this patch.

[1] https://community.arm.com/developer/tools-software/oss-platforms/w/\
docs/440/neoverse-n1-sdp

Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Tested-by: Phil Yang <phil.yang@arm.com>
Tested-by: Pavan Nikhilesh <pbhagavatula@marvell.com>
Reviewed-by: Jerin Jacob <jerinj@marvell.com>
---
 lib/librte_eal/common/include/generic/rte_ticketlock.h | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/lib/librte_eal/common/include/generic/rte_ticketlock.h b/lib/librte_eal/common/include/generic/rte_ticketlock.h
index d9bec87..232bbe9 100644
--- a/lib/librte_eal/common/include/generic/rte_ticketlock.h
+++ b/lib/librte_eal/common/include/generic/rte_ticketlock.h
@@ -66,8 +66,7 @@ static inline void
 rte_ticketlock_lock(rte_ticketlock_t *tl)
 {
 	uint16_t me = __atomic_fetch_add(&tl->s.next, 1, __ATOMIC_RELAXED);
-	while (__atomic_load_n(&tl->s.current, __ATOMIC_ACQUIRE) != me)
-		rte_pause();
+	rte_wait_until_equal_acquire_16(&tl->s.current, me);
 }
 
 /**
-- 
2.7.4


^ permalink raw reply	[flat|nested] 163+ messages in thread

* [dpdk-dev] [PATCH v7 5/7] ring: use wfe to wait for ring tail update on aarch64
  2019-06-30 16:21 [dpdk-dev] [RFC 0/5] use WFE for locks and ring on aarch64 Gavin Hu
                   ` (46 preceding siblings ...)
  2019-09-27  5:41 ` [dpdk-dev] [PATCH v7 4/7] ticketlock: use new API " Gavin Hu
@ 2019-09-27  5:41 ` " Gavin Hu
  2019-09-27  5:41 ` [dpdk-dev] [PATCH v7 6/7] net/thunderx: use new API to save cycles " Gavin Hu
                   ` (33 subsequent siblings)
  81 siblings, 0 replies; 163+ messages in thread
From: Gavin Hu @ 2019-09-27  5:41 UTC (permalink / raw)
  To: dev
  Cc: nd, thomas, stephen, hemant.agrawal, jerinj, pbhagavatula,
	Honnappa.Nagarahalli, ruifeng.wang, phil.yang, steve.capper

Instead of polling for tail to be updated, use wfe instruction.

Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
Reviewed-by: Steve Capper <steve.capper@arm.com>
Reviewed-by: Ola Liljedahl <ola.liljedahl@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
---
 lib/librte_ring/rte_ring_c11_mem.h | 4 ++--
 lib/librte_ring/rte_ring_generic.h | 3 +--
 2 files changed, 3 insertions(+), 4 deletions(-)

diff --git a/lib/librte_ring/rte_ring_c11_mem.h b/lib/librte_ring/rte_ring_c11_mem.h
index 0fb73a3..764d8f1 100644
--- a/lib/librte_ring/rte_ring_c11_mem.h
+++ b/lib/librte_ring/rte_ring_c11_mem.h
@@ -2,6 +2,7 @@
  *
  * Copyright (c) 2017,2018 HXT-semitech Corporation.
  * Copyright (c) 2007-2009 Kip Macy kmacy@freebsd.org
+ * Copyright (c) 2019 Arm Limited
  * All rights reserved.
  * Derived from FreeBSD's bufring.h
  * Used as BSD-3 Licensed with permission from Kip Macy.
@@ -21,8 +22,7 @@ update_tail(struct rte_ring_headtail *ht, uint32_t old_val, uint32_t new_val,
 	 * we need to wait for them to complete
 	 */
 	if (!single)
-		while (unlikely(ht->tail != old_val))
-			rte_pause();
+		rte_wait_until_equal_relaxed_32(&ht->tail, old_val);
 
 	__atomic_store_n(&ht->tail, new_val, __ATOMIC_RELEASE);
 }
diff --git a/lib/librte_ring/rte_ring_generic.h b/lib/librte_ring/rte_ring_generic.h
index 953cdbb..6828527 100644
--- a/lib/librte_ring/rte_ring_generic.h
+++ b/lib/librte_ring/rte_ring_generic.h
@@ -23,8 +23,7 @@ update_tail(struct rte_ring_headtail *ht, uint32_t old_val, uint32_t new_val,
 	 * we need to wait for them to complete
 	 */
 	if (!single)
-		while (unlikely(ht->tail != old_val))
-			rte_pause();
+		rte_wait_until_equal_relaxed_32(&ht->tail, old_val);
 
 	ht->tail = new_val;
 }
-- 
2.7.4


^ permalink raw reply	[flat|nested] 163+ messages in thread

* [dpdk-dev] [PATCH v7 6/7] net/thunderx: use new API to save cycles on aarch64
  2019-06-30 16:21 [dpdk-dev] [RFC 0/5] use WFE for locks and ring on aarch64 Gavin Hu
                   ` (47 preceding siblings ...)
  2019-09-27  5:41 ` [dpdk-dev] [PATCH v7 5/7] ring: use wfe to wait for ring tail update " Gavin Hu
@ 2019-09-27  5:41 ` " Gavin Hu
  2019-09-27  5:41 ` [dpdk-dev] [PATCH v7 7/7] event/opdl: " Gavin Hu
                   ` (32 subsequent siblings)
  81 siblings, 0 replies; 163+ messages in thread
From: Gavin Hu @ 2019-09-27  5:41 UTC (permalink / raw)
  To: dev
  Cc: nd, thomas, stephen, hemant.agrawal, jerinj, pbhagavatula,
	Honnappa.Nagarahalli, ruifeng.wang, phil.yang, steve.capper

Use the new API to wait in low power state instead of continuous
polling to save CPU cycles and power.

Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
Acked-by: Jerin Jacob <jerinj@marvell.com>
---
 drivers/net/thunderx/nicvf_rxtx.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/net/thunderx/nicvf_rxtx.c b/drivers/net/thunderx/nicvf_rxtx.c
index 1c42874..16f34c6 100644
--- a/drivers/net/thunderx/nicvf_rxtx.c
+++ b/drivers/net/thunderx/nicvf_rxtx.c
@@ -385,8 +385,7 @@ nicvf_fill_rbdr(struct nicvf_rxq *rxq, int to_fill)
 		ltail++;
 	}
 
-	while (__atomic_load_n(&rbdr->tail, __ATOMIC_RELAXED) != next_tail)
-		rte_pause();
+	rte_wait_until_equal_relaxed_32(&rbdr->tail, next_tail);
 
 	__atomic_store_n(&rbdr->tail, ltail, __ATOMIC_RELEASE);
 	nicvf_addr_write(door, to_fill);
-- 
2.7.4


^ permalink raw reply	[flat|nested] 163+ messages in thread

* [dpdk-dev] [PATCH v7 7/7] event/opdl: use new API to save cycles on aarch64
  2019-06-30 16:21 [dpdk-dev] [RFC 0/5] use WFE for locks and ring on aarch64 Gavin Hu
                   ` (48 preceding siblings ...)
  2019-09-27  5:41 ` [dpdk-dev] [PATCH v7 6/7] net/thunderx: use new API to save cycles " Gavin Hu
@ 2019-09-27  5:41 ` " Gavin Hu
  2019-10-21  9:47 ` [dpdk-dev] [PATCH v8 0/6] use WFE for aarch64 Gavin Hu
                   ` (31 subsequent siblings)
  81 siblings, 0 replies; 163+ messages in thread
From: Gavin Hu @ 2019-09-27  5:41 UTC (permalink / raw)
  To: dev
  Cc: nd, thomas, stephen, hemant.agrawal, jerinj, pbhagavatula,
	Honnappa.Nagarahalli, ruifeng.wang, phil.yang, steve.capper

Use the new API to wait in low power state instead of continuous
polling to save CPU cycles and power.

Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
Reviewed-by: Jerin Jacob <jerinj@marvell.com>
---
 drivers/event/opdl/opdl_ring.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/drivers/event/opdl/opdl_ring.c b/drivers/event/opdl/opdl_ring.c
index e8b29e2..f446fa3 100644
--- a/drivers/event/opdl/opdl_ring.c
+++ b/drivers/event/opdl/opdl_ring.c
@@ -17,6 +17,7 @@
 #include <rte_memory.h>
 #include <rte_memzone.h>
 #include <rte_eal_memconfig.h>
+#include <rte_atomic.h>
 
 #include "opdl_ring.h"
 #include "opdl_log.h"
@@ -475,9 +476,7 @@ opdl_ring_input_multithread(struct opdl_ring *t, const void *entries,
 	/* If another thread started inputting before this one, but hasn't
 	 * finished, we need to wait for it to complete to update the tail.
 	 */
-	while (unlikely(__atomic_load_n(&s->shared.tail, __ATOMIC_ACQUIRE) !=
-			old_head))
-		rte_pause();
+	rte_wait_until_equal_acquire_32(&s->shared.tail, old_head);
 
 	__atomic_store_n(&s->shared.tail, old_head + num_entries,
 			__ATOMIC_RELEASE);
-- 
2.7.4


^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: [dpdk-dev] [PATCH v6 0/7] use WFE for aarch64
  2019-09-26 13:41   ` Jerin Jacob
@ 2019-09-27  5:45     ` Gavin Hu (Arm Technology China)
  0 siblings, 0 replies; 163+ messages in thread
From: Gavin Hu (Arm Technology China) @ 2019-09-27  5:45 UTC (permalink / raw)
  To: Jerin Jacob
  Cc: dpdk-dev, nd, thomas, Stephen Hemminger, hemant.agrawal, jerinj,
	Pavan Nikhilesh, Honnappa Nagarahalli,
	Ruifeng Wang (Arm Technology China),
	Phil Yang (Arm Technology China),
	Steve Capper, nd

> 
> There is checkpatch failure.
> ### eal: add the APIs to wait until equal
> 
> WARNING:LONG_LINE_COMMENT: line over 80 characters
> #123: FILE: lib/librte_eal/common/include/generic/rte_pause.h:29:
> + * Wait for *addr to be updated with a 16-bit expected value, with a
> relaxed memory
> 
> With checkpatch fixes:
> 
> Acked-by: Jerin Jacob <jerinj@marvell.com>
Thanks Jerin for review, sorry for this leakage, it was fixed in the new version(v7, already posted). 

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: [dpdk-dev] [PATCH v7 1/7] bus/fslmc: fix the conflicting dmb function
  2019-09-27  5:41 ` [dpdk-dev] [PATCH v7 1/7] bus/fslmc: fix the conflicting dmb function Gavin Hu
@ 2019-09-27  8:24   ` Hemant Agrawal
  2019-10-17 15:06   ` David Marchand
  1 sibling, 0 replies; 163+ messages in thread
From: Hemant Agrawal @ 2019-09-27  8:24 UTC (permalink / raw)
  To: Gavin Hu, dev
  Cc: nd, thomas, stephen, jerinj, pbhagavatula, Honnappa.Nagarahalli,
	ruifeng.wang, phil.yang, steve.capper, stable

Acked-by: Hemant Agrawal <hemant.agrawal@nxp.com>

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: [dpdk-dev] [PATCH v7 2/7] eal: add the APIs to wait until equal
  2019-09-27  5:41 ` [dpdk-dev] [PATCH v7 2/7] eal: add the APIs to wait until equal Gavin Hu
@ 2019-09-27 11:03   ` Jerin Jacob
  2019-10-17 13:14   ` Ananyev, Konstantin
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 163+ messages in thread
From: Jerin Jacob @ 2019-09-27 11:03 UTC (permalink / raw)
  To: Gavin Hu
  Cc: dpdk-dev, nd, Thomas Monjalon, Stephen Hemminger, Hemant Agrawal,
	Jerin Jacob, Pavan Nikhilesh, Honnappa.Nagarahalli, ruifeng.wang,
	phil.yang, steve.capper

On Fri, Sep 27, 2019 at 11:12 AM Gavin Hu <gavin.hu@arm.com> wrote:
>
> The rte_wait_until_equal_xx APIs abstract the functionality of
> 'polling for a memory location to become equal to a given value'.
>
> Add the RTE_ARM_USE_WFE configuration entry for aarch64, disabled
> by default. When it is enabled, the above APIs will call WFE instruction
> to save CPU cycles and power.
>
> Signed-off-by: Gavin Hu <gavin.hu@arm.com>
> Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> Reviewed-by: Steve Capper <steve.capper@arm.com>
> Reviewed-by: Ola Liljedahl <ola.liljedahl@arm.com>
> Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> Reviewed-by: Phil Yang <phil.yang@arm.com>
> Acked-by: Pavan Nikhilesh <pbhagavatula@marvell.com>

Acked-by: Jerin Jacob <jerinj@marvell.com>

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: [dpdk-dev] [PATCH v4 0/6] use WFE for locks and ring on aarch64
  2019-08-22  6:12 ` [dpdk-dev] [PATCH v4 0/6] use WFE for locks and ring on aarch64 Gavin Hu
@ 2019-10-16  8:08   ` David Marchand
  2019-10-24 20:26     ` David Christensen
  0 siblings, 1 reply; 163+ messages in thread
From: David Marchand @ 2019-10-16  8:08 UTC (permalink / raw)
  To: Bruce Richardson, Ananyev, Konstantin, David Christensen
  Cc: dev, Gavin Hu, nd, Thomas Monjalon, Stephen Hemminger,
	Jerin Jacob Kollanukkaran, Pavan Nikhilesh, Honnappa Nagarahalli

Hello guys,

This series got a lot of attention from ARM people and it seems ready
for integration.
But I did not see comment from other architectures, could you have a
look please?


Thanks.
-- 
David Marchand


On Thu, Aug 22, 2019 at 8:13 AM Gavin Hu <gavin.hu@arm.com> wrote:
>
> DPDK has multiple use cases where the core repeatedly polls a location in
> memory. This polling results in many cache and memory transactions.
>
> Arm architecture provides WFE (Wait For Event) instruction, which allows
> the cpu core to enter a low power state until woken up by the update to the
> memory location being polled. Thus reducing the cache and memory
> transactions.
>
> x86 has the PAUSE hint instruction to reduce such overhead.
>
> The rte_wait_until_equal_xxx APIs abstract the functionality of 'polling
> for a memory location to become equal to a given value'.
>
> For non-Arm platforms, these APIs are just wrappers around do-while loop
> with rte_pause, so there are no performance differences.
>
> For Arm platforms, use of WFE can be configured using CONFIG_RTE_USE_WFE
> option. It is disabled by default.
>
> Currently, use of WFE is supported only for aarch64 platforms. armv7
> platforms do support the WFE instruction, but they require explicit wake up
> events(sev) and are less performannt.
>
> Testing shows that, performance varies across different platforms, with
> some showing degradation.
>
> CONFIG_RTE_ARM_USE_WFE should be enabled depending on the performance
> benchmarking on the target platforms. Power saving should be an bonus,
> but currenly we don't have ways to characterize that.
>
> V4:
> - rename the config as CONFIG_RTE_ARM_USE_WFE to indicate it applys to arm only
> - introduce a macro for assembly Skelton to reduce the duplication of code
> - add one patch for nxp fslmc to address a compiling error
> V3:
> - Convert RFCs to patches
> V2:
> - Use inline functions instead of marcos
> - Add load and compare in the beginning of the APIs
> - Fix some style errors in asm inline
> V1:
> - Add the new APIs and use it for ring and locks
>
> Gavin Hu (6):
>   bus/fslmc: fix the conflicting dmb function
>   eal: add the APIs to wait until equal
>   ticketlock: use new API to reduce contention on aarch64
>   ring: use wfe to wait for ring tail update on aarch64
>   spinlock: use wfe to reduce contention on aarch64
>   config: add WFE config entry for aarch64
>
>  config/arm/meson.build                             |  1 +
>  config/common_base                                 |  6 +++++
>  drivers/bus/fslmc/mc/fsl_mc_sys.h                  | 10 +++++---
>  drivers/bus/fslmc/mc/mc_sys.c                      |  3 +--
>  .../common/include/arch/arm/rte_pause_64.h         | 30 ++++++++++++++++++++++
>  .../common/include/arch/arm/rte_spinlock.h         | 25 ++++++++++++++++++
>  lib/librte_eal/common/include/generic/rte_pause.h  | 26 ++++++++++++++++++-
>  .../common/include/generic/rte_ticketlock.h        |  3 +--
>  lib/librte_ring/rte_ring_c11_mem.h                 |  4 +--
>  lib/librte_ring/rte_ring_generic.h                 |  3 +--
>  10 files changed, 99 insertions(+), 12 deletions(-)
>
> --
> 2.7.4
>


^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: [dpdk-dev] [PATCH v7 2/7] eal: add the APIs to wait until equal
  2019-09-27  5:41 ` [dpdk-dev] [PATCH v7 2/7] eal: add the APIs to wait until equal Gavin Hu
  2019-09-27 11:03   ` Jerin Jacob
@ 2019-10-17 13:14   ` Ananyev, Konstantin
  2019-10-21  7:21     ` Gavin Hu (Arm Technology China)
  2019-10-17 15:45   ` David Marchand
  2019-10-17 16:44   ` Ananyev, Konstantin
  3 siblings, 1 reply; 163+ messages in thread
From: Ananyev, Konstantin @ 2019-10-17 13:14 UTC (permalink / raw)
  To: Gavin Hu, dev
  Cc: nd, thomas, stephen, hemant.agrawal, jerinj, pbhagavatula,
	Honnappa.Nagarahalli, ruifeng.wang, phil.yang, steve.capper

Hi Gavin,

> 
> The rte_wait_until_equal_xx APIs abstract the functionality of
> 'polling for a memory location to become equal to a given value'.
> 
> Add the RTE_ARM_USE_WFE configuration entry for aarch64, disabled
> by default. When it is enabled, the above APIs will call WFE instruction
> to save CPU cycles and power.
> 
> Signed-off-by: Gavin Hu <gavin.hu@arm.com>
> Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> Reviewed-by: Steve Capper <steve.capper@arm.com>
> Reviewed-by: Ola Liljedahl <ola.liljedahl@arm.com>
> Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> Reviewed-by: Phil Yang <phil.yang@arm.com>
> Acked-by: Pavan Nikhilesh <pbhagavatula@marvell.com>
> ---
>  config/arm/meson.build                             |   1 +
>  config/common_base                                 |   5 +
>  .../common/include/arch/arm/rte_pause_64.h         |  30 ++++++
>  lib/librte_eal/common/include/generic/rte_pause.h  | 106 +++++++++++++++++++++
>  4 files changed, 142 insertions(+)
> 
> diff --git a/config/arm/meson.build b/config/arm/meson.build
> index 979018e..b4b4cac 100644
> --- a/config/arm/meson.build
> +++ b/config/arm/meson.build
> @@ -26,6 +26,7 @@ flags_common_default = [
>  	['RTE_LIBRTE_AVP_PMD', false],
> 
>  	['RTE_SCHED_VECTOR', false],
> +	['RTE_ARM_USE_WFE', false],
>  ]
> 
>  flags_generic = [
> diff --git a/config/common_base b/config/common_base
> index 8ef75c2..8861713 100644
> --- a/config/common_base
> +++ b/config/common_base
> @@ -111,6 +111,11 @@ CONFIG_RTE_MAX_VFIO_CONTAINERS=64
>  CONFIG_RTE_MALLOC_DEBUG=n
>  CONFIG_RTE_EAL_NUMA_AWARE_HUGEPAGES=n
>  CONFIG_RTE_USE_LIBBSD=n
> +# Use WFE instructions to implement the rte_wait_for_equal_xxx APIs,
> +# calling these APIs put the cores in low power state while waiting
> +# for the memory address to become equal to the expected value.
> +# This is supported only by aarch64.
> +CONFIG_RTE_ARM_USE_WFE=n
> 
>  #
>  # Recognize/ignore the AVX/AVX512 CPU flags for performance/power testing.
> diff --git a/lib/librte_eal/common/include/arch/arm/rte_pause_64.h b/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
> index 93895d3..dabde17 100644
> --- a/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
> +++ b/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
> @@ -1,5 +1,6 @@
>  /* SPDX-License-Identifier: BSD-3-Clause
>   * Copyright(c) 2017 Cavium, Inc
> + * Copyright(c) 2019 Arm Limited
>   */
> 
>  #ifndef _RTE_PAUSE_ARM64_H_
> @@ -17,6 +18,35 @@ static inline void rte_pause(void)
>  	asm volatile("yield" ::: "memory");
>  }
> 
> +#ifdef RTE_ARM_USE_WFE
> +#define __WAIT_UNTIL_EQUAL(name, asm_op, wide, type) \
> +static __rte_always_inline void \
> +rte_wait_until_equal_##name(volatile type * addr, type expected) \
> +{ \
> +	type tmp; \
> +	asm volatile( \
> +		#asm_op " %" #wide "[tmp], %[addr]\n" \
> +		"cmp	%" #wide "[tmp], %" #wide "[expected]\n" \
> +		"b.eq	2f\n" \
> +		"sevl\n" \
> +		"1:	wfe\n" \
> +		#asm_op " %" #wide "[tmp], %[addr]\n" \
> +		"cmp	%" #wide "[tmp], %" #wide "[expected]\n" \
> +		"bne	1b\n" \
> +		"2:\n" \
> +		: [tmp] "=&r" (tmp) \
> +		: [addr] "Q"(*addr), [expected] "r"(expected) \
> +		: "cc", "memory"); \
> +}
> +/* Wait for *addr to be updated with expected value */
> +__WAIT_UNTIL_EQUAL(relaxed_16, ldxrh, w, uint16_t)
> +__WAIT_UNTIL_EQUAL(acquire_16, ldaxrh, w, uint16_t)
> +__WAIT_UNTIL_EQUAL(relaxed_32, ldxr, w, uint32_t)
> +__WAIT_UNTIL_EQUAL(acquire_32, ldaxr, w, uint32_t)
> +__WAIT_UNTIL_EQUAL(relaxed_64, ldxr, x, uint64_t)
> +__WAIT_UNTIL_EQUAL(acquire_64, ldaxr, x, uint64_t)
> +#endif
> +
>  #ifdef __cplusplus
>  }
>  #endif
> diff --git a/lib/librte_eal/common/include/generic/rte_pause.h b/lib/librte_eal/common/include/generic/rte_pause.h
> index 52bd4db..8906473 100644
> --- a/lib/librte_eal/common/include/generic/rte_pause.h
> +++ b/lib/librte_eal/common/include/generic/rte_pause.h
> @@ -1,5 +1,6 @@
>  /* SPDX-License-Identifier: BSD-3-Clause
>   * Copyright(c) 2017 Cavium, Inc
> + * Copyright(c) 2019 Arm Limited
>   */
> 
>  #ifndef _RTE_PAUSE_H_
> @@ -12,6 +13,10 @@
>   *
>   */
> 
> +#include <stdint.h>
> +#include <rte_common.h>
> +#include <rte_atomic.h>
> +
>  /**
>   * Pause CPU execution for a short while
>   *
> @@ -20,4 +25,105 @@
>   */
>  static inline void rte_pause(void);
> 
> +/**
> + * Wait for *addr to be updated with a 16-bit expected value, with a relaxed
> + * memory ordering model meaning the loads around this API can be reordered.
> + *
> + * @param addr
> + *  A pointer to the memory location.
> + * @param expected
> + *  A 16-bit expected value to be in the memory location.
> + */
> +__rte_always_inline
> +static void
> +rte_wait_until_equal_relaxed_16(volatile uint16_t *addr, uint16_t expected);
> +
> +/**
> + * Wait for *addr to be updated with a 32-bit expected value, with a relaxed
> + * memory ordering model meaning the loads around this API can be reordered.
> + *
> + * @param addr
> + *  A pointer to the memory location.
> + * @param expected
> + *  A 32-bit expected value to be in the memory location.
> + */
> +__rte_always_inline
> +static void
> +rte_wait_until_equal_relaxed_32(volatile uint32_t *addr, uint32_t expected);
> +
> +/**
> + * Wait for *addr to be updated with a 64-bit expected value, with a relaxed
> + * memory ordering model meaning the loads around this API can be reordered.
> + *
> + * @param addr
> + *  A pointer to the memory location.
> + * @param expected
> + *  A 64-bit expected value to be in the memory location.
> + */
> +__rte_always_inline
> +static void
> +rte_wait_until_equal_relaxed_64(volatile uint64_t *addr, uint64_t expected);
> +
> +/**
> + * Wait for *addr to be updated with a 16-bit expected value, with an acquire
> + * memory ordering model meaning the loads after this API can't be observed
> + * before this API.
> + *
> + * @param addr
> + *  A pointer to the memory location.
> + * @param expected
> + *  A 16-bit expected value to be in the memory location.
> + */
> +__rte_always_inline
> +static void
> +rte_wait_until_equal_acquire_16(volatile uint16_t *addr, uint16_t expected);
> +
> +/**
> + * Wait for *addr to be updated with a 32-bit expected value, with an acquire
> + * memory ordering model meaning the loads after this API can't be observed
> + * before this API.
> + *
> + * @param addr
> + *  A pointer to the memory location.
> + * @param expected
> + *  A 32-bit expected value to be in the memory location.
> + */
> +__rte_always_inline
> +static void
> +rte_wait_until_equal_acquire_32(volatile uint32_t *addr, uint32_t expected);

LGTM in general.
One stylish thing: wouldn't it be better to have an API like that:
rte_wait_until_equal_acquire_X(addr, expected, memory_order)
?

I.E. - pass memorder as parameter, not to incorporate it into function name?
Less functions, plus user can specify order himself.
Plus looks similar to C11 atomic instrincts.


> +
> +/**
> + * Wait for *addr to be updated with a 64-bit expected value, with an acquire
> + * memory ordering model meaning the loads after this API can't be observed
> + * before this API.
> + *
> + * @param addr
> + *  A pointer to the memory location.
> + * @param expected
> + *  A 64-bit expected value to be in the memory location.
> + */
> +__rte_always_inline
> +static void
> +rte_wait_until_equal_acquire_64(volatile uint64_t *addr, uint64_t expected);
> +
> +#if !defined(RTE_ARM_USE_WFE)
> +#define __WAIT_UNTIL_EQUAL(op_name, size, type, memorder) \
> +__rte_always_inline \
> +static void	\
> +rte_wait_until_equal_##op_name##_##size(volatile type *addr, \
> +	type expected) \
> +{ \
> +	while (__atomic_load_n(addr, memorder) != expected) \
> +		rte_pause(); \
> +}
> +
> +/* Wait for *addr to be updated with expected value */
> +__WAIT_UNTIL_EQUAL(relaxed, 16, uint16_t, __ATOMIC_RELAXED)
> +__WAIT_UNTIL_EQUAL(acquire, 16, uint16_t, __ATOMIC_ACQUIRE)
> +__WAIT_UNTIL_EQUAL(relaxed, 32, uint32_t, __ATOMIC_RELAXED)
> +__WAIT_UNTIL_EQUAL(acquire, 32, uint32_t, __ATOMIC_ACQUIRE)
> +__WAIT_UNTIL_EQUAL(relaxed, 64, uint64_t, __ATOMIC_RELAXED)
> +__WAIT_UNTIL_EQUAL(acquire, 64, uint64_t, __ATOMIC_ACQUIRE)
> +#endif /* RTE_ARM_USE_WFE */
> +
>  #endif /* _RTE_PAUSE_H_ */
> --
> 2.7.4


^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: [dpdk-dev] [PATCH v7 1/7] bus/fslmc: fix the conflicting dmb function
  2019-09-27  5:41 ` [dpdk-dev] [PATCH v7 1/7] bus/fslmc: fix the conflicting dmb function Gavin Hu
  2019-09-27  8:24   ` Hemant Agrawal
@ 2019-10-17 15:06   ` David Marchand
  1 sibling, 0 replies; 163+ messages in thread
From: David Marchand @ 2019-10-17 15:06 UTC (permalink / raw)
  To: Gavin Hu
  Cc: dev, nd, Thomas Monjalon, Stephen Hemminger, Hemant Agrawal,
	Jerin Jacob Kollanukkaran, Pavan Nikhilesh, Honnappa Nagarahalli,
	Ruifeng Wang (Arm Technology China),
	Phil Yang, Steve Capper, dpdk stable

On Fri, Sep 27, 2019 at 7:42 AM Gavin Hu <gavin.hu@arm.com> wrote:
>
> There are two definitions conflicting each other, for more
> details, refer to [1].
>
> include/rte_atomic_64.h:19: error: "dmb" redefined [-Werror]
> drivers/bus/fslmc/mc/fsl_mc_sys.h:36: note: this is the location of the
> previous definition
>  #define dmb() {__asm__ __volatile__("" : : : "memory"); }

Interesting, this means that we have a leftover macro from the arm
header that can pollute the namespace.

Anyway, this driver is not using the memory barrier api from the EAL.

How about simply fixing with?
#define __iormb()      rte_io_rmb()
#define __iowmb()      rte_io_wmb()

>
> The fix is to include the spinlock.h file before the other header files,
> this is inline with the coding style[2] about the "header includes".
> The fix changes the function to take the argument for arm to be
> meaningful.
>
> [1] http://inbox.dpdk.org/users/VI1PR08MB537631AB25F41B8880DCCA988FDF0@i
> VI1PR08MB5376.eurprd08.prod.outlook.com/T/#u

This url is broken, there is a trailing i on the first line.

> [2] https://doc.dpdk.org/guides/contributing/coding_style.html
>
> Fixes: 3af733ba8da8 ("bus/fslmc: introduce MC object functions")
> Cc: stable@dpdk.org
>
> Signed-off-by: Gavin Hu <gavin.hu@arm.com>
> Reviewed-by: Phil Yang <phi.yang@arm.com>



-- 
David Marchand


^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: [dpdk-dev] [PATCH v7 2/7] eal: add the APIs to wait until equal
  2019-09-27  5:41 ` [dpdk-dev] [PATCH v7 2/7] eal: add the APIs to wait until equal Gavin Hu
  2019-09-27 11:03   ` Jerin Jacob
  2019-10-17 13:14   ` Ananyev, Konstantin
@ 2019-10-17 15:45   ` David Marchand
  2019-10-21  7:38     ` Gavin Hu (Arm Technology China)
  2019-10-17 16:44   ` Ananyev, Konstantin
  3 siblings, 1 reply; 163+ messages in thread
From: David Marchand @ 2019-10-17 15:45 UTC (permalink / raw)
  To: Gavin Hu
  Cc: dev, nd, Thomas Monjalon, Stephen Hemminger, Hemant Agrawal,
	Jerin Jacob Kollanukkaran, Pavan Nikhilesh, Honnappa Nagarahalli,
	Ruifeng Wang (Arm Technology China),
	Phil Yang, Steve Capper

On Fri, Sep 27, 2019 at 7:42 AM Gavin Hu <gavin.hu@arm.com> wrote:
>
> The rte_wait_until_equal_xx APIs abstract the functionality of
> 'polling for a memory location to become equal to a given value'.
>
> Add the RTE_ARM_USE_WFE configuration entry for aarch64, disabled
> by default. When it is enabled, the above APIs will call WFE instruction
> to save CPU cycles and power.

- As discussed on irc, I would prefer we have two stages for this:
* first stage, an architecture announces it has its own implementation
of this api, in such a case it defines RTE_ARCH_HAS_WFE in its
arch/xxx/rte_pause.h header before including generic/rte_pause.h
  The default implementation with C11 is then skipped in the generic header.
* second stage, in the arm64 header, if RTE_ARM_USE_WFE is set in the
configuration, then define RTE_ARCH_HAS_WFE

- Can you add a little description on the limitation of using WFE
instruction in the commit log?

- This is a new api, should be marked experimental, even if inlined.

Small comments inline.


>
> Signed-off-by: Gavin Hu <gavin.hu@arm.com>
> Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> Reviewed-by: Steve Capper <steve.capper@arm.com>
> Reviewed-by: Ola Liljedahl <ola.liljedahl@arm.com>
> Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> Reviewed-by: Phil Yang <phil.yang@arm.com>
> Acked-by: Pavan Nikhilesh <pbhagavatula@marvell.com>
> ---
>  config/arm/meson.build                             |   1 +
>  config/common_base                                 |   5 +
>  .../common/include/arch/arm/rte_pause_64.h         |  30 ++++++
>  lib/librte_eal/common/include/generic/rte_pause.h  | 106 +++++++++++++++++++++
>  4 files changed, 142 insertions(+)
>
> diff --git a/config/arm/meson.build b/config/arm/meson.build
> index 979018e..b4b4cac 100644
> --- a/config/arm/meson.build
> +++ b/config/arm/meson.build
> @@ -26,6 +26,7 @@ flags_common_default = [
>         ['RTE_LIBRTE_AVP_PMD', false],
>
>         ['RTE_SCHED_VECTOR', false],
> +       ['RTE_ARM_USE_WFE', false],
>  ]
>
>  flags_generic = [
> diff --git a/config/common_base b/config/common_base
> index 8ef75c2..8861713 100644
> --- a/config/common_base
> +++ b/config/common_base
> @@ -111,6 +111,11 @@ CONFIG_RTE_MAX_VFIO_CONTAINERS=64
>  CONFIG_RTE_MALLOC_DEBUG=n
>  CONFIG_RTE_EAL_NUMA_AWARE_HUGEPAGES=n
>  CONFIG_RTE_USE_LIBBSD=n
> +# Use WFE instructions to implement the rte_wait_for_equal_xxx APIs,
> +# calling these APIs put the cores in low power state while waiting
> +# for the memory address to become equal to the expected value.
> +# This is supported only by aarch64.
> +CONFIG_RTE_ARM_USE_WFE=n
>
>  #
>  # Recognize/ignore the AVX/AVX512 CPU flags for performance/power testing.
> diff --git a/lib/librte_eal/common/include/arch/arm/rte_pause_64.h b/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
> index 93895d3..dabde17 100644
> --- a/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
> +++ b/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
> @@ -1,5 +1,6 @@
>  /* SPDX-License-Identifier: BSD-3-Clause
>   * Copyright(c) 2017 Cavium, Inc
> + * Copyright(c) 2019 Arm Limited
>   */
>
>  #ifndef _RTE_PAUSE_ARM64_H_
> @@ -17,6 +18,35 @@ static inline void rte_pause(void)
>         asm volatile("yield" ::: "memory");
>  }
>
> +#ifdef RTE_ARM_USE_WFE
> +#define __WAIT_UNTIL_EQUAL(name, asm_op, wide, type) \
> +static __rte_always_inline void \
> +rte_wait_until_equal_##name(volatile type * addr, type expected) \
> +{ \
> +       type tmp; \
> +       asm volatile( \
> +               #asm_op " %" #wide "[tmp], %[addr]\n" \
> +               "cmp    %" #wide "[tmp], %" #wide "[expected]\n" \
> +               "b.eq   2f\n" \
> +               "sevl\n" \
> +               "1:     wfe\n" \
> +               #asm_op " %" #wide "[tmp], %[addr]\n" \
> +               "cmp    %" #wide "[tmp], %" #wide "[expected]\n" \
> +               "bne    1b\n" \
> +               "2:\n" \
> +               : [tmp] "=&r" (tmp) \
> +               : [addr] "Q"(*addr), [expected] "r"(expected) \
> +               : "cc", "memory"); \
> +}
> +/* Wait for *addr to be updated with expected value */
> +__WAIT_UNTIL_EQUAL(relaxed_16, ldxrh, w, uint16_t)
> +__WAIT_UNTIL_EQUAL(acquire_16, ldaxrh, w, uint16_t)
> +__WAIT_UNTIL_EQUAL(relaxed_32, ldxr, w, uint32_t)
> +__WAIT_UNTIL_EQUAL(acquire_32, ldaxr, w, uint32_t)
> +__WAIT_UNTIL_EQUAL(relaxed_64, ldxr, x, uint64_t)
> +__WAIT_UNTIL_EQUAL(acquire_64, ldaxr, x, uint64_t)

Missing #undef __WAIT_UNTIL_EQUAL

> +#endif
> +
>  #ifdef __cplusplus
>  }
>  #endif
> diff --git a/lib/librte_eal/common/include/generic/rte_pause.h b/lib/librte_eal/common/include/generic/rte_pause.h
> index 52bd4db..8906473 100644
> --- a/lib/librte_eal/common/include/generic/rte_pause.h
> +++ b/lib/librte_eal/common/include/generic/rte_pause.h
> @@ -1,5 +1,6 @@
>  /* SPDX-License-Identifier: BSD-3-Clause
>   * Copyright(c) 2017 Cavium, Inc
> + * Copyright(c) 2019 Arm Limited
>   */
>
>  #ifndef _RTE_PAUSE_H_
> @@ -12,6 +13,10 @@
>   *
>   */
>
> +#include <stdint.h>
> +#include <rte_common.h>
> +#include <rte_atomic.h>
> +
>  /**
>   * Pause CPU execution for a short while
>   *
> @@ -20,4 +25,105 @@
>   */
>  static inline void rte_pause(void);
>
> +/**

Missing warning on experimental api.

> + * Wait for *addr to be updated with a 16-bit expected value, with a relaxed
> + * memory ordering model meaning the loads around this API can be reordered.
> + *
> + * @param addr
> + *  A pointer to the memory location.
> + * @param expected
> + *  A 16-bit expected value to be in the memory location.
> + */

Missing experimental tag.
Saying this only once, please update declarations below, plus doxygen header.


> +__rte_always_inline
> +static void
> +rte_wait_until_equal_relaxed_16(volatile uint16_t *addr, uint16_t expected);
> +
> +/**
> + * Wait for *addr to be updated with a 32-bit expected value, with a relaxed
> + * memory ordering model meaning the loads around this API can be reordered.
> + *
> + * @param addr
> + *  A pointer to the memory location.
> + * @param expected
> + *  A 32-bit expected value to be in the memory location.
> + */
> +__rte_always_inline
> +static void
> +rte_wait_until_equal_relaxed_32(volatile uint32_t *addr, uint32_t expected);
> +
> +/**
> + * Wait for *addr to be updated with a 64-bit expected value, with a relaxed
> + * memory ordering model meaning the loads around this API can be reordered.
> + *
> + * @param addr
> + *  A pointer to the memory location.
> + * @param expected
> + *  A 64-bit expected value to be in the memory location.
> + */
> +__rte_always_inline
> +static void
> +rte_wait_until_equal_relaxed_64(volatile uint64_t *addr, uint64_t expected);
> +
> +/**
> + * Wait for *addr to be updated with a 16-bit expected value, with an acquire
> + * memory ordering model meaning the loads after this API can't be observed
> + * before this API.
> + *
> + * @param addr
> + *  A pointer to the memory location.
> + * @param expected
> + *  A 16-bit expected value to be in the memory location.
> + */
> +__rte_always_inline
> +static void
> +rte_wait_until_equal_acquire_16(volatile uint16_t *addr, uint16_t expected);
> +
> +/**
> + * Wait for *addr to be updated with a 32-bit expected value, with an acquire
> + * memory ordering model meaning the loads after this API can't be observed
> + * before this API.
> + *
> + * @param addr
> + *  A pointer to the memory location.
> + * @param expected
> + *  A 32-bit expected value to be in the memory location.
> + */
> +__rte_always_inline
> +static void
> +rte_wait_until_equal_acquire_32(volatile uint32_t *addr, uint32_t expected);
> +
> +/**
> + * Wait for *addr to be updated with a 64-bit expected value, with an acquire
> + * memory ordering model meaning the loads after this API can't be observed
> + * before this API.
> + *
> + * @param addr
> + *  A pointer to the memory location.
> + * @param expected
> + *  A 64-bit expected value to be in the memory location.
> + */
> +__rte_always_inline
> +static void
> +rte_wait_until_equal_acquire_64(volatile uint64_t *addr, uint64_t expected);
> +
> +#if !defined(RTE_ARM_USE_WFE)
> +#define __WAIT_UNTIL_EQUAL(op_name, size, type, memorder) \
> +__rte_always_inline \
> +static void    \
> +rte_wait_until_equal_##op_name##_##size(volatile type *addr, \
> +       type expected) \
> +{ \
> +       while (__atomic_load_n(addr, memorder) != expected) \
> +               rte_pause(); \
> +}
> +
> +/* Wait for *addr to be updated with expected value */
> +__WAIT_UNTIL_EQUAL(relaxed, 16, uint16_t, __ATOMIC_RELAXED)
> +__WAIT_UNTIL_EQUAL(acquire, 16, uint16_t, __ATOMIC_ACQUIRE)
> +__WAIT_UNTIL_EQUAL(relaxed, 32, uint32_t, __ATOMIC_RELAXED)
> +__WAIT_UNTIL_EQUAL(acquire, 32, uint32_t, __ATOMIC_ACQUIRE)
> +__WAIT_UNTIL_EQUAL(relaxed, 64, uint64_t, __ATOMIC_RELAXED)
> +__WAIT_UNTIL_EQUAL(acquire, 64, uint64_t, __ATOMIC_ACQUIRE)

#undef __WAIT_UNTIL_EQUAL

> +#endif /* RTE_ARM_USE_WFE */
> +
>  #endif /* _RTE_PAUSE_H_ */
> --
> 2.7.4
>


-- 
David Marchand


^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: [dpdk-dev] [PATCH v7 2/7] eal: add the APIs to wait until equal
  2019-09-27  5:41 ` [dpdk-dev] [PATCH v7 2/7] eal: add the APIs to wait until equal Gavin Hu
                     ` (2 preceding siblings ...)
  2019-10-17 15:45   ` David Marchand
@ 2019-10-17 16:44   ` Ananyev, Konstantin
  2019-10-23 16:20     ` Gavin Hu (Arm Technology China)
  3 siblings, 1 reply; 163+ messages in thread
From: Ananyev, Konstantin @ 2019-10-17 16:44 UTC (permalink / raw)
  To: Gavin Hu, dev
  Cc: nd, thomas, stephen, hemant.agrawal, jerinj, pbhagavatula,
	Honnappa.Nagarahalli, ruifeng.wang, phil.yang, steve.capper


> 
> Hi Gavin,
> 
> >
> > The rte_wait_until_equal_xx APIs abstract the functionality of
> > 'polling for a memory location to become equal to a given value'.
> >
> > Add the RTE_ARM_USE_WFE configuration entry for aarch64, disabled
> > by default. When it is enabled, the above APIs will call WFE instruction
> > to save CPU cycles and power.
> >
> > Signed-off-by: Gavin Hu <gavin.hu@arm.com>
> > Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > Reviewed-by: Steve Capper <steve.capper@arm.com>
> > Reviewed-by: Ola Liljedahl <ola.liljedahl@arm.com>
> > Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > Reviewed-by: Phil Yang <phil.yang@arm.com>
> > Acked-by: Pavan Nikhilesh <pbhagavatula@marvell.com>
> > ---
> >  config/arm/meson.build                             |   1 +
> >  config/common_base                                 |   5 +
> >  .../common/include/arch/arm/rte_pause_64.h         |  30 ++++++
> >  lib/librte_eal/common/include/generic/rte_pause.h  | 106 +++++++++++++++++++++
> >  4 files changed, 142 insertions(+)
> >
> > diff --git a/config/arm/meson.build b/config/arm/meson.build
> > index 979018e..b4b4cac 100644
> > --- a/config/arm/meson.build
> > +++ b/config/arm/meson.build
> > @@ -26,6 +26,7 @@ flags_common_default = [
> >  	['RTE_LIBRTE_AVP_PMD', false],
> >
> >  	['RTE_SCHED_VECTOR', false],
> > +	['RTE_ARM_USE_WFE', false],
> >  ]
> >
> >  flags_generic = [
> > diff --git a/config/common_base b/config/common_base
> > index 8ef75c2..8861713 100644
> > --- a/config/common_base
> > +++ b/config/common_base
> > @@ -111,6 +111,11 @@ CONFIG_RTE_MAX_VFIO_CONTAINERS=64
> >  CONFIG_RTE_MALLOC_DEBUG=n
> >  CONFIG_RTE_EAL_NUMA_AWARE_HUGEPAGES=n
> >  CONFIG_RTE_USE_LIBBSD=n
> > +# Use WFE instructions to implement the rte_wait_for_equal_xxx APIs,
> > +# calling these APIs put the cores in low power state while waiting
> > +# for the memory address to become equal to the expected value.
> > +# This is supported only by aarch64.
> > +CONFIG_RTE_ARM_USE_WFE=n
> >
> >  #
> >  # Recognize/ignore the AVX/AVX512 CPU flags for performance/power testing.
> > diff --git a/lib/librte_eal/common/include/arch/arm/rte_pause_64.h b/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
> > index 93895d3..dabde17 100644
> > --- a/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
> > +++ b/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
> > @@ -1,5 +1,6 @@
> >  /* SPDX-License-Identifier: BSD-3-Clause
> >   * Copyright(c) 2017 Cavium, Inc
> > + * Copyright(c) 2019 Arm Limited
> >   */
> >
> >  #ifndef _RTE_PAUSE_ARM64_H_
> > @@ -17,6 +18,35 @@ static inline void rte_pause(void)
> >  	asm volatile("yield" ::: "memory");
> >  }
> >
> > +#ifdef RTE_ARM_USE_WFE
> > +#define __WAIT_UNTIL_EQUAL(name, asm_op, wide, type) \
> > +static __rte_always_inline void \
> > +rte_wait_until_equal_##name(volatile type * addr, type expected) \
> > +{ \
> > +	type tmp; \
> > +	asm volatile( \
> > +		#asm_op " %" #wide "[tmp], %[addr]\n" \
> > +		"cmp	%" #wide "[tmp], %" #wide "[expected]\n" \
> > +		"b.eq	2f\n" \
> > +		"sevl\n" \
> > +		"1:	wfe\n" \
> > +		#asm_op " %" #wide "[tmp], %[addr]\n" \
> > +		"cmp	%" #wide "[tmp], %" #wide "[expected]\n" \
> > +		"bne	1b\n" \
> > +		"2:\n" \
> > +		: [tmp] "=&r" (tmp) \
> > +		: [addr] "Q"(*addr), [expected] "r"(expected) \
> > +		: "cc", "memory"); \
> > +}

One more thought:
Why do you need to write asm code for the whole procedure?
Why not to do like linux kernel:
define wfe() and sev() macros and use them inside normal C code?
 
#define sev()		asm volatile("sev" : : : "memory")
#define wfe()		asm volatile("wfe" : : : "memory")

Then:
rte_wait_until_equal_32(volatile uint32_t *addr, uint32_t expected, int memorder)
{
     if (__atomic_load_n(addr, memorder) != expected) {
         sev();
         do {
             wfe();
         } while ((__atomic_load_n(addr, memorder) != expected);
     }
}

?

> > +/* Wait for *addr to be updated with expected value */
> > +__WAIT_UNTIL_EQUAL(relaxed_16, ldxrh, w, uint16_t)
> > +__WAIT_UNTIL_EQUAL(acquire_16, ldaxrh, w, uint16_t)
> > +__WAIT_UNTIL_EQUAL(relaxed_32, ldxr, w, uint32_t)
> > +__WAIT_UNTIL_EQUAL(acquire_32, ldaxr, w, uint32_t)
> > +__WAIT_UNTIL_EQUAL(relaxed_64, ldxr, x, uint64_t)
> > +__WAIT_UNTIL_EQUAL(acquire_64, ldaxr, x, uint64_t)
> > +#endif
> > +
> >  #ifdef __cplusplus
> >  }
> >  #endif
> > diff --git a/lib/librte_eal/common/include/generic/rte_pause.h b/lib/librte_eal/common/include/generic/rte_pause.h
> > index 52bd4db..8906473 100644
> > --- a/lib/librte_eal/common/include/generic/rte_pause.h
> > +++ b/lib/librte_eal/common/include/generic/rte_pause.h
> > @@ -1,5 +1,6 @@
> >  /* SPDX-License-Identifier: BSD-3-Clause
> >   * Copyright(c) 2017 Cavium, Inc
> > + * Copyright(c) 2019 Arm Limited
> >   */
> >
> >  #ifndef _RTE_PAUSE_H_
> > @@ -12,6 +13,10 @@
> >   *
> >   */
> >
> > +#include <stdint.h>
> > +#include <rte_common.h>
> > +#include <rte_atomic.h>
> > +
> >  /**
> >   * Pause CPU execution for a short while
> >   *
> > @@ -20,4 +25,105 @@
> >   */
> >  static inline void rte_pause(void);
> >
> > +/**
> > + * Wait for *addr to be updated with a 16-bit expected value, with a relaxed
> > + * memory ordering model meaning the loads around this API can be reordered.
> > + *
> > + * @param addr
> > + *  A pointer to the memory location.
> > + * @param expected
> > + *  A 16-bit expected value to be in the memory location.
> > + */
> > +__rte_always_inline
> > +static void
> > +rte_wait_until_equal_relaxed_16(volatile uint16_t *addr, uint16_t expected);
> > +
> > +/**
> > + * Wait for *addr to be updated with a 32-bit expected value, with a relaxed
> > + * memory ordering model meaning the loads around this API can be reordered.
> > + *
> > + * @param addr
> > + *  A pointer to the memory location.
> > + * @param expected
> > + *  A 32-bit expected value to be in the memory location.
> > + */
> > +__rte_always_inline
> > +static void
> > +rte_wait_until_equal_relaxed_32(volatile uint32_t *addr, uint32_t expected);
> > +
> > +/**
> > + * Wait for *addr to be updated with a 64-bit expected value, with a relaxed
> > + * memory ordering model meaning the loads around this API can be reordered.
> > + *
> > + * @param addr
> > + *  A pointer to the memory location.
> > + * @param expected
> > + *  A 64-bit expected value to be in the memory location.
> > + */
> > +__rte_always_inline
> > +static void
> > +rte_wait_until_equal_relaxed_64(volatile uint64_t *addr, uint64_t expected);
> > +
> > +/**
> > + * Wait for *addr to be updated with a 16-bit expected value, with an acquire
> > + * memory ordering model meaning the loads after this API can't be observed
> > + * before this API.
> > + *
> > + * @param addr
> > + *  A pointer to the memory location.
> > + * @param expected
> > + *  A 16-bit expected value to be in the memory location.
> > + */
> > +__rte_always_inline
> > +static void
> > +rte_wait_until_equal_acquire_16(volatile uint16_t *addr, uint16_t expected);
> > +
> > +/**
> > + * Wait for *addr to be updated with a 32-bit expected value, with an acquire
> > + * memory ordering model meaning the loads after this API can't be observed
> > + * before this API.
> > + *
> > + * @param addr
> > + *  A pointer to the memory location.
> > + * @param expected
> > + *  A 32-bit expected value to be in the memory location.
> > + */
> > +__rte_always_inline
> > +static void
> > +rte_wait_until_equal_acquire_32(volatile uint32_t *addr, uint32_t expected);
> 
> LGTM in general.
> One stylish thing: wouldn't it be better to have an API like that:
> rte_wait_until_equal_acquire_X(addr, expected, memory_order)
> ?
> 
> I.E. - pass memorder as parameter, not to incorporate it into function name?
> Less functions, plus user can specify order himself.
> Plus looks similar to C11 atomic instrincts.
> 
> 
> > +
> > +/**
> > + * Wait for *addr to be updated with a 64-bit expected value, with an acquire
> > + * memory ordering model meaning the loads after this API can't be observed
> > + * before this API.
> > + *
> > + * @param addr
> > + *  A pointer to the memory location.
> > + * @param expected
> > + *  A 64-bit expected value to be in the memory location.
> > + */
> > +__rte_always_inline
> > +static void
> > +rte_wait_until_equal_acquire_64(volatile uint64_t *addr, uint64_t expected);
> > +
> > +#if !defined(RTE_ARM_USE_WFE)
> > +#define __WAIT_UNTIL_EQUAL(op_name, size, type, memorder) \
> > +__rte_always_inline \
> > +static void	\
> > +rte_wait_until_equal_##op_name##_##size(volatile type *addr, \
> > +	type expected) \
> > +{ \
> > +	while (__atomic_load_n(addr, memorder) != expected) \
> > +		rte_pause(); \
> > +}
> > +
> > +/* Wait for *addr to be updated with expected value */
> > +__WAIT_UNTIL_EQUAL(relaxed, 16, uint16_t, __ATOMIC_RELAXED)
> > +__WAIT_UNTIL_EQUAL(acquire, 16, uint16_t, __ATOMIC_ACQUIRE)
> > +__WAIT_UNTIL_EQUAL(relaxed, 32, uint32_t, __ATOMIC_RELAXED)
> > +__WAIT_UNTIL_EQUAL(acquire, 32, uint32_t, __ATOMIC_ACQUIRE)
> > +__WAIT_UNTIL_EQUAL(relaxed, 64, uint64_t, __ATOMIC_RELAXED)
> > +__WAIT_UNTIL_EQUAL(acquire, 64, uint64_t, __ATOMIC_ACQUIRE)
> > +#endif /* RTE_ARM_USE_WFE */
> > +
> >  #endif /* _RTE_PAUSE_H_ */
> > --
> > 2.7.4


^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: [dpdk-dev] [PATCH v7 3/7] spinlock: use wfe to reduce contention on aarch64
  2019-09-27  5:41 ` [dpdk-dev] [PATCH v7 3/7] spinlock: use wfe to reduce contention on aarch64 Gavin Hu
@ 2019-10-17 18:27   ` David Marchand
  2019-10-18  5:45     ` Gavin Hu (Arm Technology China)
  2019-10-21  7:27     ` Gavin Hu (Arm Technology China)
  0 siblings, 2 replies; 163+ messages in thread
From: David Marchand @ 2019-10-17 18:27 UTC (permalink / raw)
  To: Gavin Hu
  Cc: dev, nd, Thomas Monjalon, Stephen Hemminger, Hemant Agrawal,
	Jerin Jacob Kollanukkaran, Pavan Nikhilesh, Honnappa Nagarahalli,
	Ruifeng Wang (Arm Technology China),
	Phil Yang, Steve Capper

On Fri, Sep 27, 2019 at 7:43 AM Gavin Hu <gavin.hu@arm.com> wrote:
>
> In acquiring a spinlock, cores repeatedly poll the lock variable.
> This is replaced by rte_wait_until_equal API.
>
> Running the micro benchmarking and the testpmd and l3fwd traffic tests
> on ThunderX2, Ampere eMAG80 and Arm N1SDP, everything went well and no
> notable performance gain nor degradation was measured.
>
> Signed-off-by: Gavin Hu <gavin.hu@arm.com>
> Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> Reviewed-by: Phil Yang <phil.yang@arm.com>
> Reviewed-by: Steve Capper <steve.capper@arm.com>
> Reviewed-by: Ola Liljedahl <ola.liljedahl@arm.com>
> Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> Tested-by: Pavan Nikhilesh <pbhagavatula@marvell.com>
> ---
>  .../common/include/arch/arm/rte_spinlock.h         | 26 ++++++++++++++++++++++
>  1 file changed, 26 insertions(+)
>
> diff --git a/lib/librte_eal/common/include/arch/arm/rte_spinlock.h b/lib/librte_eal/common/include/arch/arm/rte_spinlock.h
> index 1a6916b..b61c055 100644
> --- a/lib/librte_eal/common/include/arch/arm/rte_spinlock.h
> +++ b/lib/librte_eal/common/include/arch/arm/rte_spinlock.h
> @@ -16,6 +16,32 @@ extern "C" {
>  #include <rte_common.h>
>  #include "generic/rte_spinlock.h"
>
> +/* armv7a does support WFE, but an explicit wake-up signal using SEV is
> + * required (must be preceded by DSB to drain the store buffer) and
> + * this is less performant, so keep armv7a implementation unchanged.
> + */
> +#ifndef RTE_FORCE_INTRINSICS

Earlier, in the same file, I can see:
https://git.dpdk.org/dpdk/tree/lib/librte_eal/common/include/arch/arm/rte_spinlock.h?h=v19.08#n8

#ifndef RTE_FORCE_INTRINSICS
#  error Platform must be built with CONFIG_RTE_FORCE_INTRINSICS
#endif

IIUC, this is dead code.

> +static inline void
> +rte_spinlock_lock(rte_spinlock_t *sl)
> +{
> +       unsigned int tmp;
> +       /* http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.
> +        * faqs/ka16809.html
> +        */
> +       asm volatile(
> +               "1:     ldaxr %w[tmp], %w[locked]\n"
> +               "cbnz   %w[tmp], 2f\n"
> +               "stxr   %w[tmp], %w[one], %w[locked]\n"
> +               "cbnz   %w[tmp], 1b\n"
> +               "ret\n"
> +               "2:     sevl\n"
> +               "wfe\n"
> +               "jmp    1b\n"
> +               : [tmp] "=&r" (tmp), [locked] "+Q"(sl->locked)
> +               : [one] "r" (1)
> +}
> +#endif
> +
>  static inline int rte_tm_supported(void)
>  {
>         return 0;
> --
> 2.7.4
>


-- 
David Marchand


^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: [dpdk-dev] [PATCH v7 0/7] use WFE for aarch64
  2019-09-27  5:41 ` [dpdk-dev] [PATCH v7 0/7] use WFE for aarch64 Gavin Hu
@ 2019-10-17 18:37   ` David Marchand
  0 siblings, 0 replies; 163+ messages in thread
From: David Marchand @ 2019-10-17 18:37 UTC (permalink / raw)
  To: Gavin Hu
  Cc: dev, nd, Thomas Monjalon, Stephen Hemminger, Hemant Agrawal,
	Jerin Jacob Kollanukkaran, Pavan Nikhilesh, Honnappa Nagarahalli,
	Ruifeng Wang (Arm Technology China),
	Phil Yang, Steve Capper

On Fri, Sep 27, 2019 at 7:42 AM Gavin Hu <gavin.hu@arm.com> wrote:
>
> V7:
> - fix the checkpatch LONG_LINE_COMMENT issue
> V6:
> - squash the RTE_ARM_USE_WFE configuration entry patch into the new API patch
> - move the new configuration to the end of EAL
> - add doxygen comments to reflect the relaxed and acquire semantics
> - correct the meson configuration
> V5:
> - add doxygen comments for the new APIs
> - spinlock early exit without wfe if the spinlock not taken by others.
> - add two patches on top for opdl and thunderx
> V4:
> - rename the config as CONFIG_RTE_ARM_USE_WFE to indicate it applys to arm only
> - introduce a macro for assembly Skelton to reduce the duplication of code
> - add one patch for nxp fslmc to address a compiling error
> V3:
> - Convert RFCs to patches
> V2:
> - Use inline functions instead of marcos
> - Add load and compare in the beginning of the APIs
> - Fix some style errors in asm inline
> V1:
> - Add the new APIs and use it for ring and locks
>
> DPDK has multiple use cases where the core repeatedly polls a location in
> memory. This polling results in many cache and memory transactions.
>
> Arm architecture provides WFE (Wait For Event) instruction, which allows
> the cpu core to enter a low power state until woken up by the update to the
> memory location being polled. Thus reducing the cache and memory
> transactions.
>
> x86 has the PAUSE hint instruction to reduce such overhead.
>
> The rte_wait_until_equal_xxx APIs abstract the functionality of 'polling
> for a memory location to become equal to a given value'.
>
> For non-Arm platforms, these APIs are just wrappers around do-while loop
> with rte_pause, so there are no performance differences.
>
> For Arm platforms, use of WFE can be configured using CONFIG_RTE_USE_WFE
> option. It is disabled by default.
>
> Currently, use of WFE is supported only for aarch64 platforms. armv7
> platforms do support the WFE instruction, but they require explicit wake up
> events(sev) and are less performannt.
>
> Testing shows that, performance varies across different platforms, with
> some showing degradation.
>
> CONFIG_RTE_ARM_USE_WFE should be enabled depending on the performance
> benchmarking on the target platforms. Power saving should be an bonus,
> but currenly we don't have ways to characterize that.
>
> Gavin Hu (7):
>   bus/fslmc: fix the conflicting dmb function
>   eal: add the APIs to wait until equal
>   spinlock: use wfe to reduce contention on aarch64

Sent comments on the 3 first patches, the rest looks good to me.

>   ticketlock: use new API to reduce contention on aarch64
>   ring: use wfe to wait for ring tail update on aarch64
>   net/thunderx: use new API to save cycles on aarch64
>   event/opdl: use new API to save cycles on aarch64


Thanks.

-- 
David Marchand


^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: [dpdk-dev] [PATCH v7 3/7] spinlock: use wfe to reduce contention on aarch64
  2019-10-17 18:27   ` David Marchand
@ 2019-10-18  5:45     ` Gavin Hu (Arm Technology China)
  2019-10-21  7:27     ` Gavin Hu (Arm Technology China)
  1 sibling, 0 replies; 163+ messages in thread
From: Gavin Hu (Arm Technology China) @ 2019-10-18  5:45 UTC (permalink / raw)
  To: David Marchand
  Cc: dev, nd, thomas, Stephen Hemminger, hemant.agrawal, jerinj,
	Pavan Nikhilesh, Honnappa Nagarahalli,
	Ruifeng Wang (Arm Technology China),
	Phil Yang (Arm Technology China),
	Steve Capper, nd

Hi David,

> -----Original Message-----
> From: David Marchand <david.marchand@redhat.com>
> Sent: Friday, October 18, 2019 2:28 AM
> To: Gavin Hu (Arm Technology China) <Gavin.Hu@arm.com>
> Cc: dev <dev@dpdk.org>; nd <nd@arm.com>; thomas@monjalon.net;
> Stephen Hemminger <stephen@networkplumber.org>;
> hemant.agrawal@nxp.com; jerinj@marvell.com; Pavan Nikhilesh
> <pbhagavatula@marvell.com>; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>; Ruifeng Wang (Arm Technology China)
> <Ruifeng.Wang@arm.com>; Phil Yang (Arm Technology China)
> <Phil.Yang@arm.com>; Steve Capper <Steve.Capper@arm.com>
> Subject: Re: [dpdk-dev] [PATCH v7 3/7] spinlock: use wfe to reduce
> contention on aarch64
> 
> On Fri, Sep 27, 2019 at 7:43 AM Gavin Hu <gavin.hu@arm.com> wrote:
> >
> > In acquiring a spinlock, cores repeatedly poll the lock variable.
> > This is replaced by rte_wait_until_equal API.
> >
> > Running the micro benchmarking and the testpmd and l3fwd traffic tests
> > on ThunderX2, Ampere eMAG80 and Arm N1SDP, everything went well
> and no
> > notable performance gain nor degradation was measured.
> >
> > Signed-off-by: Gavin Hu <gavin.hu@arm.com>
> > Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > Reviewed-by: Phil Yang <phil.yang@arm.com>
> > Reviewed-by: Steve Capper <steve.capper@arm.com>
> > Reviewed-by: Ola Liljedahl <ola.liljedahl@arm.com>
> > Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > Tested-by: Pavan Nikhilesh <pbhagavatula@marvell.com>
> > ---
> >  .../common/include/arch/arm/rte_spinlock.h         | 26
> ++++++++++++++++++++++
> >  1 file changed, 26 insertions(+)
> >
> > diff --git a/lib/librte_eal/common/include/arch/arm/rte_spinlock.h
> b/lib/librte_eal/common/include/arch/arm/rte_spinlock.h
> > index 1a6916b..b61c055 100644
> > --- a/lib/librte_eal/common/include/arch/arm/rte_spinlock.h
> > +++ b/lib/librte_eal/common/include/arch/arm/rte_spinlock.h
> > @@ -16,6 +16,32 @@ extern "C" {
> >  #include <rte_common.h>
> >  #include "generic/rte_spinlock.h"
> >
> > +/* armv7a does support WFE, but an explicit wake-up signal using SEV is
> > + * required (must be preceded by DSB to drain the store buffer) and
> > + * this is less performant, so keep armv7a implementation unchanged.
> > + */
> > +#ifndef RTE_FORCE_INTRINSICS
> 
> Earlier, in the same file, I can see:
> https://git.dpdk.org/dpdk/tree/lib/librte_eal/common/include/arch/arm/rte
> _spinlock.h?h=v19.08#n8
> 
> #ifndef RTE_FORCE_INTRINSICS
> #  error Platform must be built with CONFIG_RTE_FORCE_INTRINSICS
> #endif
> 
> IIUC, this is dead code.
Yes, will remove in next version.

> 
> > +static inline void
> > +rte_spinlock_lock(rte_spinlock_t *sl)
> > +{
> > +       unsigned int tmp;
> > +       /* http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.
> > +        * faqs/ka16809.html
> > +        */
> > +       asm volatile(
> > +               "1:     ldaxr %w[tmp], %w[locked]\n"
> > +               "cbnz   %w[tmp], 2f\n"
> > +               "stxr   %w[tmp], %w[one], %w[locked]\n"
> > +               "cbnz   %w[tmp], 1b\n"
> > +               "ret\n"
> > +               "2:     sevl\n"
> > +               "wfe\n"
> > +               "jmp    1b\n"
> > +               : [tmp] "=&r" (tmp), [locked] "+Q"(sl->locked)
> > +               : [one] "r" (1)
> > +}
> > +#endif
> > +
> >  static inline int rte_tm_supported(void)
> >  {
> >         return 0;
> > --
> > 2.7.4
> >
> 
> 
> --
> David Marchand


^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: [dpdk-dev] [PATCH v7 2/7] eal: add the APIs to wait until equal
  2019-10-17 13:14   ` Ananyev, Konstantin
@ 2019-10-21  7:21     ` Gavin Hu (Arm Technology China)
  0 siblings, 0 replies; 163+ messages in thread
From: Gavin Hu (Arm Technology China) @ 2019-10-21  7:21 UTC (permalink / raw)
  To: Ananyev, Konstantin, dev
  Cc: nd, thomas, stephen, hemant.agrawal, jerinj, pbhagavatula,
	Honnappa Nagarahalli, Ruifeng Wang (Arm Technology China),
	Phil Yang (Arm Technology China),
	Steve Capper, nd


> -----Original Message-----
> From: Ananyev, Konstantin <konstantin.ananyev@intel.com>
> Sent: Thursday, October 17, 2019 9:15 PM
> To: Gavin Hu (Arm Technology China) <Gavin.Hu@arm.com>; dev@dpdk.org
> Cc: nd <nd@arm.com>; thomas@monjalon.net;
> stephen@networkplumber.org; hemant.agrawal@nxp.com;
> jerinj@marvell.com; pbhagavatula@marvell.com; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>; Ruifeng Wang (Arm Technology China)
> <Ruifeng.Wang@arm.com>; Phil Yang (Arm Technology China)
> <Phil.Yang@arm.com>; Steve Capper <Steve.Capper@arm.com>
> Subject: RE: [dpdk-dev] [PATCH v7 2/7] eal: add the APIs to wait until equal
> 
> Hi Gavin,
> 
> >
> > The rte_wait_until_equal_xx APIs abstract the functionality of
> > 'polling for a memory location to become equal to a given value'.
> >
> > Add the RTE_ARM_USE_WFE configuration entry for aarch64, disabled
> > by default. When it is enabled, the above APIs will call WFE instruction
> > to save CPU cycles and power.
> >
> > Signed-off-by: Gavin Hu <gavin.hu@arm.com>
> > Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > Reviewed-by: Steve Capper <steve.capper@arm.com>
> > Reviewed-by: Ola Liljedahl <ola.liljedahl@arm.com>
> > Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > Reviewed-by: Phil Yang <phil.yang@arm.com>
> > Acked-by: Pavan Nikhilesh <pbhagavatula@marvell.com>
> > ---
> >  config/arm/meson.build                             |   1 +
> >  config/common_base                                 |   5 +
> >  .../common/include/arch/arm/rte_pause_64.h         |  30 ++++++
> >  lib/librte_eal/common/include/generic/rte_pause.h  | 106
> +++++++++++++++++++++
> >  4 files changed, 142 insertions(+)
> >
> > diff --git a/config/arm/meson.build b/config/arm/meson.build
> > index 979018e..b4b4cac 100644
> > --- a/config/arm/meson.build
> > +++ b/config/arm/meson.build
> > @@ -26,6 +26,7 @@ flags_common_default = [
> >  	['RTE_LIBRTE_AVP_PMD', false],
> >
> >  	['RTE_SCHED_VECTOR', false],
> > +	['RTE_ARM_USE_WFE', false],
> >  ]
> >
> >  flags_generic = [
> > diff --git a/config/common_base b/config/common_base
> > index 8ef75c2..8861713 100644
> > --- a/config/common_base
> > +++ b/config/common_base
> > @@ -111,6 +111,11 @@ CONFIG_RTE_MAX_VFIO_CONTAINERS=64
> >  CONFIG_RTE_MALLOC_DEBUG=n
> >  CONFIG_RTE_EAL_NUMA_AWARE_HUGEPAGES=n
> >  CONFIG_RTE_USE_LIBBSD=n
> > +# Use WFE instructions to implement the rte_wait_for_equal_xxx APIs,
> > +# calling these APIs put the cores in low power state while waiting
> > +# for the memory address to become equal to the expected value.
> > +# This is supported only by aarch64.
> > +CONFIG_RTE_ARM_USE_WFE=n
> >
> >  #
> >  # Recognize/ignore the AVX/AVX512 CPU flags for performance/power
> testing.
> > diff --git a/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
> b/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
> > index 93895d3..dabde17 100644
> > --- a/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
> > +++ b/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
> > @@ -1,5 +1,6 @@
> >  /* SPDX-License-Identifier: BSD-3-Clause
> >   * Copyright(c) 2017 Cavium, Inc
> > + * Copyright(c) 2019 Arm Limited
> >   */
> >
> >  #ifndef _RTE_PAUSE_ARM64_H_
> > @@ -17,6 +18,35 @@ static inline void rte_pause(void)
> >  	asm volatile("yield" ::: "memory");
> >  }
> >
> > +#ifdef RTE_ARM_USE_WFE
> > +#define __WAIT_UNTIL_EQUAL(name, asm_op, wide, type) \
> > +static __rte_always_inline void \
> > +rte_wait_until_equal_##name(volatile type * addr, type expected) \
> > +{ \
> > +	type tmp; \
> > +	asm volatile( \
> > +		#asm_op " %" #wide "[tmp], %[addr]\n" \
> > +		"cmp	%" #wide "[tmp], %" #wide "[expected]\n" \
> > +		"b.eq	2f\n" \
> > +		"sevl\n" \
> > +		"1:	wfe\n" \
> > +		#asm_op " %" #wide "[tmp], %[addr]\n" \
> > +		"cmp	%" #wide "[tmp], %" #wide "[expected]\n" \
> > +		"bne	1b\n" \
> > +		"2:\n" \
> > +		: [tmp] "=&r" (tmp) \
> > +		: [addr] "Q"(*addr), [expected] "r"(expected) \
> > +		: "cc", "memory"); \
> > +}
> > +/* Wait for *addr to be updated with expected value */
> > +__WAIT_UNTIL_EQUAL(relaxed_16, ldxrh, w, uint16_t)
> > +__WAIT_UNTIL_EQUAL(acquire_16, ldaxrh, w, uint16_t)
> > +__WAIT_UNTIL_EQUAL(relaxed_32, ldxr, w, uint32_t)
> > +__WAIT_UNTIL_EQUAL(acquire_32, ldaxr, w, uint32_t)
> > +__WAIT_UNTIL_EQUAL(relaxed_64, ldxr, x, uint64_t)
> > +__WAIT_UNTIL_EQUAL(acquire_64, ldaxr, x, uint64_t)
> > +#endif
> > +
> >  #ifdef __cplusplus
> >  }
> >  #endif
> > diff --git a/lib/librte_eal/common/include/generic/rte_pause.h
> b/lib/librte_eal/common/include/generic/rte_pause.h
> > index 52bd4db..8906473 100644
> > --- a/lib/librte_eal/common/include/generic/rte_pause.h
> > +++ b/lib/librte_eal/common/include/generic/rte_pause.h
> > @@ -1,5 +1,6 @@
> >  /* SPDX-License-Identifier: BSD-3-Clause
> >   * Copyright(c) 2017 Cavium, Inc
> > + * Copyright(c) 2019 Arm Limited
> >   */
> >
> >  #ifndef _RTE_PAUSE_H_
> > @@ -12,6 +13,10 @@
> >   *
> >   */
> >
> > +#include <stdint.h>
> > +#include <rte_common.h>
> > +#include <rte_atomic.h>
> > +
> >  /**
> >   * Pause CPU execution for a short while
> >   *
> > @@ -20,4 +25,105 @@
> >   */
> >  static inline void rte_pause(void);
> >
> > +/**
> > + * Wait for *addr to be updated with a 16-bit expected value, with a
> relaxed
> > + * memory ordering model meaning the loads around this API can be
> reordered.
> > + *
> > + * @param addr
> > + *  A pointer to the memory location.
> > + * @param expected
> > + *  A 16-bit expected value to be in the memory location.
> > + */
> > +__rte_always_inline
> > +static void
> > +rte_wait_until_equal_relaxed_16(volatile uint16_t *addr, uint16_t
> expected);
> > +
> > +/**
> > + * Wait for *addr to be updated with a 32-bit expected value, with a
> relaxed
> > + * memory ordering model meaning the loads around this API can be
> reordered.
> > + *
> > + * @param addr
> > + *  A pointer to the memory location.
> > + * @param expected
> > + *  A 32-bit expected value to be in the memory location.
> > + */
> > +__rte_always_inline
> > +static void
> > +rte_wait_until_equal_relaxed_32(volatile uint32_t *addr, uint32_t
> expected);
> > +
> > +/**
> > + * Wait for *addr to be updated with a 64-bit expected value, with a
> relaxed
> > + * memory ordering model meaning the loads around this API can be
> reordered.
> > + *
> > + * @param addr
> > + *  A pointer to the memory location.
> > + * @param expected
> > + *  A 64-bit expected value to be in the memory location.
> > + */
> > +__rte_always_inline
> > +static void
> > +rte_wait_until_equal_relaxed_64(volatile uint64_t *addr, uint64_t
> expected);
> > +
> > +/**
> > + * Wait for *addr to be updated with a 16-bit expected value, with an
> acquire
> > + * memory ordering model meaning the loads after this API can't be
> observed
> > + * before this API.
> > + *
> > + * @param addr
> > + *  A pointer to the memory location.
> > + * @param expected
> > + *  A 16-bit expected value to be in the memory location.
> > + */
> > +__rte_always_inline
> > +static void
> > +rte_wait_until_equal_acquire_16(volatile uint16_t *addr, uint16_t
> expected);
> > +
> > +/**
> > + * Wait for *addr to be updated with a 32-bit expected value, with an
> acquire
> > + * memory ordering model meaning the loads after this API can't be
> observed
> > + * before this API.
> > + *
> > + * @param addr
> > + *  A pointer to the memory location.
> > + * @param expected
> > + *  A 32-bit expected value to be in the memory location.
> > + */
> > +__rte_always_inline
> > +static void
> > +rte_wait_until_equal_acquire_32(volatile uint32_t *addr, uint32_t
> expected);
> 
> LGTM in general.
> One stylish thing: wouldn't it be better to have an API like that:
> rte_wait_until_equal_acquire_X(addr, expected, memory_order)
> ?
> 
> I.E. - pass memorder as parameter, not to incorporate it into function name?
> Less functions, plus user can specify order himself.
> Plus looks similar to C11 atomic instrincts.
> 
Thanks for your comment, will fix this in v8.
> > +
> > +/**
> > + * Wait for *addr to be updated with a 64-bit expected value, with an
> acquire
> > + * memory ordering model meaning the loads after this API can't be
> observed
> > + * before this API.
> > + *
> > + * @param addr
> > + *  A pointer to the memory location.
> > + * @param expected
> > + *  A 64-bit expected value to be in the memory location.
> > + */
> > +__rte_always_inline
> > +static void
> > +rte_wait_until_equal_acquire_64(volatile uint64_t *addr, uint64_t
> expected);
> > +
> > +#if !defined(RTE_ARM_USE_WFE)
> > +#define __WAIT_UNTIL_EQUAL(op_name, size, type, memorder) \
> > +__rte_always_inline \
> > +static void	\
> > +rte_wait_until_equal_##op_name##_##size(volatile type *addr, \
> > +	type expected) \
> > +{ \
> > +	while (__atomic_load_n(addr, memorder) != expected) \
> > +		rte_pause(); \
> > +}
> > +
> > +/* Wait for *addr to be updated with expected value */
> > +__WAIT_UNTIL_EQUAL(relaxed, 16, uint16_t, __ATOMIC_RELAXED)
> > +__WAIT_UNTIL_EQUAL(acquire, 16, uint16_t, __ATOMIC_ACQUIRE)
> > +__WAIT_UNTIL_EQUAL(relaxed, 32, uint32_t, __ATOMIC_RELAXED)
> > +__WAIT_UNTIL_EQUAL(acquire, 32, uint32_t, __ATOMIC_ACQUIRE)
> > +__WAIT_UNTIL_EQUAL(relaxed, 64, uint64_t, __ATOMIC_RELAXED)
> > +__WAIT_UNTIL_EQUAL(acquire, 64, uint64_t, __ATOMIC_ACQUIRE)
> > +#endif /* RTE_ARM_USE_WFE */
> > +
> >  #endif /* _RTE_PAUSE_H_ */
> > --
> > 2.7.4


^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: [dpdk-dev] [PATCH v7 3/7] spinlock: use wfe to reduce contention on aarch64
  2019-10-17 18:27   ` David Marchand
  2019-10-18  5:45     ` Gavin Hu (Arm Technology China)
@ 2019-10-21  7:27     ` Gavin Hu (Arm Technology China)
  1 sibling, 0 replies; 163+ messages in thread
From: Gavin Hu (Arm Technology China) @ 2019-10-21  7:27 UTC (permalink / raw)
  To: David Marchand
  Cc: dev, nd, thomas, Stephen Hemminger, hemant.agrawal, jerinj,
	Pavan Nikhilesh, Honnappa Nagarahalli,
	Ruifeng Wang (Arm Technology China),
	Phil Yang (Arm Technology China),
	Steve Capper, nd

Hi David,

> -----Original Message-----
> From: David Marchand <david.marchand@redhat.com>
> Sent: Friday, October 18, 2019 2:28 AM
> To: Gavin Hu (Arm Technology China) <Gavin.Hu@arm.com>
> Cc: dev <dev@dpdk.org>; nd <nd@arm.com>; thomas@monjalon.net;
> Stephen Hemminger <stephen@networkplumber.org>;
> hemant.agrawal@nxp.com; jerinj@marvell.com; Pavan Nikhilesh
> <pbhagavatula@marvell.com>; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>; Ruifeng Wang (Arm Technology China)
> <Ruifeng.Wang@arm.com>; Phil Yang (Arm Technology China)
> <Phil.Yang@arm.com>; Steve Capper <Steve.Capper@arm.com>
> Subject: Re: [dpdk-dev] [PATCH v7 3/7] spinlock: use wfe to reduce
> contention on aarch64
> 
> On Fri, Sep 27, 2019 at 7:43 AM Gavin Hu <gavin.hu@arm.com> wrote:
> >
> > In acquiring a spinlock, cores repeatedly poll the lock variable.
> > This is replaced by rte_wait_until_equal API.
> >
> > Running the micro benchmarking and the testpmd and l3fwd traffic tests
> > on ThunderX2, Ampere eMAG80 and Arm N1SDP, everything went well
> and no
> > notable performance gain nor degradation was measured.
> >
> > Signed-off-by: Gavin Hu <gavin.hu@arm.com>
> > Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > Reviewed-by: Phil Yang <phil.yang@arm.com>
> > Reviewed-by: Steve Capper <steve.capper@arm.com>
> > Reviewed-by: Ola Liljedahl <ola.liljedahl@arm.com>
> > Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > Tested-by: Pavan Nikhilesh <pbhagavatula@marvell.com>
> > ---
> >  .../common/include/arch/arm/rte_spinlock.h         | 26
> ++++++++++++++++++++++
> >  1 file changed, 26 insertions(+)
> >
> > diff --git a/lib/librte_eal/common/include/arch/arm/rte_spinlock.h
> b/lib/librte_eal/common/include/arch/arm/rte_spinlock.h
> > index 1a6916b..b61c055 100644
> > --- a/lib/librte_eal/common/include/arch/arm/rte_spinlock.h
> > +++ b/lib/librte_eal/common/include/arch/arm/rte_spinlock.h
> > @@ -16,6 +16,32 @@ extern "C" {
> >  #include <rte_common.h>
> >  #include "generic/rte_spinlock.h"
> >
> > +/* armv7a does support WFE, but an explicit wake-up signal using SEV
> is
> > + * required (must be preceded by DSB to drain the store buffer) and
> > + * this is less performant, so keep armv7a implementation unchanged.
> > + */
> > +#ifndef RTE_FORCE_INTRINSICS
> 
> Earlier, in the same file, I can see:
> https://git.dpdk.org/dpdk/tree/lib/librte_eal/common/include/arch/arm/r
> te_spinlock.h?h=v19.08#n8
> 
> #ifndef RTE_FORCE_INTRINSICS
> #  error Platform must be built with CONFIG_RTE_FORCE_INTRINSICS
> #endif
> 
> IIUC, this is dead code.

This is not dead code, RTE_FORCE_INTRINSICS is still mandatory for aarch64 ad RTE_ARM_USE_WFE is optional currently.
Yes, as Jerin pointed out, we may make it optional also, like x86, but now it is still too earlier before WFE is mandatory, anyway it is in our plan. 
I will tweak a little bit for the two macros to reflect this logic in v8.
/Gavin
> 
> > +static inline void
> > +rte_spinlock_lock(rte_spinlock_t *sl)
> > +{
> > +       unsigned int tmp;
> > +       /* http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.
> > +        * faqs/ka16809.html
> > +        */
> > +       asm volatile(
> > +               "1:     ldaxr %w[tmp], %w[locked]\n"
> > +               "cbnz   %w[tmp], 2f\n"
> > +               "stxr   %w[tmp], %w[one], %w[locked]\n"
> > +               "cbnz   %w[tmp], 1b\n"
> > +               "ret\n"
> > +               "2:     sevl\n"
> > +               "wfe\n"
> > +               "jmp    1b\n"
> > +               : [tmp] "=&r" (tmp), [locked] "+Q"(sl->locked)
> > +               : [one] "r" (1)
> > +}
> > +#endif
> > +
> >  static inline int rte_tm_supported(void)
> >  {
> >         return 0;
> > --
> > 2.7.4
> >
> 
> 
> --
> David Marchand


^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: [dpdk-dev] [PATCH v7 2/7] eal: add the APIs to wait until equal
  2019-10-17 15:45   ` David Marchand
@ 2019-10-21  7:38     ` Gavin Hu (Arm Technology China)
  2019-10-21 19:17       ` David Marchand
  0 siblings, 1 reply; 163+ messages in thread
From: Gavin Hu (Arm Technology China) @ 2019-10-21  7:38 UTC (permalink / raw)
  To: David Marchand
  Cc: dev, nd, thomas, Stephen Hemminger, hemant.agrawal, jerinj,
	Pavan Nikhilesh, Honnappa Nagarahalli,
	Ruifeng Wang (Arm Technology China),
	Phil Yang (Arm Technology China),
	Steve Capper, nd

Hi David, 

One comment about the experimental tag for the API inlined. 
/Gavin
> -----Original Message-----
> From: David Marchand <david.marchand@redhat.com>
> Sent: Thursday, October 17, 2019 11:45 PM
> To: Gavin Hu (Arm Technology China) <Gavin.Hu@arm.com>
> Cc: dev <dev@dpdk.org>; nd <nd@arm.com>; thomas@monjalon.net;
> Stephen Hemminger <stephen@networkplumber.org>;
> hemant.agrawal@nxp.com; jerinj@marvell.com; Pavan Nikhilesh
> <pbhagavatula@marvell.com>; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>; Ruifeng Wang (Arm Technology China)
> <Ruifeng.Wang@arm.com>; Phil Yang (Arm Technology China)
> <Phil.Yang@arm.com>; Steve Capper <Steve.Capper@arm.com>
> Subject: Re: [dpdk-dev] [PATCH v7 2/7] eal: add the APIs to wait until equal
> 
> On Fri, Sep 27, 2019 at 7:42 AM Gavin Hu <gavin.hu@arm.com> wrote:
> >
> > The rte_wait_until_equal_xx APIs abstract the functionality of
> > 'polling for a memory location to become equal to a given value'.
> >
> > Add the RTE_ARM_USE_WFE configuration entry for aarch64, disabled
> > by default. When it is enabled, the above APIs will call WFE instruction
> > to save CPU cycles and power.
> 
> - As discussed on irc, I would prefer we have two stages for this:
> * first stage, an architecture announces it has its own implementation
> of this api, in such a case it defines RTE_ARCH_HAS_WFE in its
> arch/xxx/rte_pause.h header before including generic/rte_pause.h
>   The default implementation with C11 is then skipped in the generic
> header.
> * second stage, in the arm64 header, if RTE_ARM_USE_WFE is set in the
> configuration, then define RTE_ARCH_HAS_WFE
Will fix in v8
> - Can you add a little description on the limitation of using WFE
> instruction in the commit log?
Will fix in v8
> 
> - This is a new api, should be marked experimental, even if inlined.
It is ok to add for the patches except the rte_ring, which called the API in the .h file, other than the .c file. 
For the .h file is included by a lot of components, which require adding 'allowing_experimenal_apis = true' to the meson.build and makefile.
I am worried adding too many of these changes is confusing. I may leave this patch out of the series if there is no decorous solutions. 
/Gavin
> Small comments inline.
> 
> 
> >
> > Signed-off-by: Gavin Hu <gavin.hu@arm.com>
> > Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > Reviewed-by: Steve Capper <steve.capper@arm.com>
> > Reviewed-by: Ola Liljedahl <ola.liljedahl@arm.com>
> > Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > Reviewed-by: Phil Yang <phil.yang@arm.com>
> > Acked-by: Pavan Nikhilesh <pbhagavatula@marvell.com>
> > ---
> >  config/arm/meson.build                             |   1 +
> >  config/common_base                                 |   5 +
> >  .../common/include/arch/arm/rte_pause_64.h         |  30 ++++++
> >  lib/librte_eal/common/include/generic/rte_pause.h  | 106
> +++++++++++++++++++++
> >  4 files changed, 142 insertions(+)
> >
> > diff --git a/config/arm/meson.build b/config/arm/meson.build
> > index 979018e..b4b4cac 100644
> > --- a/config/arm/meson.build
> > +++ b/config/arm/meson.build
> > @@ -26,6 +26,7 @@ flags_common_default = [
> >         ['RTE_LIBRTE_AVP_PMD', false],
> >
> >         ['RTE_SCHED_VECTOR', false],
> > +       ['RTE_ARM_USE_WFE', false],
> >  ]
> >
> >  flags_generic = [
> > diff --git a/config/common_base b/config/common_base
> > index 8ef75c2..8861713 100644
> > --- a/config/common_base
> > +++ b/config/common_base
> > @@ -111,6 +111,11 @@ CONFIG_RTE_MAX_VFIO_CONTAINERS=64
> >  CONFIG_RTE_MALLOC_DEBUG=n
> >  CONFIG_RTE_EAL_NUMA_AWARE_HUGEPAGES=n
> >  CONFIG_RTE_USE_LIBBSD=n
> > +# Use WFE instructions to implement the rte_wait_for_equal_xxx APIs,
> > +# calling these APIs put the cores in low power state while waiting
> > +# for the memory address to become equal to the expected value.
> > +# This is supported only by aarch64.
> > +CONFIG_RTE_ARM_USE_WFE=n
> >
> >  #
> >  # Recognize/ignore the AVX/AVX512 CPU flags for performance/power
> testing.
> > diff --git a/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
> b/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
> > index 93895d3..dabde17 100644
> > --- a/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
> > +++ b/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
> > @@ -1,5 +1,6 @@
> >  /* SPDX-License-Identifier: BSD-3-Clause
> >   * Copyright(c) 2017 Cavium, Inc
> > + * Copyright(c) 2019 Arm Limited
> >   */
> >
> >  #ifndef _RTE_PAUSE_ARM64_H_
> > @@ -17,6 +18,35 @@ static inline void rte_pause(void)
> >         asm volatile("yield" ::: "memory");
> >  }
> >
> > +#ifdef RTE_ARM_USE_WFE
> > +#define __WAIT_UNTIL_EQUAL(name, asm_op, wide, type) \
> > +static __rte_always_inline void \
> > +rte_wait_until_equal_##name(volatile type * addr, type expected) \
> > +{ \
> > +       type tmp; \
> > +       asm volatile( \
> > +               #asm_op " %" #wide "[tmp], %[addr]\n" \
> > +               "cmp    %" #wide "[tmp], %" #wide "[expected]\n" \
> > +               "b.eq   2f\n" \
> > +               "sevl\n" \
> > +               "1:     wfe\n" \
> > +               #asm_op " %" #wide "[tmp], %[addr]\n" \
> > +               "cmp    %" #wide "[tmp], %" #wide "[expected]\n" \
> > +               "bne    1b\n" \
> > +               "2:\n" \
> > +               : [tmp] "=&r" (tmp) \
> > +               : [addr] "Q"(*addr), [expected] "r"(expected) \
> > +               : "cc", "memory"); \
> > +}
> > +/* Wait for *addr to be updated with expected value */
> > +__WAIT_UNTIL_EQUAL(relaxed_16, ldxrh, w, uint16_t)
> > +__WAIT_UNTIL_EQUAL(acquire_16, ldaxrh, w, uint16_t)
> > +__WAIT_UNTIL_EQUAL(relaxed_32, ldxr, w, uint32_t)
> > +__WAIT_UNTIL_EQUAL(acquire_32, ldaxr, w, uint32_t)
> > +__WAIT_UNTIL_EQUAL(relaxed_64, ldxr, x, uint64_t)
> > +__WAIT_UNTIL_EQUAL(acquire_64, ldaxr, x, uint64_t)
> 
> Missing #undef __WAIT_UNTIL_EQUAL
Will fix in v8
> 
> > +#endif
> > +
> >  #ifdef __cplusplus
> >  }
> >  #endif
> > diff --git a/lib/librte_eal/common/include/generic/rte_pause.h
> b/lib/librte_eal/common/include/generic/rte_pause.h
> > index 52bd4db..8906473 100644
> > --- a/lib/librte_eal/common/include/generic/rte_pause.h
> > +++ b/lib/librte_eal/common/include/generic/rte_pause.h
> > @@ -1,5 +1,6 @@
> >  /* SPDX-License-Identifier: BSD-3-Clause
> >   * Copyright(c) 2017 Cavium, Inc
> > + * Copyright(c) 2019 Arm Limited
> >   */
> >
> >  #ifndef _RTE_PAUSE_H_
> > @@ -12,6 +13,10 @@
> >   *
> >   */
> >
> > +#include <stdint.h>
> > +#include <rte_common.h>
> > +#include <rte_atomic.h>
> > +
> >  /**
> >   * Pause CPU execution for a short while
> >   *
> > @@ -20,4 +25,105 @@
> >   */
> >  static inline void rte_pause(void);
> >
> > +/**
> 
> Missing warning on experimental api.
Will add it in v8.
> 
> > + * Wait for *addr to be updated with a 16-bit expected value, with a
> relaxed
> > + * memory ordering model meaning the loads around this API can be
> reordered.
> > + *
> > + * @param addr
> > + *  A pointer to the memory location.
> > + * @param expected
> > + *  A 16-bit expected value to be in the memory location.
> > + */
> 
> Missing experimental tag.
> Saying this only once, please update declarations below, plus doxygen
> header.
Will fix all in v8.
> 
> > +__rte_always_inline
> > +static void
> > +rte_wait_until_equal_relaxed_16(volatile uint16_t *addr, uint16_t
> expected);
> > +
> > +/**
> > + * Wait for *addr to be updated with a 32-bit expected value, with a
> relaxed
> > + * memory ordering model meaning the loads around this API can be
> reordered.
> > + *
> > + * @param addr
> > + *  A pointer to the memory location.
> > + * @param expected
> > + *  A 32-bit expected value to be in the memory location.
> > + */
> > +__rte_always_inline
> > +static void
> > +rte_wait_until_equal_relaxed_32(volatile uint32_t *addr, uint32_t
> expected);
> > +
> > +/**
> > + * Wait for *addr to be updated with a 64-bit expected value, with a
> relaxed
> > + * memory ordering model meaning the loads around this API can be
> reordered.
> > + *
> > + * @param addr
> > + *  A pointer to the memory location.
> > + * @param expected
> > + *  A 64-bit expected value to be in the memory location.
> > + */
> > +__rte_always_inline
> > +static void
> > +rte_wait_until_equal_relaxed_64(volatile uint64_t *addr, uint64_t
> expected);
> > +
> > +/**
> > + * Wait for *addr to be updated with a 16-bit expected value, with an
> acquire
> > + * memory ordering model meaning the loads after this API can't be
> observed
> > + * before this API.
> > + *
> > + * @param addr
> > + *  A pointer to the memory location.
> > + * @param expected
> > + *  A 16-bit expected value to be in the memory location.
> > + */
> > +__rte_always_inline
> > +static void
> > +rte_wait_until_equal_acquire_16(volatile uint16_t *addr, uint16_t
> expected);
> > +
> > +/**
> > + * Wait for *addr to be updated with a 32-bit expected value, with an
> acquire
> > + * memory ordering model meaning the loads after this API can't be
> observed
> > + * before this API.
> > + *
> > + * @param addr
> > + *  A pointer to the memory location.
> > + * @param expected
> > + *  A 32-bit expected value to be in the memory location.
> > + */
> > +__rte_always_inline
> > +static void
> > +rte_wait_until_equal_acquire_32(volatile uint32_t *addr, uint32_t
> expected);
> > +
> > +/**
> > + * Wait for *addr to be updated with a 64-bit expected value, with an
> acquire
> > + * memory ordering model meaning the loads after this API can't be
> observed
> > + * before this API.
> > + *
> > + * @param addr
> > + *  A pointer to the memory location.
> > + * @param expected
> > + *  A 64-bit expected value to be in the memory location.
> > + */
> > +__rte_always_inline
> > +static void
> > +rte_wait_until_equal_acquire_64(volatile uint64_t *addr, uint64_t
> expected);
> > +
> > +#if !defined(RTE_ARM_USE_WFE)
> > +#define __WAIT_UNTIL_EQUAL(op_name, size, type, memorder) \
> > +__rte_always_inline \
> > +static void    \
> > +rte_wait_until_equal_##op_name##_##size(volatile type *addr, \
> > +       type expected) \
> > +{ \
> > +       while (__atomic_load_n(addr, memorder) != expected) \
> > +               rte_pause(); \
> > +}
> > +
> > +/* Wait for *addr to be updated with expected value */
> > +__WAIT_UNTIL_EQUAL(relaxed, 16, uint16_t, __ATOMIC_RELAXED)
> > +__WAIT_UNTIL_EQUAL(acquire, 16, uint16_t, __ATOMIC_ACQUIRE)
> > +__WAIT_UNTIL_EQUAL(relaxed, 32, uint32_t, __ATOMIC_RELAXED)
> > +__WAIT_UNTIL_EQUAL(acquire, 32, uint32_t, __ATOMIC_ACQUIRE)
> > +__WAIT_UNTIL_EQUAL(relaxed, 64, uint64_t, __ATOMIC_RELAXED)
> > +__WAIT_UNTIL_EQUAL(acquire, 64, uint64_t, __ATOMIC_ACQUIRE)
> 
> #undef __WAIT_UNTIL_EQUAL
> 
> > +#endif /* RTE_ARM_USE_WFE */
> > +
> >  #endif /* _RTE_PAUSE_H_ */
> > --
> > 2.7.4
> >
> 
> 
> --
> David Marchand


^ permalink raw reply	[flat|nested] 163+ messages in thread

* [dpdk-dev] [PATCH v8 0/6] use WFE for aarch64
  2019-06-30 16:21 [dpdk-dev] [RFC 0/5] use WFE for locks and ring on aarch64 Gavin Hu
                   ` (49 preceding siblings ...)
  2019-09-27  5:41 ` [dpdk-dev] [PATCH v7 7/7] event/opdl: " Gavin Hu
@ 2019-10-21  9:47 ` Gavin Hu
  2019-10-21  9:47 ` [dpdk-dev] [PATCH v8 1/6] bus/fslmc: fix the conflicting dmb function Gavin Hu
                   ` (30 subsequent siblings)
  81 siblings, 0 replies; 163+ messages in thread
From: Gavin Hu @ 2019-10-21  9:47 UTC (permalink / raw)
  To: dev
  Cc: nd, david.marchand, konstantin.ananyev, thomas, stephen,
	hemant.agrawal, jerinj, pbhagavatula, Honnappa.Nagarahalli,
	ruifeng.wang, phil.yang, steve.capper

V8:
- define wfe() and sev() macros and use them inside normal C code (Ananyev Konstantin)
- pass memorder as parameter, not to incorporate it into function name, less functions, similar to C11 atomic intrinsics (Ananyev Konstantin)
- remove mandating RTE_FORCE_INTRINSICS in arm spinlock implementation(David Marchand)
- undef __WAIT_UNTIL_EQUAL after use (David Marchand)
- add experimental tag and warning (David Marchand)
- add the limitation of using WFE instruction in the commit log (David Marchand) 
- tweak the use of RTE_FORCE_INSTRINSICS(still mandatory for aarch64) and RTE_ARM_USE_WFE for spinlock (David Marchand)
- drop the rte_ring patch from this series, as the rte_ring.h calls this API and it is widely included by a lot of components each requires the ALLOW_EXPERIMENRAL_API for the Makefile and meson.build, leave it to future after the experimental is removed.
V7:
- fix the checkpatch LONG_LINE_COMMENT issue
V6:
- squash the RTE_ARM_USE_WFE configuration entry patch into the new API patch
- move the new configuration to the end of EAL
- add doxygen comments to reflect the relaxed and acquire semantics
- correct the meson configuration 
V5:
- add doxygen comments for the new APIs
- spinlock early exit without wfe if the spinlock not taken by others.
- add two patches on top for opdl and thunderx
V4:
- rename the config as CONFIG_RTE_ARM_USE_WFE to indicate it applys to arm only
- introduce a macro for assembly Skelton to reduce the duplication of code
- add one patch for nxp fslmc to address a compiling error
V3:
- Convert RFCs to patches
V2:
- Use inline functions instead of marcos
- Add load and compare in the beginning of the APIs
- Fix some style errors in asm inline 
V1:
- Add the new APIs and use it for ring and locks

DPDK has multiple use cases where the core repeatedly polls a location in
memory. This polling results in many cache and memory transactions.

Arm architecture provides WFE (Wait For Event) instruction, which allows
the cpu core to enter a low power state until woken up by the update to the
memory location being polled. Thus reducing the cache and memory
transactions.

x86 has the PAUSE hint instruction to reduce such overhead.

The rte_wait_until_equal_xxx APIs abstract the functionality of 'polling
for a memory location to become equal to a given value'.

For non-Arm platforms, these APIs are just wrappers around do-while loop
with rte_pause, so there are no performance differences.

For Arm platforms, use of WFE can be configured using CONFIG_RTE_USE_WFE
option. It is disabled by default.

Currently, use of WFE is supported only for aarch64 platforms. armv7
platforms do support the WFE instruction, but they require explicit wake up
events(sev) and are less performannt.

Testing shows that, performance varies across different platforms, with
some showing degradation.

CONFIG_RTE_ARM_USE_WFE should be enabled depending on the performance
benchmarking on the target platforms. Power saving should be an bonus,
but currenly we don't have ways to characterize that.

Gavin Hu (6):
  bus/fslmc: fix the conflicting dmb function
  eal: add the APIs to wait until equal
  spinlock: use wfe to reduce contention on aarch64
  ticketlock: use new API to reduce contention on aarch64
  net/thunderx: use new API to save cycles on aarch64
  event/opdl: use new API to save cycles on aarch64

 config/arm/meson.build                             |  1 +
 config/common_base                                 |  5 ++
 drivers/bus/fslmc/mc/fsl_mc_sys.h                  |  8 +-
 drivers/event/opdl/Makefile                        |  1 +
 drivers/event/opdl/meson.build                     |  1 +
 drivers/event/opdl/opdl_ring.c                     |  5 +-
 drivers/net/thunderx/Makefile                      |  1 +
 drivers/net/thunderx/meson.build                   |  1 +
 drivers/net/thunderx/nicvf_rxtx.c                  |  3 +-
 .../common/include/arch/arm/rte_pause_64.h         | 26 ++++++
 .../common/include/arch/arm/rte_spinlock.h         | 26 ++++++
 lib/librte_eal/common/include/generic/rte_pause.h  | 93 ++++++++++++++++++++++
 .../common/include/generic/rte_spinlock.h          |  2 +-
 .../common/include/generic/rte_ticketlock.h        |  3 +-
 14 files changed, 163 insertions(+), 13 deletions(-)

-- 
2.7.4


^ permalink raw reply	[flat|nested] 163+ messages in thread

* [dpdk-dev] [PATCH v8 1/6] bus/fslmc: fix the conflicting dmb function
  2019-06-30 16:21 [dpdk-dev] [RFC 0/5] use WFE for locks and ring on aarch64 Gavin Hu
                   ` (50 preceding siblings ...)
  2019-10-21  9:47 ` [dpdk-dev] [PATCH v8 0/6] use WFE for aarch64 Gavin Hu
@ 2019-10-21  9:47 ` Gavin Hu
  2019-10-21 19:00   ` David Marchand
  2019-10-21  9:47 ` [dpdk-dev] [PATCH v8 2/6] eal: add the APIs to wait until equal Gavin Hu
                   ` (29 subsequent siblings)
  81 siblings, 1 reply; 163+ messages in thread
From: Gavin Hu @ 2019-10-21  9:47 UTC (permalink / raw)
  To: dev
  Cc: nd, david.marchand, konstantin.ananyev, thomas, stephen,
	hemant.agrawal, jerinj, pbhagavatula, Honnappa.Nagarahalli,
	ruifeng.wang, phil.yang, steve.capper, stable

There are two definitions conflicting each other, for more
details, refer to [1].

include/rte_atomic_64.h:19: error: "dmb" redefined [-Werror]
drivers/bus/fslmc/mc/fsl_mc_sys.h:36: note: this is the location of the
previous definition
 #define dmb() {__asm__ __volatile__("" : : : "memory"); }

The fix is to reuse the EAL definition to avoid conflicts.

[1] http://inbox.dpdk.org/users/VI1PR08MB537631AB25F41B8880DCCA988FDF0@i
VI1PR08MB5376.eurprd08.prod.outlook.com/T/#u

Fixes: 3af733ba8da8 ("bus/fslmc: introduce MC object functions")
Cc: stable@dpdk.org

Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Phil Yang <phi.yang@arm.com>
Acked-by: Hemant Agrawal <hemant.agrawal@nxp.com>
---
 drivers/bus/fslmc/mc/fsl_mc_sys.h | 8 +++-----
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/drivers/bus/fslmc/mc/fsl_mc_sys.h b/drivers/bus/fslmc/mc/fsl_mc_sys.h
index d0c7b39..68ce38b 100644
--- a/drivers/bus/fslmc/mc/fsl_mc_sys.h
+++ b/drivers/bus/fslmc/mc/fsl_mc_sys.h
@@ -31,12 +31,10 @@ struct fsl_mc_io {
 #include <errno.h>
 #include <sys/uio.h>
 #include <linux/byteorder/little_endian.h>
+#include <rte_atomic.h>
 
-#ifndef dmb
-#define dmb() {__asm__ __volatile__("" : : : "memory"); }
-#endif
-#define __iormb()	dmb()
-#define __iowmb()	dmb()
+#define __iormb()	rte_io_rmb()
+#define __iowmb()	rte_io_wmb()
 #define __arch_getq(a)		(*(volatile uint64_t *)(a))
 #define __arch_putq(v, a)	(*(volatile uint64_t *)(a) = (v))
 #define __arch_putq32(v, a)	(*(volatile uint32_t *)(a) = (v))
-- 
2.7.4


^ permalink raw reply	[flat|nested] 163+ messages in thread

* [dpdk-dev] [PATCH v8 2/6] eal: add the APIs to wait until equal
  2019-06-30 16:21 [dpdk-dev] [RFC 0/5] use WFE for locks and ring on aarch64 Gavin Hu
                   ` (51 preceding siblings ...)
  2019-10-21  9:47 ` [dpdk-dev] [PATCH v8 1/6] bus/fslmc: fix the conflicting dmb function Gavin Hu
@ 2019-10-21  9:47 ` Gavin Hu
  2019-10-21 19:19   ` David Marchand
  2019-10-21  9:47 ` [dpdk-dev] [PATCH v8 3/6] spinlock: use wfe to reduce contention on aarch64 Gavin Hu
                   ` (28 subsequent siblings)
  81 siblings, 1 reply; 163+ messages in thread
From: Gavin Hu @ 2019-10-21  9:47 UTC (permalink / raw)
  To: dev
  Cc: nd, david.marchand, konstantin.ananyev, thomas, stephen,
	hemant.agrawal, jerinj, pbhagavatula, Honnappa.Nagarahalli,
	ruifeng.wang, phil.yang, steve.capper

The rte_wait_until_equal_xx APIs abstract the functionality of
'polling for a memory location to become equal to a given value'.

Add the RTE_ARM_USE_WFE configuration entry for aarch64, disabled
by default. When it is enabled, the above APIs will call WFE instruction
to save CPU cycles and power.

From a VM, when calling this API on aarch64, it may trap in and out to
release vCPUs whereas cause high exit latency. Since kernel 4.18.20 an
adaptive trapping mechanism is introduced to balance the latency and
workload.

Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
Reviewed-by: Steve Capper <steve.capper@arm.com>
Reviewed-by: Ola Liljedahl <ola.liljedahl@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Reviewed-by: Phil Yang <phil.yang@arm.com>
Acked-by: Pavan Nikhilesh <pbhagavatula@marvell.com>
Acked-by: Jerin Jacob <jerinj@marvell.com>
---
 config/arm/meson.build                             |  1 +
 config/common_base                                 |  5 ++
 .../common/include/arch/arm/rte_pause_64.h         | 26 ++++++
 lib/librte_eal/common/include/generic/rte_pause.h  | 93 ++++++++++++++++++++++
 4 files changed, 125 insertions(+)

diff --git a/config/arm/meson.build b/config/arm/meson.build
index 979018e..b4b4cac 100644
--- a/config/arm/meson.build
+++ b/config/arm/meson.build
@@ -26,6 +26,7 @@ flags_common_default = [
 	['RTE_LIBRTE_AVP_PMD', false],
 
 	['RTE_SCHED_VECTOR', false],
+	['RTE_ARM_USE_WFE', false],
 ]
 
 flags_generic = [
diff --git a/config/common_base b/config/common_base
index e843a21..c812156 100644
--- a/config/common_base
+++ b/config/common_base
@@ -111,6 +111,11 @@ CONFIG_RTE_MAX_VFIO_CONTAINERS=64
 CONFIG_RTE_MALLOC_DEBUG=n
 CONFIG_RTE_EAL_NUMA_AWARE_HUGEPAGES=n
 CONFIG_RTE_USE_LIBBSD=n
+# Use WFE instructions to implement the rte_wait_for_equal_xxx APIs,
+# calling these APIs put the cores in low power state while waiting
+# for the memory address to become equal to the expected value.
+# This is supported only by aarch64.
+CONFIG_RTE_ARM_USE_WFE=n
 
 #
 # Recognize/ignore the AVX/AVX512 CPU flags for performance/power testing.
diff --git a/lib/librte_eal/common/include/arch/arm/rte_pause_64.h b/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
index 93895d3..eb8f73e 100644
--- a/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
+++ b/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
@@ -1,5 +1,6 @@
 /* SPDX-License-Identifier: BSD-3-Clause
  * Copyright(c) 2017 Cavium, Inc
+ * Copyright(c) 2019 Arm Limited
  */
 
 #ifndef _RTE_PAUSE_ARM64_H_
@@ -17,6 +18,31 @@ static inline void rte_pause(void)
 	asm volatile("yield" ::: "memory");
 }
 
+#ifdef RTE_ARM_USE_WFE
+#define sev()	{ asm volatile("sev" : : : "memory") }
+#define wfe()	{ asm volatile("wfe" : : : "memory") }
+
+#define __WAIT_UNTIL_EQUAL(type, size, addr, expected, memorder) \
+__rte_experimental						\
+static __rte_always_inline void					\
+rte_wait_until_equal_##size(volatile type * addr, type expected,\
+int memorder)							\
+{								\
+	if (__atomic_load_n(addr, memorder) != expected) {	\
+		sev();							\
+		do {							\
+			wfe();						\
+		} while (__atomic_load_n(addr, memorder) != expected);	\
+	 }								\
+}
+__WAIT_UNTIL_EQUAL(uint16_t, 16, addr, expected, memorder)
+__WAIT_UNTIL_EQUAL(uint32_t, 32, addr, expected, memorder)
+__WAIT_UNTIL_EQUAL(uint64_t, 64, addr, expected, memorder)
+
+#undef __WAIT_UNTIL_EQUAL
+
+#endif /* RTE_ARM_USE_WFE */
+
 #ifdef __cplusplus
 }
 #endif
diff --git a/lib/librte_eal/common/include/generic/rte_pause.h b/lib/librte_eal/common/include/generic/rte_pause.h
index 52bd4db..80597a9 100644
--- a/lib/librte_eal/common/include/generic/rte_pause.h
+++ b/lib/librte_eal/common/include/generic/rte_pause.h
@@ -1,5 +1,6 @@
 /* SPDX-License-Identifier: BSD-3-Clause
  * Copyright(c) 2017 Cavium, Inc
+ * Copyright(c) 2019 Arm Limited
  */
 
 #ifndef _RTE_PAUSE_H_
@@ -12,6 +13,11 @@
  *
  */
 
+#include <stdint.h>
+#include <rte_common.h>
+#include <rte_atomic.h>
+#include <rte_compat.h>
+
 /**
  * Pause CPU execution for a short while
  *
@@ -20,4 +26,91 @@
  */
 static inline void rte_pause(void);
 
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Wait for *addr to be updated with a 16-bit expected value, with a relaxed
+ * memory ordering model meaning the loads around this API can be reordered.
+ *
+ * @param addr
+ *  A pointer to the memory location.
+ * @param expected
+ *  A 16-bit expected value to be in the memory location.
+ * @param memorder
+ *  Two different memory orders that can be specified:
+ *  __ATOMIC_ACQUIRE and __ATOMIC_RELAXED. These map to
+ *  C++11 memory orders with the same names, see the C++11 standard or
+ *  the GCC wiki on atomic synchronization for detailed definition.
+ */
+__rte_experimental
+static __rte_always_inline void
+rte_wait_until_equal_16(volatile uint16_t *addr, uint16_t expected,
+int memorder);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Wait for *addr to be updated with a 32-bit expected value, with a relaxed
+ * memory ordering model meaning the loads around this API can be reordered.
+ *
+ * @param addr
+ *  A pointer to the memory location.
+ * @param expected
+ *  A 32-bit expected value to be in the memory location.
+ * @param memorder
+ *  Two different memory orders that can be specified:
+ *  __ATOMIC_ACQUIRE and __ATOMIC_RELAXED. These map to
+ *  C++11 memory orders with the same names, see the C++11 standard or
+ *  the GCC wiki on atomic synchronization for detailed definition.
+ */
+__rte_experimental
+static __rte_always_inline void
+rte_wait_until_equal_32(volatile uint32_t *addr, uint32_t expected,
+int memorder);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Wait for *addr to be updated with a 64-bit expected value, with a relaxed
+ * memory ordering model meaning the loads around this API can be reordered.
+ *
+ * @param addr
+ *  A pointer to the memory location.
+ * @param expected
+ *  A 64-bit expected value to be in the memory location.
+ * @param memorder
+ *  Two different memory orders that can be specified:
+ *  __ATOMIC_ACQUIRE and __ATOMIC_RELAXED. These map to
+ *  C++11 memory orders with the same names, see the C++11 standard or
+ *  the GCC wiki on atomic synchronization for detailed definition.
+ */
+__rte_experimental
+static __rte_always_inline void
+rte_wait_until_equal_64(volatile uint64_t *addr, uint64_t expected,
+int memorder);
+
+#ifndef RTE_ARM_USE_WFE
+#define __WAIT_UNTIL_EQUAL(type, size, addr, expected, memorder) \
+__rte_experimental \
+static __rte_always_inline void					\
+rte_wait_until_equal_##size(volatile type * addr, type expected,\
+int memorder)							\
+{								\
+	if (__atomic_load_n(addr, memorder) != expected) {	\
+		do {							\
+			rte_pause();					\
+		} while (__atomic_load_n(addr, memorder) != expected);	\
+	}								\
+}
+__WAIT_UNTIL_EQUAL(uint16_t, 16, addr, expected, memorder)
+__WAIT_UNTIL_EQUAL(uint32_t, 32, addr, expected, memorder)
+__WAIT_UNTIL_EQUAL(uint64_t, 64, addr, expected, memorder)
+
+#undef __WAIT_UNTIL_EQUAL
+
+#endif /* RTE_ARM_USE_WFE */
+
 #endif /* _RTE_PAUSE_H_ */
-- 
2.7.4


^ permalink raw reply	[flat|nested] 163+ messages in thread

* [dpdk-dev] [PATCH v8 3/6] spinlock: use wfe to reduce contention on aarch64
  2019-06-30 16:21 [dpdk-dev] [RFC 0/5] use WFE for locks and ring on aarch64 Gavin Hu
                   ` (52 preceding siblings ...)
  2019-10-21  9:47 ` [dpdk-dev] [PATCH v8 2/6] eal: add the APIs to wait until equal Gavin Hu
@ 2019-10-21  9:47 ` Gavin Hu
  2019-10-21  9:47 ` [dpdk-dev] [PATCH v8 4/6] ticketlock: use new API " Gavin Hu
                   ` (27 subsequent siblings)
  81 siblings, 0 replies; 163+ messages in thread
From: Gavin Hu @ 2019-10-21  9:47 UTC (permalink / raw)
  To: dev
  Cc: nd, david.marchand, konstantin.ananyev, thomas, stephen,
	hemant.agrawal, jerinj, pbhagavatula, Honnappa.Nagarahalli,
	ruifeng.wang, phil.yang, steve.capper

In acquiring a spinlock, cores repeatedly poll the lock variable.
This is replaced by rte_wait_until_equal API.

Running the micro benchmarking and the testpmd and l3fwd traffic tests
on ThunderX2, Ampere eMAG80 and Arm N1SDP, everything went well and no
notable performance gain nor degradation was measured.

Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
Reviewed-by: Phil Yang <phil.yang@arm.com>
Reviewed-by: Steve Capper <steve.capper@arm.com>
Reviewed-by: Ola Liljedahl <ola.liljedahl@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Tested-by: Pavan Nikhilesh <pbhagavatula@marvell.com>
---
 .../common/include/arch/arm/rte_spinlock.h         | 26 ++++++++++++++++++++++
 .../common/include/generic/rte_spinlock.h          |  2 +-
 2 files changed, 27 insertions(+), 1 deletion(-)

diff --git a/lib/librte_eal/common/include/arch/arm/rte_spinlock.h b/lib/librte_eal/common/include/arch/arm/rte_spinlock.h
index 1a6916b..c69bed1 100644
--- a/lib/librte_eal/common/include/arch/arm/rte_spinlock.h
+++ b/lib/librte_eal/common/include/arch/arm/rte_spinlock.h
@@ -16,6 +16,32 @@ extern "C" {
 #include <rte_common.h>
 #include "generic/rte_spinlock.h"
 
+/* armv7a does support WFE, but an explicit wake-up signal using SEV is
+ * required (must be preceded by DSB to drain the store buffer) and
+ * this is less performant, so keep armv7a implementation unchanged.
+ */
+#ifdef RTE_ARM_USE_WFE
+static inline void
+rte_spinlock_lock(rte_spinlock_t *sl)
+{
+	unsigned int tmp;
+	/* http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.
+	 * faqs/ka16809.html
+	 */
+	asm volatile(
+		"1:	ldaxr %w[tmp], %w[locked]\n"
+		"cbnz   %w[tmp], 2f\n"
+		"stxr   %w[tmp], %w[one], %w[locked]\n"
+		"cbnz   %w[tmp], 1b\n"
+		"ret\n"
+		"2:	sevl\n"
+		"wfe\n"
+		"jmp	1b\n"
+		: [tmp] "=&r" (tmp), [locked] "+Q"(sl->locked)
+		: [one] "r" (1)
+}
+#endif
+
 static inline int rte_tm_supported(void)
 {
 	return 0;
diff --git a/lib/librte_eal/common/include/generic/rte_spinlock.h b/lib/librte_eal/common/include/generic/rte_spinlock.h
index 87ae7a4..cf57c72 100644
--- a/lib/librte_eal/common/include/generic/rte_spinlock.h
+++ b/lib/librte_eal/common/include/generic/rte_spinlock.h
@@ -57,7 +57,7 @@ rte_spinlock_init(rte_spinlock_t *sl)
 static inline void
 rte_spinlock_lock(rte_spinlock_t *sl);
 
-#ifdef RTE_FORCE_INTRINSICS
+#if defined(RTE_FORCE_INTRINSICS) && !defined(RTE_ARM_USE_WFE)
 static inline void
 rte_spinlock_lock(rte_spinlock_t *sl)
 {
-- 
2.7.4


^ permalink raw reply	[flat|nested] 163+ messages in thread

* [dpdk-dev] [PATCH v8 4/6] ticketlock: use new API to reduce contention on aarch64
  2019-06-30 16:21 [dpdk-dev] [RFC 0/5] use WFE for locks and ring on aarch64 Gavin Hu
                   ` (53 preceding siblings ...)
  2019-10-21  9:47 ` [dpdk-dev] [PATCH v8 3/6] spinlock: use wfe to reduce contention on aarch64 Gavin Hu
@ 2019-10-21  9:47 ` " Gavin Hu
  2019-10-21  9:47 ` [dpdk-dev] [PATCH v8 5/6] net/thunderx: use new API to save cycles " Gavin Hu
                   ` (26 subsequent siblings)
  81 siblings, 0 replies; 163+ messages in thread
From: Gavin Hu @ 2019-10-21  9:47 UTC (permalink / raw)
  To: dev
  Cc: nd, david.marchand, konstantin.ananyev, thomas, stephen,
	hemant.agrawal, jerinj, pbhagavatula, Honnappa.Nagarahalli,
	ruifeng.wang, phil.yang, steve.capper

While using ticket lock, cores repeatedly poll the lock variable.
This is replaced by rte_wait_until_equal API.

Running ticketlock_autotest on ThunderX2, Ampere eMAG80, and Arm N1SDP[1],
there were variances between runs, but no notable performance gain or
degradation were seen with and without this patch.

[1] https://community.arm.com/developer/tools-software/oss-platforms/w/\
docs/440/neoverse-n1-sdp

Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Tested-by: Phil Yang <phil.yang@arm.com>
Tested-by: Pavan Nikhilesh <pbhagavatula@marvell.com>
Reviewed-by: Jerin Jacob <jerinj@marvell.com>
---
 lib/librte_eal/common/include/generic/rte_ticketlock.h | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/lib/librte_eal/common/include/generic/rte_ticketlock.h b/lib/librte_eal/common/include/generic/rte_ticketlock.h
index d9bec87..c295ae7 100644
--- a/lib/librte_eal/common/include/generic/rte_ticketlock.h
+++ b/lib/librte_eal/common/include/generic/rte_ticketlock.h
@@ -66,8 +66,7 @@ static inline void
 rte_ticketlock_lock(rte_ticketlock_t *tl)
 {
 	uint16_t me = __atomic_fetch_add(&tl->s.next, 1, __ATOMIC_RELAXED);
-	while (__atomic_load_n(&tl->s.current, __ATOMIC_ACQUIRE) != me)
-		rte_pause();
+	rte_wait_until_equal_16(&tl->s.current, me, __ATOMIC_ACQUIRE);
 }
 
 /**
-- 
2.7.4


^ permalink raw reply	[flat|nested] 163+ messages in thread

* [dpdk-dev] [PATCH v8 5/6] net/thunderx: use new API to save cycles on aarch64
  2019-06-30 16:21 [dpdk-dev] [RFC 0/5] use WFE for locks and ring on aarch64 Gavin Hu
                   ` (54 preceding siblings ...)
  2019-10-21  9:47 ` [dpdk-dev] [PATCH v8 4/6] ticketlock: use new API " Gavin Hu
@ 2019-10-21  9:47 ` " Gavin Hu
  2019-10-21  9:47 ` [dpdk-dev] [PATCH v8 6/6] event/opdl: " Gavin Hu
                   ` (25 subsequent siblings)
  81 siblings, 0 replies; 163+ messages in thread
From: Gavin Hu @ 2019-10-21  9:47 UTC (permalink / raw)
  To: dev
  Cc: nd, david.marchand, konstantin.ananyev, thomas, stephen,
	hemant.agrawal, jerinj, pbhagavatula, Honnappa.Nagarahalli,
	ruifeng.wang, phil.yang, steve.capper

Use the new API to wait in low power state instead of continuous
polling to save CPU cycles and power.

Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
Acked-by: Jerin Jacob <jerinj@marvell.com>
---
 drivers/net/thunderx/Makefile     | 1 +
 drivers/net/thunderx/meson.build  | 1 +
 drivers/net/thunderx/nicvf_rxtx.c | 3 +--
 3 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/net/thunderx/Makefile b/drivers/net/thunderx/Makefile
index e6bf497..9e0de10 100644
--- a/drivers/net/thunderx/Makefile
+++ b/drivers/net/thunderx/Makefile
@@ -10,6 +10,7 @@ include $(RTE_SDK)/mk/rte.vars.mk
 LIB = librte_pmd_thunderx_nicvf.a
 
 CFLAGS += $(WERROR_FLAGS)
+CFLAGS += -DALLOW_EXPERIMENTAL_API
 
 LDLIBS += -lm
 LDLIBS += -lrte_eal -lrte_mbuf -lrte_mempool -lrte_ring
diff --git a/drivers/net/thunderx/meson.build b/drivers/net/thunderx/meson.build
index 69819a9..23d9458 100644
--- a/drivers/net/thunderx/meson.build
+++ b/drivers/net/thunderx/meson.build
@@ -4,6 +4,7 @@
 subdir('base')
 objs = [base_objs]
 
+allow_experimental_apis = true
 sources = files('nicvf_rxtx.c',
 		'nicvf_ethdev.c',
 		'nicvf_svf.c'
diff --git a/drivers/net/thunderx/nicvf_rxtx.c b/drivers/net/thunderx/nicvf_rxtx.c
index 1c42874..90a6098 100644
--- a/drivers/net/thunderx/nicvf_rxtx.c
+++ b/drivers/net/thunderx/nicvf_rxtx.c
@@ -385,8 +385,7 @@ nicvf_fill_rbdr(struct nicvf_rxq *rxq, int to_fill)
 		ltail++;
 	}
 
-	while (__atomic_load_n(&rbdr->tail, __ATOMIC_RELAXED) != next_tail)
-		rte_pause();
+	rte_wait_until_equal_32(&rbdr->tail, next_tail, __ATOMIC_RELAXED);
 
 	__atomic_store_n(&rbdr->tail, ltail, __ATOMIC_RELEASE);
 	nicvf_addr_write(door, to_fill);
-- 
2.7.4


^ permalink raw reply	[flat|nested] 163+ messages in thread

* [dpdk-dev] [PATCH v8 6/6] event/opdl: use new API to save cycles on aarch64
  2019-06-30 16:21 [dpdk-dev] [RFC 0/5] use WFE for locks and ring on aarch64 Gavin Hu
                   ` (55 preceding siblings ...)
  2019-10-21  9:47 ` [dpdk-dev] [PATCH v8 5/6] net/thunderx: use new API to save cycles " Gavin Hu
@ 2019-10-21  9:47 ` " Gavin Hu
  2019-10-24 10:42 ` [dpdk-dev] [PATCH v9 0/5] use WFE for aarch64 Gavin Hu
                   ` (24 subsequent siblings)
  81 siblings, 0 replies; 163+ messages in thread
From: Gavin Hu @ 2019-10-21  9:47 UTC (permalink / raw)
  To: dev
  Cc: nd, david.marchand, konstantin.ananyev, thomas, stephen,
	hemant.agrawal, jerinj, pbhagavatula, Honnappa.Nagarahalli,
	ruifeng.wang, phil.yang, steve.capper

Use the new API to wait in low power state instead of continuous
polling to save CPU cycles and power.

Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
Reviewed-by: Jerin Jacob <jerinj@marvell.com>
---
 drivers/event/opdl/Makefile    | 1 +
 drivers/event/opdl/meson.build | 1 +
 drivers/event/opdl/opdl_ring.c | 5 ++---
 3 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/drivers/event/opdl/Makefile b/drivers/event/opdl/Makefile
index bf50a60..72ef07d 100644
--- a/drivers/event/opdl/Makefile
+++ b/drivers/event/opdl/Makefile
@@ -9,6 +9,7 @@ LIB = librte_pmd_opdl_event.a
 # build flags
 CFLAGS += -O3
 CFLAGS += $(WERROR_FLAGS)
+CFLAGS += -DALLOW_EXPERIMENTAL_API
 # for older GCC versions, allow us to initialize an event using
 # designated initializers.
 ifeq ($(CONFIG_RTE_TOOLCHAIN_GCC),y)
diff --git a/drivers/event/opdl/meson.build b/drivers/event/opdl/meson.build
index 1fe034e..e67b164 100644
--- a/drivers/event/opdl/meson.build
+++ b/drivers/event/opdl/meson.build
@@ -1,6 +1,7 @@
 # SPDX-License-Identifier: BSD-3-Clause
 # Copyright(c) 2018 Luca Boccassi <bluca@debian.org>
 
+allow_experimental_apis = true
 sources = files(
 	'opdl_evdev.c',
 	'opdl_evdev_init.c',
diff --git a/drivers/event/opdl/opdl_ring.c b/drivers/event/opdl/opdl_ring.c
index 06fb5b3..c8d19fe 100644
--- a/drivers/event/opdl/opdl_ring.c
+++ b/drivers/event/opdl/opdl_ring.c
@@ -16,6 +16,7 @@
 #include <rte_memcpy.h>
 #include <rte_memory.h>
 #include <rte_memzone.h>
+#include <rte_atomic.h>
 
 #include "opdl_ring.h"
 #include "opdl_log.h"
@@ -474,9 +475,7 @@ opdl_ring_input_multithread(struct opdl_ring *t, const void *entries,
 	/* If another thread started inputting before this one, but hasn't
 	 * finished, we need to wait for it to complete to update the tail.
 	 */
-	while (unlikely(__atomic_load_n(&s->shared.tail, __ATOMIC_ACQUIRE) !=
-			old_head))
-		rte_pause();
+	rte_wait_until_equal_32(&s->shared.tail, old_head, __ATOMIC_ACQUIRE);
 
 	__atomic_store_n(&s->shared.tail, old_head + num_entries,
 			__ATOMIC_RELEASE);
-- 
2.7.4


^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: [dpdk-dev] [PATCH v8 1/6] bus/fslmc: fix the conflicting dmb function
  2019-10-21  9:47 ` [dpdk-dev] [PATCH v8 1/6] bus/fslmc: fix the conflicting dmb function Gavin Hu
@ 2019-10-21 19:00   ` David Marchand
  0 siblings, 0 replies; 163+ messages in thread
From: David Marchand @ 2019-10-21 19:00 UTC (permalink / raw)
  To: Gavin Hu, Hemant Agrawal
  Cc: dev, nd, Ananyev, Konstantin, Thomas Monjalon, Stephen Hemminger,
	Jerin Jacob Kollanukkaran, Pavan Nikhilesh, Honnappa Nagarahalli,
	Ruifeng Wang (Arm Technology China),
	Phil Yang, Steve Capper, dpdk stable

On Mon, Oct 21, 2019 at 11:48 AM Gavin Hu <gavin.hu@arm.com> wrote:
>
> There are two definitions conflicting each other, for more
> details, refer to [1].
>
> include/rte_atomic_64.h:19: error: "dmb" redefined [-Werror]
> drivers/bus/fslmc/mc/fsl_mc_sys.h:36: note: this is the location of the
> previous definition
>  #define dmb() {__asm__ __volatile__("" : : : "memory"); }
>
> The fix is to reuse the EAL definition to avoid conflicts.
>
> [1] http://inbox.dpdk.org/users/VI1PR08MB537631AB25F41B8880DCCA988FDF0@i
> VI1PR08MB5376.eurprd08.prod.outlook.com/T/#u

This url is broken as reported previously.
I can fix when applying, but please take the time to fix basic issues
like this when reported.

>
> Fixes: 3af733ba8da8 ("bus/fslmc: introduce MC object functions")
> Cc: stable@dpdk.org
>
> Signed-off-by: Gavin Hu <gavin.hu@arm.com>
> Reviewed-by: Phil Yang <phi.yang@arm.com>
> Acked-by: Hemant Agrawal <hemant.agrawal@nxp.com>

Was the last change on using EAL memory barrier properly seen by Hemant ?
Just want to be sure he is ok.



> ---
>  drivers/bus/fslmc/mc/fsl_mc_sys.h | 8 +++-----
>  1 file changed, 3 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/bus/fslmc/mc/fsl_mc_sys.h b/drivers/bus/fslmc/mc/fsl_mc_sys.h
> index d0c7b39..68ce38b 100644
> --- a/drivers/bus/fslmc/mc/fsl_mc_sys.h
> +++ b/drivers/bus/fslmc/mc/fsl_mc_sys.h
> @@ -31,12 +31,10 @@ struct fsl_mc_io {
>  #include <errno.h>
>  #include <sys/uio.h>
>  #include <linux/byteorder/little_endian.h>
> +#include <rte_atomic.h>
>
> -#ifndef dmb
> -#define dmb() {__asm__ __volatile__("" : : : "memory"); }
> -#endif
> -#define __iormb()      dmb()
> -#define __iowmb()      dmb()
> +#define __iormb()      rte_io_rmb()
> +#define __iowmb()      rte_io_wmb()
>  #define __arch_getq(a)         (*(volatile uint64_t *)(a))
>  #define __arch_putq(v, a)      (*(volatile uint64_t *)(a) = (v))
>  #define __arch_putq32(v, a)    (*(volatile uint32_t *)(a) = (v))
> --
> 2.7.4
>


--
David Marchand


^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: [dpdk-dev] [PATCH v7 2/7] eal: add the APIs to wait until equal
  2019-10-21  7:38     ` Gavin Hu (Arm Technology China)
@ 2019-10-21 19:17       ` David Marchand
  0 siblings, 0 replies; 163+ messages in thread
From: David Marchand @ 2019-10-21 19:17 UTC (permalink / raw)
  To: Gavin Hu (Arm Technology China)
  Cc: dev, nd, thomas, Stephen Hemminger, hemant.agrawal, jerinj,
	Pavan Nikhilesh, Honnappa Nagarahalli,
	Ruifeng Wang (Arm Technology China),
	Phil Yang (Arm Technology China),
	Steve Capper

On Mon, Oct 21, 2019 at 9:39 AM Gavin Hu (Arm Technology China)
<Gavin.Hu@arm.com> wrote:
> > -----Original Message-----
> > From: David Marchand <david.marchand@redhat.com>
> > - This is a new api, should be marked experimental, even if inlined.
> It is ok to add for the patches except the rte_ring, which called the API in the .h file, other than the .c file.
> For the .h file is included by a lot of components, which require adding 'allowing_experimenal_apis = true' to the meson.build and makefile.
> I am worried adding too many of these changes is confusing. I may leave this patch out of the series if there is no decorous solutions.

You can still keep the current code in the ring headers under a
#ifndef ALLOW_EXPERIMENTAL_API banner and put the call to your new
experimental api in the #else part of it.
Something like:

#ifndef ALLOW_EXPERIMENTAL_API
                while (unlikely(ht->tail != old_val))
                        rte_pause();
#else
                rte_wait_until_equal_relaxed_32(&ht->tail, old_val);

#endif

This way, if the application enables the experimental api, then the
ring code will benefit from it, else it will rely on the current
stable code.

-- 
David Marchand


^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: [dpdk-dev] [PATCH v8 2/6] eal: add the APIs to wait until equal
  2019-10-21  9:47 ` [dpdk-dev] [PATCH v8 2/6] eal: add the APIs to wait until equal Gavin Hu
@ 2019-10-21 19:19   ` David Marchand
  2019-10-22  9:36     ` Ananyev, Konstantin
  0 siblings, 1 reply; 163+ messages in thread
From: David Marchand @ 2019-10-21 19:19 UTC (permalink / raw)
  To: Gavin Hu
  Cc: dev, nd, Ananyev, Konstantin, Thomas Monjalon, Stephen Hemminger,
	Hemant Agrawal, Jerin Jacob Kollanukkaran, Pavan Nikhilesh,
	Honnappa Nagarahalli, Ruifeng Wang (Arm Technology China),
	Phil Yang, Steve Capper

On Mon, Oct 21, 2019 at 11:48 AM Gavin Hu <gavin.hu@arm.com> wrote:
>
> The rte_wait_until_equal_xx APIs abstract the functionality of
> 'polling for a memory location to become equal to a given value'.
>
> Add the RTE_ARM_USE_WFE configuration entry for aarch64, disabled
> by default. When it is enabled, the above APIs will call WFE instruction
> to save CPU cycles and power.
>
> From a VM, when calling this API on aarch64, it may trap in and out to
> release vCPUs whereas cause high exit latency. Since kernel 4.18.20 an
> adaptive trapping mechanism is introduced to balance the latency and
> workload.
>
> Signed-off-by: Gavin Hu <gavin.hu@arm.com>
> Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> Reviewed-by: Steve Capper <steve.capper@arm.com>
> Reviewed-by: Ola Liljedahl <ola.liljedahl@arm.com>
> Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> Reviewed-by: Phil Yang <phil.yang@arm.com>
> Acked-by: Pavan Nikhilesh <pbhagavatula@marvell.com>
> Acked-by: Jerin Jacob <jerinj@marvell.com>
> ---
>  config/arm/meson.build                             |  1 +
>  config/common_base                                 |  5 ++
>  .../common/include/arch/arm/rte_pause_64.h         | 26 ++++++
>  lib/librte_eal/common/include/generic/rte_pause.h  | 93 ++++++++++++++++++++++
>  4 files changed, 125 insertions(+)
>
> diff --git a/config/arm/meson.build b/config/arm/meson.build
> index 979018e..b4b4cac 100644
> --- a/config/arm/meson.build
> +++ b/config/arm/meson.build
> @@ -26,6 +26,7 @@ flags_common_default = [
>         ['RTE_LIBRTE_AVP_PMD', false],
>
>         ['RTE_SCHED_VECTOR', false],
> +       ['RTE_ARM_USE_WFE', false],
>  ]
>
>  flags_generic = [
> diff --git a/config/common_base b/config/common_base
> index e843a21..c812156 100644
> --- a/config/common_base
> +++ b/config/common_base
> @@ -111,6 +111,11 @@ CONFIG_RTE_MAX_VFIO_CONTAINERS=64
>  CONFIG_RTE_MALLOC_DEBUG=n
>  CONFIG_RTE_EAL_NUMA_AWARE_HUGEPAGES=n
>  CONFIG_RTE_USE_LIBBSD=n
> +# Use WFE instructions to implement the rte_wait_for_equal_xxx APIs,
> +# calling these APIs put the cores in low power state while waiting
> +# for the memory address to become equal to the expected value.
> +# This is supported only by aarch64.
> +CONFIG_RTE_ARM_USE_WFE=n
>
>  #
>  # Recognize/ignore the AVX/AVX512 CPU flags for performance/power testing.
> diff --git a/lib/librte_eal/common/include/arch/arm/rte_pause_64.h b/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
> index 93895d3..eb8f73e 100644
> --- a/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
> +++ b/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
> @@ -1,5 +1,6 @@
>  /* SPDX-License-Identifier: BSD-3-Clause
>   * Copyright(c) 2017 Cavium, Inc
> + * Copyright(c) 2019 Arm Limited
>   */

Before including generic/rte_pause.h, put a check like:

#ifdef RTE_ARM_USE_WFE
#define RTE_ARCH_HAS_WFE
#endif

#include "generic/rte_pause.h"

>
>  #ifndef _RTE_PAUSE_ARM64_H_
> @@ -17,6 +18,31 @@ static inline void rte_pause(void)
>         asm volatile("yield" ::: "memory");
>  }
>
> +#ifdef RTE_ARM_USE_WFE
> +#define sev()  { asm volatile("sev" : : : "memory") }
> +#define wfe()  { asm volatile("wfe" : : : "memory") }
> +
> +#define __WAIT_UNTIL_EQUAL(type, size, addr, expected, memorder) \
> +__rte_experimental                                             \

The experimental tag is unnecessary here.
We only need it in the function prototype (in the generic header).

> +static __rte_always_inline void                                        \
> +rte_wait_until_equal_##size(volatile type * addr, type expected,\
> +int memorder)                                                  \
> +{                                                              \
> +       if (__atomic_load_n(addr, memorder) != expected) {      \
> +               sev();                                                  \
> +               do {                                                    \
> +                       wfe();                                          \
> +               } while (__atomic_load_n(addr, memorder) != expected);  \
> +        }                                                              \
> +}
> +__WAIT_UNTIL_EQUAL(uint16_t, 16, addr, expected, memorder)
> +__WAIT_UNTIL_EQUAL(uint32_t, 32, addr, expected, memorder)
> +__WAIT_UNTIL_EQUAL(uint64_t, 64, addr, expected, memorder)
> +
> +#undef __WAIT_UNTIL_EQUAL

Missing #undef on sev and wfe macros.


> +
> +#endif /* RTE_ARM_USE_WFE */
> +
>  #ifdef __cplusplus
>  }
>  #endif
> diff --git a/lib/librte_eal/common/include/generic/rte_pause.h b/lib/librte_eal/common/include/generic/rte_pause.h
> index 52bd4db..80597a9 100644
> --- a/lib/librte_eal/common/include/generic/rte_pause.h
> +++ b/lib/librte_eal/common/include/generic/rte_pause.h
> @@ -1,5 +1,6 @@
>  /* SPDX-License-Identifier: BSD-3-Clause
>   * Copyright(c) 2017 Cavium, Inc
> + * Copyright(c) 2019 Arm Limited
>   */
>
>  #ifndef _RTE_PAUSE_H_
> @@ -12,6 +13,11 @@
>   *
>   */
>
> +#include <stdint.h>
> +#include <rte_common.h>
> +#include <rte_atomic.h>
> +#include <rte_compat.h>
> +
>  /**
>   * Pause CPU execution for a short while
>   *
> @@ -20,4 +26,91 @@
>   */
>  static inline void rte_pause(void);
>
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
> + *
> + * Wait for *addr to be updated with a 16-bit expected value, with a relaxed
> + * memory ordering model meaning the loads around this API can be reordered.
> + *
> + * @param addr
> + *  A pointer to the memory location.
> + * @param expected
> + *  A 16-bit expected value to be in the memory location.
> + * @param memorder
> + *  Two different memory orders that can be specified:
> + *  __ATOMIC_ACQUIRE and __ATOMIC_RELAXED. These map to
> + *  C++11 memory orders with the same names, see the C++11 standard or
> + *  the GCC wiki on atomic synchronization for detailed definition.
> + */
> +__rte_experimental
> +static __rte_always_inline void
> +rte_wait_until_equal_16(volatile uint16_t *addr, uint16_t expected,
> +int memorder);
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
> + *
> + * Wait for *addr to be updated with a 32-bit expected value, with a relaxed
> + * memory ordering model meaning the loads around this API can be reordered.
> + *
> + * @param addr
> + *  A pointer to the memory location.
> + * @param expected
> + *  A 32-bit expected value to be in the memory location.
> + * @param memorder
> + *  Two different memory orders that can be specified:
> + *  __ATOMIC_ACQUIRE and __ATOMIC_RELAXED. These map to
> + *  C++11 memory orders with the same names, see the C++11 standard or
> + *  the GCC wiki on atomic synchronization for detailed definition.
> + */
> +__rte_experimental
> +static __rte_always_inline void
> +rte_wait_until_equal_32(volatile uint32_t *addr, uint32_t expected,
> +int memorder);
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
> + *
> + * Wait for *addr to be updated with a 64-bit expected value, with a relaxed
> + * memory ordering model meaning the loads around this API can be reordered.
> + *
> + * @param addr
> + *  A pointer to the memory location.
> + * @param expected
> + *  A 64-bit expected value to be in the memory location.
> + * @param memorder
> + *  Two different memory orders that can be specified:
> + *  __ATOMIC_ACQUIRE and __ATOMIC_RELAXED. These map to
> + *  C++11 memory orders with the same names, see the C++11 standard or
> + *  the GCC wiki on atomic synchronization for detailed definition.
> + */
> +__rte_experimental
> +static __rte_always_inline void
> +rte_wait_until_equal_64(volatile uint64_t *addr, uint64_t expected,
> +int memorder);
> +

Here, change this check to:

#ifndef RTE_ARCH_HAS_WFE


> +#ifndef RTE_ARM_USE_WFE
> +#define __WAIT_UNTIL_EQUAL(type, size, addr, expected, memorder) \
> +__rte_experimental \
> +static __rte_always_inline void                                        \
> +rte_wait_until_equal_##size(volatile type * addr, type expected,\
> +int memorder)                                                  \
> +{                                                              \
> +       if (__atomic_load_n(addr, memorder) != expected) {      \
> +               do {                                                    \
> +                       rte_pause();                                    \
> +               } while (__atomic_load_n(addr, memorder) != expected);  \
> +       }                                                               \
> +}
> +__WAIT_UNTIL_EQUAL(uint16_t, 16, addr, expected, memorder)
> +__WAIT_UNTIL_EQUAL(uint32_t, 32, addr, expected, memorder)
> +__WAIT_UNTIL_EQUAL(uint64_t, 64, addr, expected, memorder)
> +
> +#undef __WAIT_UNTIL_EQUAL
> +
> +#endif /* RTE_ARM_USE_WFE */
> +
>  #endif /* _RTE_PAUSE_H_ */
> --
> 2.7.4
>

Thanks.

--
David Marchand


^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: [dpdk-dev] [PATCH v8 2/6] eal: add the APIs to wait until equal
  2019-10-21 19:19   ` David Marchand
@ 2019-10-22  9:36     ` Ananyev, Konstantin
  2019-10-22 10:17       ` David Marchand
  2019-10-22 16:03       ` Gavin Hu (Arm Technology China)
  0 siblings, 2 replies; 163+ messages in thread
From: Ananyev, Konstantin @ 2019-10-22  9:36 UTC (permalink / raw)
  To: David Marchand, Gavin Hu
  Cc: dev, nd, Thomas Monjalon, Stephen Hemminger, Hemant Agrawal,
	Jerin Jacob Kollanukkaran, Pavan Nikhilesh, Honnappa Nagarahalli,
	Ruifeng Wang (Arm Technology China),
	Phil Yang, Steve Capper

> > The rte_wait_until_equal_xx APIs abstract the functionality of
> > 'polling for a memory location to become equal to a given value'.
> >
> > Add the RTE_ARM_USE_WFE configuration entry for aarch64, disabled
> > by default. When it is enabled, the above APIs will call WFE instruction
> > to save CPU cycles and power.
> >
> > From a VM, when calling this API on aarch64, it may trap in and out to
> > release vCPUs whereas cause high exit latency. Since kernel 4.18.20 an
> > adaptive trapping mechanism is introduced to balance the latency and
> > workload.
> >
> > Signed-off-by: Gavin Hu <gavin.hu@arm.com>
> > Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > Reviewed-by: Steve Capper <steve.capper@arm.com>
> > Reviewed-by: Ola Liljedahl <ola.liljedahl@arm.com>
> > Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > Reviewed-by: Phil Yang <phil.yang@arm.com>
> > Acked-by: Pavan Nikhilesh <pbhagavatula@marvell.com>
> > Acked-by: Jerin Jacob <jerinj@marvell.com>
> > ---
> >  config/arm/meson.build                             |  1 +
> >  config/common_base                                 |  5 ++
> >  .../common/include/arch/arm/rte_pause_64.h         | 26 ++++++
> >  lib/librte_eal/common/include/generic/rte_pause.h  | 93 ++++++++++++++++++++++
> >  4 files changed, 125 insertions(+)
> >
> > diff --git a/config/arm/meson.build b/config/arm/meson.build
> > index 979018e..b4b4cac 100644
> > --- a/config/arm/meson.build
> > +++ b/config/arm/meson.build
> > @@ -26,6 +26,7 @@ flags_common_default = [
> >         ['RTE_LIBRTE_AVP_PMD', false],
> >
> >         ['RTE_SCHED_VECTOR', false],
> > +       ['RTE_ARM_USE_WFE', false],
> >  ]
> >
> >  flags_generic = [
> > diff --git a/config/common_base b/config/common_base
> > index e843a21..c812156 100644
> > --- a/config/common_base
> > +++ b/config/common_base
> > @@ -111,6 +111,11 @@ CONFIG_RTE_MAX_VFIO_CONTAINERS=64
> >  CONFIG_RTE_MALLOC_DEBUG=n
> >  CONFIG_RTE_EAL_NUMA_AWARE_HUGEPAGES=n
> >  CONFIG_RTE_USE_LIBBSD=n
> > +# Use WFE instructions to implement the rte_wait_for_equal_xxx APIs,
> > +# calling these APIs put the cores in low power state while waiting
> > +# for the memory address to become equal to the expected value.
> > +# This is supported only by aarch64.
> > +CONFIG_RTE_ARM_USE_WFE=n
> >
> >  #
> >  # Recognize/ignore the AVX/AVX512 CPU flags for performance/power testing.
> > diff --git a/lib/librte_eal/common/include/arch/arm/rte_pause_64.h b/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
> > index 93895d3..eb8f73e 100644
> > --- a/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
> > +++ b/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
> > @@ -1,5 +1,6 @@
> >  /* SPDX-License-Identifier: BSD-3-Clause
> >   * Copyright(c) 2017 Cavium, Inc
> > + * Copyright(c) 2019 Arm Limited
> >   */
> 
> Before including generic/rte_pause.h, put a check like:
> 
> #ifdef RTE_ARM_USE_WFE
> #define RTE_ARCH_HAS_WFE
> #endif
> 
> #include "generic/rte_pause.h"
> 
> >
> >  #ifndef _RTE_PAUSE_ARM64_H_
> > @@ -17,6 +18,31 @@ static inline void rte_pause(void)
> >         asm volatile("yield" ::: "memory");
> >  }
> >
> > +#ifdef RTE_ARM_USE_WFE
> > +#define sev()  { asm volatile("sev" : : : "memory") }
> > +#define wfe()  { asm volatile("wfe" : : : "memory") }
> > +
> > +#define __WAIT_UNTIL_EQUAL(type, size, addr, expected, memorder) \
> > +__rte_experimental                                             \
> 
> The experimental tag is unnecessary here.
> We only need it in the function prototype (in the generic header).
> 
> > +static __rte_always_inline void                                        \
> > +rte_wait_until_equal_##size(volatile type * addr, type expected,\
> > +int memorder)                                                  \
> > +{                                                              \
> > +       if (__atomic_load_n(addr, memorder) != expected) {      \
> > +               sev();                                                  \
> > +               do {                                                    \
> > +                       wfe();                                          \
> > +               } while (__atomic_load_n(addr, memorder) != expected);  \
> > +        }                                                              \
> > +}
> > +__WAIT_UNTIL_EQUAL(uint16_t, 16, addr, expected, memorder)
> > +__WAIT_UNTIL_EQUAL(uint32_t, 32, addr, expected, memorder)
> > +__WAIT_UNTIL_EQUAL(uint64_t, 64, addr, expected, memorder)
> > +
> > +#undef __WAIT_UNTIL_EQUAL

Might be instead of defining/undefining these macros, just
define explicitly these 3 functions?
Now they are really small, so I think it would be an easier/cleaner way.
Yes, a bit of code duplication, but as I said they are really small now.
Same thought about generic version.

> 
> Missing #undef on sev and wfe macros.

Actually should we undefine them?
Or should we add rte_ prefix (which is needed anyway I suppose)
and have them always defined?
Might be you can reuse them in other arm specific places too (spinlock, rwlock, etc.)
Actually probably it is possible to make them either emiting a proper instructions or NOP,
then you'll need RTE_ARM_USE_WFE only around these macros.

I.E

#ifdef RTE_ARM_USE_WFE
#define rte_sev()  { asm volatile("sev" : : : "memory") }
#define rte_wfe()  { asm volatile("wfe" : : : "memory") }
#else
static inline void rte_sev(void)
{
}
static inline void rte_wfe(void)
{
	rte_pause();
}
#endif

And then just one common version of _wait_ functios:

static __rte_always_inline void                                        \
rte_wait_until_equal_32(volatile type * addr, type expected, int memorder)
{                                                              \
	if (__atomic_load_n(addr, memorder) != expected) {    
		rte_sev();
		do {
	                       rte_wfe();
	              } while (__atomic_load_n(addr, memorder) != expected);
	  } 
}
 

> 
> 
> > +
> > +#endif /* RTE_ARM_USE_WFE */
> > +
> >  #ifdef __cplusplus
> >  }
> >  #endif
> > diff --git a/lib/librte_eal/common/include/generic/rte_pause.h b/lib/librte_eal/common/include/generic/rte_pause.h
> > index 52bd4db..80597a9 100644
> > --- a/lib/librte_eal/common/include/generic/rte_pause.h
> > +++ b/lib/librte_eal/common/include/generic/rte_pause.h
> > @@ -1,5 +1,6 @@
> >  /* SPDX-License-Identifier: BSD-3-Clause
> >   * Copyright(c) 2017 Cavium, Inc
> > + * Copyright(c) 2019 Arm Limited
> >   */
> >
> >  #ifndef _RTE_PAUSE_H_
> > @@ -12,6 +13,11 @@
> >   *
> >   */
> >
> > +#include <stdint.h>
> > +#include <rte_common.h>
> > +#include <rte_atomic.h>
> > +#include <rte_compat.h>
> > +
> >  /**
> >   * Pause CPU execution for a short while
> >   *
> > @@ -20,4 +26,91 @@
> >   */
> >  static inline void rte_pause(void);
> >
> > +/**
> > + * @warning
> > + * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
> > + *
> > + * Wait for *addr to be updated with a 16-bit expected value, with a relaxed
> > + * memory ordering model meaning the loads around this API can be reordered.
> > + *
> > + * @param addr
> > + *  A pointer to the memory location.
> > + * @param expected
> > + *  A 16-bit expected value to be in the memory location.
> > + * @param memorder
> > + *  Two different memory orders that can be specified:
> > + *  __ATOMIC_ACQUIRE and __ATOMIC_RELAXED. These map to
> > + *  C++11 memory orders with the same names, see the C++11 standard or
> > + *  the GCC wiki on atomic synchronization for detailed definition.
> > + */
> > +__rte_experimental
> > +static __rte_always_inline void
> > +rte_wait_until_equal_16(volatile uint16_t *addr, uint16_t expected,
> > +int memorder);
> > +
> > +/**
> > + * @warning
> > + * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
> > + *
> > + * Wait for *addr to be updated with a 32-bit expected value, with a relaxed
> > + * memory ordering model meaning the loads around this API can be reordered.
> > + *
> > + * @param addr
> > + *  A pointer to the memory location.
> > + * @param expected
> > + *  A 32-bit expected value to be in the memory location.
> > + * @param memorder
> > + *  Two different memory orders that can be specified:
> > + *  __ATOMIC_ACQUIRE and __ATOMIC_RELAXED. These map to
> > + *  C++11 memory orders with the same names, see the C++11 standard or
> > + *  the GCC wiki on atomic synchronization for detailed definition.
> > + */
> > +__rte_experimental
> > +static __rte_always_inline void
> > +rte_wait_until_equal_32(volatile uint32_t *addr, uint32_t expected,
> > +int memorder);
> > +
> > +/**
> > + * @warning
> > + * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
> > + *
> > + * Wait for *addr to be updated with a 64-bit expected value, with a relaxed
> > + * memory ordering model meaning the loads around this API can be reordered.
> > + *
> > + * @param addr
> > + *  A pointer to the memory location.
> > + * @param expected
> > + *  A 64-bit expected value to be in the memory location.
> > + * @param memorder
> > + *  Two different memory orders that can be specified:
> > + *  __ATOMIC_ACQUIRE and __ATOMIC_RELAXED. These map to
> > + *  C++11 memory orders with the same names, see the C++11 standard or
> > + *  the GCC wiki on atomic synchronization for detailed definition.
> > + */
> > +__rte_experimental
> > +static __rte_always_inline void
> > +rte_wait_until_equal_64(volatile uint64_t *addr, uint64_t expected,
> > +int memorder);
> > +
> 
> Here, change this check to:
> 
> #ifndef RTE_ARCH_HAS_WFE
> 
> 
> > +#ifndef RTE_ARM_USE_WFE

Might be something arch neutral in name here?
RTE_WAIT_UNTIL_EQUAL_ARCH_DEFINED or so?

> > +#define __WAIT_UNTIL_EQUAL(type, size, addr, expected, memorder) \
> > +__rte_experimental \
> > +static __rte_always_inline void                                        \
> > +rte_wait_until_equal_##size(volatile type * addr, type expected,\
> > +int memorder)                                                  \
> > +{                                                              \
> > +       if (__atomic_load_n(addr, memorder) != expected) {      \
> > +               do {                                                    \
> > +                       rte_pause();                                    \
> > +               } while (__atomic_load_n(addr, memorder) != expected);  \
> > +       }                                                               \
> > +}
> > +__WAIT_UNTIL_EQUAL(uint16_t, 16, addr, expected, memorder)
> > +__WAIT_UNTIL_EQUAL(uint32_t, 32, addr, expected, memorder)
> > +__WAIT_UNTIL_EQUAL(uint64_t, 64, addr, expected, memorder)
> > +
> > +#undef __WAIT_UNTIL_EQUAL
> > +
> > +#endif /* RTE_ARM_USE_WFE */
> > +
> >  #endif /* _RTE_PAUSE_H_ */
> > --
> > 2.7.4
> >
> 
> Thanks.
> 
> --
> David Marchand


^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: [dpdk-dev] [PATCH v8 2/6] eal: add the APIs to wait until equal
  2019-10-22  9:36     ` Ananyev, Konstantin
@ 2019-10-22 10:17       ` David Marchand
  2019-10-22 16:05         ` Gavin Hu (Arm Technology China)
  2019-10-22 16:03       ` Gavin Hu (Arm Technology China)
  1 sibling, 1 reply; 163+ messages in thread
From: David Marchand @ 2019-10-22 10:17 UTC (permalink / raw)
  To: Ananyev, Konstantin
  Cc: Gavin Hu, dev, nd, Thomas Monjalon, Stephen Hemminger,
	Hemant Agrawal, Jerin Jacob Kollanukkaran, Pavan Nikhilesh,
	Honnappa Nagarahalli, Ruifeng Wang (Arm Technology China),
	Phil Yang, Steve Capper

On Tue, Oct 22, 2019 at 11:37 AM Ananyev, Konstantin
<konstantin.ananyev@intel.com> wrote:
>
> > > The rte_wait_until_equal_xx APIs abstract the functionality of
> > > 'polling for a memory location to become equal to a given value'.
> > >
> > > Add the RTE_ARM_USE_WFE configuration entry for aarch64, disabled
> > > by default. When it is enabled, the above APIs will call WFE instruction
> > > to save CPU cycles and power.
> > >
> > > From a VM, when calling this API on aarch64, it may trap in and out to
> > > release vCPUs whereas cause high exit latency. Since kernel 4.18.20 an
> > > adaptive trapping mechanism is introduced to balance the latency and
> > > workload.
> > >
> > > Signed-off-by: Gavin Hu <gavin.hu@arm.com>
> > > Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > > Reviewed-by: Steve Capper <steve.capper@arm.com>
> > > Reviewed-by: Ola Liljedahl <ola.liljedahl@arm.com>
> > > Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > > Reviewed-by: Phil Yang <phil.yang@arm.com>
> > > Acked-by: Pavan Nikhilesh <pbhagavatula@marvell.com>
> > > Acked-by: Jerin Jacob <jerinj@marvell.com>
> > > ---
> > >  config/arm/meson.build                             |  1 +
> > >  config/common_base                                 |  5 ++
> > >  .../common/include/arch/arm/rte_pause_64.h         | 26 ++++++
> > >  lib/librte_eal/common/include/generic/rte_pause.h  | 93 ++++++++++++++++++++++
> > >  4 files changed, 125 insertions(+)
> > >
> > > diff --git a/config/arm/meson.build b/config/arm/meson.build
> > > index 979018e..b4b4cac 100644
> > > --- a/config/arm/meson.build
> > > +++ b/config/arm/meson.build
> > > @@ -26,6 +26,7 @@ flags_common_default = [
> > >         ['RTE_LIBRTE_AVP_PMD', false],
> > >
> > >         ['RTE_SCHED_VECTOR', false],
> > > +       ['RTE_ARM_USE_WFE', false],
> > >  ]
> > >
> > >  flags_generic = [
> > > diff --git a/config/common_base b/config/common_base
> > > index e843a21..c812156 100644
> > > --- a/config/common_base
> > > +++ b/config/common_base
> > > @@ -111,6 +111,11 @@ CONFIG_RTE_MAX_VFIO_CONTAINERS=64
> > >  CONFIG_RTE_MALLOC_DEBUG=n
> > >  CONFIG_RTE_EAL_NUMA_AWARE_HUGEPAGES=n
> > >  CONFIG_RTE_USE_LIBBSD=n
> > > +# Use WFE instructions to implement the rte_wait_for_equal_xxx APIs,
> > > +# calling these APIs put the cores in low power state while waiting
> > > +# for the memory address to become equal to the expected value.
> > > +# This is supported only by aarch64.
> > > +CONFIG_RTE_ARM_USE_WFE=n
> > >
> > >  #
> > >  # Recognize/ignore the AVX/AVX512 CPU flags for performance/power testing.
> > > diff --git a/lib/librte_eal/common/include/arch/arm/rte_pause_64.h b/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
> > > index 93895d3..eb8f73e 100644
> > > --- a/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
> > > +++ b/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
> > > @@ -1,5 +1,6 @@
> > >  /* SPDX-License-Identifier: BSD-3-Clause
> > >   * Copyright(c) 2017 Cavium, Inc
> > > + * Copyright(c) 2019 Arm Limited
> > >   */
> >
> > Before including generic/rte_pause.h, put a check like:
> >
> > #ifdef RTE_ARM_USE_WFE
> > #define RTE_ARCH_HAS_WFE
> > #endif
> >
> > #include "generic/rte_pause.h"
> >
> > >
> > >  #ifndef _RTE_PAUSE_ARM64_H_
> > > @@ -17,6 +18,31 @@ static inline void rte_pause(void)
> > >         asm volatile("yield" ::: "memory");
> > >  }
> > >
> > > +#ifdef RTE_ARM_USE_WFE
> > > +#define sev()  { asm volatile("sev" : : : "memory") }
> > > +#define wfe()  { asm volatile("wfe" : : : "memory") }
> > > +
> > > +#define __WAIT_UNTIL_EQUAL(type, size, addr, expected, memorder) \
> > > +__rte_experimental                                             \
> >
> > The experimental tag is unnecessary here.
> > We only need it in the function prototype (in the generic header).
> >
> > > +static __rte_always_inline void                                        \
> > > +rte_wait_until_equal_##size(volatile type * addr, type expected,\
> > > +int memorder)                                                  \
> > > +{                                                              \
> > > +       if (__atomic_load_n(addr, memorder) != expected) {      \
> > > +               sev();                                                  \
> > > +               do {                                                    \
> > > +                       wfe();                                          \
> > > +               } while (__atomic_load_n(addr, memorder) != expected);  \
> > > +        }                                                              \
> > > +}
> > > +__WAIT_UNTIL_EQUAL(uint16_t, 16, addr, expected, memorder)
> > > +__WAIT_UNTIL_EQUAL(uint32_t, 32, addr, expected, memorder)
> > > +__WAIT_UNTIL_EQUAL(uint64_t, 64, addr, expected, memorder)
> > > +
> > > +#undef __WAIT_UNTIL_EQUAL
>
> Might be instead of defining/undefining these macros, just
> define explicitly these 3 functions?
> Now they are really small, so I think it would be an easier/cleaner way.
> Yes, a bit of code duplication, but as I said they are really small now.
> Same thought about generic version.

I don't really like those macros defining inlines either.
I am fine with this little duplication, so +1 from me.


>
> >
> > Missing #undef on sev and wfe macros.
>
> Actually should we undefine them?
> Or should we add rte_ prefix (which is needed anyway I suppose)
> and have them always defined?
> Might be you can reuse them in other arm specific places too (spinlock, rwlock, etc.)
> Actually probably it is possible to make them either emiting a proper instructions or NOP,
> then you'll need RTE_ARM_USE_WFE only around these macros.

Interesting idea, but only if it gets used in this series.
I don't want to see stuff that will not be used later.

>
> I.E
>
> #ifdef RTE_ARM_USE_WFE
> #define rte_sev()  { asm volatile("sev" : : : "memory") }
> #define rte_wfe()  { asm volatile("wfe" : : : "memory") }
> #else
> static inline void rte_sev(void)
> {
> }
> static inline void rte_wfe(void)
> {
>         rte_pause();
> }
> #endif
>
> And then just one common version of _wait_ functios:
>
> static __rte_always_inline void                                        \
> rte_wait_until_equal_32(volatile type * addr, type expected, int memorder)
> {                                                              \
>         if (__atomic_load_n(addr, memorder) != expected) {
>                 rte_sev();
>                 do {
>                                rte_wfe();
>                       } while (__atomic_load_n(addr, memorder) != expected);
>           }
> }
>
>

[snip]

> > Here, change this check to:
> >
> > #ifndef RTE_ARCH_HAS_WFE
> >
> >
> > > +#ifndef RTE_ARM_USE_WFE
>
> Might be something arch neutral in name here?
> RTE_WAIT_UNTIL_EQUAL_ARCH_DEFINED or so?

Yes, better than my suggestion.


Gavin, I noticed that you added a #ifndef ARM in the generic
rte_spinlock.h header later in this series.
Please, can you think of a way to avoid this?
Maybe applying the same pattern to the spinlock?


Thanks.

--
David Marchand


^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: [dpdk-dev] [PATCH v8 2/6] eal: add the APIs to wait until equal
  2019-10-22  9:36     ` Ananyev, Konstantin
  2019-10-22 10:17       ` David Marchand
@ 2019-10-22 16:03       ` Gavin Hu (Arm Technology China)
  1 sibling, 0 replies; 163+ messages in thread
From: Gavin Hu (Arm Technology China) @ 2019-10-22 16:03 UTC (permalink / raw)
  To: Ananyev, Konstantin, David Marchand
  Cc: dev, nd, thomas, Stephen Hemminger, hemant.agrawal, jerinj,
	Pavan Nikhilesh, Honnappa Nagarahalli,
	Ruifeng Wang (Arm Technology China),
	Phil Yang (Arm Technology China),
	Steve Capper, nd


> -----Original Message-----
> From: Ananyev, Konstantin <konstantin.ananyev@intel.com>
> Sent: Tuesday, October 22, 2019 5:37 PM
> To: David Marchand <david.marchand@redhat.com>; Gavin Hu (Arm
> Technology China) <Gavin.Hu@arm.com>
> Cc: dev <dev@dpdk.org>; nd <nd@arm.com>; thomas@monjalon.net;
> Stephen Hemminger <stephen@networkplumber.org>;
> hemant.agrawal@nxp.com; jerinj@marvell.com; Pavan Nikhilesh
> <pbhagavatula@marvell.com>; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>; Ruifeng Wang (Arm Technology China)
> <Ruifeng.Wang@arm.com>; Phil Yang (Arm Technology China)
> <Phil.Yang@arm.com>; Steve Capper <Steve.Capper@arm.com>
> Subject: RE: [dpdk-dev] [PATCH v8 2/6] eal: add the APIs to wait until equal
> 
> > > The rte_wait_until_equal_xx APIs abstract the functionality of
> > > 'polling for a memory location to become equal to a given value'.
> > >
> > > Add the RTE_ARM_USE_WFE configuration entry for aarch64, disabled
> > > by default. When it is enabled, the above APIs will call WFE instruction
> > > to save CPU cycles and power.
> > >
> > > From a VM, when calling this API on aarch64, it may trap in and out to
> > > release vCPUs whereas cause high exit latency. Since kernel 4.18.20 an
> > > adaptive trapping mechanism is introduced to balance the latency and
> > > workload.
> > >
> > > Signed-off-by: Gavin Hu <gavin.hu@arm.com>
> > > Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > > Reviewed-by: Steve Capper <steve.capper@arm.com>
> > > Reviewed-by: Ola Liljedahl <ola.liljedahl@arm.com>
> > > Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > > Reviewed-by: Phil Yang <phil.yang@arm.com>
> > > Acked-by: Pavan Nikhilesh <pbhagavatula@marvell.com>
> > > Acked-by: Jerin Jacob <jerinj@marvell.com>
> > > ---
> > >  config/arm/meson.build                             |  1 +
> > >  config/common_base                                 |  5 ++
> > >  .../common/include/arch/arm/rte_pause_64.h         | 26 ++++++
> > >  lib/librte_eal/common/include/generic/rte_pause.h  | 93
> ++++++++++++++++++++++
> > >  4 files changed, 125 insertions(+)
> > >
> > > diff --git a/config/arm/meson.build b/config/arm/meson.build
> > > index 979018e..b4b4cac 100644
> > > --- a/config/arm/meson.build
> > > +++ b/config/arm/meson.build
> > > @@ -26,6 +26,7 @@ flags_common_default = [
> > >         ['RTE_LIBRTE_AVP_PMD', false],
> > >
> > >         ['RTE_SCHED_VECTOR', false],
> > > +       ['RTE_ARM_USE_WFE', false],
> > >  ]
> > >
> > >  flags_generic = [
> > > diff --git a/config/common_base b/config/common_base
> > > index e843a21..c812156 100644
> > > --- a/config/common_base
> > > +++ b/config/common_base
> > > @@ -111,6 +111,11 @@ CONFIG_RTE_MAX_VFIO_CONTAINERS=64
> > >  CONFIG_RTE_MALLOC_DEBUG=n
> > >  CONFIG_RTE_EAL_NUMA_AWARE_HUGEPAGES=n
> > >  CONFIG_RTE_USE_LIBBSD=n
> > > +# Use WFE instructions to implement the rte_wait_for_equal_xxx APIs,
> > > +# calling these APIs put the cores in low power state while waiting
> > > +# for the memory address to become equal to the expected value.
> > > +# This is supported only by aarch64.
> > > +CONFIG_RTE_ARM_USE_WFE=n
> > >
> > >  #
> > >  # Recognize/ignore the AVX/AVX512 CPU flags for performance/power
> testing.
> > > diff --git a/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
> b/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
> > > index 93895d3..eb8f73e 100644
> > > --- a/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
> > > +++ b/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
> > > @@ -1,5 +1,6 @@
> > >  /* SPDX-License-Identifier: BSD-3-Clause
> > >   * Copyright(c) 2017 Cavium, Inc
> > > + * Copyright(c) 2019 Arm Limited
> > >   */
> >
> > Before including generic/rte_pause.h, put a check like:
> >
> > #ifdef RTE_ARM_USE_WFE
> > #define RTE_ARCH_HAS_WFE
> > #endif
> >
> > #include "generic/rte_pause.h"
> >
> > >
> > >  #ifndef _RTE_PAUSE_ARM64_H_
> > > @@ -17,6 +18,31 @@ static inline void rte_pause(void)
> > >         asm volatile("yield" ::: "memory");
> > >  }
> > >
> > > +#ifdef RTE_ARM_USE_WFE
> > > +#define sev()  { asm volatile("sev" : : : "memory") }
> > > +#define wfe()  { asm volatile("wfe" : : : "memory") }
> > > +
> > > +#define __WAIT_UNTIL_EQUAL(type, size, addr, expected, memorder) \
> > > +__rte_experimental                                             \
> >
> > The experimental tag is unnecessary here.
> > We only need it in the function prototype (in the generic header).
> >
> > > +static __rte_always_inline void                                        \
> > > +rte_wait_until_equal_##size(volatile type * addr, type expected,\
> > > +int memorder)                                                  \
> > > +{                                                              \
> > > +       if (__atomic_load_n(addr, memorder) != expected) {      \
> > > +               sev();                                                  \
> > > +               do {                                                    \
> > > +                       wfe();                                          \
> > > +               } while (__atomic_load_n(addr, memorder) != expected);  \
> > > +        }                                                              \
> > > +}
> > > +__WAIT_UNTIL_EQUAL(uint16_t, 16, addr, expected, memorder)
> > > +__WAIT_UNTIL_EQUAL(uint32_t, 32, addr, expected, memorder)
> > > +__WAIT_UNTIL_EQUAL(uint64_t, 64, addr, expected, memorder)
> > > +
> > > +#undef __WAIT_UNTIL_EQUAL
> 
> Might be instead of defining/undefining these macros, just
> define explicitly these 3 functions?
> Now they are really small, so I think it would be an easier/cleaner way.
> Yes, a bit of code duplication, but as I said they are really small now.
> Same thought about generic version.
Ok, no problem, an alternative way to reduce duplication is to use _Generic( since gcc-4.9). 
As you said, now it is small, I will take your suggestion.
> >
> > Missing #undef on sev and wfe macros.
> 
> Actually should we undefine them?
> Or should we add rte_ prefix (which is needed anyway I suppose)
> and have them always defined?
> Might be you can reuse them in other arm specific places too (spinlock,
> rwlock, etc.)
> Actually probably it is possible to make them either emiting a proper
> instructions or NOP,
> then you'll need RTE_ARM_USE_WFE only around these macros.
> 
> I.E
> 
> #ifdef RTE_ARM_USE_WFE
> #define rte_sev()  { asm volatile("sev" : : : "memory") }
> #define rte_wfe()  { asm volatile("wfe" : : : "memory") }
> #else
> static inline void rte_sev(void)
> {
> }
> static inline void rte_wfe(void)
> {
> 	rte_pause();
> }
> #endif
> 
> And then just one common version of _wait_ functios:
Good suggestion, will fix in v9.
> 
> static __rte_always_inline void                                        \
> rte_wait_until_equal_32(volatile type * addr, type expected, int memorder)
> {                                                              \
> 	if (__atomic_load_n(addr, memorder) != expected) {
> 		rte_sev();
> 		do {
> 	                       rte_wfe();
> 	              } while (__atomic_load_n(addr, memorder) != expected);
> 	  }
> }
> 
> 
> >
> >
> > > +
> > > +#endif /* RTE_ARM_USE_WFE */
> > > +
> > >  #ifdef __cplusplus
> > >  }
> > >  #endif
> > > diff --git a/lib/librte_eal/common/include/generic/rte_pause.h
> b/lib/librte_eal/common/include/generic/rte_pause.h
> > > index 52bd4db..80597a9 100644
> > > --- a/lib/librte_eal/common/include/generic/rte_pause.h
> > > +++ b/lib/librte_eal/common/include/generic/rte_pause.h
> > > @@ -1,5 +1,6 @@
> > >  /* SPDX-License-Identifier: BSD-3-Clause
> > >   * Copyright(c) 2017 Cavium, Inc
> > > + * Copyright(c) 2019 Arm Limited
> > >   */
> > >
> > >  #ifndef _RTE_PAUSE_H_
> > > @@ -12,6 +13,11 @@
> > >   *
> > >   */
> > >
> > > +#include <stdint.h>
> > > +#include <rte_common.h>
> > > +#include <rte_atomic.h>
> > > +#include <rte_compat.h>
> > > +
> > >  /**
> > >   * Pause CPU execution for a short while
> > >   *
> > > @@ -20,4 +26,91 @@
> > >   */
> > >  static inline void rte_pause(void);
> > >
> > > +/**
> > > + * @warning
> > > + * @b EXPERIMENTAL: this API may change, or be removed, without
> prior notice
> > > + *
> > > + * Wait for *addr to be updated with a 16-bit expected value, with a
> relaxed
> > > + * memory ordering model meaning the loads around this API can be
> reordered.
> > > + *
> > > + * @param addr
> > > + *  A pointer to the memory location.
> > > + * @param expected
> > > + *  A 16-bit expected value to be in the memory location.
> > > + * @param memorder
> > > + *  Two different memory orders that can be specified:
> > > + *  __ATOMIC_ACQUIRE and __ATOMIC_RELAXED. These map to
> > > + *  C++11 memory orders with the same names, see the C++11 standard
> or
> > > + *  the GCC wiki on atomic synchronization for detailed definition.
> > > + */
> > > +__rte_experimental
> > > +static __rte_always_inline void
> > > +rte_wait_until_equal_16(volatile uint16_t *addr, uint16_t expected,
> > > +int memorder);
> > > +
> > > +/**
> > > + * @warning
> > > + * @b EXPERIMENTAL: this API may change, or be removed, without
> prior notice
> > > + *
> > > + * Wait for *addr to be updated with a 32-bit expected value, with a
> relaxed
> > > + * memory ordering model meaning the loads around this API can be
> reordered.
> > > + *
> > > + * @param addr
> > > + *  A pointer to the memory location.
> > > + * @param expected
> > > + *  A 32-bit expected value to be in the memory location.
> > > + * @param memorder
> > > + *  Two different memory orders that can be specified:
> > > + *  __ATOMIC_ACQUIRE and __ATOMIC_RELAXED. These map to
> > > + *  C++11 memory orders with the same names, see the C++11 standard
> or
> > > + *  the GCC wiki on atomic synchronization for detailed definition.
> > > + */
> > > +__rte_experimental
> > > +static __rte_always_inline void
> > > +rte_wait_until_equal_32(volatile uint32_t *addr, uint32_t expected,
> > > +int memorder);
> > > +
> > > +/**
> > > + * @warning
> > > + * @b EXPERIMENTAL: this API may change, or be removed, without
> prior notice
> > > + *
> > > + * Wait for *addr to be updated with a 64-bit expected value, with a
> relaxed
> > > + * memory ordering model meaning the loads around this API can be
> reordered.
> > > + *
> > > + * @param addr
> > > + *  A pointer to the memory location.
> > > + * @param expected
> > > + *  A 64-bit expected value to be in the memory location.
> > > + * @param memorder
> > > + *  Two different memory orders that can be specified:
> > > + *  __ATOMIC_ACQUIRE and __ATOMIC_RELAXED. These map to
> > > + *  C++11 memory orders with the same names, see the C++11 standard
> or
> > > + *  the GCC wiki on atomic synchronization for detailed definition.
> > > + */
> > > +__rte_experimental
> > > +static __rte_always_inline void
> > > +rte_wait_until_equal_64(volatile uint64_t *addr, uint64_t expected,
> > > +int memorder);
> > > +
> >
> > Here, change this check to:
> >
> > #ifndef RTE_ARCH_HAS_WFE
> >
> >
> > > +#ifndef RTE_ARM_USE_WFE
> 
> Might be something arch neutral in name here?
> RTE_WAIT_UNTIL_EQUAL_ARCH_DEFINED or so?
> 
> > > +#define __WAIT_UNTIL_EQUAL(type, size, addr, expected, memorder) \
> > > +__rte_experimental \
> > > +static __rte_always_inline void                                        \
> > > +rte_wait_until_equal_##size(volatile type * addr, type expected,\
> > > +int memorder)                                                  \
> > > +{                                                              \
> > > +       if (__atomic_load_n(addr, memorder) != expected) {      \
> > > +               do {                                                    \
> > > +                       rte_pause();                                    \
> > > +               } while (__atomic_load_n(addr, memorder) != expected);  \
> > > +       }                                                               \
> > > +}
> > > +__WAIT_UNTIL_EQUAL(uint16_t, 16, addr, expected, memorder)
> > > +__WAIT_UNTIL_EQUAL(uint32_t, 32, addr, expected, memorder)
> > > +__WAIT_UNTIL_EQUAL(uint64_t, 64, addr, expected, memorder)
> > > +
> > > +#undef __WAIT_UNTIL_EQUAL
> > > +
> > > +#endif /* RTE_ARM_USE_WFE */
> > > +
> > >  #endif /* _RTE_PAUSE_H_ */
> > > --
> > > 2.7.4
> > >
> >
> > Thanks.
> >
> > --
> > David Marchand


^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: [dpdk-dev] [PATCH v8 2/6] eal: add the APIs to wait until equal
  2019-10-22 10:17       ` David Marchand
@ 2019-10-22 16:05         ` Gavin Hu (Arm Technology China)
  0 siblings, 0 replies; 163+ messages in thread
From: Gavin Hu (Arm Technology China) @ 2019-10-22 16:05 UTC (permalink / raw)
  To: David Marchand, Ananyev, Konstantin
  Cc: dev, nd, thomas, Stephen Hemminger, hemant.agrawal, jerinj,
	Pavan Nikhilesh, Honnappa Nagarahalli,
	Ruifeng Wang (Arm Technology China),
	Phil Yang (Arm Technology China),
	Steve Capper, nd

> -----Original Message-----
> From: David Marchand <david.marchand@redhat.com>
> Sent: Tuesday, October 22, 2019 6:17 PM
> To: Ananyev, Konstantin <konstantin.ananyev@intel.com>
> Cc: Gavin Hu (Arm Technology China) <Gavin.Hu@arm.com>; dev
> <dev@dpdk.org>; nd <nd@arm.com>; thomas@monjalon.net; Stephen
> Hemminger <stephen@networkplumber.org>; hemant.agrawal@nxp.com;
> jerinj@marvell.com; Pavan Nikhilesh <pbhagavatula@marvell.com>;
> Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>; Ruifeng Wang
> (Arm Technology China) <Ruifeng.Wang@arm.com>; Phil Yang (Arm
> Technology China) <Phil.Yang@arm.com>; Steve Capper
> <Steve.Capper@arm.com>
> Subject: Re: [dpdk-dev] [PATCH v8 2/6] eal: add the APIs to wait until equal
> 
> On Tue, Oct 22, 2019 at 11:37 AM Ananyev, Konstantin
> <konstantin.ananyev@intel.com> wrote:
> >
> > > > The rte_wait_until_equal_xx APIs abstract the functionality of
> > > > 'polling for a memory location to become equal to a given value'.
> > > >
> > > > Add the RTE_ARM_USE_WFE configuration entry for aarch64, disabled
> > > > by default. When it is enabled, the above APIs will call WFE instruction
> > > > to save CPU cycles and power.
> > > >
> > > > From a VM, when calling this API on aarch64, it may trap in and out to
> > > > release vCPUs whereas cause high exit latency. Since kernel 4.18.20 an
> > > > adaptive trapping mechanism is introduced to balance the latency and
> > > > workload.
> > > >
> > > > Signed-off-by: Gavin Hu <gavin.hu@arm.com>
> > > > Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > > > Reviewed-by: Steve Capper <steve.capper@arm.com>
> > > > Reviewed-by: Ola Liljedahl <ola.liljedahl@arm.com>
> > > > Reviewed-by: Honnappa Nagarahalli
> <honnappa.nagarahalli@arm.com>
> > > > Reviewed-by: Phil Yang <phil.yang@arm.com>
> > > > Acked-by: Pavan Nikhilesh <pbhagavatula@marvell.com>
> > > > Acked-by: Jerin Jacob <jerinj@marvell.com>
> > > > ---
> > > >  config/arm/meson.build                             |  1 +
> > > >  config/common_base                                 |  5 ++
> > > >  .../common/include/arch/arm/rte_pause_64.h         | 26 ++++++
> > > >  lib/librte_eal/common/include/generic/rte_pause.h  | 93
> ++++++++++++++++++++++
> > > >  4 files changed, 125 insertions(+)
> > > >
> > > > diff --git a/config/arm/meson.build b/config/arm/meson.build
> > > > index 979018e..b4b4cac 100644
> > > > --- a/config/arm/meson.build
> > > > +++ b/config/arm/meson.build
> > > > @@ -26,6 +26,7 @@ flags_common_default = [
> > > >         ['RTE_LIBRTE_AVP_PMD', false],
> > > >
> > > >         ['RTE_SCHED_VECTOR', false],
> > > > +       ['RTE_ARM_USE_WFE', false],
> > > >  ]
> > > >
> > > >  flags_generic = [
> > > > diff --git a/config/common_base b/config/common_base
> > > > index e843a21..c812156 100644
> > > > --- a/config/common_base
> > > > +++ b/config/common_base
> > > > @@ -111,6 +111,11 @@ CONFIG_RTE_MAX_VFIO_CONTAINERS=64
> > > >  CONFIG_RTE_MALLOC_DEBUG=n
> > > >  CONFIG_RTE_EAL_NUMA_AWARE_HUGEPAGES=n
> > > >  CONFIG_RTE_USE_LIBBSD=n
> > > > +# Use WFE instructions to implement the rte_wait_for_equal_xxx APIs,
> > > > +# calling these APIs put the cores in low power state while waiting
> > > > +# for the memory address to become equal to the expected value.
> > > > +# This is supported only by aarch64.
> > > > +CONFIG_RTE_ARM_USE_WFE=n
> > > >
> > > >  #
> > > >  # Recognize/ignore the AVX/AVX512 CPU flags for performance/power
> testing.
> > > > diff --git a/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
> b/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
> > > > index 93895d3..eb8f73e 100644
> > > > --- a/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
> > > > +++ b/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
> > > > @@ -1,5 +1,6 @@
> > > >  /* SPDX-License-Identifier: BSD-3-Clause
> > > >   * Copyright(c) 2017 Cavium, Inc
> > > > + * Copyright(c) 2019 Arm Limited
> > > >   */
> > >
> > > Before including generic/rte_pause.h, put a check like:
> > >
> > > #ifdef RTE_ARM_USE_WFE
> > > #define RTE_ARCH_HAS_WFE
> > > #endif
> > >
> > > #include "generic/rte_pause.h"
> > >
> > > >
> > > >  #ifndef _RTE_PAUSE_ARM64_H_
> > > > @@ -17,6 +18,31 @@ static inline void rte_pause(void)
> > > >         asm volatile("yield" ::: "memory");
> > > >  }
> > > >
> > > > +#ifdef RTE_ARM_USE_WFE
> > > > +#define sev()  { asm volatile("sev" : : : "memory") }
> > > > +#define wfe()  { asm volatile("wfe" : : : "memory") }
> > > > +
> > > > +#define __WAIT_UNTIL_EQUAL(type, size, addr, expected, memorder)
> \
> > > > +__rte_experimental                                             \
> > >
> > > The experimental tag is unnecessary here.
> > > We only need it in the function prototype (in the generic header).
> > >
> > > > +static __rte_always_inline void                                        \
> > > > +rte_wait_until_equal_##size(volatile type * addr, type expected,\
> > > > +int memorder)                                                  \
> > > > +{                                                              \
> > > > +       if (__atomic_load_n(addr, memorder) != expected) {      \
> > > > +               sev();                                                  \
> > > > +               do {                                                    \
> > > > +                       wfe();                                          \
> > > > +               } while (__atomic_load_n(addr, memorder) != expected);  \
> > > > +        }                                                              \
> > > > +}
> > > > +__WAIT_UNTIL_EQUAL(uint16_t, 16, addr, expected, memorder)
> > > > +__WAIT_UNTIL_EQUAL(uint32_t, 32, addr, expected, memorder)
> > > > +__WAIT_UNTIL_EQUAL(uint64_t, 64, addr, expected, memorder)
> > > > +
> > > > +#undef __WAIT_UNTIL_EQUAL
> >
> > Might be instead of defining/undefining these macros, just
> > define explicitly these 3 functions?
> > Now they are really small, so I think it would be an easier/cleaner way.
> > Yes, a bit of code duplication, but as I said they are really small now.
> > Same thought about generic version.
> 
> I don't really like those macros defining inlines either.
> I am fine with this little duplication, so +1 from me.
> 
> 
> >
> > >
> > > Missing #undef on sev and wfe macros.
> >
> > Actually should we undefine them?
> > Or should we add rte_ prefix (which is needed anyway I suppose)
> > and have them always defined?
> > Might be you can reuse them in other arm specific places too (spinlock,
> rwlock, etc.)
> > Actually probably it is possible to make them either emiting a proper
> instructions or NOP,
> > then you'll need RTE_ARM_USE_WFE only around these macros.
> 
> Interesting idea, but only if it gets used in this series.
> I don't want to see stuff that will not be used later.
> 
> >
> > I.E
> >
> > #ifdef RTE_ARM_USE_WFE
> > #define rte_sev()  { asm volatile("sev" : : : "memory") }
> > #define rte_wfe()  { asm volatile("wfe" : : : "memory") }
> > #else
> > static inline void rte_sev(void)
> > {
> > }
> > static inline void rte_wfe(void)
> > {
> >         rte_pause();
> > }
> > #endif
> >
> > And then just one common version of _wait_ functios:
> >
> > static __rte_always_inline void                                        \
> > rte_wait_until_equal_32(volatile type * addr, type expected, int
> memorder)
> > {                                                              \
> >         if (__atomic_load_n(addr, memorder) != expected) {
> >                 rte_sev();
> >                 do {
> >                                rte_wfe();
> >                       } while (__atomic_load_n(addr, memorder) != expected);
> >           }
> > }
> >
> >
> 
> [snip]
> 
> > > Here, change this check to:
> > >
> > > #ifndef RTE_ARCH_HAS_WFE
> > >
> > >
> > > > +#ifndef RTE_ARM_USE_WFE
> >
> > Might be something arch neutral in name here?
> > RTE_WAIT_UNTIL_EQUAL_ARCH_DEFINED or so?
> 
> Yes, better than my suggestion.
> 
> 
> Gavin, I noticed that you added a #ifndef ARM in the generic
> rte_spinlock.h header later in this series.
> Please, can you think of a way to avoid this?
> Maybe applying the same pattern to the spinlock?
David, I will look into this and other comments, and fix them in v9, thanks for all your comments!
/Gavin
> 
> Thanks.
> 
> --
> David Marchand


^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: [dpdk-dev] [PATCH v7 2/7] eal: add the APIs to wait until equal
  2019-10-17 16:44   ` Ananyev, Konstantin
@ 2019-10-23 16:20     ` Gavin Hu (Arm Technology China)
  2019-10-23 16:29       ` Gavin Hu (Arm Technology China)
  0 siblings, 1 reply; 163+ messages in thread
From: Gavin Hu (Arm Technology China) @ 2019-10-23 16:20 UTC (permalink / raw)
  To: Ananyev, Konstantin, dev
  Cc: nd, thomas, stephen, hemant.agrawal, jerinj, pbhagavatula,
	Honnappa Nagarahalli, Ruifeng Wang (Arm Technology China),
	Phil Yang (Arm Technology China),
	Steve Capper, nd

Hi Konstantin,

> -----Original Message-----
> From: Ananyev, Konstantin <konstantin.ananyev@intel.com>
> Sent: Friday, October 18, 2019 12:44 AM
> To: Gavin Hu (Arm Technology China) <Gavin.Hu@arm.com>;
> dev@dpdk.org
> Cc: nd <nd@arm.com>; thomas@monjalon.net;
> stephen@networkplumber.org; hemant.agrawal@nxp.com;
> jerinj@marvell.com; pbhagavatula@marvell.com; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>; Ruifeng Wang (Arm Technology China)
> <Ruifeng.Wang@arm.com>; Phil Yang (Arm Technology China)
> <Phil.Yang@arm.com>; Steve Capper <Steve.Capper@arm.com>
> Subject: RE: [dpdk-dev] [PATCH v7 2/7] eal: add the APIs to wait until equal
> 
> 
> >
> > Hi Gavin,
> >
> > >
> > > The rte_wait_until_equal_xx APIs abstract the functionality of
> > > 'polling for a memory location to become equal to a given value'.
> > >
> > > Add the RTE_ARM_USE_WFE configuration entry for aarch64, disabled
> > > by default. When it is enabled, the above APIs will call WFE instruction
> > > to save CPU cycles and power.
> > >
> > > Signed-off-by: Gavin Hu <gavin.hu@arm.com>
> > > Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > > Reviewed-by: Steve Capper <steve.capper@arm.com>
> > > Reviewed-by: Ola Liljedahl <ola.liljedahl@arm.com>
> > > Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > > Reviewed-by: Phil Yang <phil.yang@arm.com>
> > > Acked-by: Pavan Nikhilesh <pbhagavatula@marvell.com>
> > > ---
> > >  config/arm/meson.build                             |   1 +
> > >  config/common_base                                 |   5 +
> > >  .../common/include/arch/arm/rte_pause_64.h         |  30 ++++++
> > >  lib/librte_eal/common/include/generic/rte_pause.h  | 106
> +++++++++++++++++++++
> > >  4 files changed, 142 insertions(+)
> > >
> > > diff --git a/config/arm/meson.build b/config/arm/meson.build
> > > index 979018e..b4b4cac 100644
> > > --- a/config/arm/meson.build
> > > +++ b/config/arm/meson.build
> > > @@ -26,6 +26,7 @@ flags_common_default = [
> > >  	['RTE_LIBRTE_AVP_PMD', false],
> > >
> > >  	['RTE_SCHED_VECTOR', false],
> > > +	['RTE_ARM_USE_WFE', false],
> > >  ]
> > >
> > >  flags_generic = [
> > > diff --git a/config/common_base b/config/common_base
> > > index 8ef75c2..8861713 100644
> > > --- a/config/common_base
> > > +++ b/config/common_base
> > > @@ -111,6 +111,11 @@ CONFIG_RTE_MAX_VFIO_CONTAINERS=64
> > >  CONFIG_RTE_MALLOC_DEBUG=n
> > >  CONFIG_RTE_EAL_NUMA_AWARE_HUGEPAGES=n
> > >  CONFIG_RTE_USE_LIBBSD=n
> > > +# Use WFE instructions to implement the rte_wait_for_equal_xxx APIs,
> > > +# calling these APIs put the cores in low power state while waiting
> > > +# for the memory address to become equal to the expected value.
> > > +# This is supported only by aarch64.
> > > +CONFIG_RTE_ARM_USE_WFE=n
> > >
> > >  #
> > >  # Recognize/ignore the AVX/AVX512 CPU flags for performance/power
> testing.
> > > diff --git a/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
> b/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
> > > index 93895d3..dabde17 100644
> > > --- a/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
> > > +++ b/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
> > > @@ -1,5 +1,6 @@
> > >  /* SPDX-License-Identifier: BSD-3-Clause
> > >   * Copyright(c) 2017 Cavium, Inc
> > > + * Copyright(c) 2019 Arm Limited
> > >   */
> > >
> > >  #ifndef _RTE_PAUSE_ARM64_H_
> > > @@ -17,6 +18,35 @@ static inline void rte_pause(void)
> > >  	asm volatile("yield" ::: "memory");
> > >  }
> > >
> > > +#ifdef RTE_ARM_USE_WFE
> > > +#define __WAIT_UNTIL_EQUAL(name, asm_op, wide, type) \
> > > +static __rte_always_inline void \
> > > +rte_wait_until_equal_##name(volatile type * addr, type expected) \
> > > +{ \
> > > +	type tmp; \
> > > +	asm volatile( \
> > > +		#asm_op " %" #wide "[tmp], %[addr]\n" \
> > > +		"cmp	%" #wide "[tmp], %" #wide "[expected]\n" \
> > > +		"b.eq	2f\n" \
> > > +		"sevl\n" \
> > > +		"1:	wfe\n" \
> > > +		#asm_op " %" #wide "[tmp], %[addr]\n" \
> > > +		"cmp	%" #wide "[tmp], %" #wide "[expected]\n" \
> > > +		"bne	1b\n" \
> > > +		"2:\n" \
> > > +		: [tmp] "=&r" (tmp) \
> > > +		: [addr] "Q"(*addr), [expected] "r"(expected) \
> > > +		: "cc", "memory"); \
> > > +}
> 
> One more thought:
> Why do you need to write asm code for the whole procedure?
> Why not to do like linux kernel:
> define wfe() and sev() macros and use them inside normal C code?
> 
> #define sev()		asm volatile("sev" : : : "memory")
> #define wfe()		asm volatile("wfe" : : : "memory")
> 
> Then:
> rte_wait_until_equal_32(volatile uint32_t *addr, uint32_t expected, int
> memorder)
> {
>      if (__atomic_load_n(addr, memorder) != expected) {
>          sev();
>          do {
>              wfe();
>          } while ((__atomic_load_n(addr, memorder) != expected);
>      }
> }
> 
> ?
A really good suggestion, I made corresponding changes to v8 already, but it missed a armv8 specific feature after internal discussion.
We call wfe to wait/sleep on the 'monitored' address, it will be waken up upon someone write to the monitor address, so before wfe, we have to call load-exclusive instruction to 'monitor'. 
__atomic_load_n - disassembled to "ldr" does not do so. We have to use "ldxrh" for relaxed mem ordering and "ldaxrh" for acquire ordering, in example of 16-bit.

Let me re-think coming back to the full assembly procedure or implementing a 'load-exclusive' function. What do you think? 
/Gavin

> > > +/* Wait for *addr to be updated with expected value */
> > > +__WAIT_UNTIL_EQUAL(relaxed_16, ldxrh, w, uint16_t)
> > > +__WAIT_UNTIL_EQUAL(acquire_16, ldaxrh, w, uint16_t)
> > > +__WAIT_UNTIL_EQUAL(relaxed_32, ldxr, w, uint32_t)
> > > +__WAIT_UNTIL_EQUAL(acquire_32, ldaxr, w, uint32_t)
> > > +__WAIT_UNTIL_EQUAL(relaxed_64, ldxr, x, uint64_t)
> > > +__WAIT_UNTIL_EQUAL(acquire_64, ldaxr, x, uint64_t)
> > > +#endif
> > > +
> > >  #ifdef __cplusplus
> > >  }
> > >  #endif
> > > diff --git a/lib/librte_eal/common/include/generic/rte_pause.h
> b/lib/librte_eal/common/include/generic/rte_pause.h
> > > index 52bd4db..8906473 100644
> > > --- a/lib/librte_eal/common/include/generic/rte_pause.h
> > > +++ b/lib/librte_eal/common/include/generic/rte_pause.h
> > > @@ -1,5 +1,6 @@
> > >  /* SPDX-License-Identifier: BSD-3-Clause
> > >   * Copyright(c) 2017 Cavium, Inc
> > > + * Copyright(c) 2019 Arm Limited
> > >   */
> > >
> > >  #ifndef _RTE_PAUSE_H_
> > > @@ -12,6 +13,10 @@
> > >   *
> > >   */
> > >
> > > +#include <stdint.h>
> > > +#include <rte_common.h>
> > > +#include <rte_atomic.h>
> > > +
> > >  /**
> > >   * Pause CPU execution for a short while
> > >   *
> > > @@ -20,4 +25,105 @@
> > >   */
> > >  static inline void rte_pause(void);
> > >
> > > +/**
> > > + * Wait for *addr to be updated with a 16-bit expected value, with a
> relaxed
> > > + * memory ordering model meaning the loads around this API can be
> reordered.
> > > + *
> > > + * @param addr
> > > + *  A pointer to the memory location.
> > > + * @param expected
> > > + *  A 16-bit expected value to be in the memory location.
> > > + */
> > > +__rte_always_inline
> > > +static void
> > > +rte_wait_until_equal_relaxed_16(volatile uint16_t *addr, uint16_t
> expected);
> > > +
> > > +/**
> > > + * Wait for *addr to be updated with a 32-bit expected value, with a
> relaxed
> > > + * memory ordering model meaning the loads around this API can be
> reordered.
> > > + *
> > > + * @param addr
> > > + *  A pointer to the memory location.
> > > + * @param expected
> > > + *  A 32-bit expected value to be in the memory location.
> > > + */
> > > +__rte_always_inline
> > > +static void
> > > +rte_wait_until_equal_relaxed_32(volatile uint32_t *addr, uint32_t
> expected);
> > > +
> > > +/**
> > > + * Wait for *addr to be updated with a 64-bit expected value, with a
> relaxed
> > > + * memory ordering model meaning the loads around this API can be
> reordered.
> > > + *
> > > + * @param addr
> > > + *  A pointer to the memory location.
> > > + * @param expected
> > > + *  A 64-bit expected value to be in the memory location.
> > > + */
> > > +__rte_always_inline
> > > +static void
> > > +rte_wait_until_equal_relaxed_64(volatile uint64_t *addr, uint64_t
> expected);
> > > +
> > > +/**
> > > + * Wait for *addr to be updated with a 16-bit expected value, with an
> acquire
> > > + * memory ordering model meaning the loads after this API can't be
> observed
> > > + * before this API.
> > > + *
> > > + * @param addr
> > > + *  A pointer to the memory location.
> > > + * @param expected
> > > + *  A 16-bit expected value to be in the memory location.
> > > + */
> > > +__rte_always_inline
> > > +static void
> > > +rte_wait_until_equal_acquire_16(volatile uint16_t *addr, uint16_t
> expected);
> > > +
> > > +/**
> > > + * Wait for *addr to be updated with a 32-bit expected value, with an
> acquire
> > > + * memory ordering model meaning the loads after this API can't be
> observed
> > > + * before this API.
> > > + *
> > > + * @param addr
> > > + *  A pointer to the memory location.
> > > + * @param expected
> > > + *  A 32-bit expected value to be in the memory location.
> > > + */
> > > +__rte_always_inline
> > > +static void
> > > +rte_wait_until_equal_acquire_32(volatile uint32_t *addr, uint32_t
> expected);
> >
> > LGTM in general.
> > One stylish thing: wouldn't it be better to have an API like that:
> > rte_wait_until_equal_acquire_X(addr, expected, memory_order)
> > ?
> >
> > I.E. - pass memorder as parameter, not to incorporate it into function
> name?
> > Less functions, plus user can specify order himself.
> > Plus looks similar to C11 atomic instrincts.
> >
> >
> > > +
> > > +/**
> > > + * Wait for *addr to be updated with a 64-bit expected value, with an
> acquire
> > > + * memory ordering model meaning the loads after this API can't be
> observed
> > > + * before this API.
> > > + *
> > > + * @param addr
> > > + *  A pointer to the memory location.
> > > + * @param expected
> > > + *  A 64-bit expected value to be in the memory location.
> > > + */
> > > +__rte_always_inline
> > > +static void
> > > +rte_wait_until_equal_acquire_64(volatile uint64_t *addr, uint64_t
> expected);
> > > +
> > > +#if !defined(RTE_ARM_USE_WFE)
> > > +#define __WAIT_UNTIL_EQUAL(op_name, size, type, memorder) \
> > > +__rte_always_inline \
> > > +static void	\
> > > +rte_wait_until_equal_##op_name##_##size(volatile type *addr, \
> > > +	type expected) \
> > > +{ \
> > > +	while (__atomic_load_n(addr, memorder) != expected) \
> > > +		rte_pause(); \
> > > +}
> > > +
> > > +/* Wait for *addr to be updated with expected value */
> > > +__WAIT_UNTIL_EQUAL(relaxed, 16, uint16_t, __ATOMIC_RELAXED)
> > > +__WAIT_UNTIL_EQUAL(acquire, 16, uint16_t, __ATOMIC_ACQUIRE)
> > > +__WAIT_UNTIL_EQUAL(relaxed, 32, uint32_t, __ATOMIC_RELAXED)
> > > +__WAIT_UNTIL_EQUAL(acquire, 32, uint32_t, __ATOMIC_ACQUIRE)
> > > +__WAIT_UNTIL_EQUAL(relaxed, 64, uint64_t, __ATOMIC_RELAXED)
> > > +__WAIT_UNTIL_EQUAL(acquire, 64, uint64_t, __ATOMIC_ACQUIRE)
> > > +#endif /* RTE_ARM_USE_WFE */
> > > +
> > >  #endif /* _RTE_PAUSE_H_ */
> > > --
> > > 2.7.4


^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: [dpdk-dev] [PATCH v7 2/7] eal: add the APIs to wait until equal
  2019-10-23 16:20     ` Gavin Hu (Arm Technology China)
@ 2019-10-23 16:29       ` Gavin Hu (Arm Technology China)
  2019-10-24 10:21         ` Ananyev, Konstantin
  0 siblings, 1 reply; 163+ messages in thread
From: Gavin Hu (Arm Technology China) @ 2019-10-23 16:29 UTC (permalink / raw)
  To: Ananyev, Konstantin, dev
  Cc: nd, thomas, stephen, hemant.agrawal, jerinj, pbhagavatula,
	Honnappa Nagarahalli, Ruifeng Wang (Arm Technology China),
	Phil Yang (Arm Technology China),
	Steve Capper, nd, nd

Hi Konstantin,

> -----Original Message-----
> From: Gavin Hu (Arm Technology China)
> Sent: Thursday, October 24, 2019 12:20 AM
> To: Ananyev, Konstantin <konstantin.ananyev@intel.com>; dev@dpdk.org
> Cc: nd <nd@arm.com>; thomas@monjalon.net;
> stephen@networkplumber.org; hemant.agrawal@nxp.com;
> jerinj@marvell.com; pbhagavatula@marvell.com; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>; Ruifeng Wang (Arm Technology China)
> <Ruifeng.Wang@arm.com>; Phil Yang (Arm Technology China)
> <Phil.Yang@arm.com>; Steve Capper <Steve.Capper@arm.com>; nd
> <nd@arm.com>
> Subject: RE: [dpdk-dev] [PATCH v7 2/7] eal: add the APIs to wait until equal
> 
> Hi Konstantin,
> 
> > -----Original Message-----
> > From: Ananyev, Konstantin <konstantin.ananyev@intel.com>
> > Sent: Friday, October 18, 2019 12:44 AM
> > To: Gavin Hu (Arm Technology China) <Gavin.Hu@arm.com>;
> > dev@dpdk.org
> > Cc: nd <nd@arm.com>; thomas@monjalon.net;
> > stephen@networkplumber.org; hemant.agrawal@nxp.com;
> > jerinj@marvell.com; pbhagavatula@marvell.com; Honnappa Nagarahalli
> > <Honnappa.Nagarahalli@arm.com>; Ruifeng Wang (Arm Technology
> China)
> > <Ruifeng.Wang@arm.com>; Phil Yang (Arm Technology China)
> > <Phil.Yang@arm.com>; Steve Capper <Steve.Capper@arm.com>
> > Subject: RE: [dpdk-dev] [PATCH v7 2/7] eal: add the APIs to wait until equal
> >
> >
> > >
> > > Hi Gavin,
> > >
> > > >
> > > > The rte_wait_until_equal_xx APIs abstract the functionality of
> > > > 'polling for a memory location to become equal to a given value'.
> > > >
> > > > Add the RTE_ARM_USE_WFE configuration entry for aarch64, disabled
> > > > by default. When it is enabled, the above APIs will call WFE instruction
> > > > to save CPU cycles and power.
> > > >
> > > > Signed-off-by: Gavin Hu <gavin.hu@arm.com>
> > > > Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > > > Reviewed-by: Steve Capper <steve.capper@arm.com>
> > > > Reviewed-by: Ola Liljedahl <ola.liljedahl@arm.com>
> > > > Reviewed-by: Honnappa Nagarahalli
> <honnappa.nagarahalli@arm.com>
> > > > Reviewed-by: Phil Yang <phil.yang@arm.com>
> > > > Acked-by: Pavan Nikhilesh <pbhagavatula@marvell.com>
> > > > ---
> > > >  config/arm/meson.build                             |   1 +
> > > >  config/common_base                                 |   5 +
> > > >  .../common/include/arch/arm/rte_pause_64.h         |  30 ++++++
> > > >  lib/librte_eal/common/include/generic/rte_pause.h  | 106
> > +++++++++++++++++++++
> > > >  4 files changed, 142 insertions(+)
> > > >
> > > > diff --git a/config/arm/meson.build b/config/arm/meson.build
> > > > index 979018e..b4b4cac 100644
> > > > --- a/config/arm/meson.build
> > > > +++ b/config/arm/meson.build
> > > > @@ -26,6 +26,7 @@ flags_common_default = [
> > > >  	['RTE_LIBRTE_AVP_PMD', false],
> > > >
> > > >  	['RTE_SCHED_VECTOR', false],
> > > > +	['RTE_ARM_USE_WFE', false],
> > > >  ]
> > > >
> > > >  flags_generic = [
> > > > diff --git a/config/common_base b/config/common_base
> > > > index 8ef75c2..8861713 100644
> > > > --- a/config/common_base
> > > > +++ b/config/common_base
> > > > @@ -111,6 +111,11 @@ CONFIG_RTE_MAX_VFIO_CONTAINERS=64
> > > >  CONFIG_RTE_MALLOC_DEBUG=n
> > > >  CONFIG_RTE_EAL_NUMA_AWARE_HUGEPAGES=n
> > > >  CONFIG_RTE_USE_LIBBSD=n
> > > > +# Use WFE instructions to implement the rte_wait_for_equal_xxx APIs,
> > > > +# calling these APIs put the cores in low power state while waiting
> > > > +# for the memory address to become equal to the expected value.
> > > > +# This is supported only by aarch64.
> > > > +CONFIG_RTE_ARM_USE_WFE=n
> > > >
> > > >  #
> > > >  # Recognize/ignore the AVX/AVX512 CPU flags for performance/power
> > testing.
> > > > diff --git a/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
> > b/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
> > > > index 93895d3..dabde17 100644
> > > > --- a/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
> > > > +++ b/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
> > > > @@ -1,5 +1,6 @@
> > > >  /* SPDX-License-Identifier: BSD-3-Clause
> > > >   * Copyright(c) 2017 Cavium, Inc
> > > > + * Copyright(c) 2019 Arm Limited
> > > >   */
> > > >
> > > >  #ifndef _RTE_PAUSE_ARM64_H_
> > > > @@ -17,6 +18,35 @@ static inline void rte_pause(void)
> > > >  	asm volatile("yield" ::: "memory");
> > > >  }
> > > >
> > > > +#ifdef RTE_ARM_USE_WFE
> > > > +#define __WAIT_UNTIL_EQUAL(name, asm_op, wide, type) \
> > > > +static __rte_always_inline void \
> > > > +rte_wait_until_equal_##name(volatile type * addr, type expected) \
> > > > +{ \
> > > > +	type tmp; \
> > > > +	asm volatile( \
> > > > +		#asm_op " %" #wide "[tmp], %[addr]\n" \
> > > > +		"cmp	%" #wide "[tmp], %" #wide "[expected]\n" \
> > > > +		"b.eq	2f\n" \
> > > > +		"sevl\n" \
> > > > +		"1:	wfe\n" \
> > > > +		#asm_op " %" #wide "[tmp], %[addr]\n" \
> > > > +		"cmp	%" #wide "[tmp], %" #wide "[expected]\n" \
> > > > +		"bne	1b\n" \
> > > > +		"2:\n" \
> > > > +		: [tmp] "=&r" (tmp) \
> > > > +		: [addr] "Q"(*addr), [expected] "r"(expected) \
> > > > +		: "cc", "memory"); \
> > > > +}
> >
> > One more thought:
> > Why do you need to write asm code for the whole procedure?
> > Why not to do like linux kernel:
> > define wfe() and sev() macros and use them inside normal C code?
> >
> > #define sev()		asm volatile("sev" : : : "memory")
> > #define wfe()		asm volatile("wfe" : : : "memory")
> >
> > Then:
> > rte_wait_until_equal_32(volatile uint32_t *addr, uint32_t expected, int
> > memorder)
> > {
> >      if (__atomic_load_n(addr, memorder) != expected) {
> >          sev();
> >          do {
> >              wfe();
> >          } while ((__atomic_load_n(addr, memorder) != expected);
> >      }
> > }
> >
> > ?
> A really good suggestion, I made corresponding changes to v8 already, but it
> missed a armv8 specific feature after internal discussion.
> We call wfe to wait/sleep on the 'monitored' address, it will be waken up
> upon someone write to the monitor address, so before wfe, we have to call
> load-exclusive instruction to 'monitor'.
> __atomic_load_n - disassembled to "ldr" does not do so. We have to use
> "ldxrh" for relaxed mem ordering and "ldaxrh" for acquire ordering, in
> example of 16-bit.
> 
> Let me re-think coming back to the full assembly procedure or implementing
> a 'load-exclusive' function. What do you think?
> /Gavin
Forgot to mention, kernel uses wfe() without preceding load-exclusive instructions because:
1) it replies on the timer, to wake up, i.e. __delay()
2) explicit calling sev to send wake events, for all kinds of locks
3) IPI instructions.

Our patches can't count on these events, due to of lack of these events or performance  impact. 
/Gavin
> 
> > > > +/* Wait for *addr to be updated with expected value */
> > > > +__WAIT_UNTIL_EQUAL(relaxed_16, ldxrh, w, uint16_t)
> > > > +__WAIT_UNTIL_EQUAL(acquire_16, ldaxrh, w, uint16_t)
> > > > +__WAIT_UNTIL_EQUAL(relaxed_32, ldxr, w, uint32_t)
> > > > +__WAIT_UNTIL_EQUAL(acquire_32, ldaxr, w, uint32_t)
> > > > +__WAIT_UNTIL_EQUAL(relaxed_64, ldxr, x, uint64_t)
> > > > +__WAIT_UNTIL_EQUAL(acquire_64, ldaxr, x, uint64_t)
> > > > +#endif
> > > > +
> > > >  #ifdef __cplusplus
> > > >  }
> > > >  #endif
> > > > diff --git a/lib/librte_eal/common/include/generic/rte_pause.h
> > b/lib/librte_eal/common/include/generic/rte_pause.h
> > > > index 52bd4db..8906473 100644
> > > > --- a/lib/librte_eal/common/include/generic/rte_pause.h
> > > > +++ b/lib/librte_eal/common/include/generic/rte_pause.h
> > > > @@ -1,5 +1,6 @@
> > > >  /* SPDX-License-Identifier: BSD-3-Clause
> > > >   * Copyright(c) 2017 Cavium, Inc
> > > > + * Copyright(c) 2019 Arm Limited
> > > >   */
> > > >
> > > >  #ifndef _RTE_PAUSE_H_
> > > > @@ -12,6 +13,10 @@
> > > >   *
> > > >   */
> > > >
> > > > +#include <stdint.h>
> > > > +#include <rte_common.h>
> > > > +#include <rte_atomic.h>
> > > > +
> > > >  /**
> > > >   * Pause CPU execution for a short while
> > > >   *
> > > > @@ -20,4 +25,105 @@
> > > >   */
> > > >  static inline void rte_pause(void);
> > > >
> > > > +/**
> > > > + * Wait for *addr to be updated with a 16-bit expected value, with a
> > relaxed
> > > > + * memory ordering model meaning the loads around this API can be
> > reordered.
> > > > + *
> > > > + * @param addr
> > > > + *  A pointer to the memory location.
> > > > + * @param expected
> > > > + *  A 16-bit expected value to be in the memory location.
> > > > + */
> > > > +__rte_always_inline
> > > > +static void
> > > > +rte_wait_until_equal_relaxed_16(volatile uint16_t *addr, uint16_t
> > expected);
> > > > +
> > > > +/**
> > > > + * Wait for *addr to be updated with a 32-bit expected value, with a
> > relaxed
> > > > + * memory ordering model meaning the loads around this API can be
> > reordered.
> > > > + *
> > > > + * @param addr
> > > > + *  A pointer to the memory location.
> > > > + * @param expected
> > > > + *  A 32-bit expected value to be in the memory location.
> > > > + */
> > > > +__rte_always_inline
> > > > +static void
> > > > +rte_wait_until_equal_relaxed_32(volatile uint32_t *addr, uint32_t
> > expected);
> > > > +
> > > > +/**
> > > > + * Wait for *addr to be updated with a 64-bit expected value, with a
> > relaxed
> > > > + * memory ordering model meaning the loads around this API can be
> > reordered.
> > > > + *
> > > > + * @param addr
> > > > + *  A pointer to the memory location.
> > > > + * @param expected
> > > > + *  A 64-bit expected value to be in the memory location.
> > > > + */
> > > > +__rte_always_inline
> > > > +static void
> > > > +rte_wait_until_equal_relaxed_64(volatile uint64_t *addr, uint64_t
> > expected);
> > > > +
> > > > +/**
> > > > + * Wait for *addr to be updated with a 16-bit expected value, with an
> > acquire
> > > > + * memory ordering model meaning the loads after this API can't be
> > observed
> > > > + * before this API.
> > > > + *
> > > > + * @param addr
> > > > + *  A pointer to the memory location.
> > > > + * @param expected
> > > > + *  A 16-bit expected value to be in the memory location.
> > > > + */
> > > > +__rte_always_inline
> > > > +static void
> > > > +rte_wait_until_equal_acquire_16(volatile uint16_t *addr, uint16_t
> > expected);
> > > > +
> > > > +/**
> > > > + * Wait for *addr to be updated with a 32-bit expected value, with an
> > acquire
> > > > + * memory ordering model meaning the loads after this API can't be
> > observed
> > > > + * before this API.
> > > > + *
> > > > + * @param addr
> > > > + *  A pointer to the memory location.
> > > > + * @param expected
> > > > + *  A 32-bit expected value to be in the memory location.
> > > > + */
> > > > +__rte_always_inline
> > > > +static void
> > > > +rte_wait_until_equal_acquire_32(volatile uint32_t *addr, uint32_t
> > expected);
> > >
> > > LGTM in general.
> > > One stylish thing: wouldn't it be better to have an API like that:
> > > rte_wait_until_equal_acquire_X(addr, expected, memory_order)
> > > ?
> > >
> > > I.E. - pass memorder as parameter, not to incorporate it into function
> > name?
> > > Less functions, plus user can specify order himself.
> > > Plus looks similar to C11 atomic instrincts.
> > >
> > >
> > > > +
> > > > +/**
> > > > + * Wait for *addr to be updated with a 64-bit expected value, with an
> > acquire
> > > > + * memory ordering model meaning the loads after this API can't be
> > observed
> > > > + * before this API.
> > > > + *
> > > > + * @param addr
> > > > + *  A pointer to the memory location.
> > > > + * @param expected
> > > > + *  A 64-bit expected value to be in the memory location.
> > > > + */
> > > > +__rte_always_inline
> > > > +static void
> > > > +rte_wait_until_equal_acquire_64(volatile uint64_t *addr, uint64_t
> > expected);
> > > > +
> > > > +#if !defined(RTE_ARM_USE_WFE)
> > > > +#define __WAIT_UNTIL_EQUAL(op_name, size, type, memorder) \
> > > > +__rte_always_inline \
> > > > +static void	\
> > > > +rte_wait_until_equal_##op_name##_##size(volatile type *addr, \
> > > > +	type expected) \
> > > > +{ \
> > > > +	while (__atomic_load_n(addr, memorder) != expected) \
> > > > +		rte_pause(); \
> > > > +}
> > > > +
> > > > +/* Wait for *addr to be updated with expected value */
> > > > +__WAIT_UNTIL_EQUAL(relaxed, 16, uint16_t, __ATOMIC_RELAXED)
> > > > +__WAIT_UNTIL_EQUAL(acquire, 16, uint16_t, __ATOMIC_ACQUIRE)
> > > > +__WAIT_UNTIL_EQUAL(relaxed, 32, uint32_t, __ATOMIC_RELAXED)
> > > > +__WAIT_UNTIL_EQUAL(acquire, 32, uint32_t, __ATOMIC_ACQUIRE)
> > > > +__WAIT_UNTIL_EQUAL(relaxed, 64, uint64_t, __ATOMIC_RELAXED)
> > > > +__WAIT_UNTIL_EQUAL(acquire, 64, uint64_t, __ATOMIC_ACQUIRE)
> > > > +#endif /* RTE_ARM_USE_WFE */
> > > > +
> > > >  #endif /* _RTE_PAUSE_H_ */
> > > > --
> > > > 2.7.4


^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: [dpdk-dev] [PATCH v7 2/7] eal: add the APIs to wait until equal
  2019-10-23 16:29       ` Gavin Hu (Arm Technology China)
@ 2019-10-24 10:21         ` Ananyev, Konstantin
  2019-10-24 10:52           ` Gavin Hu (Arm Technology China)
  0 siblings, 1 reply; 163+ messages in thread
From: Ananyev, Konstantin @ 2019-10-24 10:21 UTC (permalink / raw)
  To: Gavin Hu (Arm Technology China), dev
  Cc: nd, thomas, stephen, hemant.agrawal, jerinj, pbhagavatula,
	Honnappa Nagarahalli, Ruifeng Wang (Arm Technology China),
	Phil Yang (Arm Technology China),
	Steve Capper, nd, nd



Hi Gavin,
> > > > > The rte_wait_until_equal_xx APIs abstract the functionality of
> > > > > 'polling for a memory location to become equal to a given value'.
> > > > >
> > > > > Add the RTE_ARM_USE_WFE configuration entry for aarch64, disabled
> > > > > by default. When it is enabled, the above APIs will call WFE instruction
> > > > > to save CPU cycles and power.
> > > > >
> > > > > Signed-off-by: Gavin Hu <gavin.hu@arm.com>
> > > > > Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > > > > Reviewed-by: Steve Capper <steve.capper@arm.com>
> > > > > Reviewed-by: Ola Liljedahl <ola.liljedahl@arm.com>
> > > > > Reviewed-by: Honnappa Nagarahalli
> > <honnappa.nagarahalli@arm.com>
> > > > > Reviewed-by: Phil Yang <phil.yang@arm.com>
> > > > > Acked-by: Pavan Nikhilesh <pbhagavatula@marvell.com>
> > > > > ---
> > > > >  config/arm/meson.build                             |   1 +
> > > > >  config/common_base                                 |   5 +
> > > > >  .../common/include/arch/arm/rte_pause_64.h         |  30 ++++++
> > > > >  lib/librte_eal/common/include/generic/rte_pause.h  | 106
> > > +++++++++++++++++++++
> > > > >  4 files changed, 142 insertions(+)
> > > > >
> > > > > diff --git a/config/arm/meson.build b/config/arm/meson.build
> > > > > index 979018e..b4b4cac 100644
> > > > > --- a/config/arm/meson.build
> > > > > +++ b/config/arm/meson.build
> > > > > @@ -26,6 +26,7 @@ flags_common_default = [
> > > > >  	['RTE_LIBRTE_AVP_PMD', false],
> > > > >
> > > > >  	['RTE_SCHED_VECTOR', false],
> > > > > +	['RTE_ARM_USE_WFE', false],
> > > > >  ]
> > > > >
> > > > >  flags_generic = [
> > > > > diff --git a/config/common_base b/config/common_base
> > > > > index 8ef75c2..8861713 100644
> > > > > --- a/config/common_base
> > > > > +++ b/config/common_base
> > > > > @@ -111,6 +111,11 @@ CONFIG_RTE_MAX_VFIO_CONTAINERS=64
> > > > >  CONFIG_RTE_MALLOC_DEBUG=n
> > > > >  CONFIG_RTE_EAL_NUMA_AWARE_HUGEPAGES=n
> > > > >  CONFIG_RTE_USE_LIBBSD=n
> > > > > +# Use WFE instructions to implement the rte_wait_for_equal_xxx APIs,
> > > > > +# calling these APIs put the cores in low power state while waiting
> > > > > +# for the memory address to become equal to the expected value.
> > > > > +# This is supported only by aarch64.
> > > > > +CONFIG_RTE_ARM_USE_WFE=n
> > > > >
> > > > >  #
> > > > >  # Recognize/ignore the AVX/AVX512 CPU flags for performance/power
> > > testing.
> > > > > diff --git a/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
> > > b/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
> > > > > index 93895d3..dabde17 100644
> > > > > --- a/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
> > > > > +++ b/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
> > > > > @@ -1,5 +1,6 @@
> > > > >  /* SPDX-License-Identifier: BSD-3-Clause
> > > > >   * Copyright(c) 2017 Cavium, Inc
> > > > > + * Copyright(c) 2019 Arm Limited
> > > > >   */
> > > > >
> > > > >  #ifndef _RTE_PAUSE_ARM64_H_
> > > > > @@ -17,6 +18,35 @@ static inline void rte_pause(void)
> > > > >  	asm volatile("yield" ::: "memory");
> > > > >  }
> > > > >
> > > > > +#ifdef RTE_ARM_USE_WFE
> > > > > +#define __WAIT_UNTIL_EQUAL(name, asm_op, wide, type) \
> > > > > +static __rte_always_inline void \
> > > > > +rte_wait_until_equal_##name(volatile type * addr, type expected) \
> > > > > +{ \
> > > > > +	type tmp; \
> > > > > +	asm volatile( \
> > > > > +		#asm_op " %" #wide "[tmp], %[addr]\n" \
> > > > > +		"cmp	%" #wide "[tmp], %" #wide "[expected]\n" \
> > > > > +		"b.eq	2f\n" \
> > > > > +		"sevl\n" \
> > > > > +		"1:	wfe\n" \
> > > > > +		#asm_op " %" #wide "[tmp], %[addr]\n" \
> > > > > +		"cmp	%" #wide "[tmp], %" #wide "[expected]\n" \
> > > > > +		"bne	1b\n" \
> > > > > +		"2:\n" \
> > > > > +		: [tmp] "=&r" (tmp) \
> > > > > +		: [addr] "Q"(*addr), [expected] "r"(expected) \
> > > > > +		: "cc", "memory"); \
> > > > > +}
> > >
> > > One more thought:
> > > Why do you need to write asm code for the whole procedure?
> > > Why not to do like linux kernel:
> > > define wfe() and sev() macros and use them inside normal C code?
> > >
> > > #define sev()		asm volatile("sev" : : : "memory")
> > > #define wfe()		asm volatile("wfe" : : : "memory")
> > >
> > > Then:
> > > rte_wait_until_equal_32(volatile uint32_t *addr, uint32_t expected, int
> > > memorder)
> > > {
> > >      if (__atomic_load_n(addr, memorder) != expected) {
> > >          sev();
> > >          do {
> > >              wfe();
> > >          } while ((__atomic_load_n(addr, memorder) != expected);
> > >      }
> > > }
> > >
> > > ?
> > A really good suggestion, I made corresponding changes to v8 already, but it
> > missed a armv8 specific feature after internal discussion.
> > We call wfe to wait/sleep on the 'monitored' address, it will be waken up
> > upon someone write to the monitor address, so before wfe, we have to call
> > load-exclusive instruction to 'monitor'.
> > __atomic_load_n - disassembled to "ldr" does not do so. We have to use
> > "ldxrh" for relaxed mem ordering and "ldaxrh" for acquire ordering, in
> > example of 16-bit.

Didn't realize that, sorry for confusion caused...

> >
> > Let me re-think coming back to the full assembly procedure or implementing
> > a 'load-exclusive' function. What do you think?

After some thought I am leaning towards 'load-exclusive' function -
Hopefully it would help you avoid ras asm here and in other places.
What do you think?
Konstantin

> > /Gavin
> Forgot to mention, kernel uses wfe() without preceding load-exclusive instructions because:
> 1) it replies on the timer, to wake up, i.e. __delay()
> 2) explicit calling sev to send wake events, for all kinds of locks
> 3) IPI instructions.
> 
> Our patches can't count on these events, due to of lack of these events or performance  impact.
> /Gavin
> >
> > > > > +/* Wait for *addr to be updated with expected value */
> > > > > +__WAIT_UNTIL_EQUAL(relaxed_16, ldxrh, w, uint16_t)
> > > > > +__WAIT_UNTIL_EQUAL(acquire_16, ldaxrh, w, uint16_t)
> > > > > +__WAIT_UNTIL_EQUAL(relaxed_32, ldxr, w, uint32_t)
> > > > > +__WAIT_UNTIL_EQUAL(acquire_32, ldaxr, w, uint32_t)
> > > > > +__WAIT_UNTIL_EQUAL(relaxed_64, ldxr, x, uint64_t)
> > > > > +__WAIT_UNTIL_EQUAL(acquire_64, ldaxr, x, uint64_t)
> > > > > +#endif
> > > > > +
> > > > >  #ifdef __cplusplus
> > > > >  }
> > > > >  #endif

^ permalink raw reply	[flat|nested] 163+ messages in thread

* [dpdk-dev] [PATCH v9 0/5] use WFE for aarch64
  2019-06-30 16:21 [dpdk-dev] [RFC 0/5] use WFE for locks and ring on aarch64 Gavin Hu
                   ` (56 preceding siblings ...)
  2019-10-21  9:47 ` [dpdk-dev] [PATCH v8 6/6] event/opdl: " Gavin Hu
@ 2019-10-24 10:42 ` Gavin Hu
  2019-10-24 10:42 ` [dpdk-dev] [PATCH v9 1/5] bus/fslmc: fix the conflicting dmb function Gavin Hu
                   ` (23 subsequent siblings)
  81 siblings, 0 replies; 163+ messages in thread
From: Gavin Hu @ 2019-10-24 10:42 UTC (permalink / raw)
  To: dev
  Cc: nd, david.marchand, konstantin.ananyev, thomas, stephen,
	hemant.agrawal, jerinj, pbhagavatula, Honnappa.Nagarahalli,
	ruifeng.wang, phil.yang, steve.capper

V9:
- fix a weblink broken (David Marchand)
- define rte_wfe and rte_sev() (Ananyev Konstantin)
- explicitly define three function APIs instead of marcos (Ananyev Konstantin)
- incorporate common rte_wfe and rte_sev into the generic rte_spinlock (David Marchand)
- define arch neutral RTE_WAIT_UNTIL_EQUAL_ARCH_DEFINED (Ananyev Konstantin)
- define rte_load_ex_16/32/64 functions to use load-exclusive instruction for aarch64, which is required for wake up of WFE
- drop the rte_spinlock patch from this series, as the it calls this experimental API and it is widely included by a lot of components each requires the ALLOW_EXPERIMENRAL_API for the Makefile and meson.build, leave it to future after the experimental is removed.
V8:
- simplify dmb definition to use io barriers (David Marchand)
- define wfe() and sev() macros and use them inside normal C code (Ananyev Konstantin)
- pass memorder as parameter, not to incorporate it into function name, less functions, similar to C11 atomic intrinsics (Ananyev Konstantin)
- remove mandating RTE_FORCE_INTRINSICS in arm spinlock implementation(David Marchand)
- undef __WAIT_UNTIL_EQUAL after use (David Marchand)
- add experimental tag and warning (David Marchand)
- add the limitation of using WFE instruction in the commit log (David Marchand) 
- tweak the use of RTE_FORCE_INSTRINSICS(still mandatory for aarch64) and RTE_ARM_USE_WFE for spinlock (David Marchand)
- drop the rte_ring patch from this series, as the rte_ring.h calls this API and it is widely included by a lot of components each requires the ALLOW_EXPERIMENRAL_API for the Makefile and meson.build, leave it to future after the experimental is removed.
V7:
- fix the checkpatch LONG_LINE_COMMENT issue
V6:
- squash the RTE_ARM_USE_WFE configuration entry patch into the new API patch
- move the new configuration to the end of EAL
- add doxygen comments to reflect the relaxed and acquire semantics
- correct the meson configuration 
V5:
- add doxygen comments for the new APIs
- spinlock early exit without wfe if the spinlock not taken by others.
- add two patches on top for opdl and thunderx
V4:
- rename the config as CONFIG_RTE_ARM_USE_WFE to indicate it applys to arm only
- introduce a macro for assembly Skelton to reduce the duplication of code
- add one patch for nxp fslmc to address a compiling error
V3:
- Convert RFCs to patches
V2:
- Use inline functions instead of marcos
- Add load and compare in the beginning of the APIs
- Fix some style errors in asm inline 
V1:
- Add the new APIs and use it for ring and locks

Gavin Hu (5):
  bus/fslmc: fix the conflicting dmb function
  eal: add the APIs to wait until equal
  ticketlock: use new API to reduce contention on aarch64
  net/thunderx: use new API to save cycles on aarch64
  event/opdl: use new API to save cycles on aarch64

 config/arm/meson.build                             |   1 +
 config/common_base                                 |   5 +
 drivers/bus/fslmc/mc/fsl_mc_sys.h                  |   8 +-
 drivers/event/opdl/Makefile                        |   1 +
 drivers/event/opdl/meson.build                     |   1 +
 drivers/event/opdl/opdl_ring.c                     |   5 +-
 drivers/net/thunderx/Makefile                      |   1 +
 drivers/net/thunderx/meson.build                   |   1 +
 drivers/net/thunderx/nicvf_rxtx.c                  |   3 +-
 .../common/include/arch/arm/rte_pause_64.h         |  70 +++++++
 lib/librte_eal/common/include/generic/rte_pause.h  | 217 +++++++++++++++++++++
 .../common/include/generic/rte_ticketlock.h        |   3 +-
 12 files changed, 304 insertions(+), 12 deletions(-)

-- 
2.7.4


^ permalink raw reply	[flat|nested] 163+ messages in thread

* [dpdk-dev] [PATCH v9 1/5] bus/fslmc: fix the conflicting dmb function
  2019-06-30 16:21 [dpdk-dev] [RFC 0/5] use WFE for locks and ring on aarch64 Gavin Hu
                   ` (57 preceding siblings ...)
  2019-10-24 10:42 ` [dpdk-dev] [PATCH v9 0/5] use WFE for aarch64 Gavin Hu
@ 2019-10-24 10:42 ` Gavin Hu
  2019-10-24 10:42 ` [dpdk-dev] [PATCH v9 2/5] eal: add the APIs to wait until equal Gavin Hu
                   ` (22 subsequent siblings)
  81 siblings, 0 replies; 163+ messages in thread
From: Gavin Hu @ 2019-10-24 10:42 UTC (permalink / raw)
  To: dev
  Cc: nd, david.marchand, konstantin.ananyev, thomas, stephen,
	hemant.agrawal, jerinj, pbhagavatula, Honnappa.Nagarahalli,
	ruifeng.wang, phil.yang, steve.capper, stable

There are two definitions conflicting each other, for more
details, refer to [1].

include/rte_atomic_64.h:19: error: "dmb" redefined [-Werror]
drivers/bus/fslmc/mc/fsl_mc_sys.h:36: note: this is the location of the
previous definition
 #define dmb() {__asm__ __volatile__("" : : : "memory"); }

The fix is to reuse the EAL definition to avoid conflicts.

[1] http://inbox.dpdk.org/users/VI1PR08MB537631AB25F41B8880DCCA988FDF0@
VI1PR08MB5376.eurprd08.prod.outlook.com/T/#u

Fixes: 3af733ba8da8 ("bus/fslmc: introduce MC object functions")
Cc: stable@dpdk.org

Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Phil Yang <phi.yang@arm.com>
Acked-by: Hemant Agrawal <hemant.agrawal@nxp.com>
---
 drivers/bus/fslmc/mc/fsl_mc_sys.h | 8 +++-----
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/drivers/bus/fslmc/mc/fsl_mc_sys.h b/drivers/bus/fslmc/mc/fsl_mc_sys.h
index d0c7b39..68ce38b 100644
--- a/drivers/bus/fslmc/mc/fsl_mc_sys.h
+++ b/drivers/bus/fslmc/mc/fsl_mc_sys.h
@@ -31,12 +31,10 @@ struct fsl_mc_io {
 #include <errno.h>
 #include <sys/uio.h>
 #include <linux/byteorder/little_endian.h>
+#include <rte_atomic.h>
 
-#ifndef dmb
-#define dmb() {__asm__ __volatile__("" : : : "memory"); }
-#endif
-#define __iormb()	dmb()
-#define __iowmb()	dmb()
+#define __iormb()	rte_io_rmb()
+#define __iowmb()	rte_io_wmb()
 #define __arch_getq(a)		(*(volatile uint64_t *)(a))
 #define __arch_putq(v, a)	(*(volatile uint64_t *)(a) = (v))
 #define __arch_putq32(v, a)	(*(volatile uint32_t *)(a) = (v))
-- 
2.7.4


^ permalink raw reply	[flat|nested] 163+ messages in thread

* [dpdk-dev] [PATCH v9 2/5] eal: add the APIs to wait until equal
  2019-06-30 16:21 [dpdk-dev] [RFC 0/5] use WFE for locks and ring on aarch64 Gavin Hu
                   ` (58 preceding siblings ...)
  2019-10-24 10:42 ` [dpdk-dev] [PATCH v9 1/5] bus/fslmc: fix the conflicting dmb function Gavin Hu
@ 2019-10-24 10:42 ` Gavin Hu
  2019-10-24 13:52   ` Ananyev, Konstantin
  2019-10-24 10:42 ` [dpdk-dev] [PATCH v9 3/5] ticketlock: use new API to reduce contention on aarch64 Gavin Hu
                   ` (21 subsequent siblings)
  81 siblings, 1 reply; 163+ messages in thread
From: Gavin Hu @ 2019-10-24 10:42 UTC (permalink / raw)
  To: dev
  Cc: nd, david.marchand, konstantin.ananyev, thomas, stephen,
	hemant.agrawal, jerinj, pbhagavatula, Honnappa.Nagarahalli,
	ruifeng.wang, phil.yang, steve.capper

The rte_wait_until_equal_xx APIs abstract the functionality of
'polling for a memory location to become equal to a given value'.

Add the RTE_ARM_USE_WFE configuration entry for aarch64, disabled
by default. When it is enabled, the above APIs will call WFE instruction
to save CPU cycles and power.

From a VM, when calling this API on aarch64, it may trap in and out to
release vCPUs whereas cause high exit latency. Since kernel 4.18.20 an
adaptive trapping mechanism is introduced to balance the latency and
workload.

Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
Reviewed-by: Steve Capper <steve.capper@arm.com>
Reviewed-by: Ola Liljedahl <ola.liljedahl@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Reviewed-by: Phil Yang <phil.yang@arm.com>
Acked-by: Pavan Nikhilesh <pbhagavatula@marvell.com>
Acked-by: Jerin Jacob <jerinj@marvell.com>
---
 config/arm/meson.build                             |   1 +
 config/common_base                                 |   5 +
 .../common/include/arch/arm/rte_pause_64.h         |  70 +++++++
 lib/librte_eal/common/include/generic/rte_pause.h  | 217 +++++++++++++++++++++
 4 files changed, 293 insertions(+)

diff --git a/config/arm/meson.build b/config/arm/meson.build
index 979018e..b4b4cac 100644
--- a/config/arm/meson.build
+++ b/config/arm/meson.build
@@ -26,6 +26,7 @@ flags_common_default = [
 	['RTE_LIBRTE_AVP_PMD', false],
 
 	['RTE_SCHED_VECTOR', false],
+	['RTE_ARM_USE_WFE', false],
 ]
 
 flags_generic = [
diff --git a/config/common_base b/config/common_base
index e843a21..c812156 100644
--- a/config/common_base
+++ b/config/common_base
@@ -111,6 +111,11 @@ CONFIG_RTE_MAX_VFIO_CONTAINERS=64
 CONFIG_RTE_MALLOC_DEBUG=n
 CONFIG_RTE_EAL_NUMA_AWARE_HUGEPAGES=n
 CONFIG_RTE_USE_LIBBSD=n
+# Use WFE instructions to implement the rte_wait_for_equal_xxx APIs,
+# calling these APIs put the cores in low power state while waiting
+# for the memory address to become equal to the expected value.
+# This is supported only by aarch64.
+CONFIG_RTE_ARM_USE_WFE=n
 
 #
 # Recognize/ignore the AVX/AVX512 CPU flags for performance/power testing.
diff --git a/lib/librte_eal/common/include/arch/arm/rte_pause_64.h b/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
index 93895d3..7bc8efb 100644
--- a/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
+++ b/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
@@ -1,5 +1,6 @@
 /* SPDX-License-Identifier: BSD-3-Clause
  * Copyright(c) 2017 Cavium, Inc
+ * Copyright(c) 2019 Arm Limited
  */
 
 #ifndef _RTE_PAUSE_ARM64_H_
@@ -17,6 +18,75 @@ static inline void rte_pause(void)
 	asm volatile("yield" ::: "memory");
 }
 
+#ifdef RTE_WAIT_UNTIL_EQUAL_ARCH_DEFINED
+static inline void rte_sevl(void)
+{
+	asm volatile("sevl" : : : "memory");
+}
+
+static inline void rte_wfe(void)
+{
+	asm volatile("wfe" : : : "memory");
+}
+
+static __rte_always_inline uint16_t
+__atomic_load_ex_16(volatile uint16_t *addr, int memorder)
+{
+	uint16_t tmp;
+	assert((memorder == __ATOMIC_ACQUIRE)
+			|| (memorder == __ATOMIC_RELAXED));
+	if (memorder == __ATOMIC_ACQUIRE)
+		asm volatile("ldaxrh %w[tmp], [%x[addr]]"
+			: [tmp] "=&r" (tmp)
+			: [addr] "r"(addr)
+			: "memory");
+	else if (memorder == __ATOMIC_RELAXED)
+		asm volatile("ldxrh %w[tmp], [%x[addr]]"
+			: [tmp] "=&r" (tmp)
+			: [addr] "r"(addr)
+			: "memory");
+	return tmp;
+}
+
+static __rte_always_inline uint32_t
+__atomic_load_ex_32(volatile uint32_t *addr, int memorder)
+{
+	uint32_t tmp;
+	assert((memorder == __ATOMIC_ACQUIRE)
+			|| (memorder == __ATOMIC_RELAXED));
+	if (memorder == __ATOMIC_ACQUIRE)
+		asm volatile("ldaxr %w[tmp], [%x[addr]]"
+			: [tmp] "=&r" (tmp)
+			: [addr] "r"(addr)
+			: "memory");
+	else if (memorder == __ATOMIC_RELAXED)
+		asm volatile("ldxr %w[tmp], [%x[addr]]"
+			: [tmp] "=&r" (tmp)
+			: [addr] "r"(addr)
+			: "memory");
+	return tmp;
+}
+
+static __rte_always_inline uint64_t
+__atomic_load_ex_64(volatile uint64_t *addr, int memorder)
+{
+	uint64_t tmp;
+	assert((memorder == __ATOMIC_ACQUIRE)
+			|| (memorder == __ATOMIC_RELAXED));
+	if (memorder == __ATOMIC_ACQUIRE)
+		asm volatile("ldaxr %x[tmp], [%x[addr]]"
+			: [tmp] "=&r" (tmp)
+			: [addr] "r"(addr)
+			: "memory");
+	else if (memorder == __ATOMIC_RELAXED)
+		asm volatile("ldxr %x[tmp], [%x[addr]]"
+			: [tmp] "=&r" (tmp)
+			: [addr] "r"(addr)
+			: "memory");
+	return tmp;
+}
+#endif
+
 #ifdef __cplusplus
 }
 #endif
diff --git a/lib/librte_eal/common/include/generic/rte_pause.h b/lib/librte_eal/common/include/generic/rte_pause.h
index 52bd4db..4db44f9 100644
--- a/lib/librte_eal/common/include/generic/rte_pause.h
+++ b/lib/librte_eal/common/include/generic/rte_pause.h
@@ -1,5 +1,6 @@
 /* SPDX-License-Identifier: BSD-3-Clause
  * Copyright(c) 2017 Cavium, Inc
+ * Copyright(c) 2019 Arm Limited
  */
 
 #ifndef _RTE_PAUSE_H_
@@ -12,6 +13,12 @@
  *
  */
 
+#include <stdint.h>
+#include <rte_common.h>
+#include <rte_atomic.h>
+#include <rte_compat.h>
+#include <assert.h>
+
 /**
  * Pause CPU execution for a short while
  *
@@ -20,4 +27,214 @@
  */
 static inline void rte_pause(void);
 
+static inline void rte_sevl(void);
+static inline void rte_wfe(void);
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Atomic load from addr, it returns the 16-bit content of *addr.
+ *
+ * @param addr
+ *  A pointer to the memory location.
+ * @param memorder
+ *  The valid memory order variants are __ATOMIC_ACQUIRE and __ATOMIC_RELAXED.
+ *  These map to C++11 memory orders with the same names, see the C++11 standard
+ *  the GCC wiki on atomic synchronization for detailed definitions.
+ */
+static __rte_always_inline uint16_t
+__atomic_load_ex_16(volatile uint16_t *addr, int memorder);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Atomic load from addr, it returns the 32-bit content of *addr.
+ *
+ * @param addr
+ *  A pointer to the memory location.
+ * @param memorder
+ *  The valid memory order variants are __ATOMIC_ACQUIRE and __ATOMIC_RELAXED.
+ *  These map to C++11 memory orders with the same names, see the C++11 standard
+ *  the GCC wiki on atomic synchronization for detailed definitions.
+ */
+static __rte_always_inline uint32_t
+__atomic_load_ex_32(volatile uint32_t *addr, int memorder);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Atomic load from addr, it returns the 64-bit content of *addr.
+ *
+ * @param addr
+ *  A pointer to the memory location.
+ * @param memorder
+ *  The valid memory order variants are __ATOMIC_ACQUIRE and __ATOMIC_RELAXED.
+ *  These map to C++11 memory orders with the same names, see the C++11 standard
+ *  the GCC wiki on atomic synchronization for detailed definitions.
+ */
+static __rte_always_inline uint64_t
+__atomic_load_ex_64(volatile uint64_t *addr, int memorder);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Wait for *addr to be updated with a 16-bit expected value, with a relaxed
+ * memory ordering model meaning the loads around this API can be reordered.
+ *
+ * @param addr
+ *  A pointer to the memory location.
+ * @param expected
+ *  A 16-bit expected value to be in the memory location.
+ * @param memorder
+ *  Two different memory orders that can be specified:
+ *  __ATOMIC_ACQUIRE and __ATOMIC_RELAXED. These map to
+ *  C++11 memory orders with the same names, see the C++11 standard or
+ *  the GCC wiki on atomic synchronization for detailed definition.
+ */
+__rte_experimental
+static __rte_always_inline void
+rte_wait_until_equal_16(volatile uint16_t *addr, uint16_t expected,
+int memorder);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Wait for *addr to be updated with a 32-bit expected value, with a relaxed
+ * memory ordering model meaning the loads around this API can be reordered.
+ *
+ * @param addr
+ *  A pointer to the memory location.
+ * @param expected
+ *  A 32-bit expected value to be in the memory location.
+ * @param memorder
+ *  Two different memory orders that can be specified:
+ *  __ATOMIC_ACQUIRE and __ATOMIC_RELAXED. These map to
+ *  C++11 memory orders with the same names, see the C++11 standard or
+ *  the GCC wiki on atomic synchronization for detailed definition.
+ */
+__rte_experimental
+static __rte_always_inline void
+rte_wait_until_equal_32(volatile uint32_t *addr, uint32_t expected,
+int memorder);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Wait for *addr to be updated with a 64-bit expected value, with a relaxed
+ * memory ordering model meaning the loads around this API can be reordered.
+ *
+ * @param addr
+ *  A pointer to the memory location.
+ * @param expected
+ *  A 64-bit expected value to be in the memory location.
+ * @param memorder
+ *  Two different memory orders that can be specified:
+ *  __ATOMIC_ACQUIRE and __ATOMIC_RELAXED. These map to
+ *  C++11 memory orders with the same names, see the C++11 standard or
+ *  the GCC wiki on atomic synchronization for detailed definition.
+ */
+__rte_experimental
+static __rte_always_inline void
+rte_wait_until_equal_64(volatile uint64_t *addr, uint64_t expected,
+int memorder);
+
+#ifdef RTE_ARM_USE_WFE
+#define RTE_WAIT_UNTIL_EQUAL_ARCH_DEFINED
+#endif
+
+#ifndef RTE_WAIT_UNTIL_EQUAL_ARCH_DEFINED
+static inline void rte_sevl(void)
+{
+}
+
+static inline void rte_wfe(void)
+{
+	rte_pause();
+}
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Atomic load from addr, it returns the 16-bit content of *addr.
+ *
+ * @param addr
+ *  A pointer to the memory location.
+ * @param memorder
+ *  The valid memory order variants are __ATOMIC_ACQUIRE and __ATOMIC_RELAXED.
+ *  These map to C++11 memory orders with the same names, see the C++11 standard
+ *  the GCC wiki on atomic synchronization for detailed definitions.
+ */
+static __rte_always_inline uint16_t
+__atomic_load_ex_16(volatile uint16_t *addr, int memorder)
+{
+	uint16_t tmp;
+	assert((memorder == __ATOMIC_ACQUIRE)
+			|| (memorder == __ATOMIC_RELAXED));
+	tmp = __atomic_load_n(addr, memorder);
+	return tmp;
+}
+
+static __rte_always_inline uint32_t
+__atomic_load_ex_32(volatile uint32_t *addr, int memorder)
+{
+	uint32_t tmp;
+	assert((memorder == __ATOMIC_ACQUIRE)
+			|| (memorder == __ATOMIC_RELAXED));
+	tmp = __atomic_load_n(addr, memorder);
+	return tmp;
+}
+
+static __rte_always_inline uint64_t
+__atomic_load_ex_64(volatile uint64_t *addr, int memorder)
+{
+	uint64_t tmp;
+	assert((memorder == __ATOMIC_ACQUIRE)
+			|| (memorder == __ATOMIC_RELAXED));
+	tmp = __atomic_load_n(addr, memorder);
+	return tmp;
+}
+
+static __rte_always_inline void
+rte_wait_until_equal_16(volatile uint16_t *addr, uint16_t expected,
+int memorder)
+{
+	if (__atomic_load_n(addr, memorder) != expected) {
+		rte_sevl();
+		do {
+			rte_wfe();
+		} while (__atomic_load_ex_16(addr, memorder) != expected);
+	}
+}
+
+static __rte_always_inline void
+rte_wait_until_equal_32(volatile uint32_t *addr, uint32_t expected,
+int memorder)
+{
+	if (__atomic_load_ex_32(addr, memorder) != expected) {
+		rte_sevl();
+		do {
+			rte_wfe();
+		} while (__atomic_load_ex_32(addr, memorder) != expected);
+	}
+}
+
+static __rte_always_inline void
+rte_wait_until_equal_64(volatile uint64_t *addr, uint64_t expected,
+int memorder)
+{
+	if (__atomic_load_ex_64(addr, memorder) != expected) {
+		rte_sevl();
+		do {
+			rte_wfe();
+		} while (__atomic_load_ex_64(addr, memorder) != expected);
+	}
+}
+#endif
+
 #endif /* _RTE_PAUSE_H_ */
-- 
2.7.4


^ permalink raw reply	[flat|nested] 163+ messages in thread

* [dpdk-dev] [PATCH v9 3/5] ticketlock: use new API to reduce contention on aarch64
  2019-06-30 16:21 [dpdk-dev] [RFC 0/5] use WFE for locks and ring on aarch64 Gavin Hu
                   ` (59 preceding siblings ...)
  2019-10-24 10:42 ` [dpdk-dev] [PATCH v9 2/5] eal: add the APIs to wait until equal Gavin Hu
@ 2019-10-24 10:42 ` Gavin Hu
  2019-10-24 10:42 ` [dpdk-dev] [PATCH v9 4/5] net/thunderx: use new API to save cycles " Gavin Hu
                   ` (20 subsequent siblings)
  81 siblings, 0 replies; 163+ messages in thread
From: Gavin Hu @ 2019-10-24 10:42 UTC (permalink / raw)
  To: dev
  Cc: nd, david.marchand, konstantin.ananyev, thomas, stephen,
	hemant.agrawal, jerinj, pbhagavatula, Honnappa.Nagarahalli,
	ruifeng.wang, phil.yang, steve.capper

While using ticket lock, cores repeatedly poll the lock variable.
This is replaced by rte_wait_until_equal API.

Running ticketlock_autotest on ThunderX2, Ampere eMAG80, and Arm N1SDP[1],
there were variances between runs, but no notable performance gain or
degradation were seen with and without this patch.

[1] https://community.arm.com/developer/tools-software/oss-platforms/w/\
docs/440/neoverse-n1-sdp

Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Tested-by: Phil Yang <phil.yang@arm.com>
Tested-by: Pavan Nikhilesh <pbhagavatula@marvell.com>
Reviewed-by: Jerin Jacob <jerinj@marvell.com>
---
 lib/librte_eal/common/include/generic/rte_ticketlock.h | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/lib/librte_eal/common/include/generic/rte_ticketlock.h b/lib/librte_eal/common/include/generic/rte_ticketlock.h
index d9bec87..c295ae7 100644
--- a/lib/librte_eal/common/include/generic/rte_ticketlock.h
+++ b/lib/librte_eal/common/include/generic/rte_ticketlock.h
@@ -66,8 +66,7 @@ static inline void
 rte_ticketlock_lock(rte_ticketlock_t *tl)
 {
 	uint16_t me = __atomic_fetch_add(&tl->s.next, 1, __ATOMIC_RELAXED);
-	while (__atomic_load_n(&tl->s.current, __ATOMIC_ACQUIRE) != me)
-		rte_pause();
+	rte_wait_until_equal_16(&tl->s.current, me, __ATOMIC_ACQUIRE);
 }
 
 /**
-- 
2.7.4


^ permalink raw reply	[flat|nested] 163+ messages in thread

* [dpdk-dev] [PATCH v9 4/5] net/thunderx: use new API to save cycles on aarch64
  2019-06-30 16:21 [dpdk-dev] [RFC 0/5] use WFE for locks and ring on aarch64 Gavin Hu
                   ` (60 preceding siblings ...)
  2019-10-24 10:42 ` [dpdk-dev] [PATCH v9 3/5] ticketlock: use new API to reduce contention on aarch64 Gavin Hu
@ 2019-10-24 10:42 ` " Gavin Hu
  2019-10-24 10:42 ` [dpdk-dev] [PATCH v9 5/5] event/opdl: " Gavin Hu
                   ` (19 subsequent siblings)
  81 siblings, 0 replies; 163+ messages in thread
From: Gavin Hu @ 2019-10-24 10:42 UTC (permalink / raw)
  To: dev
  Cc: nd, david.marchand, konstantin.ananyev, thomas, stephen,
	hemant.agrawal, jerinj, pbhagavatula, Honnappa.Nagarahalli,
	ruifeng.wang, phil.yang, steve.capper

Use the new API to wait in low power state instead of continuous
polling to save CPU cycles and power.

Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
Acked-by: Jerin Jacob <jerinj@marvell.com>
---
 drivers/net/thunderx/Makefile     | 1 +
 drivers/net/thunderx/meson.build  | 1 +
 drivers/net/thunderx/nicvf_rxtx.c | 3 +--
 3 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/net/thunderx/Makefile b/drivers/net/thunderx/Makefile
index e6bf497..9e0de10 100644
--- a/drivers/net/thunderx/Makefile
+++ b/drivers/net/thunderx/Makefile
@@ -10,6 +10,7 @@ include $(RTE_SDK)/mk/rte.vars.mk
 LIB = librte_pmd_thunderx_nicvf.a
 
 CFLAGS += $(WERROR_FLAGS)
+CFLAGS += -DALLOW_EXPERIMENTAL_API
 
 LDLIBS += -lm
 LDLIBS += -lrte_eal -lrte_mbuf -lrte_mempool -lrte_ring
diff --git a/drivers/net/thunderx/meson.build b/drivers/net/thunderx/meson.build
index 69819a9..23d9458 100644
--- a/drivers/net/thunderx/meson.build
+++ b/drivers/net/thunderx/meson.build
@@ -4,6 +4,7 @@
 subdir('base')
 objs = [base_objs]
 
+allow_experimental_apis = true
 sources = files('nicvf_rxtx.c',
 		'nicvf_ethdev.c',
 		'nicvf_svf.c'
diff --git a/drivers/net/thunderx/nicvf_rxtx.c b/drivers/net/thunderx/nicvf_rxtx.c
index 1c42874..90a6098 100644
--- a/drivers/net/thunderx/nicvf_rxtx.c
+++ b/drivers/net/thunderx/nicvf_rxtx.c
@@ -385,8 +385,7 @@ nicvf_fill_rbdr(struct nicvf_rxq *rxq, int to_fill)
 		ltail++;
 	}
 
-	while (__atomic_load_n(&rbdr->tail, __ATOMIC_RELAXED) != next_tail)
-		rte_pause();
+	rte_wait_until_equal_32(&rbdr->tail, next_tail, __ATOMIC_RELAXED);
 
 	__atomic_store_n(&rbdr->tail, ltail, __ATOMIC_RELEASE);
 	nicvf_addr_write(door, to_fill);
-- 
2.7.4


^ permalink raw reply	[flat|nested] 163+ messages in thread

* [dpdk-dev] [PATCH v9 5/5] event/opdl: use new API to save cycles on aarch64
  2019-06-30 16:21 [dpdk-dev] [RFC 0/5] use WFE for locks and ring on aarch64 Gavin Hu
                   ` (61 preceding siblings ...)
  2019-10-24 10:42 ` [dpdk-dev] [PATCH v9 4/5] net/thunderx: use new API to save cycles " Gavin Hu
@ 2019-10-24 10:42 ` " Gavin Hu
  2019-10-25 15:39 ` [dpdk-dev] [PATCH v10 0/5] use WFE for aarch64 Gavin Hu
                   ` (18 subsequent siblings)
  81 siblings, 0 replies; 163+ messages in thread
From: Gavin Hu @ 2019-10-24 10:42 UTC (permalink / raw)
  To: dev
  Cc: nd, david.marchand, konstantin.ananyev, thomas, stephen,
	hemant.agrawal, jerinj, pbhagavatula, Honnappa.Nagarahalli,
	ruifeng.wang, phil.yang, steve.capper

Use the new API to wait in low power state instead of continuous
polling to save CPU cycles and power.

Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
Reviewed-by: Jerin Jacob <jerinj@marvell.com>
---
 drivers/event/opdl/Makefile    | 1 +
 drivers/event/opdl/meson.build | 1 +
 drivers/event/opdl/opdl_ring.c | 5 ++---
 3 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/drivers/event/opdl/Makefile b/drivers/event/opdl/Makefile
index bf50a60..72ef07d 100644
--- a/drivers/event/opdl/Makefile
+++ b/drivers/event/opdl/Makefile
@@ -9,6 +9,7 @@ LIB = librte_pmd_opdl_event.a
 # build flags
 CFLAGS += -O3
 CFLAGS += $(WERROR_FLAGS)
+CFLAGS += -DALLOW_EXPERIMENTAL_API
 # for older GCC versions, allow us to initialize an event using
 # designated initializers.
 ifeq ($(CONFIG_RTE_TOOLCHAIN_GCC),y)
diff --git a/drivers/event/opdl/meson.build b/drivers/event/opdl/meson.build
index 1fe034e..e67b164 100644
--- a/drivers/event/opdl/meson.build
+++ b/drivers/event/opdl/meson.build
@@ -1,6 +1,7 @@
 # SPDX-License-Identifier: BSD-3-Clause
 # Copyright(c) 2018 Luca Boccassi <bluca@debian.org>
 
+allow_experimental_apis = true
 sources = files(
 	'opdl_evdev.c',
 	'opdl_evdev_init.c',
diff --git a/drivers/event/opdl/opdl_ring.c b/drivers/event/opdl/opdl_ring.c
index 06fb5b3..c8d19fe 100644
--- a/drivers/event/opdl/opdl_ring.c
+++ b/drivers/event/opdl/opdl_ring.c
@@ -16,6 +16,7 @@
 #include <rte_memcpy.h>
 #include <rte_memory.h>
 #include <rte_memzone.h>
+#include <rte_atomic.h>
 
 #include "opdl_ring.h"
 #include "opdl_log.h"
@@ -474,9 +475,7 @@ opdl_ring_input_multithread(struct opdl_ring *t, const void *entries,
 	/* If another thread started inputting before this one, but hasn't
 	 * finished, we need to wait for it to complete to update the tail.
 	 */
-	while (unlikely(__atomic_load_n(&s->shared.tail, __ATOMIC_ACQUIRE) !=
-			old_head))
-		rte_pause();
+	rte_wait_until_equal_32(&s->shared.tail, old_head, __ATOMIC_ACQUIRE);
 
 	__atomic_store_n(&s->shared.tail, old_head + num_entries,
 			__ATOMIC_RELEASE);
-- 
2.7.4


^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: [dpdk-dev] [PATCH v7 2/7] eal: add the APIs to wait until equal
  2019-10-24 10:21         ` Ananyev, Konstantin
@ 2019-10-24 10:52           ` Gavin Hu (Arm Technology China)
  0 siblings, 0 replies; 163+ messages in thread
From: Gavin Hu (Arm Technology China) @ 2019-10-24 10:52 UTC (permalink / raw)
  To: Ananyev, Konstantin, dev
  Cc: nd, thomas, stephen, hemant.agrawal, jerinj, pbhagavatula,
	Honnappa Nagarahalli, Ruifeng Wang (Arm Technology China),
	Phil Yang (Arm Technology China),
	Steve Capper, nd, nd, nd

Hi Konstantin, 

> -----Original Message-----
> From: Ananyev, Konstantin <konstantin.ananyev@intel.com>
> Sent: Thursday, October 24, 2019 6:21 PM
> To: Gavin Hu (Arm Technology China) <Gavin.Hu@arm.com>;
> dev@dpdk.org
> Cc: nd <nd@arm.com>; thomas@monjalon.net;
> stephen@networkplumber.org; hemant.agrawal@nxp.com;
> jerinj@marvell.com; pbhagavatula@marvell.com; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>; Ruifeng Wang (Arm Technology China)
> <Ruifeng.Wang@arm.com>; Phil Yang (Arm Technology China)
> <Phil.Yang@arm.com>; Steve Capper <Steve.Capper@arm.com>; nd
> <nd@arm.com>; nd <nd@arm.com>
> Subject: RE: [dpdk-dev] [PATCH v7 2/7] eal: add the APIs to wait until equal
> 
> 
> 
> Hi Gavin,
> > > > > > The rte_wait_until_equal_xx APIs abstract the functionality of
> > > > > > 'polling for a memory location to become equal to a given value'.
> > > > > >
> > > > > > Add the RTE_ARM_USE_WFE configuration entry for aarch64,
> disabled
> > > > > > by default. When it is enabled, the above APIs will call WFE
> instruction
> > > > > > to save CPU cycles and power.
> > > > > >
> > > > > > Signed-off-by: Gavin Hu <gavin.hu@arm.com>
> > > > > > Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > > > > > Reviewed-by: Steve Capper <steve.capper@arm.com>
> > > > > > Reviewed-by: Ola Liljedahl <ola.liljedahl@arm.com>
> > > > > > Reviewed-by: Honnappa Nagarahalli
> > > <honnappa.nagarahalli@arm.com>
> > > > > > Reviewed-by: Phil Yang <phil.yang@arm.com>
> > > > > > Acked-by: Pavan Nikhilesh <pbhagavatula@marvell.com>
> > > > > > ---
> > > > > >  config/arm/meson.build                             |   1 +
> > > > > >  config/common_base                                 |   5 +
> > > > > >  .../common/include/arch/arm/rte_pause_64.h         |  30 ++++++
> > > > > >  lib/librte_eal/common/include/generic/rte_pause.h  | 106
> > > > +++++++++++++++++++++
> > > > > >  4 files changed, 142 insertions(+)
> > > > > >
> > > > > > diff --git a/config/arm/meson.build b/config/arm/meson.build
> > > > > > index 979018e..b4b4cac 100644
> > > > > > --- a/config/arm/meson.build
> > > > > > +++ b/config/arm/meson.build
> > > > > > @@ -26,6 +26,7 @@ flags_common_default = [
> > > > > >  	['RTE_LIBRTE_AVP_PMD', false],
> > > > > >
> > > > > >  	['RTE_SCHED_VECTOR', false],
> > > > > > +	['RTE_ARM_USE_WFE', false],
> > > > > >  ]
> > > > > >
> > > > > >  flags_generic = [
> > > > > > diff --git a/config/common_base b/config/common_base
> > > > > > index 8ef75c2..8861713 100644
> > > > > > --- a/config/common_base
> > > > > > +++ b/config/common_base
> > > > > > @@ -111,6 +111,11 @@
> CONFIG_RTE_MAX_VFIO_CONTAINERS=64
> > > > > >  CONFIG_RTE_MALLOC_DEBUG=n
> > > > > >  CONFIG_RTE_EAL_NUMA_AWARE_HUGEPAGES=n
> > > > > >  CONFIG_RTE_USE_LIBBSD=n
> > > > > > +# Use WFE instructions to implement the rte_wait_for_equal_xxx
> APIs,
> > > > > > +# calling these APIs put the cores in low power state while waiting
> > > > > > +# for the memory address to become equal to the expected
> value.
> > > > > > +# This is supported only by aarch64.
> > > > > > +CONFIG_RTE_ARM_USE_WFE=n
> > > > > >
> > > > > >  #
> > > > > >  # Recognize/ignore the AVX/AVX512 CPU flags for
> performance/power
> > > > testing.
> > > > > > diff --git
> a/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
> > > > b/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
> > > > > > index 93895d3..dabde17 100644
> > > > > > --- a/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
> > > > > > +++ b/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
> > > > > > @@ -1,5 +1,6 @@
> > > > > >  /* SPDX-License-Identifier: BSD-3-Clause
> > > > > >   * Copyright(c) 2017 Cavium, Inc
> > > > > > + * Copyright(c) 2019 Arm Limited
> > > > > >   */
> > > > > >
> > > > > >  #ifndef _RTE_PAUSE_ARM64_H_
> > > > > > @@ -17,6 +18,35 @@ static inline void rte_pause(void)
> > > > > >  	asm volatile("yield" ::: "memory");
> > > > > >  }
> > > > > >
> > > > > > +#ifdef RTE_ARM_USE_WFE
> > > > > > +#define __WAIT_UNTIL_EQUAL(name, asm_op, wide, type) \
> > > > > > +static __rte_always_inline void \
> > > > > > +rte_wait_until_equal_##name(volatile type * addr, type
> expected) \
> > > > > > +{ \
> > > > > > +	type tmp; \
> > > > > > +	asm volatile( \
> > > > > > +		#asm_op " %" #wide "[tmp], %[addr]\n" \
> > > > > > +		"cmp	%" #wide "[tmp], %" #wide "[expected]\n"
> \
> > > > > > +		"b.eq	2f\n" \
> > > > > > +		"sevl\n" \
> > > > > > +		"1:	wfe\n" \
> > > > > > +		#asm_op " %" #wide "[tmp], %[addr]\n" \
> > > > > > +		"cmp	%" #wide "[tmp], %" #wide "[expected]\n"
> \
> > > > > > +		"bne	1b\n" \
> > > > > > +		"2:\n" \
> > > > > > +		: [tmp] "=&r" (tmp) \
> > > > > > +		: [addr] "Q"(*addr), [expected] "r"(expected) \
> > > > > > +		: "cc", "memory"); \
> > > > > > +}
> > > >
> > > > One more thought:
> > > > Why do you need to write asm code for the whole procedure?
> > > > Why not to do like linux kernel:
> > > > define wfe() and sev() macros and use them inside normal C code?
> > > >
> > > > #define sev()		asm volatile("sev" : : : "memory")
> > > > #define wfe()		asm volatile("wfe" : : : "memory")
> > > >
> > > > Then:
> > > > rte_wait_until_equal_32(volatile uint32_t *addr, uint32_t expected,
> int
> > > > memorder)
> > > > {
> > > >      if (__atomic_load_n(addr, memorder) != expected) {
> > > >          sev();
> > > >          do {
> > > >              wfe();
> > > >          } while ((__atomic_load_n(addr, memorder) != expected);
> > > >      }
> > > > }
> > > >
> > > > ?
> > > A really good suggestion, I made corresponding changes to v8 already,
> but it
> > > missed a armv8 specific feature after internal discussion.
> > > We call wfe to wait/sleep on the 'monitored' address, it will be waken up
> > > upon someone write to the monitor address, so before wfe, we have to
> call
> > > load-exclusive instruction to 'monitor'.
> > > __atomic_load_n - disassembled to "ldr" does not do so. We have to
> use
> > > "ldxrh" for relaxed mem ordering and "ldaxrh" for acquire ordering, in
> > > example of 16-bit.
> 
> Didn't realize that, sorry for confusion caused...
Your comments are really helpful! Although we missed this point, anyway it helped to make the patches in a better shape(I personally likes the new v9 more than v7 😊), really appreciate, thanks!
/Gavin
> 
> > >
> > > Let me re-think coming back to the full assembly procedure or
> implementing
> > > a 'load-exclusive' function. What do you think?
> 
> After some thought I am leaning towards 'load-exclusive' function -
> Hopefully it would help you avoid ras asm here and in other places.
> What do you think?
> Konstantin
Yes, I implemented 'load-exclusive' function in v9, please have a review, thanks!
Currently I did not make it 'rte_' as it is not used in other places than the rte_wait_until_equal APIs. 
Any more comments are welcome!
/Gavin
> 
> > > /Gavin
> > Forgot to mention, kernel uses wfe() without preceding load-exclusive
> instructions because:
> > 1) it replies on the timer, to wake up, i.e. __delay()
> > 2) explicit calling sev to send wake events, for all kinds of locks
> > 3) IPI instructions.
> >
> > Our patches can't count on these events, due to of lack of these events or
> performance  impact.
> > /Gavin
> > >
> > > > > > +/* Wait for *addr to be updated with expected value */
> > > > > > +__WAIT_UNTIL_EQUAL(relaxed_16, ldxrh, w, uint16_t)
> > > > > > +__WAIT_UNTIL_EQUAL(acquire_16, ldaxrh, w, uint16_t)
> > > > > > +__WAIT_UNTIL_EQUAL(relaxed_32, ldxr, w, uint32_t)
> > > > > > +__WAIT_UNTIL_EQUAL(acquire_32, ldaxr, w, uint32_t)
> > > > > > +__WAIT_UNTIL_EQUAL(relaxed_64, ldxr, x, uint64_t)
> > > > > > +__WAIT_UNTIL_EQUAL(acquire_64, ldaxr, x, uint64_t)
> > > > > > +#endif
> > > > > > +
> > > > > >  #ifdef __cplusplus
> > > > > >  }
> > > > > >  #endif

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: [dpdk-dev] [PATCH v9 2/5] eal: add the APIs to wait until equal
  2019-10-24 10:42 ` [dpdk-dev] [PATCH v9 2/5] eal: add the APIs to wait until equal Gavin Hu
@ 2019-10-24 13:52   ` Ananyev, Konstantin
  2019-10-24 13:57     ` Ananyev, Konstantin
  2019-10-24 17:00     ` Gavin Hu (Arm Technology China)
  0 siblings, 2 replies; 163+ messages in thread
From: Ananyev, Konstantin @ 2019-10-24 13:52 UTC (permalink / raw)
  To: Gavin Hu, dev
  Cc: nd, david.marchand, thomas, stephen, hemant.agrawal, jerinj,
	pbhagavatula, Honnappa.Nagarahalli, ruifeng.wang, phil.yang,
	steve.capper

Hi Gavin,

> The rte_wait_until_equal_xx APIs abstract the functionality of
> 'polling for a memory location to become equal to a given value'.
> 
> Add the RTE_ARM_USE_WFE configuration entry for aarch64, disabled
> by default. When it is enabled, the above APIs will call WFE instruction
> to save CPU cycles and power.
> 
> From a VM, when calling this API on aarch64, it may trap in and out to
> release vCPUs whereas cause high exit latency. Since kernel 4.18.20 an
> adaptive trapping mechanism is introduced to balance the latency and
> workload.
> 
> Signed-off-by: Gavin Hu <gavin.hu@arm.com>
> Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> Reviewed-by: Steve Capper <steve.capper@arm.com>
> Reviewed-by: Ola Liljedahl <ola.liljedahl@arm.com>
> Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> Reviewed-by: Phil Yang <phil.yang@arm.com>
> Acked-by: Pavan Nikhilesh <pbhagavatula@marvell.com>
> Acked-by: Jerin Jacob <jerinj@marvell.com>
> ---
>  config/arm/meson.build                             |   1 +
>  config/common_base                                 |   5 +
>  .../common/include/arch/arm/rte_pause_64.h         |  70 +++++++
>  lib/librte_eal/common/include/generic/rte_pause.h  | 217 +++++++++++++++++++++
>  4 files changed, 293 insertions(+)
> 
> diff --git a/config/arm/meson.build b/config/arm/meson.build
> index 979018e..b4b4cac 100644
> --- a/config/arm/meson.build
> +++ b/config/arm/meson.build
> @@ -26,6 +26,7 @@ flags_common_default = [
>  	['RTE_LIBRTE_AVP_PMD', false],
> 
>  	['RTE_SCHED_VECTOR', false],
> +	['RTE_ARM_USE_WFE', false],
>  ]
> 
>  flags_generic = [
> diff --git a/config/common_base b/config/common_base
> index e843a21..c812156 100644
> --- a/config/common_base
> +++ b/config/common_base
> @@ -111,6 +111,11 @@ CONFIG_RTE_MAX_VFIO_CONTAINERS=64
>  CONFIG_RTE_MALLOC_DEBUG=n
>  CONFIG_RTE_EAL_NUMA_AWARE_HUGEPAGES=n
>  CONFIG_RTE_USE_LIBBSD=n
> +# Use WFE instructions to implement the rte_wait_for_equal_xxx APIs,
> +# calling these APIs put the cores in low power state while waiting
> +# for the memory address to become equal to the expected value.
> +# This is supported only by aarch64.
> +CONFIG_RTE_ARM_USE_WFE=n
> 
>  #
>  # Recognize/ignore the AVX/AVX512 CPU flags for performance/power testing.
> diff --git a/lib/librte_eal/common/include/arch/arm/rte_pause_64.h b/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
> index 93895d3..7bc8efb 100644
> --- a/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
> +++ b/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
> @@ -1,5 +1,6 @@
>  /* SPDX-License-Identifier: BSD-3-Clause
>   * Copyright(c) 2017 Cavium, Inc
> + * Copyright(c) 2019 Arm Limited
>   */
> 
>  #ifndef _RTE_PAUSE_ARM64_H_
> @@ -17,6 +18,75 @@ static inline void rte_pause(void)
>  	asm volatile("yield" ::: "memory");
>  }
> 
> +#ifdef RTE_WAIT_UNTIL_EQUAL_ARCH_DEFINED
> +static inline void rte_sevl(void)
> +{
> +	asm volatile("sevl" : : : "memory");
> +}
> +
> +static inline void rte_wfe(void)
> +{
> +	asm volatile("wfe" : : : "memory");
> +}
> +
> +static __rte_always_inline uint16_t
> +__atomic_load_ex_16(volatile uint16_t *addr, int memorder)
> +{
> +	uint16_t tmp;
> +	assert((memorder == __ATOMIC_ACQUIRE)
> +			|| (memorder == __ATOMIC_RELAXED));
> +	if (memorder == __ATOMIC_ACQUIRE)
> +		asm volatile("ldaxrh %w[tmp], [%x[addr]]"
> +			: [tmp] "=&r" (tmp)
> +			: [addr] "r"(addr)
> +			: "memory");
> +	else if (memorder == __ATOMIC_RELAXED)
> +		asm volatile("ldxrh %w[tmp], [%x[addr]]"
> +			: [tmp] "=&r" (tmp)
> +			: [addr] "r"(addr)
> +			: "memory");
> +	return tmp;
> +}
> +
> +static __rte_always_inline uint32_t
> +__atomic_load_ex_32(volatile uint32_t *addr, int memorder)
> +{
> +	uint32_t tmp;
> +	assert((memorder == __ATOMIC_ACQUIRE)
> +			|| (memorder == __ATOMIC_RELAXED));
> +	if (memorder == __ATOMIC_ACQUIRE)
> +		asm volatile("ldaxr %w[tmp], [%x[addr]]"
> +			: [tmp] "=&r" (tmp)
> +			: [addr] "r"(addr)
> +			: "memory");
> +	else if (memorder == __ATOMIC_RELAXED)
> +		asm volatile("ldxr %w[tmp], [%x[addr]]"
> +			: [tmp] "=&r" (tmp)
> +			: [addr] "r"(addr)
> +			: "memory");
> +	return tmp;
> +}
> +
> +static __rte_always_inline uint64_t
> +__atomic_load_ex_64(volatile uint64_t *addr, int memorder)
> +{
> +	uint64_t tmp;
> +	assert((memorder == __ATOMIC_ACQUIRE)
> +			|| (memorder == __ATOMIC_RELAXED));
> +	if (memorder == __ATOMIC_ACQUIRE)
> +		asm volatile("ldaxr %x[tmp], [%x[addr]]"
> +			: [tmp] "=&r" (tmp)
> +			: [addr] "r"(addr)
> +			: "memory");
> +	else if (memorder == __ATOMIC_RELAXED)
> +		asm volatile("ldxr %x[tmp], [%x[addr]]"
> +			: [tmp] "=&r" (tmp)
> +			: [addr] "r"(addr)
> +			: "memory");
> +	return tmp;
> +}
> +#endif
> +

The function themselves seems good to me... 
But I think it was some misunderstanding about code layout/placement.
I think arm specific functionsand defines  need to be defined in arm specific headers only.
But we still can have one instance of rte_wait_until_equal_* for arm.

To be more specific, I am talking about something like that here:

lib/librte_eal/common/include/generic/rte_pause.h:
...
#ifndef RTE_WAIT_UNTIL_EQUAL_ARCH_DEFINED
static __rte_always_inline void					
rte_wait_until_equal_32(volatile type * addr, type expected, int memorder)							\
{				
	while (__atomic_load_n(addr, memorder) != expected) {
		rte_pause();					\
							\
}
....
#endif
...

lib/librte_eal/common/include/arch/arm/rte_pause_64.h:

...
#ifdef RTE_ARM_USE_WFE 
#define RTE_WAIT_UNTIL_EQUAL_ARCH_DEFINED
#endif
#include "generic/rte_pause.h"

...
#ifdef RTE_ARM_USE_WFE
static inline void rte_sevl(void)
{
	asm volatile("sevl" : : : "memory");
}
static inline void rte_wfe(void)
{
	asm volatile("wfe" : : : "memory");
}
#else
static inline void rte_sevl(void)
{
}
static inline void rte_wfe(void)
{
	rte_pause();
}
...

static __rte_always_inline void
rte_wait_until_equal_32(volatile uint32_t *addr, uint32_t expected, int memorder)
{
	if (__atomic_load_ex_32(addr, memorder) != expected) {
		rte_sevl();
		do {
			rte_wfe();
		} while (__atomic_load_ex_32(addr, memorder) != expected);
	}
}

#endif


>  #ifdef __cplusplus
>  }
>  #endif
> diff --git a/lib/librte_eal/common/include/generic/rte_pause.h b/lib/librte_eal/common/include/generic/rte_pause.h
> index 52bd4db..4db44f9 100644
> --- a/lib/librte_eal/common/include/generic/rte_pause.h
> +++ b/lib/librte_eal/common/include/generic/rte_pause.h
> @@ -1,5 +1,6 @@
>  /* SPDX-License-Identifier: BSD-3-Clause
>   * Copyright(c) 2017 Cavium, Inc
> + * Copyright(c) 2019 Arm Limited
>   */
> 
>  #ifndef _RTE_PAUSE_H_
> @@ -12,6 +13,12 @@
>   *
>   */
> 
> +#include <stdint.h>
> +#include <rte_common.h>
> +#include <rte_atomic.h>
> +#include <rte_compat.h>
> +#include <assert.h>
> +
>  /**
>   * Pause CPU execution for a short while
>   *
> @@ -20,4 +27,214 @@
>   */
>  static inline void rte_pause(void);
> 
> +static inline void rte_sevl(void);
> +static inline void rte_wfe(void);
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
> + *
> + * Atomic load from addr, it returns the 16-bit content of *addr.
> + *
> + * @param addr
> + *  A pointer to the memory location.
> + * @param memorder
> + *  The valid memory order variants are __ATOMIC_ACQUIRE and __ATOMIC_RELAXED.
> + *  These map to C++11 memory orders with the same names, see the C++11 standard
> + *  the GCC wiki on atomic synchronization for detailed definitions.
> + */
> +static __rte_always_inline uint16_t
> +__atomic_load_ex_16(volatile uint16_t *addr, int memorder);
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
> + *
> + * Atomic load from addr, it returns the 32-bit content of *addr.
> + *
> + * @param addr
> + *  A pointer to the memory location.
> + * @param memorder
> + *  The valid memory order variants are __ATOMIC_ACQUIRE and __ATOMIC_RELAXED.
> + *  These map to C++11 memory orders with the same names, see the C++11 standard
> + *  the GCC wiki on atomic synchronization for detailed definitions.
> + */
> +static __rte_always_inline uint32_t
> +__atomic_load_ex_32(volatile uint32_t *addr, int memorder);
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
> + *
> + * Atomic load from addr, it returns the 64-bit content of *addr.
> + *
> + * @param addr
> + *  A pointer to the memory location.
> + * @param memorder
> + *  The valid memory order variants are __ATOMIC_ACQUIRE and __ATOMIC_RELAXED.
> + *  These map to C++11 memory orders with the same names, see the C++11 standard
> + *  the GCC wiki on atomic synchronization for detailed definitions.
> + */
> +static __rte_always_inline uint64_t
> +__atomic_load_ex_64(volatile uint64_t *addr, int memorder);
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
> + *
> + * Wait for *addr to be updated with a 16-bit expected value, with a relaxed
> + * memory ordering model meaning the loads around this API can be reordered.
> + *
> + * @param addr
> + *  A pointer to the memory location.
> + * @param expected
> + *  A 16-bit expected value to be in the memory location.
> + * @param memorder
> + *  Two different memory orders that can be specified:
> + *  __ATOMIC_ACQUIRE and __ATOMIC_RELAXED. These map to
> + *  C++11 memory orders with the same names, see the C++11 standard or
> + *  the GCC wiki on atomic synchronization for detailed definition.
> + */
> +__rte_experimental
> +static __rte_always_inline void
> +rte_wait_until_equal_16(volatile uint16_t *addr, uint16_t expected,
> +int memorder);
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
> + *
> + * Wait for *addr to be updated with a 32-bit expected value, with a relaxed
> + * memory ordering model meaning the loads around this API can be reordered.
> + *
> + * @param addr
> + *  A pointer to the memory location.
> + * @param expected
> + *  A 32-bit expected value to be in the memory location.
> + * @param memorder
> + *  Two different memory orders that can be specified:
> + *  __ATOMIC_ACQUIRE and __ATOMIC_RELAXED. These map to
> + *  C++11 memory orders with the same names, see the C++11 standard or
> + *  the GCC wiki on atomic synchronization for detailed definition.
> + */
> +__rte_experimental
> +static __rte_always_inline void
> +rte_wait_until_equal_32(volatile uint32_t *addr, uint32_t expected,
> +int memorder);
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
> + *
> + * Wait for *addr to be updated with a 64-bit expected value, with a relaxed
> + * memory ordering model meaning the loads around this API can be reordered.
> + *
> + * @param addr
> + *  A pointer to the memory location.
> + * @param expected
> + *  A 64-bit expected value to be in the memory location.
> + * @param memorder
> + *  Two different memory orders that can be specified:
> + *  __ATOMIC_ACQUIRE and __ATOMIC_RELAXED. These map to
> + *  C++11 memory orders with the same names, see the C++11 standard or
> + *  the GCC wiki on atomic synchronization for detailed definition.
> + */
> +__rte_experimental
> +static __rte_always_inline void
> +rte_wait_until_equal_64(volatile uint64_t *addr, uint64_t expected,
> +int memorder);
> +
> +#ifdef RTE_ARM_USE_WFE
> +#define RTE_WAIT_UNTIL_EQUAL_ARCH_DEFINED
> +#endif
> +
> +#ifndef RTE_WAIT_UNTIL_EQUAL_ARCH_DEFINED
> +static inline void rte_sevl(void)
> +{
> +}
> +
> +static inline void rte_wfe(void)
> +{
> +	rte_pause();
> +}
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
> + *
> + * Atomic load from addr, it returns the 16-bit content of *addr.
> + *
> + * @param addr
> + *  A pointer to the memory location.
> + * @param memorder
> + *  The valid memory order variants are __ATOMIC_ACQUIRE and __ATOMIC_RELAXED.
> + *  These map to C++11 memory orders with the same names, see the C++11 standard
> + *  the GCC wiki on atomic synchronization for detailed definitions.
> + */
> +static __rte_always_inline uint16_t
> +__atomic_load_ex_16(volatile uint16_t *addr, int memorder)
> +{
> +	uint16_t tmp;
> +	assert((memorder == __ATOMIC_ACQUIRE)
> +			|| (memorder == __ATOMIC_RELAXED));
> +	tmp = __atomic_load_n(addr, memorder);
> +	return tmp;
> +}
> +
> +static __rte_always_inline uint32_t
> +__atomic_load_ex_32(volatile uint32_t *addr, int memorder)
> +{
> +	uint32_t tmp;
> +	assert((memorder == __ATOMIC_ACQUIRE)
> +			|| (memorder == __ATOMIC_RELAXED));
> +	tmp = __atomic_load_n(addr, memorder);
> +	return tmp;
> +}
> +
> +static __rte_always_inline uint64_t
> +__atomic_load_ex_64(volatile uint64_t *addr, int memorder)
> +{
> +	uint64_t tmp;
> +	assert((memorder == __ATOMIC_ACQUIRE)
> +			|| (memorder == __ATOMIC_RELAXED));
> +	tmp = __atomic_load_n(addr, memorder);
> +	return tmp;
> +}
> +
> +static __rte_always_inline void
> +rte_wait_until_equal_16(volatile uint16_t *addr, uint16_t expected,
> +int memorder)
> +{
> +	if (__atomic_load_n(addr, memorder) != expected) {
> +		rte_sevl();
> +		do {
> +			rte_wfe();
> +		} while (__atomic_load_ex_16(addr, memorder) != expected);
> +	}
> +}
> +
> +static __rte_always_inline void
> +rte_wait_until_equal_32(volatile uint32_t *addr, uint32_t expected,
> +int memorder)
> +{
> +	if (__atomic_load_ex_32(addr, memorder) != expected) {
> +		rte_sevl();
> +		do {
> +			rte_wfe();
> +		} while (__atomic_load_ex_32(addr, memorder) != expected);
> +	}
> +}
> +
> +static __rte_always_inline void
> +rte_wait_until_equal_64(volatile uint64_t *addr, uint64_t expected,
> +int memorder)
> +{
> +	if (__atomic_load_ex_64(addr, memorder) != expected) {
> +		rte_sevl();
> +		do {
> +			rte_wfe();
> +		} while (__atomic_load_ex_64(addr, memorder) != expected);
> +	}
> +}
> +#endif
> +
>  #endif /* _RTE_PAUSE_H_ */
> --
> 2.7.4


^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: [dpdk-dev] [PATCH v9 2/5] eal: add the APIs to wait until equal
  2019-10-24 13:52   ` Ananyev, Konstantin
@ 2019-10-24 13:57     ` Ananyev, Konstantin
  2019-10-24 17:00     ` Gavin Hu (Arm Technology China)
  1 sibling, 0 replies; 163+ messages in thread
From: Ananyev, Konstantin @ 2019-10-24 13:57 UTC (permalink / raw)
  To: Ananyev, Konstantin, Gavin Hu, dev
  Cc: nd, david.marchand, thomas, stephen, hemant.agrawal, jerinj,
	pbhagavatula, Honnappa.Nagarahalli, ruifeng.wang, phil.yang,
	steve.capper




> 
> Hi Gavin,
> 
> > The rte_wait_until_equal_xx APIs abstract the functionality of
> > 'polling for a memory location to become equal to a given value'.
> >
> > Add the RTE_ARM_USE_WFE configuration entry for aarch64, disabled
> > by default. When it is enabled, the above APIs will call WFE instruction
> > to save CPU cycles and power.
> >
> > From a VM, when calling this API on aarch64, it may trap in and out to
> > release vCPUs whereas cause high exit latency. Since kernel 4.18.20 an
> > adaptive trapping mechanism is introduced to balance the latency and
> > workload.
> >
> > Signed-off-by: Gavin Hu <gavin.hu@arm.com>
> > Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > Reviewed-by: Steve Capper <steve.capper@arm.com>
> > Reviewed-by: Ola Liljedahl <ola.liljedahl@arm.com>
> > Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > Reviewed-by: Phil Yang <phil.yang@arm.com>
> > Acked-by: Pavan Nikhilesh <pbhagavatula@marvell.com>
> > Acked-by: Jerin Jacob <jerinj@marvell.com>
> > ---
> >  config/arm/meson.build                             |   1 +
> >  config/common_base                                 |   5 +
> >  .../common/include/arch/arm/rte_pause_64.h         |  70 +++++++
> >  lib/librte_eal/common/include/generic/rte_pause.h  | 217 +++++++++++++++++++++
> >  4 files changed, 293 insertions(+)
> >
> > diff --git a/config/arm/meson.build b/config/arm/meson.build
> > index 979018e..b4b4cac 100644
> > --- a/config/arm/meson.build
> > +++ b/config/arm/meson.build
> > @@ -26,6 +26,7 @@ flags_common_default = [
> >  	['RTE_LIBRTE_AVP_PMD', false],
> >
> >  	['RTE_SCHED_VECTOR', false],
> > +	['RTE_ARM_USE_WFE', false],
> >  ]
> >
> >  flags_generic = [
> > diff --git a/config/common_base b/config/common_base
> > index e843a21..c812156 100644
> > --- a/config/common_base
> > +++ b/config/common_base
> > @@ -111,6 +111,11 @@ CONFIG_RTE_MAX_VFIO_CONTAINERS=64
> >  CONFIG_RTE_MALLOC_DEBUG=n
> >  CONFIG_RTE_EAL_NUMA_AWARE_HUGEPAGES=n
> >  CONFIG_RTE_USE_LIBBSD=n
> > +# Use WFE instructions to implement the rte_wait_for_equal_xxx APIs,
> > +# calling these APIs put the cores in low power state while waiting
> > +# for the memory address to become equal to the expected value.
> > +# This is supported only by aarch64.
> > +CONFIG_RTE_ARM_USE_WFE=n
> >
> >  #
> >  # Recognize/ignore the AVX/AVX512 CPU flags for performance/power testing.
> > diff --git a/lib/librte_eal/common/include/arch/arm/rte_pause_64.h b/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
> > index 93895d3..7bc8efb 100644
> > --- a/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
> > +++ b/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
> > @@ -1,5 +1,6 @@
> >  /* SPDX-License-Identifier: BSD-3-Clause
> >   * Copyright(c) 2017 Cavium, Inc
> > + * Copyright(c) 2019 Arm Limited
> >   */
> >
> >  #ifndef _RTE_PAUSE_ARM64_H_
> > @@ -17,6 +18,75 @@ static inline void rte_pause(void)
> >  	asm volatile("yield" ::: "memory");
> >  }
> >
> > +#ifdef RTE_WAIT_UNTIL_EQUAL_ARCH_DEFINED
> > +static inline void rte_sevl(void)
> > +{
> > +	asm volatile("sevl" : : : "memory");
> > +}
> > +
> > +static inline void rte_wfe(void)
> > +{
> > +	asm volatile("wfe" : : : "memory");
> > +}
> > +
> > +static __rte_always_inline uint16_t
> > +__atomic_load_ex_16(volatile uint16_t *addr, int memorder)
> > +{
> > +	uint16_t tmp;
> > +	assert((memorder == __ATOMIC_ACQUIRE)
> > +			|| (memorder == __ATOMIC_RELAXED));
> > +	if (memorder == __ATOMIC_ACQUIRE)
> > +		asm volatile("ldaxrh %w[tmp], [%x[addr]]"
> > +			: [tmp] "=&r" (tmp)
> > +			: [addr] "r"(addr)
> > +			: "memory");
> > +	else if (memorder == __ATOMIC_RELAXED)
> > +		asm volatile("ldxrh %w[tmp], [%x[addr]]"
> > +			: [tmp] "=&r" (tmp)
> > +			: [addr] "r"(addr)
> > +			: "memory");
> > +	return tmp;
> > +}
> > +
> > +static __rte_always_inline uint32_t
> > +__atomic_load_ex_32(volatile uint32_t *addr, int memorder)
> > +{
> > +	uint32_t tmp;
> > +	assert((memorder == __ATOMIC_ACQUIRE)
> > +			|| (memorder == __ATOMIC_RELAXED));
> > +	if (memorder == __ATOMIC_ACQUIRE)
> > +		asm volatile("ldaxr %w[tmp], [%x[addr]]"
> > +			: [tmp] "=&r" (tmp)
> > +			: [addr] "r"(addr)
> > +			: "memory");
> > +	else if (memorder == __ATOMIC_RELAXED)
> > +		asm volatile("ldxr %w[tmp], [%x[addr]]"
> > +			: [tmp] "=&r" (tmp)
> > +			: [addr] "r"(addr)
> > +			: "memory");
> > +	return tmp;
> > +}
> > +
> > +static __rte_always_inline uint64_t
> > +__atomic_load_ex_64(volatile uint64_t *addr, int memorder)
> > +{
> > +	uint64_t tmp;
> > +	assert((memorder == __ATOMIC_ACQUIRE)
> > +			|| (memorder == __ATOMIC_RELAXED));
> > +	if (memorder == __ATOMIC_ACQUIRE)
> > +		asm volatile("ldaxr %x[tmp], [%x[addr]]"
> > +			: [tmp] "=&r" (tmp)
> > +			: [addr] "r"(addr)
> > +			: "memory");
> > +	else if (memorder == __ATOMIC_RELAXED)
> > +		asm volatile("ldxr %x[tmp], [%x[addr]]"
> > +			: [tmp] "=&r" (tmp)
> > +			: [addr] "r"(addr)
> > +			: "memory");
> > +	return tmp;
> > +}
> > +#endif
> > +
> 
> The function themselves seems good to me...
> But I think it was some misunderstanding about code layout/placement.
> I think arm specific functionsand defines  need to be defined in arm specific headers only.
> But we still can have one instance of rte_wait_until_equal_* for arm.
> 
> To be more specific, I am talking about something like that here:
> 
> lib/librte_eal/common/include/generic/rte_pause.h:
> ...
> #ifndef RTE_WAIT_UNTIL_EQUAL_ARCH_DEFINED
> static __rte_always_inline void
> rte_wait_until_equal_32(volatile type * addr, type expected, int memorder)							\
> {
> 	while (__atomic_load_n(addr, memorder) != expected) {
> 		rte_pause();					\
> 							\
> }
> ....
> #endif
> ...
> 
> lib/librte_eal/common/include/arch/arm/rte_pause_64.h:
> 
> ...
> #ifdef RTE_ARM_USE_WFE
> #define RTE_WAIT_UNTIL_EQUAL_ARCH_DEFINED
> #endif
> #include "generic/rte_pause.h"
> 
> ...
> #ifdef RTE_ARM_USE_WFE
> static inline void rte_sevl(void)
> {
> 	asm volatile("sevl" : : : "memory");
> }
> static inline void rte_wfe(void)
> {
> 	asm volatile("wfe" : : : "memory");
> }
> #else
> static inline void rte_sevl(void)
> {
> }
> static inline void rte_wfe(void)
> {
> 	rte_pause();
> }
> ...
> 
> static __rte_always_inline void
> rte_wait_until_equal_32(volatile uint32_t *addr, uint32_t expected, int memorder)
> {
> 	if (__atomic_load_ex_32(addr, memorder) != expected) {
> 		rte_sevl();
> 		do {
> 			rte_wfe();
> 		} while (__atomic_load_ex_32(addr, memorder) != expected);
> 	}
> }

One more nit (nearly forgot): I think it is better to have rte_ (or __rte__) prefix for all
functions defined in public files, so: __rte_atomic_load_ex_32() or just rte_atomic_load_ex_32().

> 
> #endif
> 
> 
> >  #ifdef __cplusplus
> >  }
> >  #endif
> > diff --git a/lib/librte_eal/common/include/generic/rte_pause.h b/lib/librte_eal/common/include/generic/rte_pause.h
> > index 52bd4db..4db44f9 100644
> > --- a/lib/librte_eal/common/include/generic/rte_pause.h
> > +++ b/lib/librte_eal/common/include/generic/rte_pause.h
> > @@ -1,5 +1,6 @@
> >  /* SPDX-License-Identifier: BSD-3-Clause
> >   * Copyright(c) 2017 Cavium, Inc
> > + * Copyright(c) 2019 Arm Limited
> >   */
> >
> >  #ifndef _RTE_PAUSE_H_
> > @@ -12,6 +13,12 @@
> >   *
> >   */
> >
> > +#include <stdint.h>
> > +#include <rte_common.h>
> > +#include <rte_atomic.h>
> > +#include <rte_compat.h>
> > +#include <assert.h>
> > +
> >  /**
> >   * Pause CPU execution for a short while
> >   *
> > @@ -20,4 +27,214 @@
> >   */
> >  static inline void rte_pause(void);
> >
> > +static inline void rte_sevl(void);
> > +static inline void rte_wfe(void);
> > +/**
> > + * @warning
> > + * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
> > + *
> > + * Atomic load from addr, it returns the 16-bit content of *addr.
> > + *
> > + * @param addr
> > + *  A pointer to the memory location.
> > + * @param memorder
> > + *  The valid memory order variants are __ATOMIC_ACQUIRE and __ATOMIC_RELAXED.
> > + *  These map to C++11 memory orders with the same names, see the C++11 standard
> > + *  the GCC wiki on atomic synchronization for detailed definitions.
> > + */
> > +static __rte_always_inline uint16_t
> > +__atomic_load_ex_16(volatile uint16_t *addr, int memorder);
> > +
> > +/**
> > + * @warning
> > + * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
> > + *
> > + * Atomic load from addr, it returns the 32-bit content of *addr.
> > + *
> > + * @param addr
> > + *  A pointer to the memory location.
> > + * @param memorder
> > + *  The valid memory order variants are __ATOMIC_ACQUIRE and __ATOMIC_RELAXED.
> > + *  These map to C++11 memory orders with the same names, see the C++11 standard
> > + *  the GCC wiki on atomic synchronization for detailed definitions.
> > + */
> > +static __rte_always_inline uint32_t
> > +__atomic_load_ex_32(volatile uint32_t *addr, int memorder);
> > +
> > +/**
> > + * @warning
> > + * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
> > + *
> > + * Atomic load from addr, it returns the 64-bit content of *addr.
> > + *
> > + * @param addr
> > + *  A pointer to the memory location.
> > + * @param memorder
> > + *  The valid memory order variants are __ATOMIC_ACQUIRE and __ATOMIC_RELAXED.
> > + *  These map to C++11 memory orders with the same names, see the C++11 standard
> > + *  the GCC wiki on atomic synchronization for detailed definitions.
> > + */
> > +static __rte_always_inline uint64_t
> > +__atomic_load_ex_64(volatile uint64_t *addr, int memorder);
> > +
> > +/**
> > + * @warning
> > + * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
> > + *
> > + * Wait for *addr to be updated with a 16-bit expected value, with a relaxed
> > + * memory ordering model meaning the loads around this API can be reordered.
> > + *
> > + * @param addr
> > + *  A pointer to the memory location.
> > + * @param expected
> > + *  A 16-bit expected value to be in the memory location.
> > + * @param memorder
> > + *  Two different memory orders that can be specified:
> > + *  __ATOMIC_ACQUIRE and __ATOMIC_RELAXED. These map to
> > + *  C++11 memory orders with the same names, see the C++11 standard or
> > + *  the GCC wiki on atomic synchronization for detailed definition.
> > + */
> > +__rte_experimental
> > +static __rte_always_inline void
> > +rte_wait_until_equal_16(volatile uint16_t *addr, uint16_t expected,
> > +int memorder);
> > +
> > +/**
> > + * @warning
> > + * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
> > + *
> > + * Wait for *addr to be updated with a 32-bit expected value, with a relaxed
> > + * memory ordering model meaning the loads around this API can be reordered.
> > + *
> > + * @param addr
> > + *  A pointer to the memory location.
> > + * @param expected
> > + *  A 32-bit expected value to be in the memory location.
> > + * @param memorder
> > + *  Two different memory orders that can be specified:
> > + *  __ATOMIC_ACQUIRE and __ATOMIC_RELAXED. These map to
> > + *  C++11 memory orders with the same names, see the C++11 standard or
> > + *  the GCC wiki on atomic synchronization for detailed definition.
> > + */
> > +__rte_experimental
> > +static __rte_always_inline void
> > +rte_wait_until_equal_32(volatile uint32_t *addr, uint32_t expected,
> > +int memorder);
> > +
> > +/**
> > + * @warning
> > + * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
> > + *
> > + * Wait for *addr to be updated with a 64-bit expected value, with a relaxed
> > + * memory ordering model meaning the loads around this API can be reordered.
> > + *
> > + * @param addr
> > + *  A pointer to the memory location.
> > + * @param expected
> > + *  A 64-bit expected value to be in the memory location.
> > + * @param memorder
> > + *  Two different memory orders that can be specified:
> > + *  __ATOMIC_ACQUIRE and __ATOMIC_RELAXED. These map to
> > + *  C++11 memory orders with the same names, see the C++11 standard or
> > + *  the GCC wiki on atomic synchronization for detailed definition.
> > + */
> > +__rte_experimental
> > +static __rte_always_inline void
> > +rte_wait_until_equal_64(volatile uint64_t *addr, uint64_t expected,
> > +int memorder);
> > +
> > +#ifdef RTE_ARM_USE_WFE
> > +#define RTE_WAIT_UNTIL_EQUAL_ARCH_DEFINED
> > +#endif
> > +
> > +#ifndef RTE_WAIT_UNTIL_EQUAL_ARCH_DEFINED
> > +static inline void rte_sevl(void)
> > +{
> > +}
> > +
> > +static inline void rte_wfe(void)
> > +{
> > +	rte_pause();
> > +}
> > +
> > +/**
> > + * @warning
> > + * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
> > + *
> > + * Atomic load from addr, it returns the 16-bit content of *addr.
> > + *
> > + * @param addr
> > + *  A pointer to the memory location.
> > + * @param memorder
> > + *  The valid memory order variants are __ATOMIC_ACQUIRE and __ATOMIC_RELAXED.
> > + *  These map to C++11 memory orders with the same names, see the C++11 standard
> > + *  the GCC wiki on atomic synchronization for detailed definitions.
> > + */
> > +static __rte_always_inline uint16_t
> > +__atomic_load_ex_16(volatile uint16_t *addr, int memorder)
> > +{
> > +	uint16_t tmp;
> > +	assert((memorder == __ATOMIC_ACQUIRE)
> > +			|| (memorder == __ATOMIC_RELAXED));
> > +	tmp = __atomic_load_n(addr, memorder);
> > +	return tmp;
> > +}
> > +
> > +static __rte_always_inline uint32_t
> > +__atomic_load_ex_32(volatile uint32_t *addr, int memorder)
> > +{
> > +	uint32_t tmp;
> > +	assert((memorder == __ATOMIC_ACQUIRE)
> > +			|| (memorder == __ATOMIC_RELAXED));
> > +	tmp = __atomic_load_n(addr, memorder);
> > +	return tmp;
> > +}
> > +
> > +static __rte_always_inline uint64_t
> > +__atomic_load_ex_64(volatile uint64_t *addr, int memorder)
> > +{
> > +	uint64_t tmp;
> > +	assert((memorder == __ATOMIC_ACQUIRE)
> > +			|| (memorder == __ATOMIC_RELAXED));
> > +	tmp = __atomic_load_n(addr, memorder);
> > +	return tmp;
> > +}
> > +
> > +static __rte_always_inline void
> > +rte_wait_until_equal_16(volatile uint16_t *addr, uint16_t expected,
> > +int memorder)
> > +{
> > +	if (__atomic_load_n(addr, memorder) != expected) {
> > +		rte_sevl();
> > +		do {
> > +			rte_wfe();
> > +		} while (__atomic_load_ex_16(addr, memorder) != expected);
> > +	}
> > +}
> > +
> > +static __rte_always_inline void
> > +rte_wait_until_equal_32(volatile uint32_t *addr, uint32_t expected,
> > +int memorder)
> > +{
> > +	if (__atomic_load_ex_32(addr, memorder) != expected) {
> > +		rte_sevl();
> > +		do {
> > +			rte_wfe();
> > +		} while (__atomic_load_ex_32(addr, memorder) != expected);
> > +	}
> > +}
> > +
> > +static __rte_always_inline void
> > +rte_wait_until_equal_64(volatile uint64_t *addr, uint64_t expected,
> > +int memorder)
> > +{
> > +	if (__atomic_load_ex_64(addr, memorder) != expected) {
> > +		rte_sevl();
> > +		do {
> > +			rte_wfe();
> > +		} while (__atomic_load_ex_64(addr, memorder) != expected);
> > +	}
> > +}
> > +#endif
> > +
> >  #endif /* _RTE_PAUSE_H_ */
> > --
> > 2.7.4


^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: [dpdk-dev] [PATCH v9 2/5] eal: add the APIs to wait until equal
  2019-10-24 13:52   ` Ananyev, Konstantin
  2019-10-24 13:57     ` Ananyev, Konstantin
@ 2019-10-24 17:00     ` Gavin Hu (Arm Technology China)
  1 sibling, 0 replies; 163+ messages in thread
From: Gavin Hu (Arm Technology China) @ 2019-10-24 17:00 UTC (permalink / raw)
  To: Ananyev, Konstantin, dev
  Cc: nd, david.marchand, thomas, stephen, hemant.agrawal, jerinj,
	pbhagavatula, Honnappa Nagarahalli,
	Ruifeng Wang (Arm Technology China),
	Phil Yang (Arm Technology China),
	nd

Hi Konstantin,

> -----Original Message-----
> From: Ananyev, Konstantin <konstantin.ananyev@intel.com>
> Sent: Thursday, October 24, 2019 9:52 PM
> To: Gavin Hu (Arm Technology China) <Gavin.Hu@arm.com>;
> dev@dpdk.org
> Cc: nd <nd@arm.com>; david.marchand@redhat.com;
> thomas@monjalon.net; stephen@networkplumber.org;
> hemant.agrawal@nxp.com; jerinj@marvell.com;
> pbhagavatula@marvell.com; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>; Ruifeng Wang (Arm Technology China)
> <Ruifeng.Wang@arm.com>; Phil Yang (Arm Technology China)
> <Phil.Yang@arm.com>; Steve Capper <Steve.Capper@arm.com>
> Subject: RE: [PATCH v9 2/5] eal: add the APIs to wait until equal
> 
> Hi Gavin,
> 
> > The rte_wait_until_equal_xx APIs abstract the functionality of
> > 'polling for a memory location to become equal to a given value'.
> >
> > Add the RTE_ARM_USE_WFE configuration entry for aarch64, disabled
> > by default. When it is enabled, the above APIs will call WFE instruction
> > to save CPU cycles and power.
> >
> > From a VM, when calling this API on aarch64, it may trap in and out to
> > release vCPUs whereas cause high exit latency. Since kernel 4.18.20 an
> > adaptive trapping mechanism is introduced to balance the latency and
> > workload.
> >
> > Signed-off-by: Gavin Hu <gavin.hu@arm.com>
> > Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > Reviewed-by: Steve Capper <steve.capper@arm.com>
> > Reviewed-by: Ola Liljedahl <ola.liljedahl@arm.com>
> > Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > Reviewed-by: Phil Yang <phil.yang@arm.com>
> > Acked-by: Pavan Nikhilesh <pbhagavatula@marvell.com>
> > Acked-by: Jerin Jacob <jerinj@marvell.com>
> > ---
> >  config/arm/meson.build                             |   1 +
> >  config/common_base                                 |   5 +
> >  .../common/include/arch/arm/rte_pause_64.h         |  70 +++++++
> >  lib/librte_eal/common/include/generic/rte_pause.h  | 217
> +++++++++++++++++++++
> >  4 files changed, 293 insertions(+)
> >
> > diff --git a/config/arm/meson.build b/config/arm/meson.build
> > index 979018e..b4b4cac 100644
> > --- a/config/arm/meson.build
> > +++ b/config/arm/meson.build
> > @@ -26,6 +26,7 @@ flags_common_default = [
> >  	['RTE_LIBRTE_AVP_PMD', false],
> >
> >  	['RTE_SCHED_VECTOR', false],
> > +	['RTE_ARM_USE_WFE', false],
> >  ]
> >
> >  flags_generic = [
> > diff --git a/config/common_base b/config/common_base
> > index e843a21..c812156 100644
> > --- a/config/common_base
> > +++ b/config/common_base
> > @@ -111,6 +111,11 @@ CONFIG_RTE_MAX_VFIO_CONTAINERS=64
> >  CONFIG_RTE_MALLOC_DEBUG=n
> >  CONFIG_RTE_EAL_NUMA_AWARE_HUGEPAGES=n
> >  CONFIG_RTE_USE_LIBBSD=n
> > +# Use WFE instructions to implement the rte_wait_for_equal_xxx APIs,
> > +# calling these APIs put the cores in low power state while waiting
> > +# for the memory address to become equal to the expected value.
> > +# This is supported only by aarch64.
> > +CONFIG_RTE_ARM_USE_WFE=n
> >
> >  #
> >  # Recognize/ignore the AVX/AVX512 CPU flags for performance/power
> testing.
> > diff --git a/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
> b/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
> > index 93895d3..7bc8efb 100644
> > --- a/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
> > +++ b/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
> > @@ -1,5 +1,6 @@
> >  /* SPDX-License-Identifier: BSD-3-Clause
> >   * Copyright(c) 2017 Cavium, Inc
> > + * Copyright(c) 2019 Arm Limited
> >   */
> >
> >  #ifndef _RTE_PAUSE_ARM64_H_
> > @@ -17,6 +18,75 @@ static inline void rte_pause(void)
> >  	asm volatile("yield" ::: "memory");
> >  }
> >
> > +#ifdef RTE_WAIT_UNTIL_EQUAL_ARCH_DEFINED
> > +static inline void rte_sevl(void)
> > +{
> > +	asm volatile("sevl" : : : "memory");
> > +}
> > +
> > +static inline void rte_wfe(void)
> > +{
> > +	asm volatile("wfe" : : : "memory");
> > +}
> > +
> > +static __rte_always_inline uint16_t
> > +__atomic_load_ex_16(volatile uint16_t *addr, int memorder)
> > +{
> > +	uint16_t tmp;
> > +	assert((memorder == __ATOMIC_ACQUIRE)
> > +			|| (memorder == __ATOMIC_RELAXED));
> > +	if (memorder == __ATOMIC_ACQUIRE)
> > +		asm volatile("ldaxrh %w[tmp], [%x[addr]]"
> > +			: [tmp] "=&r" (tmp)
> > +			: [addr] "r"(addr)
> > +			: "memory");
> > +	else if (memorder == __ATOMIC_RELAXED)
> > +		asm volatile("ldxrh %w[tmp], [%x[addr]]"
> > +			: [tmp] "=&r" (tmp)
> > +			: [addr] "r"(addr)
> > +			: "memory");
> > +	return tmp;
> > +}
> > +
> > +static __rte_always_inline uint32_t
> > +__atomic_load_ex_32(volatile uint32_t *addr, int memorder)
> > +{
> > +	uint32_t tmp;
> > +	assert((memorder == __ATOMIC_ACQUIRE)
> > +			|| (memorder == __ATOMIC_RELAXED));
> > +	if (memorder == __ATOMIC_ACQUIRE)
> > +		asm volatile("ldaxr %w[tmp], [%x[addr]]"
> > +			: [tmp] "=&r" (tmp)
> > +			: [addr] "r"(addr)
> > +			: "memory");
> > +	else if (memorder == __ATOMIC_RELAXED)
> > +		asm volatile("ldxr %w[tmp], [%x[addr]]"
> > +			: [tmp] "=&r" (tmp)
> > +			: [addr] "r"(addr)
> > +			: "memory");
> > +	return tmp;
> > +}
> > +
> > +static __rte_always_inline uint64_t
> > +__atomic_load_ex_64(volatile uint64_t *addr, int memorder)
> > +{
> > +	uint64_t tmp;
> > +	assert((memorder == __ATOMIC_ACQUIRE)
> > +			|| (memorder == __ATOMIC_RELAXED));
> > +	if (memorder == __ATOMIC_ACQUIRE)
> > +		asm volatile("ldaxr %x[tmp], [%x[addr]]"
> > +			: [tmp] "=&r" (tmp)
> > +			: [addr] "r"(addr)
> > +			: "memory");
> > +	else if (memorder == __ATOMIC_RELAXED)
> > +		asm volatile("ldxr %x[tmp], [%x[addr]]"
> > +			: [tmp] "=&r" (tmp)
> > +			: [addr] "r"(addr)
> > +			: "memory");
> > +	return tmp;
> > +}
> > +#endif
> > +
> 
> The function themselves seems good to me...
> But I think it was some misunderstanding about code layout/placement.
> I think arm specific functionsand defines  need to be defined in arm specific
> headers only.
> But we still can have one instance of rte_wait_until_equal_* for arm.
I will move that part to arm specific headers. 
/Gavin
> 
> To be more specific, I am talking about something like that here:
> 
> lib/librte_eal/common/include/generic/rte_pause.h:
> ...
> #ifndef RTE_WAIT_UNTIL_EQUAL_ARCH_DEFINED
> static __rte_always_inline void
> rte_wait_until_equal_32(volatile type * addr, type expected, int memorder)
> 							\
> {
> 	while (__atomic_load_n(addr, memorder) != expected) {
> 		rte_pause();					\
> 							\
> }
> ....
> #endif
> ...
> 
> lib/librte_eal/common/include/arch/arm/rte_pause_64.h:
> 
> ...
> #ifdef RTE_ARM_USE_WFE
> #define RTE_WAIT_UNTIL_EQUAL_ARCH_DEFINED
> #endif
> #include "generic/rte_pause.h"
> 
> ...
> #ifdef RTE_ARM_USE_WFE
> static inline void rte_sevl(void)
> {
> 	asm volatile("sevl" : : : "memory");
> }
> static inline void rte_wfe(void)
> {
> 	asm volatile("wfe" : : : "memory");
> }
> #else
> static inline void rte_sevl(void)
> {
> }
> static inline void rte_wfe(void)
> {
> 	rte_pause();
> }
Should these arm specific APIs, including rte_load_ex_xxx APIs, be added the doxygen comments? 
These APIs are arm specific, not intended to expose, but they are in the public files(arm specific headers be considered public?) 
/Gavin
> ...
> 
> static __rte_always_inline void
> rte_wait_until_equal_32(volatile uint32_t *addr, uint32_t expected, int
> memorder)
> {
> 	if (__atomic_load_ex_32(addr, memorder) != expected) {
> 		rte_sevl();
> 		do {
> 			rte_wfe();
> 		} while (__atomic_load_ex_32(addr, memorder) !=
> expected);
> 	}
> }
> 
> #endif
> 
> 
> >  #ifdef __cplusplus
> >  }
> >  #endif
> > diff --git a/lib/librte_eal/common/include/generic/rte_pause.h
> b/lib/librte_eal/common/include/generic/rte_pause.h
> > index 52bd4db..4db44f9 100644
> > --- a/lib/librte_eal/common/include/generic/rte_pause.h
> > +++ b/lib/librte_eal/common/include/generic/rte_pause.h
> > @@ -1,5 +1,6 @@
> >  /* SPDX-License-Identifier: BSD-3-Clause
> >   * Copyright(c) 2017 Cavium, Inc
> > + * Copyright(c) 2019 Arm Limited
> >   */
> >
> >  #ifndef _RTE_PAUSE_H_
> > @@ -12,6 +13,12 @@
> >   *
> >   */
> >
> > +#include <stdint.h>
> > +#include <rte_common.h>
> > +#include <rte_atomic.h>
> > +#include <rte_compat.h>
> > +#include <assert.h>
> > +
> >  /**
> >   * Pause CPU execution for a short while
> >   *
> > @@ -20,4 +27,214 @@
> >   */
> >  static inline void rte_pause(void);
> >
> > +static inline void rte_sevl(void);
> > +static inline void rte_wfe(void);
> > +/**
> > + * @warning
> > + * @b EXPERIMENTAL: this API may change, or be removed, without prior
> notice
> > + *
> > + * Atomic load from addr, it returns the 16-bit content of *addr.
> > + *
> > + * @param addr
> > + *  A pointer to the memory location.
> > + * @param memorder
> > + *  The valid memory order variants are __ATOMIC_ACQUIRE and
> __ATOMIC_RELAXED.
> > + *  These map to C++11 memory orders with the same names, see the
> C++11 standard
> > + *  the GCC wiki on atomic synchronization for detailed definitions.
> > + */
> > +static __rte_always_inline uint16_t
> > +__atomic_load_ex_16(volatile uint16_t *addr, int memorder);
> > +
> > +/**
> > + * @warning
> > + * @b EXPERIMENTAL: this API may change, or be removed, without prior
> notice
> > + *
> > + * Atomic load from addr, it returns the 32-bit content of *addr.
> > + *
> > + * @param addr
> > + *  A pointer to the memory location.
> > + * @param memorder
> > + *  The valid memory order variants are __ATOMIC_ACQUIRE and
> __ATOMIC_RELAXED.
> > + *  These map to C++11 memory orders with the same names, see the
> C++11 standard
> > + *  the GCC wiki on atomic synchronization for detailed definitions.
> > + */
> > +static __rte_always_inline uint32_t
> > +__atomic_load_ex_32(volatile uint32_t *addr, int memorder);
> > +
> > +/**
> > + * @warning
> > + * @b EXPERIMENTAL: this API may change, or be removed, without prior
> notice
> > + *
> > + * Atomic load from addr, it returns the 64-bit content of *addr.
> > + *
> > + * @param addr
> > + *  A pointer to the memory location.
> > + * @param memorder
> > + *  The valid memory order variants are __ATOMIC_ACQUIRE and
> __ATOMIC_RELAXED.
> > + *  These map to C++11 memory orders with the same names, see the
> C++11 standard
> > + *  the GCC wiki on atomic synchronization for detailed definitions.
> > + */
> > +static __rte_always_inline uint64_t
> > +__atomic_load_ex_64(volatile uint64_t *addr, int memorder);
> > +
> > +/**
> > + * @warning
> > + * @b EXPERIMENTAL: this API may change, or be removed, without prior
> notice
> > + *
> > + * Wait for *addr to be updated with a 16-bit expected value, with a
> relaxed
> > + * memory ordering model meaning the loads around this API can be
> reordered.
> > + *
> > + * @param addr
> > + *  A pointer to the memory location.
> > + * @param expected
> > + *  A 16-bit expected value to be in the memory location.
> > + * @param memorder
> > + *  Two different memory orders that can be specified:
> > + *  __ATOMIC_ACQUIRE and __ATOMIC_RELAXED. These map to
> > + *  C++11 memory orders with the same names, see the C++11 standard
> or
> > + *  the GCC wiki on atomic synchronization for detailed definition.
> > + */
> > +__rte_experimental
> > +static __rte_always_inline void
> > +rte_wait_until_equal_16(volatile uint16_t *addr, uint16_t expected,
> > +int memorder);
> > +
> > +/**
> > + * @warning
> > + * @b EXPERIMENTAL: this API may change, or be removed, without prior
> notice
> > + *
> > + * Wait for *addr to be updated with a 32-bit expected value, with a
> relaxed
> > + * memory ordering model meaning the loads around this API can be
> reordered.
> > + *
> > + * @param addr
> > + *  A pointer to the memory location.
> > + * @param expected
> > + *  A 32-bit expected value to be in the memory location.
> > + * @param memorder
> > + *  Two different memory orders that can be specified:
> > + *  __ATOMIC_ACQUIRE and __ATOMIC_RELAXED. These map to
> > + *  C++11 memory orders with the same names, see the C++11 standard
> or
> > + *  the GCC wiki on atomic synchronization for detailed definition.
> > + */
> > +__rte_experimental
> > +static __rte_always_inline void
> > +rte_wait_until_equal_32(volatile uint32_t *addr, uint32_t expected,
> > +int memorder);
> > +
> > +/**
> > + * @warning
> > + * @b EXPERIMENTAL: this API may change, or be removed, without prior
> notice
> > + *
> > + * Wait for *addr to be updated with a 64-bit expected value, with a
> relaxed
> > + * memory ordering model meaning the loads around this API can be
> reordered.
> > + *
> > + * @param addr
> > + *  A pointer to the memory location.
> > + * @param expected
> > + *  A 64-bit expected value to be in the memory location.
> > + * @param memorder
> > + *  Two different memory orders that can be specified:
> > + *  __ATOMIC_ACQUIRE and __ATOMIC_RELAXED. These map to
> > + *  C++11 memory orders with the same names, see the C++11 standard
> or
> > + *  the GCC wiki on atomic synchronization for detailed definition.
> > + */
> > +__rte_experimental
> > +static __rte_always_inline void
> > +rte_wait_until_equal_64(volatile uint64_t *addr, uint64_t expected,
> > +int memorder);
> > +
> > +#ifdef RTE_ARM_USE_WFE
> > +#define RTE_WAIT_UNTIL_EQUAL_ARCH_DEFINED
> > +#endif
> > +
> > +#ifndef RTE_WAIT_UNTIL_EQUAL_ARCH_DEFINED
> > +static inline void rte_sevl(void)
> > +{
> > +}
> > +
> > +static inline void rte_wfe(void)
> > +{
> > +	rte_pause();
> > +}
> > +
> > +/**
> > + * @warning
> > + * @b EXPERIMENTAL: this API may change, or be removed, without prior
> notice
> > + *
> > + * Atomic load from addr, it returns the 16-bit content of *addr.
> > + *
> > + * @param addr
> > + *  A pointer to the memory location.
> > + * @param memorder
> > + *  The valid memory order variants are __ATOMIC_ACQUIRE and
> __ATOMIC_RELAXED.
> > + *  These map to C++11 memory orders with the same names, see the
> C++11 standard
> > + *  the GCC wiki on atomic synchronization for detailed definitions.
> > + */
> > +static __rte_always_inline uint16_t
> > +__atomic_load_ex_16(volatile uint16_t *addr, int memorder)
> > +{
> > +	uint16_t tmp;
> > +	assert((memorder == __ATOMIC_ACQUIRE)
> > +			|| (memorder == __ATOMIC_RELAXED));
> > +	tmp = __atomic_load_n(addr, memorder);
> > +	return tmp;
> > +}
> > +
> > +static __rte_always_inline uint32_t
> > +__atomic_load_ex_32(volatile uint32_t *addr, int memorder)
> > +{
> > +	uint32_t tmp;
> > +	assert((memorder == __ATOMIC_ACQUIRE)
> > +			|| (memorder == __ATOMIC_RELAXED));
> > +	tmp = __atomic_load_n(addr, memorder);
> > +	return tmp;
> > +}
> > +
> > +static __rte_always_inline uint64_t
> > +__atomic_load_ex_64(volatile uint64_t *addr, int memorder)
> > +{
> > +	uint64_t tmp;
> > +	assert((memorder == __ATOMIC_ACQUIRE)
> > +			|| (memorder == __ATOMIC_RELAXED));
> > +	tmp = __atomic_load_n(addr, memorder);
> > +	return tmp;
> > +}
> > +
> > +static __rte_always_inline void
> > +rte_wait_until_equal_16(volatile uint16_t *addr, uint16_t expected,
> > +int memorder)
> > +{
> > +	if (__atomic_load_n(addr, memorder) != expected) {
> > +		rte_sevl();
> > +		do {
> > +			rte_wfe();
> > +		} while (__atomic_load_ex_16(addr, memorder) !=
> expected);
> > +	}
> > +}
> > +
> > +static __rte_always_inline void
> > +rte_wait_until_equal_32(volatile uint32_t *addr, uint32_t expected,
> > +int memorder)
> > +{
> > +	if (__atomic_load_ex_32(addr, memorder) != expected) {
> > +		rte_sevl();
> > +		do {
> > +			rte_wfe();
> > +		} while (__atomic_load_ex_32(addr, memorder) !=
> expected);
> > +	}
> > +}
> > +
> > +static __rte_always_inline void
> > +rte_wait_until_equal_64(volatile uint64_t *addr, uint64_t expected,
> > +int memorder)
> > +{
> > +	if (__atomic_load_ex_64(addr, memorder) != expected) {
> > +		rte_sevl();
> > +		do {
> > +			rte_wfe();
> > +		} while (__atomic_load_ex_64(addr, memorder) !=
> expected);
> > +	}
> > +}
> > +#endif
> > +
> >  #endif /* _RTE_PAUSE_H_ */
> > --
> > 2.7.4


^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: [dpdk-dev] [PATCH v4 0/6] use WFE for locks and ring on aarch64
  2019-10-16  8:08   ` David Marchand
@ 2019-10-24 20:26     ` David Christensen
  0 siblings, 0 replies; 163+ messages in thread
From: David Christensen @ 2019-10-24 20:26 UTC (permalink / raw)
  To: David Marchand, Bruce Richardson, Ananyev, Konstantin
  Cc: dev, Gavin Hu, nd, Thomas Monjalon, Stephen Hemminger,
	Jerin Jacob Kollanukkaran, Pavan Nikhilesh, Honnappa Nagarahalli

> This series got a lot of attention from ARM people and it seems ready
> for integration.
> But I did not see comment from other architectures, could you have a
> look please?

I spent some time going through the Power ISA specification and the 
Linux code and didn't find an equivalent.  Under Linux this looks like
a __cmpwait_case_XX operation but that's only defined for arm64 and used 
in barrier operations.

Dave

^ permalink raw reply	[flat|nested] 163+ messages in thread

* [dpdk-dev] [PATCH v10 0/5] use WFE for aarch64
  2019-06-30 16:21 [dpdk-dev] [RFC 0/5] use WFE for locks and ring on aarch64 Gavin Hu
                   ` (62 preceding siblings ...)
  2019-10-24 10:42 ` [dpdk-dev] [PATCH v9 5/5] event/opdl: " Gavin Hu
@ 2019-10-25 15:39 ` Gavin Hu
  2019-10-25 15:39 ` [dpdk-dev] [PATCH v10 1/5] bus/fslmc: fix the conflicting dmb function Gavin Hu
                   ` (17 subsequent siblings)
  81 siblings, 0 replies; 163+ messages in thread
From: Gavin Hu @ 2019-10-25 15:39 UTC (permalink / raw)
  To: dev
  Cc: nd, david.marchand, konstantin.ananyev, thomas, stephen,
	hemant.agrawal, jerinj, pbhagavatula, Honnappa.Nagarahalli,
	ruifeng.wang, phil.yang, steve.capper

V10:
- move arm specific stuff to arch/arm/rte_pause_64.h
V9:
- fix a weblink broken (David Marchand)
- define rte_wfe and rte_sev() (Ananyev Konstantin)
- explicitly define three function APIs instead of marcos (Ananyev Konstantin)
- incorporate common rte_wfe and rte_sev into the generic rte_spinlock (David Marchand)
- define arch neutral RTE_WAIT_UNTIL_EQUAL_ARCH_DEFINED (Ananyev Konstantin)
- define rte_load_ex_16/32/64 functions to use load-exclusive instruction for aarch64, which is required for wake up of WFE
- drop the rte_spinlock patch from this series, as the it calls this experimental API and it is widely included by a lot of components each requires the ALLOW_EXPERIMENRAL_API for the Makefile and meson.build, leave it to future after the experimental is removed.
V8:
- simplify dmb definition to use io barriers (David Marchand)
- define wfe() and sev() macros and use them inside normal C code (Ananyev Konstantin)
- pass memorder as parameter, not to incorporate it into function name, less functions, similar to C11 atomic intrinsics (Ananyev Konstantin)
- remove mandating RTE_FORCE_INTRINSICS in arm spinlock implementation(David Marchand)
- undef __WAIT_UNTIL_EQUAL after use (David Marchand)
- add experimental tag and warning (David Marchand)
- add the limitation of using WFE instruction in the commit log (David Marchand) 
- tweak the use of RTE_FORCE_INSTRINSICS(still mandatory for aarch64) and RTE_ARM_USE_WFE for spinlock (David Marchand)
- drop the rte_ring patch from this series, as the rte_ring.h calls this API and it is widely included by a lot of components each requires the ALLOW_EXPERIMENRAL_API for the Makefile and meson.build, leave it to future after the experimental is removed.
V7:
- fix the checkpatch LONG_LINE_COMMENT issue
V6:
- squash the RTE_ARM_USE_WFE configuration entry patch into the new API patch
- move the new configuration to the end of EAL
- add doxygen comments to reflect the relaxed and acquire semantics
- correct the meson configuration 
V5:
- add doxygen comments for the new APIs
- spinlock early exit without wfe if the spinlock not taken by others.
- add two patches on top for opdl and thunderx
V4:
- rename the config as CONFIG_RTE_ARM_USE_WFE to indicate it applys to arm only
- introduce a macro for assembly Skelton to reduce the duplication of code
- add one patch for nxp fslmc to address a compiling error
V3:
- Convert RFCs to patches
V2:
- Use inline functions instead of marcos
- Add load and compare in the beginning of the APIs
- Fix some style errors in asm inline 
V1:
- Add the new APIs and use it for ring and locks

Gavin Hu (5):
  bus/fslmc: fix the conflicting dmb function
  eal: add the APIs to wait until equal
  ticketlock: use new API to reduce contention on aarch64
  net/thunderx: use new API to save cycles on aarch64
  event/opdl: use new API to save cycles on aarch64

 config/arm/meson.build                             |   1 +
 config/common_base                                 |   5 +
 drivers/bus/fslmc/mc/fsl_mc_sys.h                  |   8 +-
 drivers/event/opdl/Makefile                        |   1 +
 drivers/event/opdl/meson.build                     |   1 +
 drivers/event/opdl/opdl_ring.c                     |   5 +-
 drivers/net/thunderx/Makefile                      |   1 +
 drivers/net/thunderx/meson.build                   |   1 +
 drivers/net/thunderx/nicvf_rxtx.c                  |   3 +-
 .../common/include/arch/arm/rte_pause_64.h         | 188 +++++++++++++++++++++
 lib/librte_eal/common/include/generic/rte_pause.h  | 108 ++++++++++++
 .../common/include/generic/rte_ticketlock.h        |   3 +-
 12 files changed, 313 insertions(+), 12 deletions(-)

-- 
2.7.4


^ permalink raw reply	[flat|nested] 163+ messages in thread

* [dpdk-dev] [PATCH v10 1/5] bus/fslmc: fix the conflicting dmb function
  2019-06-30 16:21 [dpdk-dev] [RFC 0/5] use WFE for locks and ring on aarch64 Gavin Hu
                   ` (63 preceding siblings ...)
  2019-10-25 15:39 ` [dpdk-dev] [PATCH v10 0/5] use WFE for aarch64 Gavin Hu
@ 2019-10-25 15:39 ` Gavin Hu
  2019-10-25 15:39 ` [dpdk-dev] [PATCH v10 2/5] eal: add the APIs to wait until equal Gavin Hu
                   ` (16 subsequent siblings)
  81 siblings, 0 replies; 163+ messages in thread
From: Gavin Hu @ 2019-10-25 15:39 UTC (permalink / raw)
  To: dev
  Cc: nd, david.marchand, konstantin.ananyev, thomas, stephen,
	hemant.agrawal, jerinj, pbhagavatula, Honnappa.Nagarahalli,
	ruifeng.wang, phil.yang, steve.capper, stable

There are two definitions conflicting each other, for more
details, refer to [1].

include/rte_atomic_64.h:19: error: "dmb" redefined [-Werror]
drivers/bus/fslmc/mc/fsl_mc_sys.h:36: note: this is the location of the
previous definition
 #define dmb() {__asm__ __volatile__("" : : : "memory"); }

The fix is to reuse the EAL definition to avoid conflicts.

[1] http://inbox.dpdk.org/users/VI1PR08MB537631AB25F41B8880DCCA988FDF0@
VI1PR08MB5376.eurprd08.prod.outlook.com/T/#u

Fixes: 3af733ba8da8 ("bus/fslmc: introduce MC object functions")
Cc: stable@dpdk.org

Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Phil Yang <phi.yang@arm.com>
Acked-by: Hemant Agrawal <hemant.agrawal@nxp.com>
---
 drivers/bus/fslmc/mc/fsl_mc_sys.h | 8 +++-----
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/drivers/bus/fslmc/mc/fsl_mc_sys.h b/drivers/bus/fslmc/mc/fsl_mc_sys.h
index d0c7b39..68ce38b 100644
--- a/drivers/bus/fslmc/mc/fsl_mc_sys.h
+++ b/drivers/bus/fslmc/mc/fsl_mc_sys.h
@@ -31,12 +31,10 @@ struct fsl_mc_io {
 #include <errno.h>
 #include <sys/uio.h>
 #include <linux/byteorder/little_endian.h>
+#include <rte_atomic.h>
 
-#ifndef dmb
-#define dmb() {__asm__ __volatile__("" : : : "memory"); }
-#endif
-#define __iormb()	dmb()
-#define __iowmb()	dmb()
+#define __iormb()	rte_io_rmb()
+#define __iowmb()	rte_io_wmb()
 #define __arch_getq(a)		(*(volatile uint64_t *)(a))
 #define __arch_putq(v, a)	(*(volatile uint64_t *)(a) = (v))
 #define __arch_putq32(v, a)	(*(volatile uint32_t *)(a) = (v))
-- 
2.7.4


^ permalink raw reply	[flat|nested] 163+ messages in thread

* [dpdk-dev] [PATCH v10 2/5] eal: add the APIs to wait until equal
  2019-06-30 16:21 [dpdk-dev] [RFC 0/5] use WFE for locks and ring on aarch64 Gavin Hu
                   ` (64 preceding siblings ...)
  2019-10-25 15:39 ` [dpdk-dev] [PATCH v10 1/5] bus/fslmc: fix the conflicting dmb function Gavin Hu
@ 2019-10-25 15:39 ` Gavin Hu
  2019-10-25 17:27   ` Ananyev, Konstantin
  2019-10-25 15:39 ` [dpdk-dev] [PATCH v10 3/5] ticketlock: use new API to reduce contention on aarch64 Gavin Hu
                   ` (15 subsequent siblings)
  81 siblings, 1 reply; 163+ messages in thread
From: Gavin Hu @ 2019-10-25 15:39 UTC (permalink / raw)
  To: dev
  Cc: nd, david.marchand, konstantin.ananyev, thomas, stephen,
	hemant.agrawal, jerinj, pbhagavatula, Honnappa.Nagarahalli,
	ruifeng.wang, phil.yang, steve.capper

The rte_wait_until_equal_xx APIs abstract the functionality of
'polling for a memory location to become equal to a given value'.

Add the RTE_ARM_USE_WFE configuration entry for aarch64, disabled
by default. When it is enabled, the above APIs will call WFE instruction
to save CPU cycles and power.

From a VM, when calling this API on aarch64, it may trap in and out to
release vCPUs whereas cause high exit latency. Since kernel 4.18.20 an
adaptive trapping mechanism is introduced to balance the latency and
workload.

Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
Reviewed-by: Steve Capper <steve.capper@arm.com>
Reviewed-by: Ola Liljedahl <ola.liljedahl@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Reviewed-by: Phil Yang <phil.yang@arm.com>
Acked-by: Pavan Nikhilesh <pbhagavatula@marvell.com>
Acked-by: Jerin Jacob <jerinj@marvell.com>
---
 config/arm/meson.build                             |   1 +
 config/common_base                                 |   5 +
 .../common/include/arch/arm/rte_pause_64.h         | 188 +++++++++++++++++++++
 lib/librte_eal/common/include/generic/rte_pause.h  | 108 ++++++++++++
 4 files changed, 302 insertions(+)

diff --git a/config/arm/meson.build b/config/arm/meson.build
index 979018e..b4b4cac 100644
--- a/config/arm/meson.build
+++ b/config/arm/meson.build
@@ -26,6 +26,7 @@ flags_common_default = [
 	['RTE_LIBRTE_AVP_PMD', false],
 
 	['RTE_SCHED_VECTOR', false],
+	['RTE_ARM_USE_WFE', false],
 ]
 
 flags_generic = [
diff --git a/config/common_base b/config/common_base
index e843a21..c812156 100644
--- a/config/common_base
+++ b/config/common_base
@@ -111,6 +111,11 @@ CONFIG_RTE_MAX_VFIO_CONTAINERS=64
 CONFIG_RTE_MALLOC_DEBUG=n
 CONFIG_RTE_EAL_NUMA_AWARE_HUGEPAGES=n
 CONFIG_RTE_USE_LIBBSD=n
+# Use WFE instructions to implement the rte_wait_for_equal_xxx APIs,
+# calling these APIs put the cores in low power state while waiting
+# for the memory address to become equal to the expected value.
+# This is supported only by aarch64.
+CONFIG_RTE_ARM_USE_WFE=n
 
 #
 # Recognize/ignore the AVX/AVX512 CPU flags for performance/power testing.
diff --git a/lib/librte_eal/common/include/arch/arm/rte_pause_64.h b/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
index 93895d3..dd37f72 100644
--- a/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
+++ b/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
@@ -1,5 +1,6 @@
 /* SPDX-License-Identifier: BSD-3-Clause
  * Copyright(c) 2017 Cavium, Inc
+ * Copyright(c) 2019 Arm Limited
  */
 
 #ifndef _RTE_PAUSE_ARM64_H_
@@ -10,6 +11,11 @@ extern "C" {
 #endif
 
 #include <rte_common.h>
+
+#ifdef RTE_ARM_USE_WFE
+#define RTE_WAIT_UNTIL_EQUAL_ARCH_DEFINED
+#endif
+
 #include "generic/rte_pause.h"
 
 static inline void rte_pause(void)
@@ -17,6 +23,188 @@ static inline void rte_pause(void)
 	asm volatile("yield" ::: "memory");
 }
 
+/**
+ * Send an event to exit WFE.
+ */
+static inline void rte_sevl(void);
+
+/**
+ * Put processor into low power WFE(Wait For Event) state
+ */
+static inline void rte_wfe(void);
+
+#ifdef RTE_ARM_USE_WFE
+static inline void rte_sevl(void)
+{
+	asm volatile("sevl" : : : "memory");
+}
+
+static inline void rte_wfe(void)
+{
+	asm volatile("wfe" : : : "memory");
+}
+#else
+static inline void rte_sevl(void)
+{
+}
+static inline void rte_wfe(void)
+{
+	rte_pause();
+}
+#endif
+
+#ifdef RTE_WAIT_UNTIL_EQUAL_ARCH_DEFINED
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Atomic exclusive load from addr, it returns the 16-bit content of *addr
+ * while making it 'monitored',when it is written by someone else, the
+ * 'monitored' state is cleared and a event is generated implicitly to exit
+ * WFE.
+ *
+ * @param addr
+ *  A pointer to the memory location.
+ * @param memorder
+ *  The valid memory order variants are __ATOMIC_ACQUIRE and __ATOMIC_RELAXED.
+ *  These map to C++11 memory orders with the same names, see the C++11 standard
+ *  the GCC wiki on atomic synchronization for detailed definitions.
+ */
+static __rte_always_inline uint16_t
+__atomic_load_ex_16(volatile uint16_t *addr, int memorder);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Atomic exclusive load from addr, it returns the 32-bit content of *addr
+ * while making it 'monitored',when it is written by someone else, the
+ * 'monitored' state is cleared and a event is generated implicitly to exit
+ * WFE.
+ *
+ * @param addr
+ *  A pointer to the memory location.
+ * @param memorder
+ *  The valid memory order variants are __ATOMIC_ACQUIRE and __ATOMIC_RELAXED.
+ *  These map to C++11 memory orders with the same names, see the C++11 standard
+ *  the GCC wiki on atomic synchronization for detailed definitions.
+ */
+static __rte_always_inline uint32_t
+__atomic_load_ex_32(volatile uint32_t *addr, int memorder);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Atomic exclusive load from addr, it returns the 64-bit content of *addr
+ * while making it 'monitored',when it is written by someone else, the
+ * 'monitored' state is cleared and a event is generated implicitly to exit
+ * WFE.
+ *
+ * @param addr
+ *  A pointer to the memory location.
+ * @param memorder
+ *  The valid memory order variants are __ATOMIC_ACQUIRE and __ATOMIC_RELAXED.
+ *  These map to C++11 memory orders with the same names, see the C++11 standard
+ *  the GCC wiki on atomic synchronization for detailed definitions.
+ */
+static __rte_always_inline uint64_t
+__atomic_load_ex_64(volatile uint64_t *addr, int memorder);
+
+static __rte_always_inline uint16_t
+__atomic_load_ex_16(volatile uint16_t *addr, int memorder)
+{
+	uint16_t tmp;
+	assert((memorder == __ATOMIC_ACQUIRE)
+			|| (memorder == __ATOMIC_RELAXED));
+	if (memorder == __ATOMIC_ACQUIRE)
+		asm volatile("ldaxrh %w[tmp], [%x[addr]]"
+			: [tmp] "=&r" (tmp)
+			: [addr] "r"(addr)
+			: "memory");
+	else if (memorder == __ATOMIC_RELAXED)
+		asm volatile("ldxrh %w[tmp], [%x[addr]]"
+			: [tmp] "=&r" (tmp)
+			: [addr] "r"(addr)
+			: "memory");
+	return tmp;
+}
+
+static __rte_always_inline uint32_t
+__atomic_load_ex_32(volatile uint32_t *addr, int memorder)
+{
+	uint32_t tmp;
+	assert((memorder == __ATOMIC_ACQUIRE)
+			|| (memorder == __ATOMIC_RELAXED));
+	if (memorder == __ATOMIC_ACQUIRE)
+		asm volatile("ldaxr %w[tmp], [%x[addr]]"
+			: [tmp] "=&r" (tmp)
+			: [addr] "r"(addr)
+			: "memory");
+	else if (memorder == __ATOMIC_RELAXED)
+		asm volatile("ldxr %w[tmp], [%x[addr]]"
+			: [tmp] "=&r" (tmp)
+			: [addr] "r"(addr)
+			: "memory");
+	return tmp;
+}
+
+static __rte_always_inline uint64_t
+__atomic_load_ex_64(volatile uint64_t *addr, int memorder)
+{
+	uint64_t tmp;
+	assert((memorder == __ATOMIC_ACQUIRE)
+			|| (memorder == __ATOMIC_RELAXED));
+	if (memorder == __ATOMIC_ACQUIRE)
+		asm volatile("ldaxr %x[tmp], [%x[addr]]"
+			: [tmp] "=&r" (tmp)
+			: [addr] "r"(addr)
+			: "memory");
+	else if (memorder == __ATOMIC_RELAXED)
+		asm volatile("ldxr %x[tmp], [%x[addr]]"
+			: [tmp] "=&r" (tmp)
+			: [addr] "r"(addr)
+			: "memory");
+	return tmp;
+}
+
+static __rte_always_inline void
+rte_wait_until_equal_16(volatile uint16_t *addr, uint16_t expected,
+int memorder)
+{
+	if (__atomic_load_n(addr, memorder) != expected) {
+		rte_sevl();
+		do {
+			rte_wfe();
+		} while (__atomic_load_ex_16(addr, memorder) != expected);
+	}
+}
+
+static __rte_always_inline void
+rte_wait_until_equal_32(volatile uint32_t *addr, uint32_t expected,
+int memorder)
+{
+	if (__atomic_load_n(addr, memorder) != expected) {
+		rte_sevl();
+		do {
+			rte_wfe();
+		} while (__atomic_load_n(addr, memorder) != expected);
+	}
+}
+
+static __rte_always_inline void
+rte_wait_until_equal_64(volatile uint64_t *addr, uint64_t expected,
+int memorder)
+{
+	if (__atomic_load_n(addr, memorder) != expected) {
+		rte_sevl();
+		do {
+			rte_wfe();
+		} while (__atomic_load_n(addr, memorder) != expected);
+	}
+}
+#endif
+
 #ifdef __cplusplus
 }
 #endif
diff --git a/lib/librte_eal/common/include/generic/rte_pause.h b/lib/librte_eal/common/include/generic/rte_pause.h
index 52bd4db..9854455 100644
--- a/lib/librte_eal/common/include/generic/rte_pause.h
+++ b/lib/librte_eal/common/include/generic/rte_pause.h
@@ -1,5 +1,6 @@
 /* SPDX-License-Identifier: BSD-3-Clause
  * Copyright(c) 2017 Cavium, Inc
+ * Copyright(c) 2019 Arm Limited
  */
 
 #ifndef _RTE_PAUSE_H_
@@ -12,6 +13,12 @@
  *
  */
 
+#include <stdint.h>
+#include <rte_common.h>
+#include <rte_atomic.h>
+#include <rte_compat.h>
+#include <assert.h>
+
 /**
  * Pause CPU execution for a short while
  *
@@ -20,4 +27,105 @@
  */
 static inline void rte_pause(void);
 
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Wait for *addr to be updated with a 16-bit expected value, with a relaxed
+ * memory ordering model meaning the loads around this API can be reordered.
+ *
+ * @param addr
+ *  A pointer to the memory location.
+ * @param expected
+ *  A 16-bit expected value to be in the memory location.
+ * @param memorder
+ *  Two different memory orders that can be specified:
+ *  __ATOMIC_ACQUIRE and __ATOMIC_RELAXED. These map to
+ *  C++11 memory orders with the same names, see the C++11 standard or
+ *  the GCC wiki on atomic synchronization for detailed definition.
+ */
+__rte_experimental
+static __rte_always_inline void
+rte_wait_until_equal_16(volatile uint16_t *addr, uint16_t expected,
+int memorder);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Wait for *addr to be updated with a 32-bit expected value, with a relaxed
+ * memory ordering model meaning the loads around this API can be reordered.
+ *
+ * @param addr
+ *  A pointer to the memory location.
+ * @param expected
+ *  A 32-bit expected value to be in the memory location.
+ * @param memorder
+ *  Two different memory orders that can be specified:
+ *  __ATOMIC_ACQUIRE and __ATOMIC_RELAXED. These map to
+ *  C++11 memory orders with the same names, see the C++11 standard or
+ *  the GCC wiki on atomic synchronization for detailed definition.
+ */
+__rte_experimental
+static __rte_always_inline void
+rte_wait_until_equal_32(volatile uint32_t *addr, uint32_t expected,
+int memorder);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Wait for *addr to be updated with a 64-bit expected value, with a relaxed
+ * memory ordering model meaning the loads around this API can be reordered.
+ *
+ * @param addr
+ *  A pointer to the memory location.
+ * @param expected
+ *  A 64-bit expected value to be in the memory location.
+ * @param memorder
+ *  Two different memory orders that can be specified:
+ *  __ATOMIC_ACQUIRE and __ATOMIC_RELAXED. These map to
+ *  C++11 memory orders with the same names, see the C++11 standard or
+ *  the GCC wiki on atomic synchronization for detailed definition.
+ */
+__rte_experimental
+static __rte_always_inline void
+rte_wait_until_equal_64(volatile uint64_t *addr, uint64_t expected,
+int memorder);
+
+#ifndef RTE_WAIT_UNTIL_EQUAL_ARCH_DEFINED
+static __rte_always_inline void
+rte_wait_until_equal_16(volatile uint16_t *addr, uint16_t expected,
+int memorder)
+{
+	if (__atomic_load_n(addr, memorder) != expected) {
+		do {
+			rte_pause();
+		} while (__atomic_load_n(addr, memorder) != expected);
+	}
+}
+
+static __rte_always_inline void
+rte_wait_until_equal_32(volatile uint32_t *addr, uint32_t expected,
+int memorder)
+{
+	if (__atomic_load_n(addr, memorder) != expected) {
+		do {
+			rte_pause();
+		} while (__atomic_load_n(addr, memorder) != expected);
+	}
+}
+
+static __rte_always_inline void
+rte_wait_until_equal_64(volatile uint64_t *addr, uint64_t expected,
+int memorder)
+{
+	if (__atomic_load_n(addr, memorder) != expected) {
+		do {
+			rte_pause();
+		} while (__atomic_load_n(addr, memorder) != expected);
+	}
+}
+#endif
+
 #endif /* _RTE_PAUSE_H_ */
-- 
2.7.4


^ permalink raw reply	[flat|nested] 163+ messages in thread

* [dpdk-dev] [PATCH v10 3/5] ticketlock: use new API to reduce contention on aarch64
  2019-06-30 16:21 [dpdk-dev] [RFC 0/5] use WFE for locks and ring on aarch64 Gavin Hu
                   ` (65 preceding siblings ...)
  2019-10-25 15:39 ` [dpdk-dev] [PATCH v10 2/5] eal: add the APIs to wait until equal Gavin Hu
@ 2019-10-25 15:39 ` Gavin Hu
  2019-10-25 15:39 ` [dpdk-dev] [PATCH v10 4/5] net/thunderx: use new API to save cycles " Gavin Hu
                   ` (14 subsequent siblings)
  81 siblings, 0 replies; 163+ messages in thread
From: Gavin Hu @ 2019-10-25 15:39 UTC (permalink / raw)
  To: dev
  Cc: nd, david.marchand, konstantin.ananyev, thomas, stephen,
	hemant.agrawal, jerinj, pbhagavatula, Honnappa.Nagarahalli,
	ruifeng.wang, phil.yang, steve.capper

While using ticket lock, cores repeatedly poll the lock variable.
This is replaced by rte_wait_until_equal API.

Running ticketlock_autotest on ThunderX2, Ampere eMAG80, and Arm N1SDP[1],
there were variances between runs, but no notable performance gain or
degradation were seen with and without this patch.

[1] https://community.arm.com/developer/tools-software/oss-platforms/w/\
docs/440/neoverse-n1-sdp

Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Tested-by: Phil Yang <phil.yang@arm.com>
Tested-by: Pavan Nikhilesh <pbhagavatula@marvell.com>
Reviewed-by: Jerin Jacob <jerinj@marvell.com>
---
 lib/librte_eal/common/include/generic/rte_ticketlock.h | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/lib/librte_eal/common/include/generic/rte_ticketlock.h b/lib/librte_eal/common/include/generic/rte_ticketlock.h
index d9bec87..c295ae7 100644
--- a/lib/librte_eal/common/include/generic/rte_ticketlock.h
+++ b/lib/librte_eal/common/include/generic/rte_ticketlock.h
@@ -66,8 +66,7 @@ static inline void
 rte_ticketlock_lock(rte_ticketlock_t *tl)
 {
 	uint16_t me = __atomic_fetch_add(&tl->s.next, 1, __ATOMIC_RELAXED);
-	while (__atomic_load_n(&tl->s.current, __ATOMIC_ACQUIRE) != me)
-		rte_pause();
+	rte_wait_until_equal_16(&tl->s.current, me, __ATOMIC_ACQUIRE);
 }
 
 /**
-- 
2.7.4


^ permalink raw reply	[flat|nested] 163+ messages in thread

* [dpdk-dev] [PATCH v10 4/5] net/thunderx: use new API to save cycles on aarch64
  2019-06-30 16:21 [dpdk-dev] [RFC 0/5] use WFE for locks and ring on aarch64 Gavin Hu
                   ` (66 preceding siblings ...)
  2019-10-25 15:39 ` [dpdk-dev] [PATCH v10 3/5] ticketlock: use new API to reduce contention on aarch64 Gavin Hu
@ 2019-10-25 15:39 ` " Gavin Hu
  2019-10-25 15:39 ` [dpdk-dev] [PATCH v10 5/5] event/opdl: " Gavin Hu
                   ` (13 subsequent siblings)
  81 siblings, 0 replies; 163+ messages in thread
From: Gavin Hu @ 2019-10-25 15:39 UTC (permalink / raw)
  To: dev
  Cc: nd, david.marchand, konstantin.ananyev, thomas, stephen,
	hemant.agrawal, jerinj, pbhagavatula, Honnappa.Nagarahalli,
	ruifeng.wang, phil.yang, steve.capper

Use the new API to wait in low power state instead of continuous
polling to save CPU cycles and power.

Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
Acked-by: Jerin Jacob <jerinj@marvell.com>
---
 drivers/net/thunderx/Makefile     | 1 +
 drivers/net/thunderx/meson.build  | 1 +
 drivers/net/thunderx/nicvf_rxtx.c | 3 +--
 3 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/net/thunderx/Makefile b/drivers/net/thunderx/Makefile
index e6bf497..9e0de10 100644
--- a/drivers/net/thunderx/Makefile
+++ b/drivers/net/thunderx/Makefile
@@ -10,6 +10,7 @@ include $(RTE_SDK)/mk/rte.vars.mk
 LIB = librte_pmd_thunderx_nicvf.a
 
 CFLAGS += $(WERROR_FLAGS)
+CFLAGS += -DALLOW_EXPERIMENTAL_API
 
 LDLIBS += -lm
 LDLIBS += -lrte_eal -lrte_mbuf -lrte_mempool -lrte_ring
diff --git a/drivers/net/thunderx/meson.build b/drivers/net/thunderx/meson.build
index 69819a9..23d9458 100644
--- a/drivers/net/thunderx/meson.build
+++ b/drivers/net/thunderx/meson.build
@@ -4,6 +4,7 @@
 subdir('base')
 objs = [base_objs]
 
+allow_experimental_apis = true
 sources = files('nicvf_rxtx.c',
 		'nicvf_ethdev.c',
 		'nicvf_svf.c'
diff --git a/drivers/net/thunderx/nicvf_rxtx.c b/drivers/net/thunderx/nicvf_rxtx.c
index 1c42874..90a6098 100644
--- a/drivers/net/thunderx/nicvf_rxtx.c
+++ b/drivers/net/thunderx/nicvf_rxtx.c
@@ -385,8 +385,7 @@ nicvf_fill_rbdr(struct nicvf_rxq *rxq, int to_fill)
 		ltail++;
 	}
 
-	while (__atomic_load_n(&rbdr->tail, __ATOMIC_RELAXED) != next_tail)
-		rte_pause();
+	rte_wait_until_equal_32(&rbdr->tail, next_tail, __ATOMIC_RELAXED);
 
 	__atomic_store_n(&rbdr->tail, ltail, __ATOMIC_RELEASE);
 	nicvf_addr_write(door, to_fill);
-- 
2.7.4


^ permalink raw reply	[flat|nested] 163+ messages in thread

* [dpdk-dev] [PATCH v10 5/5] event/opdl: use new API to save cycles on aarch64
  2019-06-30 16:21 [dpdk-dev] [RFC 0/5] use WFE for locks and ring on aarch64 Gavin Hu
                   ` (67 preceding siblings ...)
  2019-10-25 15:39 ` [dpdk-dev] [PATCH v10 4/5] net/thunderx: use new API to save cycles " Gavin Hu
@ 2019-10-25 15:39 ` " Gavin Hu
  2019-10-27 12:52 ` [dpdk-dev] [PATCH v11 0/5] use WFE for aarch64 Gavin Hu
                   ` (12 subsequent siblings)
  81 siblings, 0 replies; 163+ messages in thread
From: Gavin Hu @ 2019-10-25 15:39 UTC (permalink / raw)
  To: dev
  Cc: nd, david.marchand, konstantin.ananyev, thomas, stephen,
	hemant.agrawal, jerinj, pbhagavatula, Honnappa.Nagarahalli,
	ruifeng.wang, phil.yang, steve.capper

Use the new API to wait in low power state instead of continuous
polling to save CPU cycles and power.

Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
Reviewed-by: Jerin Jacob <jerinj@marvell.com>
---
 drivers/event/opdl/Makefile    | 1 +
 drivers/event/opdl/meson.build | 1 +
 drivers/event/opdl/opdl_ring.c | 5 ++---
 3 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/drivers/event/opdl/Makefile b/drivers/event/opdl/Makefile
index bf50a60..72ef07d 100644
--- a/drivers/event/opdl/Makefile
+++ b/drivers/event/opdl/Makefile
@@ -9,6 +9,7 @@ LIB = librte_pmd_opdl_event.a
 # build flags
 CFLAGS += -O3
 CFLAGS += $(WERROR_FLAGS)
+CFLAGS += -DALLOW_EXPERIMENTAL_API
 # for older GCC versions, allow us to initialize an event using
 # designated initializers.
 ifeq ($(CONFIG_RTE_TOOLCHAIN_GCC),y)
diff --git a/drivers/event/opdl/meson.build b/drivers/event/opdl/meson.build
index 1fe034e..e67b164 100644
--- a/drivers/event/opdl/meson.build
+++ b/drivers/event/opdl/meson.build
@@ -1,6 +1,7 @@
 # SPDX-License-Identifier: BSD-3-Clause
 # Copyright(c) 2018 Luca Boccassi <bluca@debian.org>
 
+allow_experimental_apis = true
 sources = files(
 	'opdl_evdev.c',
 	'opdl_evdev_init.c',
diff --git a/drivers/event/opdl/opdl_ring.c b/drivers/event/opdl/opdl_ring.c
index 06fb5b3..c8d19fe 100644
--- a/drivers/event/opdl/opdl_ring.c
+++ b/drivers/event/opdl/opdl_ring.c
@@ -16,6 +16,7 @@
 #include <rte_memcpy.h>
 #include <rte_memory.h>
 #include <rte_memzone.h>
+#include <rte_atomic.h>
 
 #include "opdl_ring.h"
 #include "opdl_log.h"
@@ -474,9 +475,7 @@ opdl_ring_input_multithread(struct opdl_ring *t, const void *entries,
 	/* If another thread started inputting before this one, but hasn't
 	 * finished, we need to wait for it to complete to update the tail.
 	 */
-	while (unlikely(__atomic_load_n(&s->shared.tail, __ATOMIC_ACQUIRE) !=
-			old_head))
-		rte_pause();
+	rte_wait_until_equal_32(&s->shared.tail, old_head, __ATOMIC_ACQUIRE);
 
 	__atomic_store_n(&s->shared.tail, old_head + num_entries,
 			__ATOMIC_RELEASE);
-- 
2.7.4


^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: [dpdk-dev] [PATCH v10 2/5] eal: add the APIs to wait until equal
  2019-10-25 15:39 ` [dpdk-dev] [PATCH v10 2/5] eal: add the APIs to wait until equal Gavin Hu
@ 2019-10-25 17:27   ` Ananyev, Konstantin
  2019-10-27 13:03     ` Gavin Hu (Arm Technology China)
  0 siblings, 1 reply; 163+ messages in thread
From: Ananyev, Konstantin @ 2019-10-25 17:27 UTC (permalink / raw)
  To: Gavin Hu, dev
  Cc: nd, david.marchand, thomas, stephen, hemant.agrawal, jerinj,
	pbhagavatula, Honnappa.Nagarahalli, ruifeng.wang, phil.yang,
	steve.capper

Hi Gavin,

> The rte_wait_until_equal_xx APIs abstract the functionality of
> 'polling for a memory location to become equal to a given value'.
> 
> Add the RTE_ARM_USE_WFE configuration entry for aarch64, disabled
> by default. When it is enabled, the above APIs will call WFE instruction
> to save CPU cycles and power.
> 
> From a VM, when calling this API on aarch64, it may trap in and out to
> release vCPUs whereas cause high exit latency. Since kernel 4.18.20 an
> adaptive trapping mechanism is introduced to balance the latency and
> workload.
> 
> Signed-off-by: Gavin Hu <gavin.hu@arm.com>
> Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> Reviewed-by: Steve Capper <steve.capper@arm.com>
> Reviewed-by: Ola Liljedahl <ola.liljedahl@arm.com>
> Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> Reviewed-by: Phil Yang <phil.yang@arm.com>
> Acked-by: Pavan Nikhilesh <pbhagavatula@marvell.com>
> Acked-by: Jerin Jacob <jerinj@marvell.com>
> ---
>  config/arm/meson.build                             |   1 +
>  config/common_base                                 |   5 +
>  .../common/include/arch/arm/rte_pause_64.h         | 188 +++++++++++++++++++++
>  lib/librte_eal/common/include/generic/rte_pause.h  | 108 ++++++++++++
>  4 files changed, 302 insertions(+)
> 
> diff --git a/config/arm/meson.build b/config/arm/meson.build
> index 979018e..b4b4cac 100644
> --- a/config/arm/meson.build
> +++ b/config/arm/meson.build
> @@ -26,6 +26,7 @@ flags_common_default = [
>  	['RTE_LIBRTE_AVP_PMD', false],
> 
>  	['RTE_SCHED_VECTOR', false],
> +	['RTE_ARM_USE_WFE', false],
>  ]
> 
>  flags_generic = [
> diff --git a/config/common_base b/config/common_base
> index e843a21..c812156 100644
> --- a/config/common_base
> +++ b/config/common_base
> @@ -111,6 +111,11 @@ CONFIG_RTE_MAX_VFIO_CONTAINERS=64
>  CONFIG_RTE_MALLOC_DEBUG=n
>  CONFIG_RTE_EAL_NUMA_AWARE_HUGEPAGES=n
>  CONFIG_RTE_USE_LIBBSD=n
> +# Use WFE instructions to implement the rte_wait_for_equal_xxx APIs,
> +# calling these APIs put the cores in low power state while waiting
> +# for the memory address to become equal to the expected value.
> +# This is supported only by aarch64.
> +CONFIG_RTE_ARM_USE_WFE=n
> 
>  #
>  # Recognize/ignore the AVX/AVX512 CPU flags for performance/power testing.
> diff --git a/lib/librte_eal/common/include/arch/arm/rte_pause_64.h b/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
> index 93895d3..dd37f72 100644
> --- a/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
> +++ b/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
> @@ -1,5 +1,6 @@
>  /* SPDX-License-Identifier: BSD-3-Clause
>   * Copyright(c) 2017 Cavium, Inc
> + * Copyright(c) 2019 Arm Limited
>   */
> 
>  #ifndef _RTE_PAUSE_ARM64_H_
> @@ -10,6 +11,11 @@ extern "C" {
>  #endif
> 
>  #include <rte_common.h>
> +
> +#ifdef RTE_ARM_USE_WFE
> +#define RTE_WAIT_UNTIL_EQUAL_ARCH_DEFINED
> +#endif
> +
>  #include "generic/rte_pause.h"
> 
>  static inline void rte_pause(void)
> @@ -17,6 +23,188 @@ static inline void rte_pause(void)
>  	asm volatile("yield" ::: "memory");
>  }
> 
> +/**
> + * Send an event to exit WFE.
> + */
> +static inline void rte_sevl(void);
> +
> +/**
> + * Put processor into low power WFE(Wait For Event) state
> + */
> +static inline void rte_wfe(void);
> +
> +#ifdef RTE_ARM_USE_WFE
> +static inline void rte_sevl(void)
> +{
> +	asm volatile("sevl" : : : "memory");
> +}
> +
> +static inline void rte_wfe(void)
> +{
> +	asm volatile("wfe" : : : "memory");
> +}
> +#else
> +static inline void rte_sevl(void)
> +{
> +}
> +static inline void rte_wfe(void)
> +{
> +	rte_pause();
> +}
> +#endif
> +
> +#ifdef RTE_WAIT_UNTIL_EQUAL_ARCH_DEFINED
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
> + *
> + * Atomic exclusive load from addr, it returns the 16-bit content of *addr
> + * while making it 'monitored',when it is written by someone else, the
> + * 'monitored' state is cleared and a event is generated implicitly to exit
> + * WFE.
> + *
> + * @param addr
> + *  A pointer to the memory location.
> + * @param memorder
> + *  The valid memory order variants are __ATOMIC_ACQUIRE and __ATOMIC_RELAXED.
> + *  These map to C++11 memory orders with the same names, see the C++11 standard
> + *  the GCC wiki on atomic synchronization for detailed definitions.
> + */
> +static __rte_always_inline uint16_t
> +__atomic_load_ex_16(volatile uint16_t *addr, int memorder);

I still think (as it is a public header) better to have all function names prefixed with rte_.
Or if you consider them not to be used by user explicitly with __rte_
BTW, these _load_ex_ functions can be defined even if RTE_WAIT_UNTIL_EQUAL_ARCH_DEFINED,
though don't know would be any other non-WFE usages for them. 

> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
> + *
> + * Atomic exclusive load from addr, it returns the 32-bit content of *addr
> + * while making it 'monitored',when it is written by someone else, the
> + * 'monitored' state is cleared and a event is generated implicitly to exit
> + * WFE.
> + *
> + * @param addr
> + *  A pointer to the memory location.
> + * @param memorder
> + *  The valid memory order variants are __ATOMIC_ACQUIRE and __ATOMIC_RELAXED.
> + *  These map to C++11 memory orders with the same names, see the C++11 standard
> + *  the GCC wiki on atomic synchronization for detailed definitions.
> + */
> +static __rte_always_inline uint32_t
> +__atomic_load_ex_32(volatile uint32_t *addr, int memorder);
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
> + *
> + * Atomic exclusive load from addr, it returns the 64-bit content of *addr
> + * while making it 'monitored',when it is written by someone else, the
> + * 'monitored' state is cleared and a event is generated implicitly to exit
> + * WFE.
> + *
> + * @param addr
> + *  A pointer to the memory location.
> + * @param memorder
> + *  The valid memory order variants are __ATOMIC_ACQUIRE and __ATOMIC_RELAXED.
> + *  These map to C++11 memory orders with the same names, see the C++11 standard
> + *  the GCC wiki on atomic synchronization for detailed definitions.
> + */
> +static __rte_always_inline uint64_t
> +__atomic_load_ex_64(volatile uint64_t *addr, int memorder);
> +
> +static __rte_always_inline uint16_t
> +__atomic_load_ex_16(volatile uint16_t *addr, int memorder)
> +{
> +	uint16_t tmp;
> +	assert((memorder == __ATOMIC_ACQUIRE)
> +			|| (memorder == __ATOMIC_RELAXED));
> +	if (memorder == __ATOMIC_ACQUIRE)
> +		asm volatile("ldaxrh %w[tmp], [%x[addr]]"
> +			: [tmp] "=&r" (tmp)
> +			: [addr] "r"(addr)
> +			: "memory");
> +	else if (memorder == __ATOMIC_RELAXED)
> +		asm volatile("ldxrh %w[tmp], [%x[addr]]"
> +			: [tmp] "=&r" (tmp)
> +			: [addr] "r"(addr)
> +			: "memory");
> +	return tmp;
> +}
> +
> +static __rte_always_inline uint32_t
> +__atomic_load_ex_32(volatile uint32_t *addr, int memorder)
> +{
> +	uint32_t tmp;
> +	assert((memorder == __ATOMIC_ACQUIRE)
> +			|| (memorder == __ATOMIC_RELAXED));
> +	if (memorder == __ATOMIC_ACQUIRE)
> +		asm volatile("ldaxr %w[tmp], [%x[addr]]"
> +			: [tmp] "=&r" (tmp)
> +			: [addr] "r"(addr)
> +			: "memory");
> +	else if (memorder == __ATOMIC_RELAXED)
> +		asm volatile("ldxr %w[tmp], [%x[addr]]"
> +			: [tmp] "=&r" (tmp)
> +			: [addr] "r"(addr)
> +			: "memory");
> +	return tmp;
> +}
> +
> +static __rte_always_inline uint64_t
> +__atomic_load_ex_64(volatile uint64_t *addr, int memorder)
> +{
> +	uint64_t tmp;
> +	assert((memorder == __ATOMIC_ACQUIRE)
> +			|| (memorder == __ATOMIC_RELAXED));
> +	if (memorder == __ATOMIC_ACQUIRE)
> +		asm volatile("ldaxr %x[tmp], [%x[addr]]"
> +			: [tmp] "=&r" (tmp)
> +			: [addr] "r"(addr)
> +			: "memory");
> +	else if (memorder == __ATOMIC_RELAXED)
> +		asm volatile("ldxr %x[tmp], [%x[addr]]"
> +			: [tmp] "=&r" (tmp)
> +			: [addr] "r"(addr)
> +			: "memory");
> +	return tmp;
> +}
> +
> +static __rte_always_inline void
> +rte_wait_until_equal_16(volatile uint16_t *addr, uint16_t expected,
> +int memorder)
> +{
> +	if (__atomic_load_n(addr, memorder) != expected) {
> +		rte_sevl();
> +		do {
> +			rte_wfe();
> +		} while (__atomic_load_ex_16(addr, memorder) != expected);
> +	}
> +}
> +
> +static __rte_always_inline void
> +rte_wait_until_equal_32(volatile uint32_t *addr, uint32_t expected,
> +int memorder)
> +{
> +	if (__atomic_load_n(addr, memorder) != expected) {
> +		rte_sevl();
> +		do {
> +			rte_wfe();
> +		} while (__atomic_load_n(addr, memorder) != expected);

while (__atomic_load_ex_32(addr, memorder) != expected);
?
Same for 64 bit version

> +	}
> +}
> +
> +static __rte_always_inline void
> +rte_wait_until_equal_64(volatile uint64_t *addr, uint64_t expected,
> +int memorder)
> +{
> +	if (__atomic_load_n(addr, memorder) != expected) {
> +		rte_sevl();
> +		do {
> +			rte_wfe();
> +		} while (__atomic_load_n(addr, memorder) != expected);
> +	}
> +}
> +#endif
> +
>  #ifdef __cplusplus
>  }
>  #endif
> diff --git a/lib/librte_eal/common/include/generic/rte_pause.h b/lib/librte_eal/common/include/generic/rte_pause.h
> index 52bd4db..9854455 100644
> --- a/lib/librte_eal/common/include/generic/rte_pause.h
> +++ b/lib/librte_eal/common/include/generic/rte_pause.h
> @@ -1,5 +1,6 @@
>  /* SPDX-License-Identifier: BSD-3-Clause
>   * Copyright(c) 2017 Cavium, Inc
> + * Copyright(c) 2019 Arm Limited
>   */
> 
>  #ifndef _RTE_PAUSE_H_
> @@ -12,6 +13,12 @@
>   *
>   */
> 
> +#include <stdint.h>
> +#include <rte_common.h>
> +#include <rte_atomic.h>
> +#include <rte_compat.h>
> +#include <assert.h>
> +
>  /**
>   * Pause CPU execution for a short while
>   *
> @@ -20,4 +27,105 @@
>   */
>  static inline void rte_pause(void);
> 
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
> + *
> + * Wait for *addr to be updated with a 16-bit expected value, with a relaxed
> + * memory ordering model meaning the loads around this API can be reordered.
> + *
> + * @param addr
> + *  A pointer to the memory location.
> + * @param expected
> + *  A 16-bit expected value to be in the memory location.
> + * @param memorder
> + *  Two different memory orders that can be specified:
> + *  __ATOMIC_ACQUIRE and __ATOMIC_RELAXED. These map to
> + *  C++11 memory orders with the same names, see the C++11 standard or
> + *  the GCC wiki on atomic synchronization for detailed definition.
> + */
> +__rte_experimental
> +static __rte_always_inline void
> +rte_wait_until_equal_16(volatile uint16_t *addr, uint16_t expected,
> +int memorder);
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
> + *
> + * Wait for *addr to be updated with a 32-bit expected value, with a relaxed
> + * memory ordering model meaning the loads around this API can be reordered.
> + *
> + * @param addr
> + *  A pointer to the memory location.
> + * @param expected
> + *  A 32-bit expected value to be in the memory location.
> + * @param memorder
> + *  Two different memory orders that can be specified:
> + *  __ATOMIC_ACQUIRE and __ATOMIC_RELAXED. These map to
> + *  C++11 memory orders with the same names, see the C++11 standard or
> + *  the GCC wiki on atomic synchronization for detailed definition.
> + */
> +__rte_experimental
> +static __rte_always_inline void
> +rte_wait_until_equal_32(volatile uint32_t *addr, uint32_t expected,
> +int memorder);
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
> + *
> + * Wait for *addr to be updated with a 64-bit expected value, with a relaxed
> + * memory ordering model meaning the loads around this API can be reordered.
> + *
> + * @param addr
> + *  A pointer to the memory location.
> + * @param expected
> + *  A 64-bit expected value to be in the memory location.
> + * @param memorder
> + *  Two different memory orders that can be specified:
> + *  __ATOMIC_ACQUIRE and __ATOMIC_RELAXED. These map to
> + *  C++11 memory orders with the same names, see the C++11 standard or
> + *  the GCC wiki on atomic synchronization for detailed definition.
> + */
> +__rte_experimental
> +static __rte_always_inline void
> +rte_wait_until_equal_64(volatile uint64_t *addr, uint64_t expected,
> +int memorder);
> +
> +#ifndef RTE_WAIT_UNTIL_EQUAL_ARCH_DEFINED
> +static __rte_always_inline void
> +rte_wait_until_equal_16(volatile uint16_t *addr, uint16_t expected,
> +int memorder)
> +{
> +	if (__atomic_load_n(addr, memorder) != expected) {
> +		do {
> +			rte_pause();
> +		} while (__atomic_load_n(addr, memorder) != expected);
> +	}

I think, these generic implementations could be just:
while (__atomic_load_n(addr, memorder) != expected)
	rte_pause();
	
Other than that:
Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>


> +}
> +
> +static __rte_always_inline void
> +rte_wait_until_equal_32(volatile uint32_t *addr, uint32_t expected,
> +int memorder)
> +{
> +	if (__atomic_load_n(addr, memorder) != expected) {
> +		do {
> +			rte_pause();
> +		} while (__atomic_load_n(addr, memorder) != expected);
> +	}
> +}
> +
> +static __rte_always_inline void
> +rte_wait_until_equal_64(volatile uint64_t *addr, uint64_t expected,
> +int memorder)
> +{
> +	if (__atomic_load_n(addr, memorder) != expected) {
> +		do {
> +			rte_pause();
> +		} while (__atomic_load_n(addr, memorder) != expected);
> +	}
> +}
> +#endif
> +
>  #endif /* _RTE_PAUSE_H_ */
> --
> 2.7.4


^ permalink raw reply	[flat|nested] 163+ messages in thread

* [dpdk-dev] [PATCH v11 0/5] use WFE for aarch64
  2019-06-30 16:21 [dpdk-dev] [RFC 0/5] use WFE for locks and ring on aarch64 Gavin Hu
                   ` (68 preceding siblings ...)
  2019-10-25 15:39 ` [dpdk-dev] [PATCH v10 5/5] event/opdl: " Gavin Hu
@ 2019-10-27 12:52 ` Gavin Hu
  2019-10-27 12:52 ` [dpdk-dev] [PATCH v11 1/5] bus/fslmc: fix the conflicting dmb function Gavin Hu
                   ` (11 subsequent siblings)
  81 siblings, 0 replies; 163+ messages in thread
From: Gavin Hu @ 2019-10-27 12:52 UTC (permalink / raw)
  To: dev
  Cc: nd, david.marchand, konstantin.ananyev, thomas, stephen,
	hemant.agrawal, jerinj, pbhagavatula, Honnappa.Nagarahalli,
	ruifeng.wang, phil.yang, steve.capper

V11:
-add rte_ prefix to the __atomic_load_ex_x funtions(Ananyev Konstantin)
-define the above rte_atomic_load_ex_x funtions even if not RTE_WAIT_UNTIL_EQUAL_ARCH_DEFINED for future non-wfe usages(Ananyev Konstantin)
-use the above functions for arm specific rte_wait_until_equal_x functions(Ananyev Konstantin)
-simplify the generic implementation by immersing "if" into "while"(Ananyev Konstantin)

V10:
- move arm specific stuff to arch/arm/rte_pause_64.h(Ananyev Konstantin)
V9:
- fix a weblink broken (David Marchand)
- define rte_wfe and rte_sev() (Ananyev Konstantin)
- explicitly define three function APIs instead of marcos (Ananyev Konstantin)
- incorporate common rte_wfe and rte_sev into the generic rte_spinlock (David Marchand)
- define arch neutral RTE_WAIT_UNTIL_EQUAL_ARCH_DEFINED (Ananyev Konstantin)
- define rte_load_ex_16/32/64 functions to use load-exclusive instruction for aarch64, which is required for wake up of WFE
- drop the rte_spinlock patch from this series, as the it calls this experimental API and it is widely included by a lot of components each requires the ALLOW_EXPERIMENRAL_API for the Makefile and meson.build, leave it to future after the experimental is removed.
V8:
- simplify dmb definition to use io barriers (David Marchand)
- define wfe() and sev() macros and use them inside normal C code (Ananyev Konstantin)
- pass memorder as parameter, not to incorporate it into function name, less functions, similar to C11 atomic intrinsics (Ananyev Konstantin)
- remove mandating RTE_FORCE_INTRINSICS in arm spinlock implementation(David Marchand)
- undef __WAIT_UNTIL_EQUAL after use (David Marchand)
- add experimental tag and warning (David Marchand)
- add the limitation of using WFE instruction in the commit log (David Marchand) 
- tweak the use of RTE_FORCE_INSTRINSICS(still mandatory for aarch64) and RTE_ARM_USE_WFE for spinlock (David Marchand)
- drop the rte_ring patch from this series, as the rte_ring.h calls this API and it is widely included by a lot of components each requires the ALLOW_EXPERIMENRAL_API for the Makefile and meson.build, leave it to future after the experimental is removed.
V7:
- fix the checkpatch LONG_LINE_COMMENT issue
V6:
- squash the RTE_ARM_USE_WFE configuration entry patch into the new API patch
- move the new configuration to the end of EAL
- add doxygen comments to reflect the relaxed and acquire semantics
- correct the meson configuration 
V5:
- add doxygen comments for the new APIs
- spinlock early exit without wfe if the spinlock not taken by others.
- add two patches on top for opdl and thunderx
V4:
- rename the config as CONFIG_RTE_ARM_USE_WFE to indicate it applys to arm only
- introduce a macro for assembly Skelton to reduce the duplication of code
- add one patch for nxp fslmc to address a compiling error
V3:
- Convert RFCs to patches
V2:
- Use inline functions instead of marcos
- Add load and compare in the beginning of the APIs
- Fix some style errors in asm inline 
V1:
- Add the new APIs and use it for ring and locks

Gavin Hu (5):
  bus/fslmc: fix the conflicting dmb function
  eal: add the APIs to wait until equal
  ticketlock: use new API to reduce contention on aarch64
  net/thunderx: use new API to save cycles on aarch64
  event/opdl: use new API to save cycles on aarch64

 config/arm/meson.build                             |   1 +
 config/common_base                                 |   5 +
 drivers/bus/fslmc/mc/fsl_mc_sys.h                  |   8 +-
 drivers/event/opdl/Makefile                        |   1 +
 drivers/event/opdl/meson.build                     |   1 +
 drivers/event/opdl/opdl_ring.c                     |   5 +-
 drivers/net/thunderx/Makefile                      |   1 +
 drivers/net/thunderx/meson.build                   |   1 +
 drivers/net/thunderx/nicvf_rxtx.c                  |   3 +-
 .../common/include/arch/arm/rte_pause_64.h         | 188 +++++++++++++++++++++
 lib/librte_eal/common/include/generic/rte_pause.h  |  99 +++++++++++
 .../common/include/generic/rte_ticketlock.h        |   3 +-
 12 files changed, 304 insertions(+), 12 deletions(-)

-- 
2.7.4


^ permalink raw reply	[flat|nested] 163+ messages in thread

* [dpdk-dev] [PATCH v11 1/5] bus/fslmc: fix the conflicting dmb function
  2019-06-30 16:21 [dpdk-dev] [RFC 0/5] use WFE for locks and ring on aarch64 Gavin Hu
                   ` (69 preceding siblings ...)
  2019-10-27 12:52 ` [dpdk-dev] [PATCH v11 0/5] use WFE for aarch64 Gavin Hu
@ 2019-10-27 12:52 ` Gavin Hu
  2019-10-27 12:52 ` [dpdk-dev] [PATCH v11 2/5] eal: add the APIs to wait until equal Gavin Hu
                   ` (10 subsequent siblings)
  81 siblings, 0 replies; 163+ messages in thread
From: Gavin Hu @ 2019-10-27 12:52 UTC (permalink / raw)
  To: dev
  Cc: nd, david.marchand, konstantin.ananyev, thomas, stephen,
	hemant.agrawal, jerinj, pbhagavatula, Honnappa.Nagarahalli,
	ruifeng.wang, phil.yang, steve.capper, stable

There are two definitions conflicting each other, for more
details, refer to [1].

include/rte_atomic_64.h:19: error: "dmb" redefined [-Werror]
drivers/bus/fslmc/mc/fsl_mc_sys.h:36: note: this is the location of the
previous definition
 #define dmb() {__asm__ __volatile__("" : : : "memory"); }

The fix is to reuse the EAL definition to avoid conflicts.

[1] http://inbox.dpdk.org/users/VI1PR08MB537631AB25F41B8880DCCA988FDF0@
VI1PR08MB5376.eurprd08.prod.outlook.com/T/#u

Fixes: 3af733ba8da8 ("bus/fslmc: introduce MC object functions")
Cc: stable@dpdk.org

Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Phil Yang <phi.yang@arm.com>
Acked-by: Hemant Agrawal <hemant.agrawal@nxp.com>
---
 drivers/bus/fslmc/mc/fsl_mc_sys.h | 8 +++-----
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/drivers/bus/fslmc/mc/fsl_mc_sys.h b/drivers/bus/fslmc/mc/fsl_mc_sys.h
index d0c7b39..68ce38b 100644
--- a/drivers/bus/fslmc/mc/fsl_mc_sys.h
+++ b/drivers/bus/fslmc/mc/fsl_mc_sys.h
@@ -31,12 +31,10 @@ struct fsl_mc_io {
 #include <errno.h>
 #include <sys/uio.h>
 #include <linux/byteorder/little_endian.h>
+#include <rte_atomic.h>
 
-#ifndef dmb
-#define dmb() {__asm__ __volatile__("" : : : "memory"); }
-#endif
-#define __iormb()	dmb()
-#define __iowmb()	dmb()
+#define __iormb()	rte_io_rmb()
+#define __iowmb()	rte_io_wmb()
 #define __arch_getq(a)		(*(volatile uint64_t *)(a))
 #define __arch_putq(v, a)	(*(volatile uint64_t *)(a) = (v))
 #define __arch_putq32(v, a)	(*(volatile uint32_t *)(a) = (v))
-- 
2.7.4


^ permalink raw reply	[flat|nested] 163+ messages in thread

* [dpdk-dev] [PATCH v11 2/5] eal: add the APIs to wait until equal
  2019-06-30 16:21 [dpdk-dev] [RFC 0/5] use WFE for locks and ring on aarch64 Gavin Hu
                   ` (70 preceding siblings ...)
  2019-10-27 12:52 ` [dpdk-dev] [PATCH v11 1/5] bus/fslmc: fix the conflicting dmb function Gavin Hu
@ 2019-10-27 12:52 ` Gavin Hu
  2019-10-27 20:49   ` David Marchand
  2019-10-27 22:19   ` Ananyev, Konstantin
  2019-10-27 12:52 ` [dpdk-dev] [PATCH v11 3/5] ticketlock: use new API to reduce contention on aarch64 Gavin Hu
                   ` (9 subsequent siblings)
  81 siblings, 2 replies; 163+ messages in thread
From: Gavin Hu @ 2019-10-27 12:52 UTC (permalink / raw)
  To: dev
  Cc: nd, david.marchand, konstantin.ananyev, thomas, stephen,
	hemant.agrawal, jerinj, pbhagavatula, Honnappa.Nagarahalli,
	ruifeng.wang, phil.yang, steve.capper

The rte_wait_until_equal_xx APIs abstract the functionality of
'polling for a memory location to become equal to a given value'.

Add the RTE_ARM_USE_WFE configuration entry for aarch64, disabled
by default. When it is enabled, the above APIs will call WFE instruction
to save CPU cycles and power.

From a VM, when calling this API on aarch64, it may trap in and out to
release vCPUs whereas cause high exit latency. Since kernel 4.18.20 an
adaptive trapping mechanism is introduced to balance the latency and
workload.

Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
Reviewed-by: Steve Capper <steve.capper@arm.com>
Reviewed-by: Ola Liljedahl <ola.liljedahl@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Reviewed-by: Phil Yang <phil.yang@arm.com>
Acked-by: Pavan Nikhilesh <pbhagavatula@marvell.com>
Acked-by: Jerin Jacob <jerinj@marvell.com>
---
 config/arm/meson.build                             |   1 +
 config/common_base                                 |   5 +
 .../common/include/arch/arm/rte_pause_64.h         | 188 +++++++++++++++++++++
 lib/librte_eal/common/include/generic/rte_pause.h  |  99 +++++++++++
 4 files changed, 293 insertions(+)

diff --git a/config/arm/meson.build b/config/arm/meson.build
index 979018e..b4b4cac 100644
--- a/config/arm/meson.build
+++ b/config/arm/meson.build
@@ -26,6 +26,7 @@ flags_common_default = [
 	['RTE_LIBRTE_AVP_PMD', false],
 
 	['RTE_SCHED_VECTOR', false],
+	['RTE_ARM_USE_WFE', false],
 ]
 
 flags_generic = [
diff --git a/config/common_base b/config/common_base
index e843a21..c812156 100644
--- a/config/common_base
+++ b/config/common_base
@@ -111,6 +111,11 @@ CONFIG_RTE_MAX_VFIO_CONTAINERS=64
 CONFIG_RTE_MALLOC_DEBUG=n
 CONFIG_RTE_EAL_NUMA_AWARE_HUGEPAGES=n
 CONFIG_RTE_USE_LIBBSD=n
+# Use WFE instructions to implement the rte_wait_for_equal_xxx APIs,
+# calling these APIs put the cores in low power state while waiting
+# for the memory address to become equal to the expected value.
+# This is supported only by aarch64.
+CONFIG_RTE_ARM_USE_WFE=n
 
 #
 # Recognize/ignore the AVX/AVX512 CPU flags for performance/power testing.
diff --git a/lib/librte_eal/common/include/arch/arm/rte_pause_64.h b/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
index 93895d3..1680d7a 100644
--- a/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
+++ b/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
@@ -1,5 +1,6 @@
 /* SPDX-License-Identifier: BSD-3-Clause
  * Copyright(c) 2017 Cavium, Inc
+ * Copyright(c) 2019 Arm Limited
  */
 
 #ifndef _RTE_PAUSE_ARM64_H_
@@ -10,6 +11,11 @@ extern "C" {
 #endif
 
 #include <rte_common.h>
+
+#ifdef RTE_ARM_USE_WFE
+#define RTE_WAIT_UNTIL_EQUAL_ARCH_DEFINED
+#endif
+
 #include "generic/rte_pause.h"
 
 static inline void rte_pause(void)
@@ -17,6 +23,188 @@ static inline void rte_pause(void)
 	asm volatile("yield" ::: "memory");
 }
 
+/**
+ * Send an event to quit WFE.
+ */
+static inline void rte_sevl(void);
+
+/**
+ * Put processor into low power WFE(Wait For Event) state
+ */
+static inline void rte_wfe(void);
+
+#ifdef RTE_ARM_USE_WFE
+static inline void rte_sevl(void)
+{
+	asm volatile("sevl" : : : "memory");
+}
+
+static inline void rte_wfe(void)
+{
+	asm volatile("wfe" : : : "memory");
+}
+#else
+static inline void rte_sevl(void)
+{
+}
+static inline void rte_wfe(void)
+{
+	rte_pause();
+}
+#endif
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Atomic exclusive load from addr, it returns the 16-bit content of *addr
+ * while making it 'monitored',when it is written by someone else, the
+ * 'monitored' state is cleared and a event is generated implicitly to exit
+ * WFE.
+ *
+ * @param addr
+ *  A pointer to the memory location.
+ * @param memorder
+ *  The valid memory order variants are __ATOMIC_ACQUIRE and __ATOMIC_RELAXED.
+ *  These map to C++11 memory orders with the same names, see the C++11 standard
+ *  the GCC wiki on atomic synchronization for detailed definitions.
+ */
+static __rte_always_inline uint16_t
+rte_atomic_load_ex_16(volatile uint16_t *addr, int memorder);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Atomic exclusive load from addr, it returns the 32-bit content of *addr
+ * while making it 'monitored',when it is written by someone else, the
+ * 'monitored' state is cleared and a event is generated implicitly to exit
+ * WFE.
+ *
+ * @param addr
+ *  A pointer to the memory location.
+ * @param memorder
+ *  The valid memory order variants are __ATOMIC_ACQUIRE and __ATOMIC_RELAXED.
+ *  These map to C++11 memory orders with the same names, see the C++11 standard
+ *  the GCC wiki on atomic synchronization for detailed definitions.
+ */
+static __rte_always_inline uint32_t
+rte_atomic_load_ex_32(volatile uint32_t *addr, int memorder);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Atomic exclusive load from addr, it returns the 64-bit content of *addr
+ * while making it 'monitored',when it is written by someone else, the
+ * 'monitored' state is cleared and a event is generated implicitly to exit
+ * WFE.
+ *
+ * @param addr
+ *  A pointer to the memory location.
+ * @param memorder
+ *  The valid memory order variants are __ATOMIC_ACQUIRE and __ATOMIC_RELAXED.
+ *  These map to C++11 memory orders with the same names, see the C++11 standard
+ *  the GCC wiki on atomic synchronization for detailed definitions.
+ */
+static __rte_always_inline uint64_t
+rte_atomic_load_ex_64(volatile uint64_t *addr, int memorder);
+
+static __rte_always_inline uint16_t
+rte_atomic_load_ex_16(volatile uint16_t *addr, int memorder)
+{
+	uint16_t tmp;
+	assert((memorder == __ATOMIC_ACQUIRE)
+			|| (memorder == __ATOMIC_RELAXED));
+	if (memorder == __ATOMIC_ACQUIRE)
+		asm volatile("ldaxrh %w[tmp], [%x[addr]]"
+			: [tmp] "=&r" (tmp)
+			: [addr] "r"(addr)
+			: "memory");
+	else if (memorder == __ATOMIC_RELAXED)
+		asm volatile("ldxrh %w[tmp], [%x[addr]]"
+			: [tmp] "=&r" (tmp)
+			: [addr] "r"(addr)
+			: "memory");
+	return tmp;
+}
+
+static __rte_always_inline uint32_t
+rte_atomic_load_ex_32(volatile uint32_t *addr, int memorder)
+{
+	uint32_t tmp;
+	assert((memorder == __ATOMIC_ACQUIRE)
+			|| (memorder == __ATOMIC_RELAXED));
+	if (memorder == __ATOMIC_ACQUIRE)
+		asm volatile("ldaxr %w[tmp], [%x[addr]]"
+			: [tmp] "=&r" (tmp)
+			: [addr] "r"(addr)
+			: "memory");
+	else if (memorder == __ATOMIC_RELAXED)
+		asm volatile("ldxr %w[tmp], [%x[addr]]"
+			: [tmp] "=&r" (tmp)
+			: [addr] "r"(addr)
+			: "memory");
+	return tmp;
+}
+
+static __rte_always_inline uint64_t
+rte_atomic_load_ex_64(volatile uint64_t *addr, int memorder)
+{
+	uint64_t tmp;
+	assert((memorder == __ATOMIC_ACQUIRE)
+			|| (memorder == __ATOMIC_RELAXED));
+	if (memorder == __ATOMIC_ACQUIRE)
+		asm volatile("ldaxr %x[tmp], [%x[addr]]"
+			: [tmp] "=&r" (tmp)
+			: [addr] "r"(addr)
+			: "memory");
+	else if (memorder == __ATOMIC_RELAXED)
+		asm volatile("ldxr %x[tmp], [%x[addr]]"
+			: [tmp] "=&r" (tmp)
+			: [addr] "r"(addr)
+			: "memory");
+	return tmp;
+}
+
+#ifdef RTE_WAIT_UNTIL_EQUAL_ARCH_DEFINED
+static __rte_always_inline void
+rte_wait_until_equal_16(volatile uint16_t *addr, uint16_t expected,
+int memorder)
+{
+	if (__atomic_load_n(addr, memorder) != expected) {
+		rte_sevl();
+		do {
+			rte_wfe();
+		} while (rte_atomic_load_ex_16(addr, memorder) != expected);
+	}
+}
+
+static __rte_always_inline void
+rte_wait_until_equal_32(volatile uint32_t *addr, uint32_t expected,
+int memorder)
+{
+	if (__atomic_load_n(addr, memorder) != expected) {
+		rte_sevl();
+		do {
+			rte_wfe();
+		} while (__atomic_load_n(addr, memorder) != expected);
+	}
+}
+
+static __rte_always_inline void
+rte_wait_until_equal_64(volatile uint64_t *addr, uint64_t expected,
+int memorder)
+{
+	if (__atomic_load_n(addr, memorder) != expected) {
+		rte_sevl();
+		do {
+			rte_wfe();
+		} while (__atomic_load_n(addr, memorder) != expected);
+	}
+}
+#endif
+
 #ifdef __cplusplus
 }
 #endif
diff --git a/lib/librte_eal/common/include/generic/rte_pause.h b/lib/librte_eal/common/include/generic/rte_pause.h
index 52bd4db..9d42e32 100644
--- a/lib/librte_eal/common/include/generic/rte_pause.h
+++ b/lib/librte_eal/common/include/generic/rte_pause.h
@@ -1,5 +1,6 @@
 /* SPDX-License-Identifier: BSD-3-Clause
  * Copyright(c) 2017 Cavium, Inc
+ * Copyright(c) 2019 Arm Limited
  */
 
 #ifndef _RTE_PAUSE_H_
@@ -12,6 +13,12 @@
  *
  */
 
+#include <stdint.h>
+#include <rte_common.h>
+#include <rte_atomic.h>
+#include <rte_compat.h>
+#include <assert.h>
+
 /**
  * Pause CPU execution for a short while
  *
@@ -20,4 +27,96 @@
  */
 static inline void rte_pause(void);
 
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Wait for *addr to be updated with a 16-bit expected value, with a relaxed
+ * memory ordering model meaning the loads around this API can be reordered.
+ *
+ * @param addr
+ *  A pointer to the memory location.
+ * @param expected
+ *  A 16-bit expected value to be in the memory location.
+ * @param memorder
+ *  Two different memory orders that can be specified:
+ *  __ATOMIC_ACQUIRE and __ATOMIC_RELAXED. These map to
+ *  C++11 memory orders with the same names, see the C++11 standard or
+ *  the GCC wiki on atomic synchronization for detailed definition.
+ */
+__rte_experimental
+static __rte_always_inline void
+rte_wait_until_equal_16(volatile uint16_t *addr, uint16_t expected,
+int memorder);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Wait for *addr to be updated with a 32-bit expected value, with a relaxed
+ * memory ordering model meaning the loads around this API can be reordered.
+ *
+ * @param addr
+ *  A pointer to the memory location.
+ * @param expected
+ *  A 32-bit expected value to be in the memory location.
+ * @param memorder
+ *  Two different memory orders that can be specified:
+ *  __ATOMIC_ACQUIRE and __ATOMIC_RELAXED. These map to
+ *  C++11 memory orders with the same names, see the C++11 standard or
+ *  the GCC wiki on atomic synchronization for detailed definition.
+ */
+__rte_experimental
+static __rte_always_inline void
+rte_wait_until_equal_32(volatile uint32_t *addr, uint32_t expected,
+int memorder);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Wait for *addr to be updated with a 64-bit expected value, with a relaxed
+ * memory ordering model meaning the loads around this API can be reordered.
+ *
+ * @param addr
+ *  A pointer to the memory location.
+ * @param expected
+ *  A 64-bit expected value to be in the memory location.
+ * @param memorder
+ *  Two different memory orders that can be specified:
+ *  __ATOMIC_ACQUIRE and __ATOMIC_RELAXED. These map to
+ *  C++11 memory orders with the same names, see the C++11 standard or
+ *  the GCC wiki on atomic synchronization for detailed definition.
+ */
+__rte_experimental
+static __rte_always_inline void
+rte_wait_until_equal_64(volatile uint64_t *addr, uint64_t expected,
+int memorder);
+
+#ifndef RTE_WAIT_UNTIL_EQUAL_ARCH_DEFINED
+static __rte_always_inline void
+rte_wait_until_equal_16(volatile uint16_t *addr, uint16_t expected,
+int memorder)
+{
+	while (__atomic_load_n(addr, memorder) != expected)
+		rte_pause();
+}
+
+static __rte_always_inline void
+rte_wait_until_equal_32(volatile uint32_t *addr, uint32_t expected,
+int memorder)
+{
+	while (__atomic_load_n(addr, memorder) != expected)
+		rte_pause();
+}
+
+static __rte_always_inline void
+rte_wait_until_equal_64(volatile uint64_t *addr, uint64_t expected,
+int memorder)
+{
+	while (__atomic_load_n(addr, memorder) != expected)
+		rte_pause();
+}
+#endif
+
 #endif /* _RTE_PAUSE_H_ */
-- 
2.7.4


^ permalink raw reply	[flat|nested] 163+ messages in thread

* [dpdk-dev] [PATCH v11 3/5] ticketlock: use new API to reduce contention on aarch64
  2019-06-30 16:21 [dpdk-dev] [RFC 0/5] use WFE for locks and ring on aarch64 Gavin Hu
                   ` (71 preceding siblings ...)
  2019-10-27 12:52 ` [dpdk-dev] [PATCH v11 2/5] eal: add the APIs to wait until equal Gavin Hu
@ 2019-10-27 12:52 ` Gavin Hu
  2019-10-27 12:52 ` [dpdk-dev] [PATCH v11 4/5] net/thunderx: use new API to save cycles " Gavin Hu
                   ` (8 subsequent siblings)
  81 siblings, 0 replies; 163+ messages in thread
From: Gavin Hu @ 2019-10-27 12:52 UTC (permalink / raw)
  To: dev
  Cc: nd, david.marchand, konstantin.ananyev, thomas, stephen,
	hemant.agrawal, jerinj, pbhagavatula, Honnappa.Nagarahalli,
	ruifeng.wang, phil.yang, steve.capper

While using ticket lock, cores repeatedly poll the lock variable.
This is replaced by rte_wait_until_equal API.

Running ticketlock_autotest on ThunderX2, Ampere eMAG80, and Arm N1SDP[1],
there were variances between runs, but no notable performance gain or
degradation were seen with and without this patch.

[1] https://community.arm.com/developer/tools-software/oss-platforms/w/\
docs/440/neoverse-n1-sdp

Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Tested-by: Phil Yang <phil.yang@arm.com>
Tested-by: Pavan Nikhilesh <pbhagavatula@marvell.com>
Reviewed-by: Jerin Jacob <jerinj@marvell.com>
---
 lib/librte_eal/common/include/generic/rte_ticketlock.h | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/lib/librte_eal/common/include/generic/rte_ticketlock.h b/lib/librte_eal/common/include/generic/rte_ticketlock.h
index d9bec87..c295ae7 100644
--- a/lib/librte_eal/common/include/generic/rte_ticketlock.h
+++ b/lib/librte_eal/common/include/generic/rte_ticketlock.h
@@ -66,8 +66,7 @@ static inline void
 rte_ticketlock_lock(rte_ticketlock_t *tl)
 {
 	uint16_t me = __atomic_fetch_add(&tl->s.next, 1, __ATOMIC_RELAXED);
-	while (__atomic_load_n(&tl->s.current, __ATOMIC_ACQUIRE) != me)
-		rte_pause();
+	rte_wait_until_equal_16(&tl->s.current, me, __ATOMIC_ACQUIRE);
 }
 
 /**
-- 
2.7.4


^ permalink raw reply	[flat|nested] 163+ messages in thread

* [dpdk-dev] [PATCH v11 4/5] net/thunderx: use new API to save cycles on aarch64
  2019-06-30 16:21 [dpdk-dev] [RFC 0/5] use WFE for locks and ring on aarch64 Gavin Hu
                   ` (72 preceding siblings ...)
  2019-10-27 12:52 ` [dpdk-dev] [PATCH v11 3/5] ticketlock: use new API to reduce contention on aarch64 Gavin Hu
@ 2019-10-27 12:52 ` " Gavin Hu
  2019-10-27 12:52 ` [dpdk-dev] [PATCH v11 5/5] event/opdl: " Gavin Hu
                   ` (7 subsequent siblings)
  81 siblings, 0 replies; 163+ messages in thread
From: Gavin Hu @ 2019-10-27 12:52 UTC (permalink / raw)
  To: dev
  Cc: nd, david.marchand, konstantin.ananyev, thomas, stephen,
	hemant.agrawal, jerinj, pbhagavatula, Honnappa.Nagarahalli,
	ruifeng.wang, phil.yang, steve.capper

Use the new API to wait in low power state instead of continuous
polling to save CPU cycles and power.

Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
Acked-by: Jerin Jacob <jerinj@marvell.com>
---
 drivers/net/thunderx/Makefile     | 1 +
 drivers/net/thunderx/meson.build  | 1 +
 drivers/net/thunderx/nicvf_rxtx.c | 3 +--
 3 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/net/thunderx/Makefile b/drivers/net/thunderx/Makefile
index e6bf497..9e0de10 100644
--- a/drivers/net/thunderx/Makefile
+++ b/drivers/net/thunderx/Makefile
@@ -10,6 +10,7 @@ include $(RTE_SDK)/mk/rte.vars.mk
 LIB = librte_pmd_thunderx_nicvf.a
 
 CFLAGS += $(WERROR_FLAGS)
+CFLAGS += -DALLOW_EXPERIMENTAL_API
 
 LDLIBS += -lm
 LDLIBS += -lrte_eal -lrte_mbuf -lrte_mempool -lrte_ring
diff --git a/drivers/net/thunderx/meson.build b/drivers/net/thunderx/meson.build
index 69819a9..23d9458 100644
--- a/drivers/net/thunderx/meson.build
+++ b/drivers/net/thunderx/meson.build
@@ -4,6 +4,7 @@
 subdir('base')
 objs = [base_objs]
 
+allow_experimental_apis = true
 sources = files('nicvf_rxtx.c',
 		'nicvf_ethdev.c',
 		'nicvf_svf.c'
diff --git a/drivers/net/thunderx/nicvf_rxtx.c b/drivers/net/thunderx/nicvf_rxtx.c
index 1c42874..90a6098 100644
--- a/drivers/net/thunderx/nicvf_rxtx.c
+++ b/drivers/net/thunderx/nicvf_rxtx.c
@@ -385,8 +385,7 @@ nicvf_fill_rbdr(struct nicvf_rxq *rxq, int to_fill)
 		ltail++;
 	}
 
-	while (__atomic_load_n(&rbdr->tail, __ATOMIC_RELAXED) != next_tail)
-		rte_pause();
+	rte_wait_until_equal_32(&rbdr->tail, next_tail, __ATOMIC_RELAXED);
 
 	__atomic_store_n(&rbdr->tail, ltail, __ATOMIC_RELEASE);
 	nicvf_addr_write(door, to_fill);
-- 
2.7.4


^ permalink raw reply	[flat|nested] 163+ messages in thread

* [dpdk-dev] [PATCH v11 5/5] event/opdl: use new API to save cycles on aarch64
  2019-06-30 16:21 [dpdk-dev] [RFC 0/5] use WFE for locks and ring on aarch64 Gavin Hu
                   ` (73 preceding siblings ...)
  2019-10-27 12:52 ` [dpdk-dev] [PATCH v11 4/5] net/thunderx: use new API to save cycles " Gavin Hu
@ 2019-10-27 12:52 ` " Gavin Hu
  2019-11-04 15:32 ` [dpdk-dev] [PATCH v12 0/5] use WFE for aarch64 Gavin Hu
                   ` (6 subsequent siblings)
  81 siblings, 0 replies; 163+ messages in thread
From: Gavin Hu @ 2019-10-27 12:52 UTC (permalink / raw)
  To: dev
  Cc: nd, david.marchand, konstantin.ananyev, thomas, stephen,
	hemant.agrawal, jerinj, pbhagavatula, Honnappa.Nagarahalli,
	ruifeng.wang, phil.yang, steve.capper

Use the new API to wait in low power state instead of continuous
polling to save CPU cycles and power.

Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
Reviewed-by: Jerin Jacob <jerinj@marvell.com>
---
 drivers/event/opdl/Makefile    | 1 +
 drivers/event/opdl/meson.build | 1 +
 drivers/event/opdl/opdl_ring.c | 5 ++---
 3 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/drivers/event/opdl/Makefile b/drivers/event/opdl/Makefile
index bf50a60..72ef07d 100644
--- a/drivers/event/opdl/Makefile
+++ b/drivers/event/opdl/Makefile
@@ -9,6 +9,7 @@ LIB = librte_pmd_opdl_event.a
 # build flags
 CFLAGS += -O3
 CFLAGS += $(WERROR_FLAGS)
+CFLAGS += -DALLOW_EXPERIMENTAL_API
 # for older GCC versions, allow us to initialize an event using
 # designated initializers.
 ifeq ($(CONFIG_RTE_TOOLCHAIN_GCC),y)
diff --git a/drivers/event/opdl/meson.build b/drivers/event/opdl/meson.build
index 1fe034e..e67b164 100644
--- a/drivers/event/opdl/meson.build
+++ b/drivers/event/opdl/meson.build
@@ -1,6 +1,7 @@
 # SPDX-License-Identifier: BSD-3-Clause
 # Copyright(c) 2018 Luca Boccassi <bluca@debian.org>
 
+allow_experimental_apis = true
 sources = files(
 	'opdl_evdev.c',
 	'opdl_evdev_init.c',
diff --git a/drivers/event/opdl/opdl_ring.c b/drivers/event/opdl/opdl_ring.c
index 06fb5b3..c8d19fe 100644
--- a/drivers/event/opdl/opdl_ring.c
+++ b/drivers/event/opdl/opdl_ring.c
@@ -16,6 +16,7 @@
 #include <rte_memcpy.h>
 #include <rte_memory.h>
 #include <rte_memzone.h>
+#include <rte_atomic.h>
 
 #include "opdl_ring.h"
 #include "opdl_log.h"
@@ -474,9 +475,7 @@ opdl_ring_input_multithread(struct opdl_ring *t, const void *entries,
 	/* If another thread started inputting before this one, but hasn't
 	 * finished, we need to wait for it to complete to update the tail.
 	 */
-	while (unlikely(__atomic_load_n(&s->shared.tail, __ATOMIC_ACQUIRE) !=
-			old_head))
-		rte_pause();
+	rte_wait_until_equal_32(&s->shared.tail, old_head, __ATOMIC_ACQUIRE);
 
 	__atomic_store_n(&s->shared.tail, old_head + num_entries,
 			__ATOMIC_RELEASE);
-- 
2.7.4


^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: [dpdk-dev] [PATCH v10 2/5] eal: add the APIs to wait until equal
  2019-10-25 17:27   ` Ananyev, Konstantin
@ 2019-10-27 13:03     ` Gavin Hu (Arm Technology China)
  0 siblings, 0 replies; 163+ messages in thread
From: Gavin Hu (Arm Technology China) @ 2019-10-27 13:03 UTC (permalink / raw)
  To: Ananyev, Konstantin, dev
  Cc: nd, david.marchand, thomas, stephen, hemant.agrawal, jerinj,
	pbhagavatula, Honnappa Nagarahalli,
	Ruifeng Wang (Arm Technology China),
	Phil Yang (Arm Technology China),
	Steve Capper, nd

Hi Konstantin,

I think all your comments are addressed in v11, you can have a check, thanks for your review and really appreciate. 
/Gavin

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: [dpdk-dev] [PATCH v11 2/5] eal: add the APIs to wait until equal
  2019-10-27 12:52 ` [dpdk-dev] [PATCH v11 2/5] eal: add the APIs to wait until equal Gavin Hu
@ 2019-10-27 20:49   ` David Marchand
  2019-10-28  5:08     ` Gavin Hu (Arm Technology China)
  2019-10-27 22:19   ` Ananyev, Konstantin
  1 sibling, 1 reply; 163+ messages in thread
From: David Marchand @ 2019-10-27 20:49 UTC (permalink / raw)
  To: Gavin Hu
  Cc: dev, nd, Ananyev, Konstantin, Thomas Monjalon, Stephen Hemminger,
	Hemant Agrawal, Jerin Jacob Kollanukkaran, Pavan Nikhilesh,
	Honnappa Nagarahalli, Ruifeng Wang (Arm Technology China),
	Phil Yang, Steve Capper

On Sun, Oct 27, 2019 at 1:53 PM Gavin Hu <gavin.hu@arm.com> wrote:

[snip]

> diff --git a/lib/librte_eal/common/include/arch/arm/rte_pause_64.h b/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
> index 93895d3..1680d7a 100644
> --- a/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
> +++ b/lib/librte_eal/common/include/arch/arm/rte_pause_64.h

[snip]

> @@ -17,6 +23,188 @@ static inline void rte_pause(void)
>         asm volatile("yield" ::: "memory");
>  }
>
> +/**
> + * Send an event to quit WFE.
> + */
> +static inline void rte_sevl(void);
> +
> +/**
> + * Put processor into low power WFE(Wait For Event) state
> + */
> +static inline void rte_wfe(void);
> +
> +#ifdef RTE_ARM_USE_WFE
> +static inline void rte_sevl(void)
> +{
> +       asm volatile("sevl" : : : "memory");
> +}
> +
> +static inline void rte_wfe(void)
> +{
> +       asm volatile("wfe" : : : "memory");
> +}
> +#else
> +static inline void rte_sevl(void)
> +{
> +}
> +static inline void rte_wfe(void)
> +{
> +       rte_pause();
> +}
> +#endif
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change, or be removed, without prior notice

experimental?
Just complaining on the principle, you missed the __rte_experimental
in such a case.
But this API is a no go for me, see below.


> + *
> + * Atomic exclusive load from addr, it returns the 16-bit content of *addr
> + * while making it 'monitored',when it is written by someone else, the
> + * 'monitored' state is cleared and a event is generated implicitly to exit
> + * WFE.
> + *
> + * @param addr
> + *  A pointer to the memory location.
> + * @param memorder
> + *  The valid memory order variants are __ATOMIC_ACQUIRE and __ATOMIC_RELAXED.
> + *  These map to C++11 memory orders with the same names, see the C++11 standard
> + *  the GCC wiki on atomic synchronization for detailed definitions.
> + */
> +static __rte_always_inline uint16_t
> +rte_atomic_load_ex_16(volatile uint16_t *addr, int memorder);

This API does not make sense for anything but arm, so this prefix is not good.

On arm, when RTE_ARM_USE_WFE is undefined, why would you need it?
A non exclusive load is enough since you don't want to use wfe.

[snip]

> +
> +static __rte_always_inline uint16_t
> +rte_atomic_load_ex_16(volatile uint16_t *addr, int memorder)
> +{
> +       uint16_t tmp;
> +       assert((memorder == __ATOMIC_ACQUIRE)
> +                       || (memorder == __ATOMIC_RELAXED));
> +       if (memorder == __ATOMIC_ACQUIRE)
> +               asm volatile("ldaxrh %w[tmp], [%x[addr]]"
> +                       : [tmp] "=&r" (tmp)
> +                       : [addr] "r"(addr)
> +                       : "memory");
> +       else if (memorder == __ATOMIC_RELAXED)
> +               asm volatile("ldxrh %w[tmp], [%x[addr]]"
> +                       : [tmp] "=&r" (tmp)
> +                       : [addr] "r"(addr)
> +                       : "memory");
> +       return tmp;
> +}
> +
> +static __rte_always_inline uint32_t
> +rte_atomic_load_ex_32(volatile uint32_t *addr, int memorder)
> +{
> +       uint32_t tmp;
> +       assert((memorder == __ATOMIC_ACQUIRE)
> +                       || (memorder == __ATOMIC_RELAXED));
> +       if (memorder == __ATOMIC_ACQUIRE)
> +               asm volatile("ldaxr %w[tmp], [%x[addr]]"
> +                       : [tmp] "=&r" (tmp)
> +                       : [addr] "r"(addr)
> +                       : "memory");
> +       else if (memorder == __ATOMIC_RELAXED)
> +               asm volatile("ldxr %w[tmp], [%x[addr]]"
> +                       : [tmp] "=&r" (tmp)
> +                       : [addr] "r"(addr)
> +                       : "memory");
> +       return tmp;
> +}
> +
> +static __rte_always_inline uint64_t
> +rte_atomic_load_ex_64(volatile uint64_t *addr, int memorder)
> +{
> +       uint64_t tmp;
> +       assert((memorder == __ATOMIC_ACQUIRE)
> +                       || (memorder == __ATOMIC_RELAXED));
> +       if (memorder == __ATOMIC_ACQUIRE)
> +               asm volatile("ldaxr %x[tmp], [%x[addr]]"
> +                       : [tmp] "=&r" (tmp)
> +                       : [addr] "r"(addr)
> +                       : "memory");
> +       else if (memorder == __ATOMIC_RELAXED)
> +               asm volatile("ldxr %x[tmp], [%x[addr]]"
> +                       : [tmp] "=&r" (tmp)
> +                       : [addr] "r"(addr)
> +                       : "memory");
> +       return tmp;
> +}
> +
> +#ifdef RTE_WAIT_UNTIL_EQUAL_ARCH_DEFINED
> +static __rte_always_inline void
> +rte_wait_until_equal_16(volatile uint16_t *addr, uint16_t expected,
> +int memorder)
> +{
> +       if (__atomic_load_n(addr, memorder) != expected) {
> +               rte_sevl();
> +               do {
> +                       rte_wfe();


We are in the RTE_WAIT_UNTIL_EQUAL_ARCH_DEFINED case.
rte_wfe() is always asm volatile("wfe" : : : "memory");


> +               } while (rte_atomic_load_ex_16(addr, memorder) != expected);
> +       }
> +}
> +
> +static __rte_always_inline void
> +rte_wait_until_equal_32(volatile uint32_t *addr, uint32_t expected,
> +int memorder)
> +{
> +       if (__atomic_load_n(addr, memorder) != expected) {
> +               rte_sevl();
> +               do {
> +                       rte_wfe();
> +               } while (__atomic_load_n(addr, memorder) != expected);
> +       }
> +}

The while() should be with an exclusive load.


I will submit a v12 with those comments addressed so that we move
forward for rc2.
But it won't make it in rc1, sorry.


-- 
David Marchand


^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: [dpdk-dev] [PATCH v11 2/5] eal: add the APIs to wait until equal
  2019-10-27 12:52 ` [dpdk-dev] [PATCH v11 2/5] eal: add the APIs to wait until equal Gavin Hu
  2019-10-27 20:49   ` David Marchand
@ 2019-10-27 22:19   ` Ananyev, Konstantin
  2019-10-28  5:04     ` Gavin Hu (Arm Technology China)
  1 sibling, 1 reply; 163+ messages in thread
From: Ananyev, Konstantin @ 2019-10-27 22:19 UTC (permalink / raw)
  To: Gavin Hu, dev
  Cc: nd, david.marchand, thomas, stephen, hemant.agrawal, jerinj,
	pbhagavatula, Honnappa.Nagarahalli, ruifeng.wang, phil.yang,
	steve.capper


> The rte_wait_until_equal_xx APIs abstract the functionality of
> 'polling for a memory location to become equal to a given value'.
> 
> Add the RTE_ARM_USE_WFE configuration entry for aarch64, disabled
> by default. When it is enabled, the above APIs will call WFE instruction
> to save CPU cycles and power.
> 
> From a VM, when calling this API on aarch64, it may trap in and out to
> release vCPUs whereas cause high exit latency. Since kernel 4.18.20 an
> adaptive trapping mechanism is introduced to balance the latency and
> workload.
> 
> Signed-off-by: Gavin Hu <gavin.hu@arm.com>
> Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> Reviewed-by: Steve Capper <steve.capper@arm.com>
> Reviewed-by: Ola Liljedahl <ola.liljedahl@arm.com>
> Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> Reviewed-by: Phil Yang <phil.yang@arm.com>
> Acked-by: Pavan Nikhilesh <pbhagavatula@marvell.com>
> Acked-by: Jerin Jacob <jerinj@marvell.com>
> ---
>  config/arm/meson.build                             |   1 +
>  config/common_base                                 |   5 +
>  .../common/include/arch/arm/rte_pause_64.h         | 188 +++++++++++++++++++++
>  lib/librte_eal/common/include/generic/rte_pause.h  |  99 +++++++++++
>  4 files changed, 293 insertions(+)
> 
> diff --git a/config/arm/meson.build b/config/arm/meson.build
> index 979018e..b4b4cac 100644
> --- a/config/arm/meson.build
> +++ b/config/arm/meson.build
> @@ -26,6 +26,7 @@ flags_common_default = [
>  	['RTE_LIBRTE_AVP_PMD', false],
> 
>  	['RTE_SCHED_VECTOR', false],
> +	['RTE_ARM_USE_WFE', false],
>  ]
> 
>  flags_generic = [
> diff --git a/config/common_base b/config/common_base
> index e843a21..c812156 100644
> --- a/config/common_base
> +++ b/config/common_base
> @@ -111,6 +111,11 @@ CONFIG_RTE_MAX_VFIO_CONTAINERS=64
>  CONFIG_RTE_MALLOC_DEBUG=n
>  CONFIG_RTE_EAL_NUMA_AWARE_HUGEPAGES=n
>  CONFIG_RTE_USE_LIBBSD=n
> +# Use WFE instructions to implement the rte_wait_for_equal_xxx APIs,
> +# calling these APIs put the cores in low power state while waiting
> +# for the memory address to become equal to the expected value.
> +# This is supported only by aarch64.
> +CONFIG_RTE_ARM_USE_WFE=n
> 
>  #
>  # Recognize/ignore the AVX/AVX512 CPU flags for performance/power testing.
> diff --git a/lib/librte_eal/common/include/arch/arm/rte_pause_64.h b/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
> index 93895d3..1680d7a 100644
> --- a/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
> +++ b/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
> @@ -1,5 +1,6 @@
>  /* SPDX-License-Identifier: BSD-3-Clause
>   * Copyright(c) 2017 Cavium, Inc
> + * Copyright(c) 2019 Arm Limited
>   */
> 
>  #ifndef _RTE_PAUSE_ARM64_H_
> @@ -10,6 +11,11 @@ extern "C" {
>  #endif
> 
>  #include <rte_common.h>
> +
> +#ifdef RTE_ARM_USE_WFE
> +#define RTE_WAIT_UNTIL_EQUAL_ARCH_DEFINED
> +#endif
> +
>  #include "generic/rte_pause.h"
> 
>  static inline void rte_pause(void)
> @@ -17,6 +23,188 @@ static inline void rte_pause(void)
>  	asm volatile("yield" ::: "memory");
>  }
> 
> +/**
> + * Send an event to quit WFE.
> + */
> +static inline void rte_sevl(void);
> +
> +/**
> + * Put processor into low power WFE(Wait For Event) state
> + */
> +static inline void rte_wfe(void);
> +
> +#ifdef RTE_ARM_USE_WFE
> +static inline void rte_sevl(void)
> +{
> +	asm volatile("sevl" : : : "memory");
> +}
> +
> +static inline void rte_wfe(void)
> +{
> +	asm volatile("wfe" : : : "memory");
> +}
> +#else
> +static inline void rte_sevl(void)
> +{
> +}
> +static inline void rte_wfe(void)
> +{
> +	rte_pause();
> +}
> +#endif
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
> + *
> + * Atomic exclusive load from addr, it returns the 16-bit content of *addr
> + * while making it 'monitored',when it is written by someone else, the
> + * 'monitored' state is cleared and a event is generated implicitly to exit
> + * WFE.
> + *
> + * @param addr
> + *  A pointer to the memory location.
> + * @param memorder
> + *  The valid memory order variants are __ATOMIC_ACQUIRE and __ATOMIC_RELAXED.
> + *  These map to C++11 memory orders with the same names, see the C++11 standard
> + *  the GCC wiki on atomic synchronization for detailed definitions.
> + */
> +static __rte_always_inline uint16_t
> +rte_atomic_load_ex_16(volatile uint16_t *addr, int memorder);
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
> + *
> + * Atomic exclusive load from addr, it returns the 32-bit content of *addr
> + * while making it 'monitored',when it is written by someone else, the
> + * 'monitored' state is cleared and a event is generated implicitly to exit
> + * WFE.
> + *
> + * @param addr
> + *  A pointer to the memory location.
> + * @param memorder
> + *  The valid memory order variants are __ATOMIC_ACQUIRE and __ATOMIC_RELAXED.
> + *  These map to C++11 memory orders with the same names, see the C++11 standard
> + *  the GCC wiki on atomic synchronization for detailed definitions.
> + */
> +static __rte_always_inline uint32_t
> +rte_atomic_load_ex_32(volatile uint32_t *addr, int memorder);
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
> + *
> + * Atomic exclusive load from addr, it returns the 64-bit content of *addr
> + * while making it 'monitored',when it is written by someone else, the
> + * 'monitored' state is cleared and a event is generated implicitly to exit
> + * WFE.
> + *
> + * @param addr
> + *  A pointer to the memory location.
> + * @param memorder
> + *  The valid memory order variants are __ATOMIC_ACQUIRE and __ATOMIC_RELAXED.
> + *  These map to C++11 memory orders with the same names, see the C++11 standard
> + *  the GCC wiki on atomic synchronization for detailed definitions.
> + */
> +static __rte_always_inline uint64_t
> +rte_atomic_load_ex_64(volatile uint64_t *addr, int memorder);
> +
> +static __rte_always_inline uint16_t
> +rte_atomic_load_ex_16(volatile uint16_t *addr, int memorder)
> +{
> +	uint16_t tmp;
> +	assert((memorder == __ATOMIC_ACQUIRE)
> +			|| (memorder == __ATOMIC_RELAXED));
> +	if (memorder == __ATOMIC_ACQUIRE)
> +		asm volatile("ldaxrh %w[tmp], [%x[addr]]"
> +			: [tmp] "=&r" (tmp)
> +			: [addr] "r"(addr)
> +			: "memory");
> +	else if (memorder == __ATOMIC_RELAXED)
> +		asm volatile("ldxrh %w[tmp], [%x[addr]]"
> +			: [tmp] "=&r" (tmp)
> +			: [addr] "r"(addr)
> +			: "memory");
> +	return tmp;
> +}
> +
> +static __rte_always_inline uint32_t
> +rte_atomic_load_ex_32(volatile uint32_t *addr, int memorder)
> +{
> +	uint32_t tmp;
> +	assert((memorder == __ATOMIC_ACQUIRE)
> +			|| (memorder == __ATOMIC_RELAXED));
> +	if (memorder == __ATOMIC_ACQUIRE)
> +		asm volatile("ldaxr %w[tmp], [%x[addr]]"
> +			: [tmp] "=&r" (tmp)
> +			: [addr] "r"(addr)
> +			: "memory");
> +	else if (memorder == __ATOMIC_RELAXED)
> +		asm volatile("ldxr %w[tmp], [%x[addr]]"
> +			: [tmp] "=&r" (tmp)
> +			: [addr] "r"(addr)
> +			: "memory");
> +	return tmp;
> +}
> +
> +static __rte_always_inline uint64_t
> +rte_atomic_load_ex_64(volatile uint64_t *addr, int memorder)
> +{
> +	uint64_t tmp;
> +	assert((memorder == __ATOMIC_ACQUIRE)
> +			|| (memorder == __ATOMIC_RELAXED));
> +	if (memorder == __ATOMIC_ACQUIRE)
> +		asm volatile("ldaxr %x[tmp], [%x[addr]]"
> +			: [tmp] "=&r" (tmp)
> +			: [addr] "r"(addr)
> +			: "memory");
> +	else if (memorder == __ATOMIC_RELAXED)
> +		asm volatile("ldxr %x[tmp], [%x[addr]]"
> +			: [tmp] "=&r" (tmp)
> +			: [addr] "r"(addr)
> +			: "memory");
> +	return tmp;
> +}
> +
> +#ifdef RTE_WAIT_UNTIL_EQUAL_ARCH_DEFINED
> +static __rte_always_inline void
> +rte_wait_until_equal_16(volatile uint16_t *addr, uint16_t expected,
> +int memorder)
> +{
> +	if (__atomic_load_n(addr, memorder) != expected) {
> +		rte_sevl();
> +		do {
> +			rte_wfe();
> +		} while (rte_atomic_load_ex_16(addr, memorder) != expected);
> +	}
> +}
> +
> +static __rte_always_inline void
> +rte_wait_until_equal_32(volatile uint32_t *addr, uint32_t expected,
> +int memorder)
> +{
> +	if (__atomic_load_n(addr, memorder) != expected) {
> +		rte_sevl();
> +		do {
> +			rte_wfe();
> +		} while (__atomic_load_n(addr, memorder) != expected);

Here and in _64, shouldn't it be:
rte_atomic_load_ex_..
?

> +	}
> +}
> +
> +static __rte_always_inline void
> +rte_wait_until_equal_64(volatile uint64_t *addr, uint64_t expected,
> +int memorder)
> +{
> +	if (__atomic_load_n(addr, memorder) != expected) {
> +		rte_sevl();
> +		do {
> +			rte_wfe();
> +		} while (__atomic_load_n(addr, memorder) != expected);
> +	}
> +}
> +#endif
> +
>  #ifdef __cplusplus
>  }
>  #endif
> diff --git a/lib/librte_eal/common/include/generic/rte_pause.h b/lib/librte_eal/common/include/generic/rte_pause.h
> index 52bd4db..9d42e32 100644
> --- a/lib/librte_eal/common/include/generic/rte_pause.h
> +++ b/lib/librte_eal/common/include/generic/rte_pause.h
> @@ -1,5 +1,6 @@
>  /* SPDX-License-Identifier: BSD-3-Clause
>   * Copyright(c) 2017 Cavium, Inc
> + * Copyright(c) 2019 Arm Limited
>   */
> 
>  #ifndef _RTE_PAUSE_H_
> @@ -12,6 +13,12 @@
>   *
>   */
> 
> +#include <stdint.h>
> +#include <rte_common.h>
> +#include <rte_atomic.h>
> +#include <rte_compat.h>
> +#include <assert.h>
> +
>  /**
>   * Pause CPU execution for a short while
>   *
> @@ -20,4 +27,96 @@
>   */
>  static inline void rte_pause(void);
> 
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
> + *
> + * Wait for *addr to be updated with a 16-bit expected value, with a relaxed
> + * memory ordering model meaning the loads around this API can be reordered.
> + *
> + * @param addr
> + *  A pointer to the memory location.
> + * @param expected
> + *  A 16-bit expected value to be in the memory location.
> + * @param memorder
> + *  Two different memory orders that can be specified:
> + *  __ATOMIC_ACQUIRE and __ATOMIC_RELAXED. These map to
> + *  C++11 memory orders with the same names, see the C++11 standard or
> + *  the GCC wiki on atomic synchronization for detailed definition.
> + */
> +__rte_experimental
> +static __rte_always_inline void
> +rte_wait_until_equal_16(volatile uint16_t *addr, uint16_t expected,
> +int memorder);
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
> + *
> + * Wait for *addr to be updated with a 32-bit expected value, with a relaxed
> + * memory ordering model meaning the loads around this API can be reordered.
> + *
> + * @param addr
> + *  A pointer to the memory location.
> + * @param expected
> + *  A 32-bit expected value to be in the memory location.
> + * @param memorder
> + *  Two different memory orders that can be specified:
> + *  __ATOMIC_ACQUIRE and __ATOMIC_RELAXED. These map to
> + *  C++11 memory orders with the same names, see the C++11 standard or
> + *  the GCC wiki on atomic synchronization for detailed definition.
> + */
> +__rte_experimental
> +static __rte_always_inline void
> +rte_wait_until_equal_32(volatile uint32_t *addr, uint32_t expected,
> +int memorder);
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
> + *
> + * Wait for *addr to be updated with a 64-bit expected value, with a relaxed
> + * memory ordering model meaning the loads around this API can be reordered.
> + *
> + * @param addr
> + *  A pointer to the memory location.
> + * @param expected
> + *  A 64-bit expected value to be in the memory location.
> + * @param memorder
> + *  Two different memory orders that can be specified:
> + *  __ATOMIC_ACQUIRE and __ATOMIC_RELAXED. These map to
> + *  C++11 memory orders with the same names, see the C++11 standard or
> + *  the GCC wiki on atomic synchronization for detailed definition.
> + */
> +__rte_experimental
> +static __rte_always_inline void
> +rte_wait_until_equal_64(volatile uint64_t *addr, uint64_t expected,
> +int memorder);
> +
> +#ifndef RTE_WAIT_UNTIL_EQUAL_ARCH_DEFINED
> +static __rte_always_inline void
> +rte_wait_until_equal_16(volatile uint16_t *addr, uint16_t expected,
> +int memorder)
> +{
> +	while (__atomic_load_n(addr, memorder) != expected)
> +		rte_pause();
> +}
> +
> +static __rte_always_inline void
> +rte_wait_until_equal_32(volatile uint32_t *addr, uint32_t expected,
> +int memorder)
> +{
> +	while (__atomic_load_n(addr, memorder) != expected)
> +		rte_pause();
> +}
> +
> +static __rte_always_inline void
> +rte_wait_until_equal_64(volatile uint64_t *addr, uint64_t expected,
> +int memorder)
> +{
> +	while (__atomic_load_n(addr, memorder) != expected)
> +		rte_pause();
> +}
> +#endif
> +
>  #endif /* _RTE_PAUSE_H_ */
> --
> 2.7.4


^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: [dpdk-dev] [PATCH v11 2/5] eal: add the APIs to wait until equal
  2019-10-27 22:19   ` Ananyev, Konstantin
@ 2019-10-28  5:04     ` Gavin Hu (Arm Technology China)
  0 siblings, 0 replies; 163+ messages in thread
From: Gavin Hu (Arm Technology China) @ 2019-10-28  5:04 UTC (permalink / raw)
  To: Ananyev, Konstantin, dev
  Cc: nd, david.marchand, thomas, stephen, hemant.agrawal, jerinj,
	pbhagavatula, Honnappa Nagarahalli,
	Ruifeng Wang (Arm Technology China),
	Phil Yang (Arm Technology China),
	Steve Capper, nd

Hi Konstantin,

> -----Original Message-----
> From: Ananyev, Konstantin <konstantin.ananyev@intel.com>
> Sent: Monday, October 28, 2019 6:20 AM
> To: Gavin Hu (Arm Technology China) <Gavin.Hu@arm.com>; dev@dpdk.org
> Cc: nd <nd@arm.com>; david.marchand@redhat.com; thomas@monjalon.net;
> stephen@networkplumber.org; hemant.agrawal@nxp.com;
> jerinj@marvell.com; pbhagavatula@marvell.com; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>; Ruifeng Wang (Arm Technology China)
> <Ruifeng.Wang@arm.com>; Phil Yang (Arm Technology China)
> <Phil.Yang@arm.com>; Steve Capper <Steve.Capper@arm.com>
> Subject: RE: [PATCH v11 2/5] eal: add the APIs to wait until equal
> 
> 
> > The rte_wait_until_equal_xx APIs abstract the functionality of
> > 'polling for a memory location to become equal to a given value'.
> >
> > Add the RTE_ARM_USE_WFE configuration entry for aarch64, disabled
> > by default. When it is enabled, the above APIs will call WFE instruction
> > to save CPU cycles and power.
> >
> > From a VM, when calling this API on aarch64, it may trap in and out to
> > release vCPUs whereas cause high exit latency. Since kernel 4.18.20 an
> > adaptive trapping mechanism is introduced to balance the latency and
> > workload.
> >
> > Signed-off-by: Gavin Hu <gavin.hu@arm.com>
> > Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > Reviewed-by: Steve Capper <steve.capper@arm.com>
> > Reviewed-by: Ola Liljedahl <ola.liljedahl@arm.com>
> > Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > Reviewed-by: Phil Yang <phil.yang@arm.com>
> > Acked-by: Pavan Nikhilesh <pbhagavatula@marvell.com>
> > Acked-by: Jerin Jacob <jerinj@marvell.com>
> > ---
> >  config/arm/meson.build                             |   1 +
> >  config/common_base                                 |   5 +
> >  .../common/include/arch/arm/rte_pause_64.h         | 188
> +++++++++++++++++++++
> >  lib/librte_eal/common/include/generic/rte_pause.h  |  99 +++++++++++
> >  4 files changed, 293 insertions(+)
> >
> > diff --git a/config/arm/meson.build b/config/arm/meson.build
> > index 979018e..b4b4cac 100644
> > --- a/config/arm/meson.build
> > +++ b/config/arm/meson.build
> > @@ -26,6 +26,7 @@ flags_common_default = [
> >  	['RTE_LIBRTE_AVP_PMD', false],
> >
> >  	['RTE_SCHED_VECTOR', false],
> > +	['RTE_ARM_USE_WFE', false],
> >  ]
> >
> >  flags_generic = [
> > diff --git a/config/common_base b/config/common_base
> > index e843a21..c812156 100644
> > --- a/config/common_base
> > +++ b/config/common_base
> > @@ -111,6 +111,11 @@ CONFIG_RTE_MAX_VFIO_CONTAINERS=64
> >  CONFIG_RTE_MALLOC_DEBUG=n
> >  CONFIG_RTE_EAL_NUMA_AWARE_HUGEPAGES=n
> >  CONFIG_RTE_USE_LIBBSD=n
> > +# Use WFE instructions to implement the rte_wait_for_equal_xxx APIs,
> > +# calling these APIs put the cores in low power state while waiting
> > +# for the memory address to become equal to the expected value.
> > +# This is supported only by aarch64.
> > +CONFIG_RTE_ARM_USE_WFE=n
> >
> >  #
> >  # Recognize/ignore the AVX/AVX512 CPU flags for performance/power
> testing.
> > diff --git a/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
> b/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
> > index 93895d3..1680d7a 100644
> > --- a/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
> > +++ b/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
> > @@ -1,5 +1,6 @@
> >  /* SPDX-License-Identifier: BSD-3-Clause
> >   * Copyright(c) 2017 Cavium, Inc
> > + * Copyright(c) 2019 Arm Limited
> >   */
> >
> >  #ifndef _RTE_PAUSE_ARM64_H_
> > @@ -10,6 +11,11 @@ extern "C" {
> >  #endif
> >
> >  #include <rte_common.h>
> > +
> > +#ifdef RTE_ARM_USE_WFE
> > +#define RTE_WAIT_UNTIL_EQUAL_ARCH_DEFINED
> > +#endif
> > +
> >  #include "generic/rte_pause.h"
> >
> >  static inline void rte_pause(void)
> > @@ -17,6 +23,188 @@ static inline void rte_pause(void)
> >  	asm volatile("yield" ::: "memory");
> >  }
> >
> > +/**
> > + * Send an event to quit WFE.
> > + */
> > +static inline void rte_sevl(void);
> > +
> > +/**
> > + * Put processor into low power WFE(Wait For Event) state
> > + */
> > +static inline void rte_wfe(void);
> > +
> > +#ifdef RTE_ARM_USE_WFE
> > +static inline void rte_sevl(void)
> > +{
> > +	asm volatile("sevl" : : : "memory");
> > +}
> > +
> > +static inline void rte_wfe(void)
> > +{
> > +	asm volatile("wfe" : : : "memory");
> > +}
> > +#else
> > +static inline void rte_sevl(void)
> > +{
> > +}
> > +static inline void rte_wfe(void)
> > +{
> > +	rte_pause();
> > +}
> > +#endif
> > +
> > +/**
> > + * @warning
> > + * @b EXPERIMENTAL: this API may change, or be removed, without prior
> notice
> > + *
> > + * Atomic exclusive load from addr, it returns the 16-bit content of *addr
> > + * while making it 'monitored',when it is written by someone else, the
> > + * 'monitored' state is cleared and a event is generated implicitly to exit
> > + * WFE.
> > + *
> > + * @param addr
> > + *  A pointer to the memory location.
> > + * @param memorder
> > + *  The valid memory order variants are __ATOMIC_ACQUIRE and
> __ATOMIC_RELAXED.
> > + *  These map to C++11 memory orders with the same names, see the
> C++11 standard
> > + *  the GCC wiki on atomic synchronization for detailed definitions.
> > + */
> > +static __rte_always_inline uint16_t
> > +rte_atomic_load_ex_16(volatile uint16_t *addr, int memorder);
> > +
> > +/**
> > + * @warning
> > + * @b EXPERIMENTAL: this API may change, or be removed, without prior
> notice
> > + *
> > + * Atomic exclusive load from addr, it returns the 32-bit content of *addr
> > + * while making it 'monitored',when it is written by someone else, the
> > + * 'monitored' state is cleared and a event is generated implicitly to exit
> > + * WFE.
> > + *
> > + * @param addr
> > + *  A pointer to the memory location.
> > + * @param memorder
> > + *  The valid memory order variants are __ATOMIC_ACQUIRE and
> __ATOMIC_RELAXED.
> > + *  These map to C++11 memory orders with the same names, see the
> C++11 standard
> > + *  the GCC wiki on atomic synchronization for detailed definitions.
> > + */
> > +static __rte_always_inline uint32_t
> > +rte_atomic_load_ex_32(volatile uint32_t *addr, int memorder);
> > +
> > +/**
> > + * @warning
> > + * @b EXPERIMENTAL: this API may change, or be removed, without prior
> notice
> > + *
> > + * Atomic exclusive load from addr, it returns the 64-bit content of *addr
> > + * while making it 'monitored',when it is written by someone else, the
> > + * 'monitored' state is cleared and a event is generated implicitly to exit
> > + * WFE.
> > + *
> > + * @param addr
> > + *  A pointer to the memory location.
> > + * @param memorder
> > + *  The valid memory order variants are __ATOMIC_ACQUIRE and
> __ATOMIC_RELAXED.
> > + *  These map to C++11 memory orders with the same names, see the
> C++11 standard
> > + *  the GCC wiki on atomic synchronization for detailed definitions.
> > + */
> > +static __rte_always_inline uint64_t
> > +rte_atomic_load_ex_64(volatile uint64_t *addr, int memorder);
> > +
> > +static __rte_always_inline uint16_t
> > +rte_atomic_load_ex_16(volatile uint16_t *addr, int memorder)
> > +{
> > +	uint16_t tmp;
> > +	assert((memorder == __ATOMIC_ACQUIRE)
> > +			|| (memorder == __ATOMIC_RELAXED));
> > +	if (memorder == __ATOMIC_ACQUIRE)
> > +		asm volatile("ldaxrh %w[tmp], [%x[addr]]"
> > +			: [tmp] "=&r" (tmp)
> > +			: [addr] "r"(addr)
> > +			: "memory");
> > +	else if (memorder == __ATOMIC_RELAXED)
> > +		asm volatile("ldxrh %w[tmp], [%x[addr]]"
> > +			: [tmp] "=&r" (tmp)
> > +			: [addr] "r"(addr)
> > +			: "memory");
> > +	return tmp;
> > +}
> > +
> > +static __rte_always_inline uint32_t
> > +rte_atomic_load_ex_32(volatile uint32_t *addr, int memorder)
> > +{
> > +	uint32_t tmp;
> > +	assert((memorder == __ATOMIC_ACQUIRE)
> > +			|| (memorder == __ATOMIC_RELAXED));
> > +	if (memorder == __ATOMIC_ACQUIRE)
> > +		asm volatile("ldaxr %w[tmp], [%x[addr]]"
> > +			: [tmp] "=&r" (tmp)
> > +			: [addr] "r"(addr)
> > +			: "memory");
> > +	else if (memorder == __ATOMIC_RELAXED)
> > +		asm volatile("ldxr %w[tmp], [%x[addr]]"
> > +			: [tmp] "=&r" (tmp)
> > +			: [addr] "r"(addr)
> > +			: "memory");
> > +	return tmp;
> > +}
> > +
> > +static __rte_always_inline uint64_t
> > +rte_atomic_load_ex_64(volatile uint64_t *addr, int memorder)
> > +{
> > +	uint64_t tmp;
> > +	assert((memorder == __ATOMIC_ACQUIRE)
> > +			|| (memorder == __ATOMIC_RELAXED));
> > +	if (memorder == __ATOMIC_ACQUIRE)
> > +		asm volatile("ldaxr %x[tmp], [%x[addr]]"
> > +			: [tmp] "=&r" (tmp)
> > +			: [addr] "r"(addr)
> > +			: "memory");
> > +	else if (memorder == __ATOMIC_RELAXED)
> > +		asm volatile("ldxr %x[tmp], [%x[addr]]"
> > +			: [tmp] "=&r" (tmp)
> > +			: [addr] "r"(addr)
> > +			: "memory");
> > +	return tmp;
> > +}
> > +
> > +#ifdef RTE_WAIT_UNTIL_EQUAL_ARCH_DEFINED
> > +static __rte_always_inline void
> > +rte_wait_until_equal_16(volatile uint16_t *addr, uint16_t expected,
> > +int memorder)
> > +{
> > +	if (__atomic_load_n(addr, memorder) != expected) {
> > +		rte_sevl();
> > +		do {
> > +			rte_wfe();
> > +		} while (rte_atomic_load_ex_16(addr, memorder) !=
> expected);
> > +	}
> > +}
> > +
> > +static __rte_always_inline void
> > +rte_wait_until_equal_32(volatile uint32_t *addr, uint32_t expected,
> > +int memorder)
> > +{
> > +	if (__atomic_load_n(addr, memorder) != expected) {
> > +		rte_sevl();
> > +		do {
> > +			rte_wfe();
> > +		} while (__atomic_load_n(addr, memorder) != expected);
> 
> Here and in _64, shouldn't it be:
> rte_atomic_load_ex_..
Thanks for spotting this error, David also spotted it. Sorry for that. 
> 
> > +	}
> > +}
> > +
> > +static __rte_always_inline void
> > +rte_wait_until_equal_64(volatile uint64_t *addr, uint64_t expected,
> > +int memorder)
> > +{
> > +	if (__atomic_load_n(addr, memorder) != expected) {
> > +		rte_sevl();
> > +		do {
> > +			rte_wfe();
> > +		} while (__atomic_load_n(addr, memorder) != expected);
> > +	}
> > +}
> > +#endif
> > +
> >  #ifdef __cplusplus
> >  }
> >  #endif
> > diff --git a/lib/librte_eal/common/include/generic/rte_pause.h
> b/lib/librte_eal/common/include/generic/rte_pause.h
> > index 52bd4db..9d42e32 100644
> > --- a/lib/librte_eal/common/include/generic/rte_pause.h
> > +++ b/lib/librte_eal/common/include/generic/rte_pause.h
> > @@ -1,5 +1,6 @@
> >  /* SPDX-License-Identifier: BSD-3-Clause
> >   * Copyright(c) 2017 Cavium, Inc
> > + * Copyright(c) 2019 Arm Limited
> >   */
> >
> >  #ifndef _RTE_PAUSE_H_
> > @@ -12,6 +13,12 @@
> >   *
> >   */
> >
> > +#include <stdint.h>
> > +#include <rte_common.h>
> > +#include <rte_atomic.h>
> > +#include <rte_compat.h>
> > +#include <assert.h>
> > +
> >  /**
> >   * Pause CPU execution for a short while
> >   *
> > @@ -20,4 +27,96 @@
> >   */
> >  static inline void rte_pause(void);
> >
> > +/**
> > + * @warning
> > + * @b EXPERIMENTAL: this API may change, or be removed, without prior
> notice
> > + *
> > + * Wait for *addr to be updated with a 16-bit expected value, with a relaxed
> > + * memory ordering model meaning the loads around this API can be
> reordered.
> > + *
> > + * @param addr
> > + *  A pointer to the memory location.
> > + * @param expected
> > + *  A 16-bit expected value to be in the memory location.
> > + * @param memorder
> > + *  Two different memory orders that can be specified:
> > + *  __ATOMIC_ACQUIRE and __ATOMIC_RELAXED. These map to
> > + *  C++11 memory orders with the same names, see the C++11 standard or
> > + *  the GCC wiki on atomic synchronization for detailed definition.
> > + */
> > +__rte_experimental
> > +static __rte_always_inline void
> > +rte_wait_until_equal_16(volatile uint16_t *addr, uint16_t expected,
> > +int memorder);
> > +
> > +/**
> > + * @warning
> > + * @b EXPERIMENTAL: this API may change, or be removed, without prior
> notice
> > + *
> > + * Wait for *addr to be updated with a 32-bit expected value, with a relaxed
> > + * memory ordering model meaning the loads around this API can be
> reordered.
> > + *
> > + * @param addr
> > + *  A pointer to the memory location.
> > + * @param expected
> > + *  A 32-bit expected value to be in the memory location.
> > + * @param memorder
> > + *  Two different memory orders that can be specified:
> > + *  __ATOMIC_ACQUIRE and __ATOMIC_RELAXED. These map to
> > + *  C++11 memory orders with the same names, see the C++11 standard or
> > + *  the GCC wiki on atomic synchronization for detailed definition.
> > + */
> > +__rte_experimental
> > +static __rte_always_inline void
> > +rte_wait_until_equal_32(volatile uint32_t *addr, uint32_t expected,
> > +int memorder);
> > +
> > +/**
> > + * @warning
> > + * @b EXPERIMENTAL: this API may change, or be removed, without prior
> notice
> > + *
> > + * Wait for *addr to be updated with a 64-bit expected value, with a relaxed
> > + * memory ordering model meaning the loads around this API can be
> reordered.
> > + *
> > + * @param addr
> > + *  A pointer to the memory location.
> > + * @param expected
> > + *  A 64-bit expected value to be in the memory location.
> > + * @param memorder
> > + *  Two different memory orders that can be specified:
> > + *  __ATOMIC_ACQUIRE and __ATOMIC_RELAXED. These map to
> > + *  C++11 memory orders with the same names, see the C++11 standard or
> > + *  the GCC wiki on atomic synchronization for detailed definition.
> > + */
> > +__rte_experimental
> > +static __rte_always_inline void
> > +rte_wait_until_equal_64(volatile uint64_t *addr, uint64_t expected,
> > +int memorder);
> > +
> > +#ifndef RTE_WAIT_UNTIL_EQUAL_ARCH_DEFINED
> > +static __rte_always_inline void
> > +rte_wait_until_equal_16(volatile uint16_t *addr, uint16_t expected,
> > +int memorder)
> > +{
> > +	while (__atomic_load_n(addr, memorder) != expected)
> > +		rte_pause();
> > +}
> > +
> > +static __rte_always_inline void
> > +rte_wait_until_equal_32(volatile uint32_t *addr, uint32_t expected,
> > +int memorder)
> > +{
> > +	while (__atomic_load_n(addr, memorder) != expected)
> > +		rte_pause();
> > +}
> > +
> > +static __rte_always_inline void
> > +rte_wait_until_equal_64(volatile uint64_t *addr, uint64_t expected,
> > +int memorder)
> > +{
> > +	while (__atomic_load_n(addr, memorder) != expected)
> > +		rte_pause();
> > +}
> > +#endif
> > +
> >  #endif /* _RTE_PAUSE_H_ */
> > --
> > 2.7.4


^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: [dpdk-dev] [PATCH v11 2/5] eal: add the APIs to wait until equal
  2019-10-27 20:49   ` David Marchand
@ 2019-10-28  5:08     ` Gavin Hu (Arm Technology China)
  0 siblings, 0 replies; 163+ messages in thread
From: Gavin Hu (Arm Technology China) @ 2019-10-28  5:08 UTC (permalink / raw)
  To: David Marchand
  Cc: dev, nd, Ananyev, Konstantin, thomas, Stephen Hemminger,
	hemant.agrawal, jerinj, Pavan Nikhilesh, Honnappa Nagarahalli,
	Ruifeng Wang (Arm Technology China),
	Phil Yang (Arm Technology China),
	Steve Capper, nd

Hi david,
> -----Original Message-----
> From: David Marchand <david.marchand@redhat.com>
> Sent: Monday, October 28, 2019 4:50 AM
> To: Gavin Hu (Arm Technology China) <Gavin.Hu@arm.com>
> Cc: dev <dev@dpdk.org>; nd <nd@arm.com>; Ananyev, Konstantin
> <konstantin.ananyev@intel.com>; thomas@monjalon.net; Stephen
> Hemminger <stephen@networkplumber.org>; hemant.agrawal@nxp.com;
> jerinj@marvell.com; Pavan Nikhilesh <pbhagavatula@marvell.com>;
> Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>; Ruifeng Wang
> (Arm Technology China) <Ruifeng.Wang@arm.com>; Phil Yang (Arm
> Technology China) <Phil.Yang@arm.com>; Steve Capper
> <Steve.Capper@arm.com>
> Subject: Re: [PATCH v11 2/5] eal: add the APIs to wait until equal
> 
> On Sun, Oct 27, 2019 at 1:53 PM Gavin Hu <gavin.hu@arm.com> wrote:
> 
> [snip]
> 
> > diff --git a/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
> b/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
> > index 93895d3..1680d7a 100644
> > --- a/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
> > +++ b/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
> 
> [snip]
> 
> > @@ -17,6 +23,188 @@ static inline void rte_pause(void)
> >         asm volatile("yield" ::: "memory");
> >  }
> >
> > +/**
> > + * Send an event to quit WFE.
> > + */
> > +static inline void rte_sevl(void);
> > +
> > +/**
> > + * Put processor into low power WFE(Wait For Event) state
> > + */
> > +static inline void rte_wfe(void);
> > +
> > +#ifdef RTE_ARM_USE_WFE
> > +static inline void rte_sevl(void)
> > +{
> > +       asm volatile("sevl" : : : "memory");
> > +}
> > +
> > +static inline void rte_wfe(void)
> > +{
> > +       asm volatile("wfe" : : : "memory");
> > +}
> > +#else
> > +static inline void rte_sevl(void)
> > +{
> > +}
> > +static inline void rte_wfe(void)
> > +{
> > +       rte_pause();
> > +}
> > +#endif
> > +
> > +/**
> > + * @warning
> > + * @b EXPERIMENTAL: this API may change, or be removed, without prior
> notice
> 
> experimental?
> Just complaining on the principle, you missed the __rte_experimental
> in such a case.
> But this API is a no go for me, see below.
Got it, thanks!
> 
> > + *
> > + * Atomic exclusive load from addr, it returns the 16-bit content of *addr
> > + * while making it 'monitored',when it is written by someone else, the
> > + * 'monitored' state is cleared and a event is generated implicitly to exit
> > + * WFE.
> > + *
> > + * @param addr
> > + *  A pointer to the memory location.
> > + * @param memorder
> > + *  The valid memory order variants are __ATOMIC_ACQUIRE and
> __ATOMIC_RELAXED.
> > + *  These map to C++11 memory orders with the same names, see the
> C++11 standard
> > + *  the GCC wiki on atomic synchronization for detailed definitions.
> > + */
> > +static __rte_always_inline uint16_t
> > +rte_atomic_load_ex_16(volatile uint16_t *addr, int memorder);
> 
> This API does not make sense for anything but arm, so this prefix is not good.
Yes, we can change back to __ atomic_load_ex_16?
> 
> On arm, when RTE_ARM_USE_WFE is undefined, why would you need it?
> A non exclusive load is enough since you don't want to use wfe.
We can move it inside #ifdef RTE_ARM_USE_WFE .. #endif.
> [snip]
> 
> > +
> > +static __rte_always_inline uint16_t
> > +rte_atomic_load_ex_16(volatile uint16_t *addr, int memorder)
> > +{
> > +       uint16_t tmp;
> > +       assert((memorder == __ATOMIC_ACQUIRE)
> > +                       || (memorder == __ATOMIC_RELAXED));
> > +       if (memorder == __ATOMIC_ACQUIRE)
> > +               asm volatile("ldaxrh %w[tmp], [%x[addr]]"
> > +                       : [tmp] "=&r" (tmp)
> > +                       : [addr] "r"(addr)
> > +                       : "memory");
> > +       else if (memorder == __ATOMIC_RELAXED)
> > +               asm volatile("ldxrh %w[tmp], [%x[addr]]"
> > +                       : [tmp] "=&r" (tmp)
> > +                       : [addr] "r"(addr)
> > +                       : "memory");
> > +       return tmp;
> > +}
> > +
> > +static __rte_always_inline uint32_t
> > +rte_atomic_load_ex_32(volatile uint32_t *addr, int memorder)
> > +{
> > +       uint32_t tmp;
> > +       assert((memorder == __ATOMIC_ACQUIRE)
> > +                       || (memorder == __ATOMIC_RELAXED));
> > +       if (memorder == __ATOMIC_ACQUIRE)
> > +               asm volatile("ldaxr %w[tmp], [%x[addr]]"
> > +                       : [tmp] "=&r" (tmp)
> > +                       : [addr] "r"(addr)
> > +                       : "memory");
> > +       else if (memorder == __ATOMIC_RELAXED)
> > +               asm volatile("ldxr %w[tmp], [%x[addr]]"
> > +                       : [tmp] "=&r" (tmp)
> > +                       : [addr] "r"(addr)
> > +                       : "memory");
> > +       return tmp;
> > +}
> > +
> > +static __rte_always_inline uint64_t
> > +rte_atomic_load_ex_64(volatile uint64_t *addr, int memorder)
> > +{
> > +       uint64_t tmp;
> > +       assert((memorder == __ATOMIC_ACQUIRE)
> > +                       || (memorder == __ATOMIC_RELAXED));
> > +       if (memorder == __ATOMIC_ACQUIRE)
> > +               asm volatile("ldaxr %x[tmp], [%x[addr]]"
> > +                       : [tmp] "=&r" (tmp)
> > +                       : [addr] "r"(addr)
> > +                       : "memory");
> > +       else if (memorder == __ATOMIC_RELAXED)
> > +               asm volatile("ldxr %x[tmp], [%x[addr]]"
> > +                       : [tmp] "=&r" (tmp)
> > +                       : [addr] "r"(addr)
> > +                       : "memory");
> > +       return tmp;
> > +}
> > +
> > +#ifdef RTE_WAIT_UNTIL_EQUAL_ARCH_DEFINED
> > +static __rte_always_inline void
> > +rte_wait_until_equal_16(volatile uint16_t *addr, uint16_t expected,
> > +int memorder)
> > +{
> > +       if (__atomic_load_n(addr, memorder) != expected) {
> > +               rte_sevl();
> > +               do {
> > +                       rte_wfe();
> 
> 
> We are in the RTE_WAIT_UNTIL_EQUAL_ARCH_DEFINED case.
> rte_wfe() is always asm volatile("wfe" : : : "memory");
> 
> 
> > +               } while (rte_atomic_load_ex_16(addr, memorder) != expected);
> > +       }
> > +}
> > +
> > +static __rte_always_inline void
> > +rte_wait_until_equal_32(volatile uint32_t *addr, uint32_t expected,
> > +int memorder)
> > +{
> > +       if (__atomic_load_n(addr, memorder) != expected) {
> > +               rte_sevl();
> > +               do {
> > +                       rte_wfe();
> > +               } while (__atomic_load_n(addr, memorder) != expected);
> > +       }
> > +}
> 
> The while() should be with an exclusive load.
Sorry for this explicit error. 
> 
> 
> I will submit a v12 with those comments addressed so that we move
> forward for rc2.
> But it won't make it in rc1, sorry.
I will do it if you prefer, otherwise thanks!
> 
> 
> --
> David Marchand


^ permalink raw reply	[flat|nested] 163+ messages in thread

* [dpdk-dev] [PATCH v12 0/5] use WFE for aarch64
  2019-06-30 16:21 [dpdk-dev] [RFC 0/5] use WFE for locks and ring on aarch64 Gavin Hu
                   ` (74 preceding siblings ...)
  2019-10-27 12:52 ` [dpdk-dev] [PATCH v11 5/5] event/opdl: " Gavin Hu
@ 2019-11-04 15:32 ` Gavin Hu
  2019-11-04 15:32 ` [dpdk-dev] [PATCH v12 1/5] bus/fslmc: fix the conflicting dmb function Gavin Hu
                   ` (5 subsequent siblings)
  81 siblings, 0 replies; 163+ messages in thread
From: Gavin Hu @ 2019-11-04 15:32 UTC (permalink / raw)
  To: dev; +Cc: nd, david.marchand, konstantin.ananyev

V12:
- remove the 'rte_' prefix from the arm specific functions (David Marchand)
- use the __atomic_load_ex_xx functions in arm specific implementations of APIS (David Marchand)
- remove the experimental warnings (David Marchand)
- tweak the macros working scope (David Marchand)
V11:
-add rte_ prefix to the __atomic_load_ex_x funtions(Ananyev Konstantin)
-define the above rte_atomic_load_ex_x funtions even if not RTE_WAIT_UNTIL_EQUAL_ARCH_DEFINED for future non-wfe usages(Ananyev Konstantin)
-use the above functions for arm specific rte_wait_until_equal_x functions(Ananyev Konstantin)
-simplify the generic implementation by immersing "if" into "while"(Ananyev Konstantin)

V10:
- move arm specific stuff to arch/arm/rte_pause_64.h(Ananyev Konstantin)
V9:
- fix a weblink broken (David Marchand)
- define rte_wfe and rte_sev() (Ananyev Konstantin)
- explicitly define three function APIs instead of marcos (Ananyev Konstantin)
- incorporate common rte_wfe and rte_sev into the generic rte_spinlock (David Marchand)
- define arch neutral RTE_WAIT_UNTIL_EQUAL_ARCH_DEFINED (Ananyev Konstantin)
- define rte_load_ex_16/32/64 functions to use load-exclusive instruction for aarch64, which is required for wake up of WFE
- drop the rte_spinlock patch from this series, as the it calls this experimental API and it is widely included by a lot of components each requires the ALLOW_EXPERIMENRAL_API for the Makefile and meson.build, leave it to future after the experimental is removed.
V8:
- simplify dmb definition to use io barriers (David Marchand)
- define wfe() and sev() macros and use them inside normal C code (Ananyev Konstantin)
- pass memorder as parameter, not to incorporate it into function name, less functions, similar to C11 atomic intrinsics (Ananyev Konstantin)
- remove mandating RTE_FORCE_INTRINSICS in arm spinlock implementation(David Marchand)
- undef __WAIT_UNTIL_EQUAL after use (David Marchand)
- add experimental tag and warning (David Marchand)
- add the limitation of using WFE instruction in the commit log (David Marchand) 
- tweak the use of RTE_FORCE_INSTRINSICS(still mandatory for aarch64) and RTE_ARM_USE_WFE for spinlock (David Marchand)
- drop the rte_ring patch from this series, as the rte_ring.h calls this API and it is widely included by a lot of components each requires the ALLOW_EXPERIMENRAL_API for the Makefile and meson.build, leave it to future after the experimental is removed.
V7:
- fix the checkpatch LONG_LINE_COMMENT issue
V6:
- squash the RTE_ARM_USE_WFE configuration entry patch into the new API patch
- move the new configuration to the end of EAL
- add doxygen comments to reflect the relaxed and acquire semantics
- correct the meson configuration 
V5:
- add doxygen comments for the new APIs
- spinlock early exit without wfe if the spinlock not taken by others.
- add two patches on top for opdl and thunderx
V4:
- rename the config as CONFIG_RTE_ARM_USE_WFE to indicate it applys to arm only
- introduce a macro for assembly Skelton to reduce the duplication of code
- add one patch for nxp fslmc to address a compiling error
V3:
- Convert RFCs to patches
V2:
- Use inline functions instead of marcos
- Add load and compare in the beginning of the APIs
- Fix some style errors in asm inline 
V1:
- Add the new APIs and use it for ring and locks



Gavin Hu (5):
  bus/fslmc: fix the conflicting dmb function
  eal: add the APIs to wait until equal
  ticketlock: use new API to reduce contention on aarch64
  net/thunderx: use new API to save cycles on aarch64
  event/opdl: use new API to save cycles on aarch64

 config/arm/meson.build                             |   1 +
 config/common_base                                 |   5 +
 drivers/bus/fslmc/mc/fsl_mc_sys.h                  |   8 +-
 drivers/event/opdl/Makefile                        |   1 +
 drivers/event/opdl/meson.build                     |   1 +
 drivers/event/opdl/opdl_ring.c                     |   5 +-
 drivers/net/thunderx/Makefile                      |   1 +
 drivers/net/thunderx/meson.build                   |   1 +
 drivers/net/thunderx/nicvf_rxtx.c                  |   3 +-
 .../common/include/arch/arm/rte_pause_64.h         | 165 +++++++++++++++++++++
 lib/librte_eal/common/include/generic/rte_pause.h  |  99 +++++++++++++
 .../common/include/generic/rte_ticketlock.h        |   3 +-
 12 files changed, 281 insertions(+), 12 deletions(-)

-- 
2.7.4


^ permalink raw reply	[flat|nested] 163+ messages in thread

* [dpdk-dev] [PATCH v12 1/5] bus/fslmc: fix the conflicting dmb function
  2019-06-30 16:21 [dpdk-dev] [RFC 0/5] use WFE for locks and ring on aarch64 Gavin Hu
                   ` (75 preceding siblings ...)
  2019-11-04 15:32 ` [dpdk-dev] [PATCH v12 0/5] use WFE for aarch64 Gavin Hu
@ 2019-11-04 15:32 ` Gavin Hu
  2019-11-04 15:32 ` [dpdk-dev] [PATCH v12 2/5] eal: add the APIs to wait until equal Gavin Hu
                   ` (4 subsequent siblings)
  81 siblings, 0 replies; 163+ messages in thread
From: Gavin Hu @ 2019-11-04 15:32 UTC (permalink / raw)
  To: dev; +Cc: nd, david.marchand, konstantin.ananyev, stable

There are two definitions conflicting each other, for more
details, refer to [1].

include/rte_atomic_64.h:19: error: "dmb" redefined [-Werror]
drivers/bus/fslmc/mc/fsl_mc_sys.h:36: note: this is the location of the
previous definition
 #define dmb() {__asm__ __volatile__("" : : : "memory"); }

The fix is to reuse the EAL definition to avoid conflicts.

[1] http://inbox.dpdk.org/users/VI1PR08MB537631AB25F41B8880DCCA988FDF0@
VI1PR08MB5376.eurprd08.prod.outlook.com/T/#u

Fixes: 3af733ba8da8 ("bus/fslmc: introduce MC object functions")
Cc: stable@dpdk.org

Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Phil Yang <phi.yang@arm.com>
Acked-by: Hemant Agrawal <hemant.agrawal@nxp.com>
---
 drivers/bus/fslmc/mc/fsl_mc_sys.h | 8 +++-----
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/drivers/bus/fslmc/mc/fsl_mc_sys.h b/drivers/bus/fslmc/mc/fsl_mc_sys.h
index d0c7b39..68ce38b 100644
--- a/drivers/bus/fslmc/mc/fsl_mc_sys.h
+++ b/drivers/bus/fslmc/mc/fsl_mc_sys.h
@@ -31,12 +31,10 @@ struct fsl_mc_io {
 #include <errno.h>
 #include <sys/uio.h>
 #include <linux/byteorder/little_endian.h>
+#include <rte_atomic.h>
 
-#ifndef dmb
-#define dmb() {__asm__ __volatile__("" : : : "memory"); }
-#endif
-#define __iormb()	dmb()
-#define __iowmb()	dmb()
+#define __iormb()	rte_io_rmb()
+#define __iowmb()	rte_io_wmb()
 #define __arch_getq(a)		(*(volatile uint64_t *)(a))
 #define __arch_putq(v, a)	(*(volatile uint64_t *)(a) = (v))
 #define __arch_putq32(v, a)	(*(volatile uint32_t *)(a) = (v))
-- 
2.7.4


^ permalink raw reply	[flat|nested] 163+ messages in thread

* [dpdk-dev] [PATCH v12 2/5] eal: add the APIs to wait until equal
  2019-06-30 16:21 [dpdk-dev] [RFC 0/5] use WFE for locks and ring on aarch64 Gavin Hu
                   ` (76 preceding siblings ...)
  2019-11-04 15:32 ` [dpdk-dev] [PATCH v12 1/5] bus/fslmc: fix the conflicting dmb function Gavin Hu
@ 2019-11-04 15:32 ` Gavin Hu
  2019-11-07 15:03   ` Ananyev, Konstantin
  2019-11-04 15:32 ` [dpdk-dev] [PATCH v12 3/5] ticketlock: use new API to reduce contention on aarch64 Gavin Hu
                   ` (3 subsequent siblings)
  81 siblings, 1 reply; 163+ messages in thread
From: Gavin Hu @ 2019-11-04 15:32 UTC (permalink / raw)
  To: dev; +Cc: nd, david.marchand, konstantin.ananyev

The rte_wait_until_equal_xx APIs abstract the functionality of
'polling for a memory location to become equal to a given value'.

Add the RTE_ARM_USE_WFE configuration entry for aarch64, disabled
by default. When it is enabled, the above APIs will call WFE instruction
to save CPU cycles and power.

From a VM, when calling this API on aarch64, it may trap in and out to
release vCPUs whereas cause high exit latency. Since kernel 4.18.20 an
adaptive trapping mechanism is introduced to balance the latency and
workload.

Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
Reviewed-by: Steve Capper <steve.capper@arm.com>
Reviewed-by: Ola Liljedahl <ola.liljedahl@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Reviewed-by: Phil Yang <phil.yang@arm.com>
Acked-by: Pavan Nikhilesh <pbhagavatula@marvell.com>
Acked-by: Jerin Jacob <jerinj@marvell.com>
---
 config/arm/meson.build                             |   1 +
 config/common_base                                 |   5 +
 .../common/include/arch/arm/rte_pause_64.h         | 165 +++++++++++++++++++++
 lib/librte_eal/common/include/generic/rte_pause.h  |  99 +++++++++++++
 4 files changed, 270 insertions(+)

diff --git a/config/arm/meson.build b/config/arm/meson.build
index d9f9811..1981c72 100644
--- a/config/arm/meson.build
+++ b/config/arm/meson.build
@@ -26,6 +26,7 @@ flags_common_default = [
 	['RTE_LIBRTE_AVP_PMD', false],
 
 	['RTE_SCHED_VECTOR', false],
+	['RTE_ARM_USE_WFE', false],
 ]
 
 flags_generic = [
diff --git a/config/common_base b/config/common_base
index b2be3d9..ee7e390 100644
--- a/config/common_base
+++ b/config/common_base
@@ -110,6 +110,11 @@ CONFIG_RTE_MAX_VFIO_CONTAINERS=64
 CONFIG_RTE_MALLOC_DEBUG=n
 CONFIG_RTE_EAL_NUMA_AWARE_HUGEPAGES=n
 CONFIG_RTE_USE_LIBBSD=n
+# Use WFE instructions to implement the rte_wait_for_equal_xxx APIs,
+# calling these APIs put the cores in low power state while waiting
+# for the memory address to become equal to the expected value.
+# This is supported only by aarch64.
+CONFIG_RTE_ARM_USE_WFE=n
 
 #
 # Recognize/ignore the AVX/AVX512 CPU flags for performance/power testing.
diff --git a/lib/librte_eal/common/include/arch/arm/rte_pause_64.h b/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
index 93895d3..accdb4c 100644
--- a/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
+++ b/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
@@ -1,5 +1,6 @@
 /* SPDX-License-Identifier: BSD-3-Clause
  * Copyright(c) 2017 Cavium, Inc
+ * Copyright(c) 2019 Arm Limited
  */
 
 #ifndef _RTE_PAUSE_ARM64_H_
@@ -10,6 +11,11 @@ extern "C" {
 #endif
 
 #include <rte_common.h>
+
+#ifdef RTE_ARM_USE_WFE
+#define RTE_WAIT_UNTIL_EQUAL_ARCH_DEFINED
+#endif
+
 #include "generic/rte_pause.h"
 
 static inline void rte_pause(void)
@@ -17,6 +23,165 @@ static inline void rte_pause(void)
 	asm volatile("yield" ::: "memory");
 }
 
+#ifdef RTE_WAIT_UNTIL_EQUAL_ARCH_DEFINED
+/**
+ * Send an event to quit WFE.
+ */
+static inline void _sevl(void)
+{
+	asm volatile("sevl" : : : "memory");
+}
+
+/**
+ * Put processor into low power WFE(Wait For Event) state
+ */
+static inline void _wfe(void)
+{
+	asm volatile("wfe" : : : "memory");
+}
+
+/**
+ * Atomic exclusive load from addr, it returns the 16-bit content of *addr
+ * while making it 'monitored',when it is written by someone else, the
+ * 'monitored' state is cleared and a event is generated implicitly to exit
+ * WFE.
+ *
+ * @param addr
+ *  A pointer to the memory location.
+ * @param memorder
+ *  The valid memory order variants are __ATOMIC_ACQUIRE and __ATOMIC_RELAXED.
+ *  These map to C++11 memory orders with the same names, see the C++11 standard
+ *  the GCC wiki on atomic synchronization for detailed definitions.
+ */
+static __rte_always_inline uint16_t
+__atomic_load_ex_16(volatile uint16_t *addr, int memorder);
+
+/**
+ * Atomic exclusive load from addr, it returns the 32-bit content of *addr
+ * while making it 'monitored',when it is written by someone else, the
+ * 'monitored' state is cleared and a event is generated implicitly to exit
+ * WFE.
+ *
+ * @param addr
+ *  A pointer to the memory location.
+ * @param memorder
+ *  The valid memory order variants are __ATOMIC_ACQUIRE and __ATOMIC_RELAXED.
+ *  These map to C++11 memory orders with the same names, see the C++11 standard
+ *  the GCC wiki on atomic synchronization for detailed definitions.
+ */
+static __rte_always_inline uint32_t
+__atomic_load_ex_32(volatile uint32_t *addr, int memorder);
+
+/**
+ * Atomic exclusive load from addr, it returns the 64-bit content of *addr
+ * while making it 'monitored',when it is written by someone else, the
+ * 'monitored' state is cleared and a event is generated implicitly to exit
+ * WFE.
+ *
+ * @param addr
+ *  A pointer to the memory location.
+ * @param memorder
+ *  The valid memory order variants are __ATOMIC_ACQUIRE and __ATOMIC_RELAXED.
+ *  These map to C++11 memory orders with the same names, see the C++11 standard
+ *  the GCC wiki on atomic synchronization for detailed definitions.
+ */
+static __rte_always_inline uint64_t
+__atomic_load_ex_64(volatile uint64_t *addr, int memorder);
+
+static __rte_always_inline uint16_t
+__atomic_load_ex_16(volatile uint16_t *addr, int memorder)
+{
+	uint16_t tmp;
+	assert((memorder == __ATOMIC_ACQUIRE)
+			|| (memorder == __ATOMIC_RELAXED));
+	if (memorder == __ATOMIC_ACQUIRE)
+		asm volatile("ldaxrh %w[tmp], [%x[addr]]"
+			: [tmp] "=&r" (tmp)
+			: [addr] "r"(addr)
+			: "memory");
+	else if (memorder == __ATOMIC_RELAXED)
+		asm volatile("ldxrh %w[tmp], [%x[addr]]"
+			: [tmp] "=&r" (tmp)
+			: [addr] "r"(addr)
+			: "memory");
+	return tmp;
+}
+
+static __rte_always_inline uint32_t
+__atomic_load_ex_32(volatile uint32_t *addr, int memorder)
+{
+	uint32_t tmp;
+	assert((memorder == __ATOMIC_ACQUIRE)
+			|| (memorder == __ATOMIC_RELAXED));
+	if (memorder == __ATOMIC_ACQUIRE)
+		asm volatile("ldaxr %w[tmp], [%x[addr]]"
+			: [tmp] "=&r" (tmp)
+			: [addr] "r"(addr)
+			: "memory");
+	else if (memorder == __ATOMIC_RELAXED)
+		asm volatile("ldxr %w[tmp], [%x[addr]]"
+			: [tmp] "=&r" (tmp)
+			: [addr] "r"(addr)
+			: "memory");
+	return tmp;
+}
+
+static __rte_always_inline uint64_t
+__atomic_load_ex_64(volatile uint64_t *addr, int memorder)
+{
+	uint64_t tmp;
+	assert((memorder == __ATOMIC_ACQUIRE)
+			|| (memorder == __ATOMIC_RELAXED));
+	if (memorder == __ATOMIC_ACQUIRE)
+		asm volatile("ldaxr %x[tmp], [%x[addr]]"
+			: [tmp] "=&r" (tmp)
+			: [addr] "r"(addr)
+			: "memory");
+	else if (memorder == __ATOMIC_RELAXED)
+		asm volatile("ldxr %x[tmp], [%x[addr]]"
+			: [tmp] "=&r" (tmp)
+			: [addr] "r"(addr)
+			: "memory");
+	return tmp;
+}
+
+static __rte_always_inline void
+rte_wait_until_equal_16(volatile uint16_t *addr, uint16_t expected,
+int memorder)
+{
+	if (__atomic_load_ex_16(addr, memorder) != expected) {
+		_sevl();
+		do {
+			_wfe();
+		} while (__atomic_load_ex_16(addr, memorder) != expected);
+	}
+}
+
+static __rte_always_inline void
+rte_wait_until_equal_32(volatile uint32_t *addr, uint32_t expected,
+int memorder)
+{
+	if (__atomic_load_ex_32(addr, memorder) != expected) {
+		_sevl();
+		do {
+			_wfe();
+		} while (__atomic_load_ex_32(addr, memorder) != expected);
+	}
+}
+
+static __rte_always_inline void
+rte_wait_until_equal_64(volatile uint64_t *addr, uint64_t expected,
+int memorder)
+{
+	if (__atomic_load_ex_64(addr, memorder) != expected) {
+		_sevl();
+		do {
+			_wfe();
+		} while (__atomic_load_ex_64(addr, memorder) != expected);
+	}
+}
+#endif
+
 #ifdef __cplusplus
 }
 #endif
diff --git a/lib/librte_eal/common/include/generic/rte_pause.h b/lib/librte_eal/common/include/generic/rte_pause.h
index 52bd4db..89e2084 100644
--- a/lib/librte_eal/common/include/generic/rte_pause.h
+++ b/lib/librte_eal/common/include/generic/rte_pause.h
@@ -1,5 +1,6 @@
 /* SPDX-License-Identifier: BSD-3-Clause
  * Copyright(c) 2017 Cavium, Inc
+ * Copyright(c) 2019 Arm Limited
  */
 
 #ifndef _RTE_PAUSE_H_
@@ -12,6 +13,12 @@
  *
  */
 
+#include <stdint.h>
+#include <assert.h>
+#include <rte_common.h>
+#include <rte_atomic.h>
+#include <rte_compat.h>
+
 /**
  * Pause CPU execution for a short while
  *
@@ -20,4 +27,96 @@
  */
 static inline void rte_pause(void);
 
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Wait for *addr to be updated with a 16-bit expected value, with a relaxed
+ * memory ordering model meaning the loads around this API can be reordered.
+ *
+ * @param addr
+ *  A pointer to the memory location.
+ * @param expected
+ *  A 16-bit expected value to be in the memory location.
+ * @param memorder
+ *  Two different memory orders that can be specified:
+ *  __ATOMIC_ACQUIRE and __ATOMIC_RELAXED. These map to
+ *  C++11 memory orders with the same names, see the C++11 standard or
+ *  the GCC wiki on atomic synchronization for detailed definition.
+ */
+__rte_experimental
+static __rte_always_inline void
+rte_wait_until_equal_16(volatile uint16_t *addr, uint16_t expected,
+int memorder);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Wait for *addr to be updated with a 32-bit expected value, with a relaxed
+ * memory ordering model meaning the loads around this API can be reordered.
+ *
+ * @param addr
+ *  A pointer to the memory location.
+ * @param expected
+ *  A 32-bit expected value to be in the memory location.
+ * @param memorder
+ *  Two different memory orders that can be specified:
+ *  __ATOMIC_ACQUIRE and __ATOMIC_RELAXED. These map to
+ *  C++11 memory orders with the same names, see the C++11 standard or
+ *  the GCC wiki on atomic synchronization for detailed definition.
+ */
+__rte_experimental
+static __rte_always_inline void
+rte_wait_until_equal_32(volatile uint32_t *addr, uint32_t expected,
+int memorder);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Wait for *addr to be updated with a 64-bit expected value, with a relaxed
+ * memory ordering model meaning the loads around this API can be reordered.
+ *
+ * @param addr
+ *  A pointer to the memory location.
+ * @param expected
+ *  A 64-bit expected value to be in the memory location.
+ * @param memorder
+ *  Two different memory orders that can be specified:
+ *  __ATOMIC_ACQUIRE and __ATOMIC_RELAXED. These map to
+ *  C++11 memory orders with the same names, see the C++11 standard or
+ *  the GCC wiki on atomic synchronization for detailed definition.
+ */
+__rte_experimental
+static __rte_always_inline void
+rte_wait_until_equal_64(volatile uint64_t *addr, uint64_t expected,
+int memorder);
+
+#ifndef RTE_WAIT_UNTIL_EQUAL_ARCH_DEFINED
+static __rte_always_inline void
+rte_wait_until_equal_16(volatile uint16_t *addr, uint16_t expected,
+int memorder)
+{
+	while (__atomic_load_n(addr, memorder) != expected)
+		rte_pause();
+}
+
+static __rte_always_inline void
+rte_wait_until_equal_32(volatile uint32_t *addr, uint32_t expected,
+int memorder)
+{
+	while (__atomic_load_n(addr, memorder) != expected)
+		rte_pause();
+}
+
+static __rte_always_inline void
+rte_wait_until_equal_64(volatile uint64_t *addr, uint64_t expected,
+int memorder)
+{
+	while (__atomic_load_n(addr, memorder) != expected)
+		rte_pause();
+}
+#endif
+
 #endif /* _RTE_PAUSE_H_ */
-- 
2.7.4


^ permalink raw reply	[flat|nested] 163+ messages in thread

* [dpdk-dev] [PATCH v12 3/5] ticketlock: use new API to reduce contention on aarch64
  2019-06-30 16:21 [dpdk-dev] [RFC 0/5] use WFE for locks and ring on aarch64 Gavin Hu
                   ` (77 preceding siblings ...)
  2019-11-04 15:32 ` [dpdk-dev] [PATCH v12 2/5] eal: add the APIs to wait until equal Gavin Hu
@ 2019-11-04 15:32 ` Gavin Hu
  2019-11-04 15:32 ` [dpdk-dev] [PATCH v12 4/5] net/thunderx: use new API to save cycles " Gavin Hu
                   ` (2 subsequent siblings)
  81 siblings, 0 replies; 163+ messages in thread
From: Gavin Hu @ 2019-11-04 15:32 UTC (permalink / raw)
  To: dev; +Cc: nd, david.marchand, konstantin.ananyev

While using ticket lock, cores repeatedly poll the lock variable.
This is replaced by rte_wait_until_equal API.

Running ticketlock_autotest on ThunderX2, Ampere eMAG80, and Arm N1SDP[1],
there were variances between runs, but no notable performance gain or
degradation were seen with and without this patch.

[1] https://community.arm.com/developer/tools-software/oss-platforms/w/\
docs/440/neoverse-n1-sdp

Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Tested-by: Phil Yang <phil.yang@arm.com>
Tested-by: Pavan Nikhilesh <pbhagavatula@marvell.com>
Reviewed-by: Jerin Jacob <jerinj@marvell.com>
---
 lib/librte_eal/common/include/generic/rte_ticketlock.h | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/lib/librte_eal/common/include/generic/rte_ticketlock.h b/lib/librte_eal/common/include/generic/rte_ticketlock.h
index d9bec87..c295ae7 100644
--- a/lib/librte_eal/common/include/generic/rte_ticketlock.h
+++ b/lib/librte_eal/common/include/generic/rte_ticketlock.h
@@ -66,8 +66,7 @@ static inline void
 rte_ticketlock_lock(rte_ticketlock_t *tl)
 {
 	uint16_t me = __atomic_fetch_add(&tl->s.next, 1, __ATOMIC_RELAXED);
-	while (__atomic_load_n(&tl->s.current, __ATOMIC_ACQUIRE) != me)
-		rte_pause();
+	rte_wait_until_equal_16(&tl->s.current, me, __ATOMIC_ACQUIRE);
 }
 
 /**
-- 
2.7.4


^ permalink raw reply	[flat|nested] 163+ messages in thread

* [dpdk-dev] [PATCH v12 4/5] net/thunderx: use new API to save cycles on aarch64
  2019-06-30 16:21 [dpdk-dev] [RFC 0/5] use WFE for locks and ring on aarch64 Gavin Hu
                   ` (78 preceding siblings ...)
  2019-11-04 15:32 ` [dpdk-dev] [PATCH v12 3/5] ticketlock: use new API to reduce contention on aarch64 Gavin Hu
@ 2019-11-04 15:32 ` " Gavin Hu
  2019-11-04 15:32 ` [dpdk-dev] [PATCH v12 5/5] event/opdl: " Gavin Hu
  2019-11-07 21:35 ` [dpdk-dev] [PATCH v13 0/5] use WFE for aarch64 David Marchand
  81 siblings, 0 replies; 163+ messages in thread
From: Gavin Hu @ 2019-11-04 15:32 UTC (permalink / raw)
  To: dev; +Cc: nd, david.marchand, konstantin.ananyev

Use the new API to wait in low power state instead of continuous
polling to save CPU cycles and power.

Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
Acked-by: Jerin Jacob <jerinj@marvell.com>
---
 drivers/net/thunderx/Makefile     | 1 +
 drivers/net/thunderx/meson.build  | 1 +
 drivers/net/thunderx/nicvf_rxtx.c | 3 +--
 3 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/net/thunderx/Makefile b/drivers/net/thunderx/Makefile
index e6bf497..9e0de10 100644
--- a/drivers/net/thunderx/Makefile
+++ b/drivers/net/thunderx/Makefile
@@ -10,6 +10,7 @@ include $(RTE_SDK)/mk/rte.vars.mk
 LIB = librte_pmd_thunderx_nicvf.a
 
 CFLAGS += $(WERROR_FLAGS)
+CFLAGS += -DALLOW_EXPERIMENTAL_API
 
 LDLIBS += -lm
 LDLIBS += -lrte_eal -lrte_mbuf -lrte_mempool -lrte_ring
diff --git a/drivers/net/thunderx/meson.build b/drivers/net/thunderx/meson.build
index 69819a9..23d9458 100644
--- a/drivers/net/thunderx/meson.build
+++ b/drivers/net/thunderx/meson.build
@@ -4,6 +4,7 @@
 subdir('base')
 objs = [base_objs]
 
+allow_experimental_apis = true
 sources = files('nicvf_rxtx.c',
 		'nicvf_ethdev.c',
 		'nicvf_svf.c'
diff --git a/drivers/net/thunderx/nicvf_rxtx.c b/drivers/net/thunderx/nicvf_rxtx.c
index 1c42874..90a6098 100644
--- a/drivers/net/thunderx/nicvf_rxtx.c
+++ b/drivers/net/thunderx/nicvf_rxtx.c
@@ -385,8 +385,7 @@ nicvf_fill_rbdr(struct nicvf_rxq *rxq, int to_fill)
 		ltail++;
 	}
 
-	while (__atomic_load_n(&rbdr->tail, __ATOMIC_RELAXED) != next_tail)
-		rte_pause();
+	rte_wait_until_equal_32(&rbdr->tail, next_tail, __ATOMIC_RELAXED);
 
 	__atomic_store_n(&rbdr->tail, ltail, __ATOMIC_RELEASE);
 	nicvf_addr_write(door, to_fill);
-- 
2.7.4


^ permalink raw reply	[flat|nested] 163+ messages in thread

* [dpdk-dev] [PATCH v12 5/5] event/opdl: use new API to save cycles on aarch64
  2019-06-30 16:21 [dpdk-dev] [RFC 0/5] use WFE for locks and ring on aarch64 Gavin Hu
                   ` (79 preceding siblings ...)
  2019-11-04 15:32 ` [dpdk-dev] [PATCH v12 4/5] net/thunderx: use new API to save cycles " Gavin Hu
@ 2019-11-04 15:32 ` " Gavin Hu
  2019-11-07 21:35 ` [dpdk-dev] [PATCH v13 0/5] use WFE for aarch64 David Marchand
  81 siblings, 0 replies; 163+ messages in thread
From: Gavin Hu @ 2019-11-04 15:32 UTC (permalink / raw)
  To: dev; +Cc: nd, david.marchand, konstantin.ananyev

Use the new API to wait in low power state instead of continuous
polling to save CPU cycles and power.

Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
Reviewed-by: Jerin Jacob <jerinj@marvell.com>
---
 drivers/event/opdl/Makefile    | 1 +
 drivers/event/opdl/meson.build | 1 +
 drivers/event/opdl/opdl_ring.c | 5 ++---
 3 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/drivers/event/opdl/Makefile b/drivers/event/opdl/Makefile
index bf50a60..72ef07d 100644
--- a/drivers/event/opdl/Makefile
+++ b/drivers/event/opdl/Makefile
@@ -9,6 +9,7 @@ LIB = librte_pmd_opdl_event.a
 # build flags
 CFLAGS += -O3
 CFLAGS += $(WERROR_FLAGS)
+CFLAGS += -DALLOW_EXPERIMENTAL_API
 # for older GCC versions, allow us to initialize an event using
 # designated initializers.
 ifeq ($(CONFIG_RTE_TOOLCHAIN_GCC),y)
diff --git a/drivers/event/opdl/meson.build b/drivers/event/opdl/meson.build
index 1fe034e..e67b164 100644
--- a/drivers/event/opdl/meson.build
+++ b/drivers/event/opdl/meson.build
@@ -1,6 +1,7 @@
 # SPDX-License-Identifier: BSD-3-Clause
 # Copyright(c) 2018 Luca Boccassi <bluca@debian.org>
 
+allow_experimental_apis = true
 sources = files(
 	'opdl_evdev.c',
 	'opdl_evdev_init.c',
diff --git a/drivers/event/opdl/opdl_ring.c b/drivers/event/opdl/opdl_ring.c
index 06fb5b3..c8d19fe 100644
--- a/drivers/event/opdl/opdl_ring.c
+++ b/drivers/event/opdl/opdl_ring.c
@@ -16,6 +16,7 @@
 #include <rte_memcpy.h>
 #include <rte_memory.h>
 #include <rte_memzone.h>
+#include <rte_atomic.h>
 
 #include "opdl_ring.h"
 #include "opdl_log.h"
@@ -474,9 +475,7 @@ opdl_ring_input_multithread(struct opdl_ring *t, const void *entries,
 	/* If another thread started inputting before this one, but hasn't
 	 * finished, we need to wait for it to complete to update the tail.
 	 */
-	while (unlikely(__atomic_load_n(&s->shared.tail, __ATOMIC_ACQUIRE) !=
-			old_head))
-		rte_pause();
+	rte_wait_until_equal_32(&s->shared.tail, old_head, __ATOMIC_ACQUIRE);
 
 	__atomic_store_n(&s->shared.tail, old_head + num_entries,
 			__ATOMIC_RELEASE);
-- 
2.7.4


^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: [dpdk-dev] [PATCH v12 2/5] eal: add the APIs to wait until equal
  2019-11-04 15:32 ` [dpdk-dev] [PATCH v12 2/5] eal: add the APIs to wait until equal Gavin Hu
@ 2019-11-07 15:03   ` Ananyev, Konstantin
  0 siblings, 0 replies; 163+ messages in thread
From: Ananyev, Konstantin @ 2019-11-07 15:03 UTC (permalink / raw)
  To: Gavin Hu, dev; +Cc: nd, david.marchand



> The rte_wait_until_equal_xx APIs abstract the functionality of
> 'polling for a memory location to become equal to a given value'.
> 
> Add the RTE_ARM_USE_WFE configuration entry for aarch64, disabled
> by default. When it is enabled, the above APIs will call WFE instruction
> to save CPU cycles and power.
> 
> From a VM, when calling this API on aarch64, it may trap in and out to
> release vCPUs whereas cause high exit latency. Since kernel 4.18.20 an
> adaptive trapping mechanism is introduced to balance the latency and
> workload.
> 
> Signed-off-by: Gavin Hu <gavin.hu@arm.com>
> Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> Reviewed-by: Steve Capper <steve.capper@arm.com>
> Reviewed-by: Ola Liljedahl <ola.liljedahl@arm.com>
> Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> Reviewed-by: Phil Yang <phil.yang@arm.com>
> Acked-by: Pavan Nikhilesh <pbhagavatula@marvell.com>
> Acked-by: Jerin Jacob <jerinj@marvell.com>
> ---

Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>

> 2.7.4


^ permalink raw reply	[flat|nested] 163+ messages in thread

* [dpdk-dev] [PATCH v13 0/5] use WFE for aarch64
  2019-06-30 16:21 [dpdk-dev] [RFC 0/5] use WFE for locks and ring on aarch64 Gavin Hu
                   ` (80 preceding siblings ...)
  2019-11-04 15:32 ` [dpdk-dev] [PATCH v12 5/5] event/opdl: " Gavin Hu
@ 2019-11-07 21:35 ` David Marchand
  2019-11-07 21:35   ` [dpdk-dev] [PATCH v13 1/5] bus/fslmc: fix the conflicting dmb function David Marchand
                     ` (4 more replies)
  81 siblings, 5 replies; 163+ messages in thread
From: David Marchand @ 2019-11-07 21:35 UTC (permalink / raw)
  To: dev; +Cc: nd, konstantin.ananyev

DPDK has multiple use cases where the core repeatedly polls a location in
memory. This polling results in many cache and memory transactions.

Arm architecture provides WFE (Wait For Event) instruction, which allows
the cpu core to enter a low power state until woken up by the update to the
memory location being polled. Thus reducing the cache and memory
transactions.

x86 has the PAUSE hint instruction to reduce such overhead.

The rte_wait_until_equal_xxx APIs abstract the functionality of 'polling
for a memory location to become equal to a given value'.

For non-Arm platforms, these APIs are just wrappers around do-while loop
with rte_pause, so there are no performance differences.

For Arm platforms, use of WFE can be configured using CONFIG_RTE_USE_WFE
option. It is disabled by default.

Currently, use of WFE is supported only for aarch64 platforms. armv7
platforms do support the WFE instruction, but they require explicit wake up
events(sev) and are less performannt.

Testing shows that, performance varies across different platforms, with
some showing degradation.

CONFIG_RTE_USE_WFE should be enabled depending on the performance on the
target platforms.

V13:
- added release notes update,
- reworked arm implementation to avoid exporting inlines,
- added assert in generic implementation,

V12:
- remove the 'rte_' prefix from the arm specific functions (David Marchand)
- use the __atomic_load_ex_xx functions in arm specific implementations of
  APIS (David Marchand)
- remove the experimental warnings (David Marchand)
- tweak the macros working scope (David Marchand)
V11:
- add rte_ prefix to the __atomic_load_ex_x funtions (Ananyev Konstantin)
- define the above rte_atomic_load_ex_x funtions even if not
  RTE_WAIT_UNTIL_EQUAL_ARCH_DEFINED for future non-wfe usages (Ananyev
  Konstantin)
- use the above functions for arm specific rte_wait_until_equal_x functions
  (Ananyev Konstantin)
- simplify the generic implementation by immersing "if" into "while"
  (Ananyev Konstantin)

V10:
- move arm specific stuff to arch/arm/rte_pause_64.h (Ananyev Konstantin)

V9:
- fix a weblink broken (David Marchand)
- define rte_wfe and rte_sev() (Ananyev Konstantin)
- explicitly define three function APIs instead of marcos (Ananyev Konstantin)
- incorporate common rte_wfe and rte_sev into the generic rte_spinlock (David
  Marchand)
- define arch neutral RTE_WAIT_UNTIL_EQUAL_ARCH_DEFINED (Ananyev Konstantin)
- define rte_load_ex_16/32/64 functions to use load-exclusive instruction for
  aarch64, which is required for wake up of WFE
- drop the rte_spinlock patch from this series, as the it calls this
  experimental API and it is widely included by a lot of components each
  requires the ALLOW_EXPERIMENRAL_API for the Makefile and meson.build, leave
  it to future after the experimental is removed.

V8:
- simplify dmb definition to use io barriers (David Marchand)
- define wfe() and sev() macros and use them inside normal C code (Ananyev
  Konstantin)
- pass memorder as parameter, not to incorporate it into function name, less
  functions, similar to C11 atomic intrinsics (Ananyev Konstantin)
- remove mandating RTE_FORCE_INTRINSICS in arm spinlock implementation (David
  Marchand)
- undef __WAIT_UNTIL_EQUAL after use (David Marchand)
- add experimental tag and warning (David Marchand)
- add the limitation of using WFE instruction in the commit log (David
  Marchand) 
- tweak the use of RTE_FORCE_INSTRINSICS (still mandatory for aarch64) and
  RTE_ARM_USE_WFE for spinlock (David Marchand)
- drop the rte_ring patch from this series, as the rte_ring.h calls this API
  and it is widely included by a lot of components each requires the
  ALLOW_EXPERIMENRAL_API for the Makefile and meson.build, leave it to future
  after the experimental is removed.

V7:
- fix the checkpatch LONG_LINE_COMMENT issue

V6:
- squash the RTE_ARM_USE_WFE configuration entry patch into the new API patch
- move the new configuration to the end of EAL
- add doxygen comments to reflect the relaxed and acquire semantics
- correct the meson configuration 

V5:
- add doxygen comments for the new APIs
- spinlock early exit without wfe if the spinlock not taken by others.
- add two patches on top for opdl and thunderx

V4:
- rename the config as CONFIG_RTE_ARM_USE_WFE to indicate it applys to arm only
- introduce a macro for assembly Skelton to reduce the duplication of code
- add one patch for nxp fslmc to address a compiling error

V3:
- Convert RFCs to patches

V2:
- Use inline functions instead of marcos
- Add load and compare in the beginning of the APIs
- Fix some style errors in asm inline 

V1:
- Add the new APIs and use it for ring and locks

Gavin Hu (5):
  bus/fslmc: fix the conflicting dmb function
  eal: add the APIs to wait until equal
  ticketlock: use new API to reduce contention on aarch64
  net/thunderx: use new API to save cycles on aarch64
  event/opdl: use new API to save cycles on aarch64

 config/arm/meson.build                             |   1 +
 config/common_base                                 |   5 +
 doc/guides/rel_notes/release_19_11.rst             |   5 +
 drivers/bus/fslmc/mc/fsl_mc_sys.h                  |   9 +-
 drivers/event/opdl/Makefile                        |   1 +
 drivers/event/opdl/meson.build                     |   1 +
 drivers/event/opdl/opdl_ring.c                     |   5 +-
 drivers/net/thunderx/Makefile                      |   1 +
 drivers/net/thunderx/meson.build                   |   1 +
 drivers/net/thunderx/nicvf_rxtx.c                  |   3 +-
 .../common/include/arch/arm/rte_pause_64.h         | 133 +++++++++++++++++++++
 lib/librte_eal/common/include/generic/rte_pause.h  | 105 ++++++++++++++++
 .../common/include/generic/rte_ticketlock.h        |   3 +-
 13 files changed, 261 insertions(+), 12 deletions(-)

-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 163+ messages in thread

* [dpdk-dev] [PATCH v13 1/5] bus/fslmc: fix the conflicting dmb function
  2019-11-07 21:35 ` [dpdk-dev] [PATCH v13 0/5] use WFE for aarch64 David Marchand
@ 2019-11-07 21:35   ` David Marchand
  2019-11-07 21:35   ` [dpdk-dev] [PATCH v13 2/5] eal: add the APIs to wait until equal David Marchand
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 163+ messages in thread
From: David Marchand @ 2019-11-07 21:35 UTC (permalink / raw)
  To: dev
  Cc: nd, konstantin.ananyev, Gavin Hu, stable, Hemant Agrawal, Sachin Saxena

From: Gavin Hu <gavin.hu@arm.com>

There are two definitions conflicting each other, for more
details, refer to [1].

include/rte_atomic_64.h:19: error: "dmb" redefined [-Werror]
drivers/bus/fslmc/mc/fsl_mc_sys.h:36: note: this is the location of the
previous definition
 #define dmb() {__asm__ __volatile__("" : : : "memory"); }

The fix is to reuse the EAL definition to avoid conflicts.

[1] http://inbox.dpdk.org/users/VI1PR08MB537631AB25F41B8880DCCA988FDF0@
VI1PR08MB5376.eurprd08.prod.outlook.com/T/#u

Fixes: 3af733ba8da8 ("bus/fslmc: introduce MC object functions")
Cc: stable@dpdk.org

Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Phil Yang <phil.yang@arm.com>
Acked-by: Hemant Agrawal <hemant.agrawal@nxp.com>
---
Changelog since v12:
- fixed Phil Yang mail address,

---
 drivers/bus/fslmc/mc/fsl_mc_sys.h | 9 ++++-----
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/drivers/bus/fslmc/mc/fsl_mc_sys.h b/drivers/bus/fslmc/mc/fsl_mc_sys.h
index d0c7b39..a310c56 100644
--- a/drivers/bus/fslmc/mc/fsl_mc_sys.h
+++ b/drivers/bus/fslmc/mc/fsl_mc_sys.h
@@ -32,11 +32,10 @@ struct fsl_mc_io {
 #include <sys/uio.h>
 #include <linux/byteorder/little_endian.h>
 
-#ifndef dmb
-#define dmb() {__asm__ __volatile__("" : : : "memory"); }
-#endif
-#define __iormb()	dmb()
-#define __iowmb()	dmb()
+#include <rte_atomic.h>
+
+#define __iormb()	rte_io_rmb()
+#define __iowmb()	rte_io_wmb()
 #define __arch_getq(a)		(*(volatile uint64_t *)(a))
 #define __arch_putq(v, a)	(*(volatile uint64_t *)(a) = (v))
 #define __arch_putq32(v, a)	(*(volatile uint32_t *)(a) = (v))
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 163+ messages in thread

* [dpdk-dev] [PATCH v13 2/5] eal: add the APIs to wait until equal
  2019-11-07 21:35 ` [dpdk-dev] [PATCH v13 0/5] use WFE for aarch64 David Marchand
  2019-11-07 21:35   ` [dpdk-dev] [PATCH v13 1/5] bus/fslmc: fix the conflicting dmb function David Marchand
@ 2019-11-07 21:35   ` David Marchand
  2019-11-08 16:38     ` Ananyev, Konstantin
  2019-11-07 21:35   ` [dpdk-dev] [PATCH v13 3/5] ticketlock: use new API to reduce contention on aarch64 David Marchand
                     ` (2 subsequent siblings)
  4 siblings, 1 reply; 163+ messages in thread
From: David Marchand @ 2019-11-07 21:35 UTC (permalink / raw)
  To: dev
  Cc: nd, konstantin.ananyev, Gavin Hu, Thomas Monjalon, John McNamara,
	Marko Kovacevic, Jerin Jacob, Jan Viktorin

From: Gavin Hu <gavin.hu@arm.com>

The rte_wait_until_equal_xx APIs abstract the functionality of
'polling for a memory location to become equal to a given value'.

Add the RTE_ARM_USE_WFE configuration entry for aarch64, disabled
by default. When it is enabled, the above APIs will call WFE instruction
to save CPU cycles and power.

From a VM, when calling this API on aarch64, it may trap in and out to
release vCPUs whereas cause high exit latency. Since kernel 4.18.20 an
adaptive trapping mechanism is introduced to balance the latency and
workload.

Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
Reviewed-by: Steve Capper <steve.capper@arm.com>
Reviewed-by: Ola Liljedahl <ola.liljedahl@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Reviewed-by: Phil Yang <phil.yang@arm.com>
Acked-by: Pavan Nikhilesh <pbhagavatula@marvell.com>
Acked-by: Jerin Jacob <jerinj@marvell.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
Signed-off-by: David Marchand <david.marchand@redhat.com>
---
Changelog since v12:
- added release notes update,
- fixed function prototypes indent,
- reimplemented the arm implementation without exposing internal inline
  functions,
- added asserts in generic implementation,

---
 config/arm/meson.build                             |   1 +
 config/common_base                                 |   5 +
 doc/guides/rel_notes/release_19_11.rst             |   5 +
 .../common/include/arch/arm/rte_pause_64.h         | 133 +++++++++++++++++++++
 lib/librte_eal/common/include/generic/rte_pause.h  | 105 ++++++++++++++++
 5 files changed, 249 insertions(+)

diff --git a/config/arm/meson.build b/config/arm/meson.build
index 46dff3a..ea47425 100644
--- a/config/arm/meson.build
+++ b/config/arm/meson.build
@@ -26,6 +26,7 @@ flags_common_default = [
 	['RTE_LIBRTE_AVP_PMD', false],
 
 	['RTE_SCHED_VECTOR', false],
+	['RTE_ARM_USE_WFE', false],
 ]
 
 flags_generic = [
diff --git a/config/common_base b/config/common_base
index 1858598..bb1b1ed 100644
--- a/config/common_base
+++ b/config/common_base
@@ -110,6 +110,11 @@ CONFIG_RTE_MAX_VFIO_CONTAINERS=64
 CONFIG_RTE_MALLOC_DEBUG=n
 CONFIG_RTE_EAL_NUMA_AWARE_HUGEPAGES=n
 CONFIG_RTE_USE_LIBBSD=n
+# Use WFE instructions to implement the rte_wait_for_equal_xxx APIs,
+# calling these APIs put the cores in low power state while waiting
+# for the memory address to become equal to the expected value.
+# This is supported only by aarch64.
+CONFIG_RTE_ARM_USE_WFE=n
 
 #
 # Recognize/ignore the AVX/AVX512 CPU flags for performance/power testing.
diff --git a/doc/guides/rel_notes/release_19_11.rst b/doc/guides/rel_notes/release_19_11.rst
index fe11b4b..af5f2c5 100644
--- a/doc/guides/rel_notes/release_19_11.rst
+++ b/doc/guides/rel_notes/release_19_11.rst
@@ -65,6 +65,11 @@ New Features
 
   The lock-free stack implementation is enabled for aarch64 platforms.
 
+* **Added Wait Until Equal API.**
+
+  A new API has been added to wait for a memory location to be updated with a
+  16-bit, 32-bit, 64-bit value.
+
 * **Changed mempool allocation behaviour.**
 
   Objects are no longer across pages by default.
diff --git a/lib/librte_eal/common/include/arch/arm/rte_pause_64.h b/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
index 93895d3..e87d10b 100644
--- a/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
+++ b/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
@@ -1,5 +1,6 @@
 /* SPDX-License-Identifier: BSD-3-Clause
  * Copyright(c) 2017 Cavium, Inc
+ * Copyright(c) 2019 Arm Limited
  */
 
 #ifndef _RTE_PAUSE_ARM64_H_
@@ -10,6 +11,11 @@ extern "C" {
 #endif
 
 #include <rte_common.h>
+
+#ifdef RTE_ARM_USE_WFE
+#define RTE_WAIT_UNTIL_EQUAL_ARCH_DEFINED
+#endif
+
 #include "generic/rte_pause.h"
 
 static inline void rte_pause(void)
@@ -17,6 +23,133 @@ static inline void rte_pause(void)
 	asm volatile("yield" ::: "memory");
 }
 
+#ifdef RTE_WAIT_UNTIL_EQUAL_ARCH_DEFINED
+
+/* Send an event to quit WFE. */
+#define __SEVL() { asm volatile("sevl" : : : "memory"); }
+
+/* Put processor into low power WFE(Wait For Event) state. */
+#define __WFE() { asm volatile("wfe" : : : "memory"); }
+
+static __rte_always_inline void
+rte_wait_until_equal_16(volatile uint16_t *addr, uint16_t expected,
+		int memorder)
+{
+	uint16_t value;
+
+	assert(memorder == __ATOMIC_ACQUIRE || memorder == __ATOMIC_RELAXED);
+
+	/*
+	 * Atomic exclusive load from addr, it returns the 16-bit content of
+	 * *addr while making it 'monitored',when it is written by someone
+	 * else, the 'monitored' state is cleared and a event is generated
+	 * implicitly to exit WFE.
+	 */
+#define __LOAD_EXC_16(src, dst, memorder) {               \
+	if (memorder == __ATOMIC_RELAXED) {               \
+		asm volatile("ldxrh %w[tmp], [%x[addr]]"  \
+			: [tmp] "=&r" (dst)               \
+			: [addr] "r"(src)                 \
+			: "memory");                      \
+	} else {                                          \
+		asm volatile("ldaxrh %w[tmp], [%x[addr]]" \
+			: [tmp] "=&r" (dst)               \
+			: [addr] "r"(src)                 \
+			: "memory");                      \
+	} }
+
+	__LOAD_EXC_16(addr, value, memorder)
+	if (value != expected) {
+		__SEVL()
+		do {
+			__WFE()
+			__LOAD_EXC_16(addr, value, memorder)
+		} while (value != expected);
+	}
+#undef __LOAD_EXC_16
+}
+
+static __rte_always_inline void
+rte_wait_until_equal_32(volatile uint32_t *addr, uint32_t expected,
+		int memorder)
+{
+	uint32_t value;
+
+	assert(memorder == __ATOMIC_ACQUIRE || memorder == __ATOMIC_RELAXED);
+
+	/*
+	 * Atomic exclusive load from addr, it returns the 32-bit content of
+	 * *addr while making it 'monitored',when it is written by someone
+	 * else, the 'monitored' state is cleared and a event is generated
+	 * implicitly to exit WFE.
+	 */
+#define __LOAD_EXC_32(src, dst, memorder) {              \
+	if (memorder == __ATOMIC_RELAXED) {              \
+		asm volatile("ldxr %w[tmp], [%x[addr]]"  \
+			: [tmp] "=&r" (dst)              \
+			: [addr] "r"(src)                \
+			: "memory");                     \
+	} else {                                         \
+		asm volatile("ldaxr %w[tmp], [%x[addr]]" \
+			: [tmp] "=&r" (dst)              \
+			: [addr] "r"(src)                \
+			: "memory");                     \
+	} }
+
+	__LOAD_EXC_32(addr, value, memorder)
+	if (value != expected) {
+		__SEVL()
+		do {
+			__WFE()
+			__LOAD_EXC_32(addr, value, memorder)
+		} while (value != expected);
+	}
+#undef __LOAD_EXC_32
+}
+
+static __rte_always_inline void
+rte_wait_until_equal_64(volatile uint64_t *addr, uint64_t expected,
+		int memorder)
+{
+	uint64_t value;
+
+	assert(memorder == __ATOMIC_ACQUIRE || memorder == __ATOMIC_RELAXED);
+
+	/*
+	 * Atomic exclusive load from addr, it returns the 64-bit content of
+	 * *addr while making it 'monitored',when it is written by someone
+	 * else, the 'monitored' state is cleared and a event is generated
+	 * implicitly to exit WFE.
+	 */
+#define __LOAD_EXC_64(src, dst, memorder) {              \
+	if (memorder == __ATOMIC_RELAXED) {              \
+		asm volatile("ldxr %x[tmp], [%x[addr]]"  \
+			: [tmp] "=&r" (dst)              \
+			: [addr] "r"(src)                \
+			: "memory");                     \
+	} else {                                         \
+		asm volatile("ldaxr %x[tmp], [%x[addr]]" \
+			: [tmp] "=&r" (dst)              \
+			: [addr] "r"(src)                \
+			: "memory");                     \
+	} }
+
+	__LOAD_EXC_64(addr, value, memorder)
+	if (value != expected) {
+		__SEVL()
+		do {
+			__WFE()
+			__LOAD_EXC_64(addr, value, memorder)
+		} while (value != expected);
+	}
+}
+#undef __LOAD_EXC_64
+
+#undef __SEVL
+#undef __WFE
+
+#endif
+
 #ifdef __cplusplus
 }
 #endif
diff --git a/lib/librte_eal/common/include/generic/rte_pause.h b/lib/librte_eal/common/include/generic/rte_pause.h
index 52bd4db..7422785 100644
--- a/lib/librte_eal/common/include/generic/rte_pause.h
+++ b/lib/librte_eal/common/include/generic/rte_pause.h
@@ -1,5 +1,6 @@
 /* SPDX-License-Identifier: BSD-3-Clause
  * Copyright(c) 2017 Cavium, Inc
+ * Copyright(c) 2019 Arm Limited
  */
 
 #ifndef _RTE_PAUSE_H_
@@ -12,6 +13,12 @@
  *
  */
 
+#include <stdint.h>
+#include <assert.h>
+#include <rte_common.h>
+#include <rte_atomic.h>
+#include <rte_compat.h>
+
 /**
  * Pause CPU execution for a short while
  *
@@ -20,4 +27,102 @@
  */
 static inline void rte_pause(void);
 
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Wait for *addr to be updated with a 16-bit expected value, with a relaxed
+ * memory ordering model meaning the loads around this API can be reordered.
+ *
+ * @param addr
+ *  A pointer to the memory location.
+ * @param expected
+ *  A 16-bit expected value to be in the memory location.
+ * @param memorder
+ *  Two different memory orders that can be specified:
+ *  __ATOMIC_ACQUIRE and __ATOMIC_RELAXED. These map to
+ *  C++11 memory orders with the same names, see the C++11 standard or
+ *  the GCC wiki on atomic synchronization for detailed definition.
+ */
+__rte_experimental
+static __rte_always_inline void
+rte_wait_until_equal_16(volatile uint16_t *addr, uint16_t expected,
+		int memorder);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Wait for *addr to be updated with a 32-bit expected value, with a relaxed
+ * memory ordering model meaning the loads around this API can be reordered.
+ *
+ * @param addr
+ *  A pointer to the memory location.
+ * @param expected
+ *  A 32-bit expected value to be in the memory location.
+ * @param memorder
+ *  Two different memory orders that can be specified:
+ *  __ATOMIC_ACQUIRE and __ATOMIC_RELAXED. These map to
+ *  C++11 memory orders with the same names, see the C++11 standard or
+ *  the GCC wiki on atomic synchronization for detailed definition.
+ */
+__rte_experimental
+static __rte_always_inline void
+rte_wait_until_equal_32(volatile uint32_t *addr, uint32_t expected,
+		int memorder);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Wait for *addr to be updated with a 64-bit expected value, with a relaxed
+ * memory ordering model meaning the loads around this API can be reordered.
+ *
+ * @param addr
+ *  A pointer to the memory location.
+ * @param expected
+ *  A 64-bit expected value to be in the memory location.
+ * @param memorder
+ *  Two different memory orders that can be specified:
+ *  __ATOMIC_ACQUIRE and __ATOMIC_RELAXED. These map to
+ *  C++11 memory orders with the same names, see the C++11 standard or
+ *  the GCC wiki on atomic synchronization for detailed definition.
+ */
+__rte_experimental
+static __rte_always_inline void
+rte_wait_until_equal_64(volatile uint64_t *addr, uint64_t expected,
+		int memorder);
+
+#ifndef RTE_WAIT_UNTIL_EQUAL_ARCH_DEFINED
+static __rte_always_inline void
+rte_wait_until_equal_16(volatile uint16_t *addr, uint16_t expected,
+		int memorder)
+{
+	assert(memorder == __ATOMIC_ACQUIRE || memorder == __ATOMIC_RELAXED);
+
+	while (__atomic_load_n(addr, memorder) != expected)
+		rte_pause();
+}
+
+static __rte_always_inline void
+rte_wait_until_equal_32(volatile uint32_t *addr, uint32_t expected,
+		int memorder)
+{
+	assert(memorder == __ATOMIC_ACQUIRE || memorder == __ATOMIC_RELAXED);
+
+	while (__atomic_load_n(addr, memorder) != expected)
+		rte_pause();
+}
+
+static __rte_always_inline void
+rte_wait_until_equal_64(volatile uint64_t *addr, uint64_t expected,
+		int memorder)
+{
+	assert(memorder == __ATOMIC_ACQUIRE || memorder == __ATOMIC_RELAXED);
+
+	while (__atomic_load_n(addr, memorder) != expected)
+		rte_pause();
+}
+#endif
+
 #endif /* _RTE_PAUSE_H_ */
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 163+ messages in thread

* [dpdk-dev] [PATCH v13 3/5] ticketlock: use new API to reduce contention on aarch64
  2019-11-07 21:35 ` [dpdk-dev] [PATCH v13 0/5] use WFE for aarch64 David Marchand
  2019-11-07 21:35   ` [dpdk-dev] [PATCH v13 1/5] bus/fslmc: fix the conflicting dmb function David Marchand
  2019-11-07 21:35   ` [dpdk-dev] [PATCH v13 2/5] eal: add the APIs to wait until equal David Marchand
@ 2019-11-07 21:35   ` David Marchand
  2019-11-07 21:35   ` [dpdk-dev] [PATCH v13 4/5] net/thunderx: use new API to save cycles " David Marchand
  2019-11-07 21:35   ` [dpdk-dev] [PATCH v13 5/5] event/opdl: " David Marchand
  4 siblings, 0 replies; 163+ messages in thread
From: David Marchand @ 2019-11-07 21:35 UTC (permalink / raw)
  To: dev; +Cc: nd, konstantin.ananyev, Gavin Hu, Joyce Kong

From: Gavin Hu <gavin.hu@arm.com>

While using ticket lock, cores repeatedly poll the lock variable.
This is replaced by rte_wait_until_equal API.

Running ticketlock_autotest on ThunderX2, Ampere eMAG80, and Arm N1SDP[1],
there were variances between runs, but no notable performance gain or
degradation were seen with and without this patch.

[1] https://community.arm.com/developer/tools-software/oss-platforms/w/\
docs/440/neoverse-n1-sdp

Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Tested-by: Phil Yang <phil.yang@arm.com>
Tested-by: Pavan Nikhilesh <pbhagavatula@marvell.com>
Reviewed-by: Jerin Jacob <jerinj@marvell.com>
---
 lib/librte_eal/common/include/generic/rte_ticketlock.h | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/lib/librte_eal/common/include/generic/rte_ticketlock.h b/lib/librte_eal/common/include/generic/rte_ticketlock.h
index d9bec87..c295ae7 100644
--- a/lib/librte_eal/common/include/generic/rte_ticketlock.h
+++ b/lib/librte_eal/common/include/generic/rte_ticketlock.h
@@ -66,8 +66,7 @@ static inline void
 rte_ticketlock_lock(rte_ticketlock_t *tl)
 {
 	uint16_t me = __atomic_fetch_add(&tl->s.next, 1, __ATOMIC_RELAXED);
-	while (__atomic_load_n(&tl->s.current, __ATOMIC_ACQUIRE) != me)
-		rte_pause();
+	rte_wait_until_equal_16(&tl->s.current, me, __ATOMIC_ACQUIRE);
 }
 
 /**
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 163+ messages in thread

* [dpdk-dev] [PATCH v13 4/5] net/thunderx: use new API to save cycles on aarch64
  2019-11-07 21:35 ` [dpdk-dev] [PATCH v13 0/5] use WFE for aarch64 David Marchand
                     ` (2 preceding siblings ...)
  2019-11-07 21:35   ` [dpdk-dev] [PATCH v13 3/5] ticketlock: use new API to reduce contention on aarch64 David Marchand
@ 2019-11-07 21:35   ` " David Marchand
  2019-11-07 21:35   ` [dpdk-dev] [PATCH v13 5/5] event/opdl: " David Marchand
  4 siblings, 0 replies; 163+ messages in thread
From: David Marchand @ 2019-11-07 21:35 UTC (permalink / raw)
  To: dev; +Cc: nd, konstantin.ananyev, Gavin Hu, Jerin Jacob, Maciej Czekaj

From: Gavin Hu <gavin.hu@arm.com>

Use the new API to wait in low power state instead of continuous
polling to save CPU cycles and power.

Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
Acked-by: Jerin Jacob <jerinj@marvell.com>
---
 drivers/net/thunderx/Makefile     | 1 +
 drivers/net/thunderx/meson.build  | 1 +
 drivers/net/thunderx/nicvf_rxtx.c | 3 +--
 3 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/net/thunderx/Makefile b/drivers/net/thunderx/Makefile
index e6bf497..9e0de10 100644
--- a/drivers/net/thunderx/Makefile
+++ b/drivers/net/thunderx/Makefile
@@ -10,6 +10,7 @@ include $(RTE_SDK)/mk/rte.vars.mk
 LIB = librte_pmd_thunderx_nicvf.a
 
 CFLAGS += $(WERROR_FLAGS)
+CFLAGS += -DALLOW_EXPERIMENTAL_API
 
 LDLIBS += -lm
 LDLIBS += -lrte_eal -lrte_mbuf -lrte_mempool -lrte_ring
diff --git a/drivers/net/thunderx/meson.build b/drivers/net/thunderx/meson.build
index 69819a9..23d9458 100644
--- a/drivers/net/thunderx/meson.build
+++ b/drivers/net/thunderx/meson.build
@@ -4,6 +4,7 @@
 subdir('base')
 objs = [base_objs]
 
+allow_experimental_apis = true
 sources = files('nicvf_rxtx.c',
 		'nicvf_ethdev.c',
 		'nicvf_svf.c'
diff --git a/drivers/net/thunderx/nicvf_rxtx.c b/drivers/net/thunderx/nicvf_rxtx.c
index 1c42874..90a6098 100644
--- a/drivers/net/thunderx/nicvf_rxtx.c
+++ b/drivers/net/thunderx/nicvf_rxtx.c
@@ -385,8 +385,7 @@ nicvf_fill_rbdr(struct nicvf_rxq *rxq, int to_fill)
 		ltail++;
 	}
 
-	while (__atomic_load_n(&rbdr->tail, __ATOMIC_RELAXED) != next_tail)
-		rte_pause();
+	rte_wait_until_equal_32(&rbdr->tail, next_tail, __ATOMIC_RELAXED);
 
 	__atomic_store_n(&rbdr->tail, ltail, __ATOMIC_RELEASE);
 	nicvf_addr_write(door, to_fill);
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 163+ messages in thread

* [dpdk-dev] [PATCH v13 5/5] event/opdl: use new API to save cycles on aarch64
  2019-11-07 21:35 ` [dpdk-dev] [PATCH v13 0/5] use WFE for aarch64 David Marchand
                     ` (3 preceding siblings ...)
  2019-11-07 21:35   ` [dpdk-dev] [PATCH v13 4/5] net/thunderx: use new API to save cycles " David Marchand
@ 2019-11-07 21:35   ` " David Marchand
  4 siblings, 0 replies; 163+ messages in thread
From: David Marchand @ 2019-11-07 21:35 UTC (permalink / raw)
  To: dev; +Cc: nd, konstantin.ananyev, Gavin Hu, Liang Ma, Peter Mccarthy

From: Gavin Hu <gavin.hu@arm.com>

Use the new API to wait in low power state instead of continuous
polling to save CPU cycles and power.

Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
Reviewed-by: Jerin Jacob <jerinj@marvell.com>
---
 drivers/event/opdl/Makefile    | 1 +
 drivers/event/opdl/meson.build | 1 +
 drivers/event/opdl/opdl_ring.c | 5 ++---
 3 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/drivers/event/opdl/Makefile b/drivers/event/opdl/Makefile
index bf50a60..72ef07d 100644
--- a/drivers/event/opdl/Makefile
+++ b/drivers/event/opdl/Makefile
@@ -9,6 +9,7 @@ LIB = librte_pmd_opdl_event.a
 # build flags
 CFLAGS += -O3
 CFLAGS += $(WERROR_FLAGS)
+CFLAGS += -DALLOW_EXPERIMENTAL_API
 # for older GCC versions, allow us to initialize an event using
 # designated initializers.
 ifeq ($(CONFIG_RTE_TOOLCHAIN_GCC),y)
diff --git a/drivers/event/opdl/meson.build b/drivers/event/opdl/meson.build
index 1fe034e..e67b164 100644
--- a/drivers/event/opdl/meson.build
+++ b/drivers/event/opdl/meson.build
@@ -1,6 +1,7 @@
 # SPDX-License-Identifier: BSD-3-Clause
 # Copyright(c) 2018 Luca Boccassi <bluca@debian.org>
 
+allow_experimental_apis = true
 sources = files(
 	'opdl_evdev.c',
 	'opdl_evdev_init.c',
diff --git a/drivers/event/opdl/opdl_ring.c b/drivers/event/opdl/opdl_ring.c
index 06fb5b3..c8d19fe 100644
--- a/drivers/event/opdl/opdl_ring.c
+++ b/drivers/event/opdl/opdl_ring.c
@@ -16,6 +16,7 @@
 #include <rte_memcpy.h>
 #include <rte_memory.h>
 #include <rte_memzone.h>
+#include <rte_atomic.h>
 
 #include "opdl_ring.h"
 #include "opdl_log.h"
@@ -474,9 +475,7 @@ opdl_ring_input_multithread(struct opdl_ring *t, const void *entries,
 	/* If another thread started inputting before this one, but hasn'