All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v1 0/2] ring C11 library fix and optimization
@ 2018-11-09 11:42 Gavin Hu
  2018-11-09 11:42 ` [PATCH v1 1/2] ring: keep the deterministic order allowing retry to work Gavin Hu
                   ` (2 more replies)
  0 siblings, 3 replies; 4+ messages in thread
From: Gavin Hu @ 2018-11-09 11:42 UTC (permalink / raw)
  To: dev
  Cc: thomas, stephen, olivier.matz, chaozhu, bruce.richardson,
	konstantin.ananyev, jerin.jacob, Honnappa.Nagarahalli, gavin.hu

V1:
Update the ring C11 library including the following changes:
1) On relaxed ordering platforms(like Arm64,PPPC), in ring C11 implementation,
   loading head and tail might be reodered, this makes CAS(compare and retry
   the flow if the head is outdated) not working as expected, the fix is to
   ensure the head is read before the tail, leaving no chances of the combination
   of outdated head and new tail.
2) With the above memory fence introduced, some loading can be relaxed.

Gavin Hu (2):
  ring: keep the deterministic order allowing retry to work
  ring: relaxed ordering for load and store the head

 lib/librte_ring/rte_ring_c11_mem.h | 14 ++++++++++----
 1 file changed, 10 insertions(+), 4 deletions(-)

-- 
2.7.4

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [PATCH v1 1/2] ring: keep the deterministic order allowing retry to work
  2018-11-09 11:42 [PATCH v1 0/2] ring C11 library fix and optimization Gavin Hu
@ 2018-11-09 11:42 ` Gavin Hu
  2018-11-09 11:42 ` [PATCH v1 2/2] ring: relaxed ordering for load and store the head Gavin Hu
  2018-11-13 15:57 ` [PATCH v1 0/2] ring C11 library fix and optimization Thomas Monjalon
  2 siblings, 0 replies; 4+ messages in thread
From: Gavin Hu @ 2018-11-09 11:42 UTC (permalink / raw)
  To: dev
  Cc: thomas, stephen, olivier.matz, chaozhu, bruce.richardson,
	konstantin.ananyev, jerin.jacob, Honnappa.Nagarahalli, gavin.hu,
	stable

Use case scenario:
1) Thread 1 is enqueuing. It reads prod.head and gets stalled for some
   reasons (running out of cpu time, preempted,...)
2) Thread 2 is enqueuing. It succeeds in enqueuing and moves prod.head
   forward.
3) Thread 3 is dequeuing. It succeeds in dequeuing and moves the cons.tail
   beyond the prod.head read by thread 1.
4) Thread 1 is re-scheduled. It reads cons.tail.

cpu1(producer)      cpu2(producer)          cpu3(consumer)
load r->prod.head
    ^               load r->prod.head
    |               load r->cons.tail
    |               store r->prod.head(+n)
  stalled           <-- enqueue ----->
    |               store r->prod.tail(+n)
    |                                        load r->cons.head
    |                                        load r->prod.tail
    |                                        store r->cons.head(+n)
    |                                        <...dequeue.....>
    v                                        store r->cons.tail(+n)
load r->cons.tail

For thread 1, the __atomic_compare_exchange_n detects the outdated
prod.head and retry the flow with the new one. This retry flow works ok on
strong ordering platform(eg:x86). But for weak ordering platforms(arm,
ppc), loading cons.tail and prod.head might be re-ordered, prod.head is new
but cons.tail becomes too old, the retry flow, based on the detection of
outdated head, does not trigger as expected, thus the outdate cons.tail
causes wrong free_entries.

Similarly, for dequeuing, outdated prod.tail leads to wrong avail_entries.

The fix is to keep the deterministic order of two loads allowing the retry
to work.

Run the ring perf test on the following testbed:
HW: ThunderX2 B0 CPU CN9975 v2.0, 2 sockets, 28core, 4 threads/core, 2.5GHz
OS: Ubuntu 16.04.5 LTS, Kernel: 4.15.0-36-generic
DPDK: 18.08, Configuration: arm64-armv8a-linuxapp-gcc
gcc: 8.1.0
$sudo ./test/test/test -l 16-19,44-47,72-75,100-103 -n 4 \
--socket-mem=1024 -- -i

Without the patch:
*** Testing using two physical cores ***
SP/SC bulk enq/dequeue (size: 8): 5.64
MP/MC bulk enq/dequeue (size: 8): 9.58
SP/SC bulk enq/dequeue (size: 32): 1.98
MP/MC bulk enq/dequeue (size: 32): 2.30

With the patch:
*** Testing using two physical cores ***
SP/SC bulk enq/dequeue (size: 8): 5.75
MP/MC bulk enq/dequeue (size: 8): 10.18
SP/SC bulk enq/dequeue (size: 32): 1.80
MP/MC bulk enq/dequeue (size: 32): 2.34

The results showed the thread fence degrade the performance slightly, but
it is required for correctness.

Fixes: 39368ebfc6 ("ring: introduce C11 memory model barrier option")
Cc: stable@dpdk.org

Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
Reviewed-by: Steve Capper <steve.capper@arm.com>
Reviewed-by: Ola Liljedahl <Ola.Liljedahl@arm.com>
---
 lib/librte_ring/rte_ring_c11_mem.h | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/lib/librte_ring/rte_ring_c11_mem.h b/lib/librte_ring/rte_ring_c11_mem.h
index 7bc74a4..dc49a99 100644
--- a/lib/librte_ring/rte_ring_c11_mem.h
+++ b/lib/librte_ring/rte_ring_c11_mem.h
@@ -66,6 +66,9 @@ __rte_ring_move_prod_head(struct rte_ring *r, unsigned int is_sp,
 		/* Reset n to the initial burst count */
 		n = max;
 
+		/* Ensure the head is read before tail */
+		__atomic_thread_fence(__ATOMIC_ACQUIRE);
+
 		/* load-acquire synchronize with store-release of ht->tail
 		 * in update_tail.
 		 */
@@ -139,6 +142,9 @@ __rte_ring_move_cons_head(struct rte_ring *r, int is_sc,
 		/* Restore n as it may change every loop */
 		n = max;
 
+		/* Ensure the head is read before tail */
+		__atomic_thread_fence(__ATOMIC_ACQUIRE);
+
 		/* this load-acquire synchronize with store-release of ht->tail
 		 * in update_tail.
 		 */
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* [PATCH v1 2/2] ring: relaxed ordering for load and store the head
  2018-11-09 11:42 [PATCH v1 0/2] ring C11 library fix and optimization Gavin Hu
  2018-11-09 11:42 ` [PATCH v1 1/2] ring: keep the deterministic order allowing retry to work Gavin Hu
@ 2018-11-09 11:42 ` Gavin Hu
  2018-11-13 15:57 ` [PATCH v1 0/2] ring C11 library fix and optimization Thomas Monjalon
  2 siblings, 0 replies; 4+ messages in thread
From: Gavin Hu @ 2018-11-09 11:42 UTC (permalink / raw)
  To: dev
  Cc: thomas, stephen, olivier.matz, chaozhu, bruce.richardson,
	konstantin.ananyev, jerin.jacob, Honnappa.Nagarahalli, gavin.hu,
	stable

When calling __atomic_compare_exchange_n, use relaxed ordering for the
success case, as multiple producers/consumers do not release updates to
each other so no need for acquire or release ordering.

Because the thread fence in place, ordering for the first iteration can be
relaxed.

Run the ring perf test on the following testbed:
HW: ThunderX2 B0 CPU CN9975 v2.0, 2 sockets, 28core,4 threads/core,2.5GHz
OS: Ubuntu 16.04.5 LTS, Kernel: 4.15.0-36-generic
DPDK: 18.08, Configuration: arm64-armv8a-linuxapp-gcc
gcc: 8.1.0
$sudo ./test/test/test -l 16-19,44-47,72-75,100-103 -n 4 \
--socket-mem=1024 -- -i

Without the patch:
*** Testing using two physical cores ***
SP/SC bulk enq/dequeue (size: 8): 5.75
MP/MC bulk enq/dequeue (size: 8): 10.18
SP/SC bulk enq/dequeue (size: 32): 1.80
MP/MC bulk enq/dequeue (size: 32): 2.34

With the patch:
*** Testing using two physical cores ***
SP/SC bulk enq/dequeue (size: 8): 5.59
MP/MC bulk enq/dequeue (size: 8): 10.54
SP/SC bulk enq/dequeue (size: 32): 1.73
MP/MC bulk enq/dequeue (size: 32): 2.38

No significant improvement, nor regression was seen,as the optimisation is
not at the critical path.

Fixes: 39368ebfc6 ("ring: introduce C11 memory model barrier option")
Cc: stable@dpdk.org

Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
Reviewed-by: Steve Capper <steve.capper@arm.com>
Reviewed-by: Ola Liljedahl <Ola.Liljedahl@arm.com>
---
 lib/librte_ring/rte_ring_c11_mem.h | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/lib/librte_ring/rte_ring_c11_mem.h b/lib/librte_ring/rte_ring_c11_mem.h
index dc49a99..0fb73a3 100644
--- a/lib/librte_ring/rte_ring_c11_mem.h
+++ b/lib/librte_ring/rte_ring_c11_mem.h
@@ -61,7 +61,7 @@ __rte_ring_move_prod_head(struct rte_ring *r, unsigned int is_sp,
 	unsigned int max = n;
 	int success;
 
-	*old_head = __atomic_load_n(&r->prod.head, __ATOMIC_ACQUIRE);
+	*old_head = __atomic_load_n(&r->prod.head, __ATOMIC_RELAXED);
 	do {
 		/* Reset n to the initial burst count */
 		n = max;
@@ -97,7 +97,7 @@ __rte_ring_move_prod_head(struct rte_ring *r, unsigned int is_sp,
 			/* on failure, *old_head is updated */
 			success = __atomic_compare_exchange_n(&r->prod.head,
 					old_head, *new_head,
-					0, __ATOMIC_ACQUIRE,
+					0, __ATOMIC_RELAXED,
 					__ATOMIC_RELAXED);
 	} while (unlikely(success == 0));
 	return n;
@@ -137,7 +137,7 @@ __rte_ring_move_cons_head(struct rte_ring *r, int is_sc,
 	int success;
 
 	/* move cons.head atomically */
-	*old_head = __atomic_load_n(&r->cons.head, __ATOMIC_ACQUIRE);
+	*old_head = __atomic_load_n(&r->cons.head, __ATOMIC_RELAXED);
 	do {
 		/* Restore n as it may change every loop */
 		n = max;
@@ -172,7 +172,7 @@ __rte_ring_move_cons_head(struct rte_ring *r, int is_sc,
 			/* on failure, *old_head will be updated */
 			success = __atomic_compare_exchange_n(&r->cons.head,
 							old_head, *new_head,
-							0, __ATOMIC_ACQUIRE,
+							0, __ATOMIC_RELAXED,
 							__ATOMIC_RELAXED);
 	} while (unlikely(success == 0));
 	return n;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH v1 0/2] ring C11 library fix and optimization
  2018-11-09 11:42 [PATCH v1 0/2] ring C11 library fix and optimization Gavin Hu
  2018-11-09 11:42 ` [PATCH v1 1/2] ring: keep the deterministic order allowing retry to work Gavin Hu
  2018-11-09 11:42 ` [PATCH v1 2/2] ring: relaxed ordering for load and store the head Gavin Hu
@ 2018-11-13 15:57 ` Thomas Monjalon
  2 siblings, 0 replies; 4+ messages in thread
From: Thomas Monjalon @ 2018-11-13 15:57 UTC (permalink / raw)
  To: Gavin Hu
  Cc: dev, stephen, olivier.matz, chaozhu, bruce.richardson,
	konstantin.ananyev, jerin.jacob, Honnappa.Nagarahalli

09/11/2018 12:42, Gavin Hu:
> V1:
> Update the ring C11 library including the following changes:
> 1) On relaxed ordering platforms(like Arm64,PPPC), in ring C11 implementation,
>    loading head and tail might be reodered, this makes CAS(compare and retry
>    the flow if the head is outdated) not working as expected, the fix is to
>    ensure the head is read before the tail, leaving no chances of the combination
>    of outdated head and new tail.
> 2) With the above memory fence introduced, some loading can be relaxed.
> 
> Gavin Hu (2):
>   ring: keep the deterministic order allowing retry to work
>   ring: relaxed ordering for load and store the head

No comment, so I guess it can enter in 18.11-rc3.
Applied, thanks

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2018-11-13 15:57 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-11-09 11:42 [PATCH v1 0/2] ring C11 library fix and optimization Gavin Hu
2018-11-09 11:42 ` [PATCH v1 1/2] ring: keep the deterministic order allowing retry to work Gavin Hu
2018-11-09 11:42 ` [PATCH v1 2/2] ring: relaxed ordering for load and store the head Gavin Hu
2018-11-13 15:57 ` [PATCH v1 0/2] ring C11 library fix and optimization Thomas Monjalon

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.