From mboxrd@z Thu Jan 1 00:00:00 1970 From: "Gavin Hu (Arm Technology China)" Subject: Re: [PATCH v3 2/3] ring: synchronize the load and store of the tail Date: Wed, 26 Sep 2018 09:29:36 +0000 Message-ID: References: <20180807031943.5331-1-gavin.hu@arm.com> <1537172244-64874-1-git-send-email-gavin.hu@arm.com> <1537172244-64874-2-git-send-email-gavin.hu@arm.com> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable Cc: Honnappa Nagarahalli , Steve Capper , Ola Liljedahl , "jerin.jacob@caviumnetworks.com" , nd , "stable@dpdk.org" , Justin He To: "Gavin Hu (Arm Technology China)" , "dev@dpdk.org" Return-path: In-Reply-To: <1537172244-64874-2-git-send-email-gavin.hu@arm.com> Content-Language: en-US List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" +Justin He for review. > -----Original Message----- > From: Gavin Hu > Sent: Monday, September 17, 2018 4:17 PM > To: dev@dpdk.org > Cc: Gavin Hu (Arm Technology China) ; Honnappa > Nagarahalli ; Steve Capper > ; Ola Liljedahl ; > jerin.jacob@caviumnetworks.com; nd ; stable@dpdk.org > Subject: [PATCH v3 2/3] ring: synchronize the load and store of the tail >=20 > Synchronize the load-acquire of the tail and the store-release within > update_tail, the store release ensures all the ring operations, enqueue o= r > dequeue, are seen by the observers on the other side as soon as they see > the updated tail. The load-acquire is needed here as the data dependency = is > not a reliable way for ordering as the compiler might break it by saving = to > temporary values to boost performance. > When computing the free_entries and avail_entries, use atomic semantics t= o > load the heads and tails instead. >=20 > The patch was benchmarked with test/ring_perf_autotest and it decreases > the enqueue/dequeue latency by 5% ~ 27.6% with two lcores, the real gains > are dependent on the number of lcores, depth of the ring, SPSC or MPMC. > For 1 lcore, it also improves a little, about 3 ~ 4%. > It is a big improvement, in case of MPMC, with two lcores and ring size o= f 32, > it saves latency up to (3.26-2.36)/3.26 =3D 27.6%. >=20 > This patch is a bug fix, while the improvement is a bonus. In our analysi= s the > improvement comes from the cacheline pre-filling after hoisting load- > acquire from _atomic_compare_exchange_n up above. >=20 > The test command: > $sudo ./test/test/test -l 16-19,44-47,72-75,100-103 -n 4 --socket-mem=3D\ > 1024 -- -i >=20 > Test result with this patch(two cores): > SP/SC bulk enq/dequeue (size: 8): 5.86 > MP/MC bulk enq/dequeue (size: 8): 10.15 SP/SC bulk enq/dequeue (size: > 32): 1.94 MP/MC bulk enq/dequeue (size: 32): 2.36 >=20 > In comparison of the test result without this patch: > SP/SC bulk enq/dequeue (size: 8): 6.67 > MP/MC bulk enq/dequeue (size: 8): 13.12 SP/SC bulk enq/dequeue (size: > 32): 2.04 MP/MC bulk enq/dequeue (size: 32): 3.26 >=20 > Fixes: 39368ebfc6 ("ring: introduce C11 memory model barrier option") > Cc: stable@dpdk.org >=20 > Signed-off-by: Gavin Hu > Reviewed-by: Honnappa Nagarahalli > Reviewed-by: Steve Capper > Reviewed-by: Ola Liljedahl > --- > lib/librte_ring/rte_ring_c11_mem.h | 20 ++++++++++++++++---- > 1 file changed, 16 insertions(+), 4 deletions(-) >=20 > diff --git a/lib/librte_ring/rte_ring_c11_mem.h > b/lib/librte_ring/rte_ring_c11_mem.h > index 234fea0..0eae3b3 100644 > --- a/lib/librte_ring/rte_ring_c11_mem.h > +++ b/lib/librte_ring/rte_ring_c11_mem.h > @@ -68,13 +68,18 @@ __rte_ring_move_prod_head(struct rte_ring *r, > unsigned int is_sp, > *old_head =3D __atomic_load_n(&r->prod.head, > __ATOMIC_ACQUIRE); >=20 > - /* > - * The subtraction is done between two unsigned 32bits > value > + /* load-acquire synchronize with store-release of ht->tail > + * in update_tail. > + */ > + const uint32_t cons_tail =3D __atomic_load_n(&r->cons.tail, > + > __ATOMIC_ACQUIRE); > + > + /* The subtraction is done between two unsigned 32bits > value > * (the result is always modulo 32 bits even if we have > * *old_head > cons_tail). So 'free_entries' is always > between 0 > * and capacity (which is < size). > */ > - *free_entries =3D (capacity + r->cons.tail - *old_head); > + *free_entries =3D (capacity + cons_tail - *old_head); >=20 > /* check that we have enough room in ring */ > if (unlikely(n > *free_entries)) > @@ -132,15 +137,22 @@ __rte_ring_move_cons_head(struct rte_ring *r, int > is_sc, > do { > /* Restore n as it may change every loop */ > n =3D max; > + > *old_head =3D __atomic_load_n(&r->cons.head, > __ATOMIC_ACQUIRE); >=20 > + /* this load-acquire synchronize with store-release of ht->tail > + * in update_tail. > + */ > + const uint32_t prod_tail =3D __atomic_load_n(&r->prod.tail, > + __ATOMIC_ACQUIRE); > + > /* The subtraction is done between two unsigned 32bits > value > * (the result is always modulo 32 bits even if we have > * cons_head > prod_tail). So 'entries' is always between 0 > * and size(ring)-1. > */ > - *entries =3D (r->prod.tail - *old_head); > + *entries =3D (prod_tail - *old_head); >=20 > /* Set the actual entries for dequeue */ > if (n > *entries) > -- > 2.7.4