From mboxrd@z Thu Jan  1 00:00:00 1970
From: Justin He <Justin.He@arm.com>
Subject: Re: [PATCH v3 2/3] ring: synchronize the load and store
	of the tail
Date: Wed, 26 Sep 2018 09:59:02 +0000
Message-ID: <DB7PR08MB3082E373D1ABD115E9292BEEF7150@DB7PR08MB3082.eurprd08.prod.outlook.com>
References: <20180807031943.5331-1-gavin.hu@arm.com>
 <1537172244-64874-1-git-send-email-gavin.hu@arm.com>
 <1537172244-64874-2-git-send-email-gavin.hu@arm.com>
 <VI1PR08MB3167A4F8ECD79D8F683967858F150@VI1PR08MB3167.eurprd08.prod.outlook.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="iso-2022-jp"
Content-Transfer-Encoding: quoted-printable
Cc: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>, Steve Capper
 <Steve.Capper@arm.com>, Ola Liljedahl <Ola.Liljedahl@arm.com>,
 "jerin.jacob@caviumnetworks.com" <jerin.jacob@caviumnetworks.com>, nd
 <nd@arm.com>, "stable@dpdk.org" <stable@dpdk.org>
To: "Gavin Hu (Arm Technology China)" <Gavin.Hu@arm.com>, "dev@dpdk.org"
 <dev@dpdk.org>
Return-path: <dev-bounces@dpdk.org>
In-Reply-To: <VI1PR08MB3167A4F8ECD79D8F683967858F150@VI1PR08MB3167.eurprd08.prod.outlook.com>
Content-Language: en-US
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <https://mails.dpdk.org/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://mails.dpdk.org/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <https://mails.dpdk.org/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
Errors-To: dev-bounces@dpdk.org
Sender: "dev" <dev-bounces@dpdk.org>

Hi Gavin

> -----Original Message-----
> From: Gavin Hu (Arm Technology China)
> Sent: 2018=1B$BG/=1B(B9=1B$B7n=1B(B26=1B$BF|=1B(B 17:30
> To: Gavin Hu (Arm Technology China) <Gavin.Hu@arm.com>; dev@dpdk.org
> Cc: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>; Steve Capper
> <Steve.Capper@arm.com>; Ola Liljedahl <Ola.Liljedahl@arm.com>;
> jerin.jacob@caviumnetworks.com; nd <nd@arm.com>; stable@dpdk.org; Justin
> He <Justin.He@arm.com>
> Subject: RE: [PATCH v3 2/3] ring: synchronize the load and store of the t=
ail
>
> +Justin He for review.
>
> > -----Original Message-----
> > From: Gavin Hu <gavin.hu@arm.com>
> > Sent: Monday, September 17, 2018 4:17 PM
> > To: dev@dpdk.org
> > Cc: Gavin Hu (Arm Technology China) <Gavin.Hu@arm.com>; Honnappa
> > Nagarahalli <Honnappa.Nagarahalli@arm.com>; Steve Capper
> > <Steve.Capper@arm.com>; Ola Liljedahl <Ola.Liljedahl@arm.com>;
> > jerin.jacob@caviumnetworks.com; nd <nd@arm.com>; stable@dpdk.org
> > Subject: [PATCH v3 2/3] ring: synchronize the load and store of the
> > tail
> >
> > Synchronize the load-acquire of the tail and the store-release within
> > update_tail, the store release ensures all the ring operations,
> > enqueue or dequeue, are seen by the observers on the other side as
> > soon as they see the updated tail. The load-acquire is needed here as
> > the data dependency is not a reliable way for ordering as the compiler
> > might break it by saving to temporary values to boost performance.
> > When computing the free_entries and avail_entries, use atomic
> > semantics to load the heads and tails instead.
> >
> > The patch was benchmarked with test/ring_perf_autotest and it
> > decreases the enqueue/dequeue latency by 5% ~ 27.6% with two lcores,
> > the real gains are dependent on the number of lcores, depth of the ring=
, SPSC
> or MPMC.
> > For 1 lcore, it also improves a little, about 3 ~ 4%.
> > It is a big improvement, in case of MPMC, with two lcores and ring
> > size of 32, it saves latency up to (3.26-2.36)/3.26 =3D 27.6%.
> >
> > This patch is a bug fix, while the improvement is a bonus. In our
> > analysis the improvement comes from the cacheline pre-filling after
> > hoisting load- acquire from _atomic_compare_exchange_n up above.
> >
> > The test command:
> > $sudo ./test/test/test -l 16-19,44-47,72-75,100-103 -n 4
> > --socket-mem=3D\
> > 1024 -- -i
> >
> > Test result with this patch(two cores):
> >  SP/SC bulk enq/dequeue (size: 8): 5.86  MP/MC bulk enq/dequeue (size:
> > 8): 10.15  SP/SC bulk enq/dequeue (size:
> > 32): 1.94  MP/MC bulk enq/dequeue (size: 32): 2.36
> >
> > In comparison of the test result without this patch:
> >  SP/SC bulk enq/dequeue (size: 8): 6.67  MP/MC bulk enq/dequeue (size:
> > 8): 13.12  SP/SC bulk enq/dequeue (size:
> > 32): 2.04  MP/MC bulk enq/dequeue (size: 32): 3.26
> >
> > Fixes: 39368ebfc6 ("ring: introduce C11 memory model barrier option")
> > Cc: stable@dpdk.org
> >
> > Signed-off-by: Gavin Hu <gavin.hu@arm.com>
> > Reviewed-by: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
> > Reviewed-by: Steve Capper <steve.capper@arm.com>
> > Reviewed-by: Ola Liljedahl <Ola.Liljedahl@arm.com>
> > ---
> >  lib/librte_ring/rte_ring_c11_mem.h | 20 ++++++++++++++++----
> >  1 file changed, 16 insertions(+), 4 deletions(-)
> >
> > diff --git a/lib/librte_ring/rte_ring_c11_mem.h
> > b/lib/librte_ring/rte_ring_c11_mem.h
> > index 234fea0..0eae3b3 100644
> > --- a/lib/librte_ring/rte_ring_c11_mem.h
> > +++ b/lib/librte_ring/rte_ring_c11_mem.h
> > @@ -68,13 +68,18 @@ __rte_ring_move_prod_head(struct rte_ring *r,
> > unsigned int is_sp,
> >  *old_head =3D __atomic_load_n(&r->prod.head,
> >  __ATOMIC_ACQUIRE);
> >
> > -/*
> > - *  The subtraction is done between two unsigned 32bits
> > value
> > +/* load-acquire synchronize with store-release of ht->tail
> > + * in update_tail.
> > + */
> > +const uint32_t cons_tail =3D __atomic_load_n(&r->cons.tail,
> > +
> > __ATOMIC_ACQUIRE);
I ever noticed that freebsd also used the double __atomic_load_n. And as we
discussed it several months ago [1], seems the second load_acquire is not
necessary. But as you verified, it looks good to me
[1] http://mails.dpdk.org/archives/dev/2017-November/080983.html
So,
Reviewed-by: Jia He <justin.he@arm.com>

Cheers,
Justin (Jia He)
> > +
> > +/* The subtraction is done between two unsigned 32bits
> > value
> >   * (the result is always modulo 32 bits even if we have
> >   * *old_head > cons_tail). So 'free_entries' is always between 0
> >   * and capacity (which is < size).
> >   */
> > -*free_entries =3D (capacity + r->cons.tail - *old_head);
> > +*free_entries =3D (capacity + cons_tail - *old_head);
> >
> >  /* check that we have enough room in ring */
> >  if (unlikely(n > *free_entries))
> > @@ -132,15 +137,22 @@ __rte_ring_move_cons_head(struct rte_ring *r,
> > int is_sc,
> >  do {
> >  /* Restore n as it may change every loop */
> >  n =3D max;
> > +
> >  *old_head =3D __atomic_load_n(&r->cons.head,
> >  __ATOMIC_ACQUIRE);
> >
> > +/* this load-acquire synchronize with store-release of ht->tail
> > + * in update_tail.
> > + */
> > +const uint32_t prod_tail =3D __atomic_load_n(&r->prod.tail,
> > +__ATOMIC_ACQUIRE);
> > +
> >  /* The subtraction is done between two unsigned 32bits value
> >   * (the result is always modulo 32 bits even if we have
> >   * cons_head > prod_tail). So 'entries' is always between 0
> >   * and size(ring)-1.
> >   */
> > -*entries =3D (r->prod.tail - *old_head);
> > +*entries =3D (prod_tail - *old_head);
> >
> >  /* Set the actual entries for dequeue */
> >  if (n > *entries)
> > --
> > 2.7.4

IMPORTANT NOTICE: The contents of this email and any attachments are confid=
ential and may also be privileged. If you are not the intended recipient, p=
lease notify the sender immediately and do not disclose the contents to any=
 other person, use it for any purpose, or store or copy the information in =
any medium. Thank you.