From mboxrd@z Thu Jan  1 00:00:00 1970
From: "Eads, Gage" <gage.eads@intel.com>
Subject: Re: [PATCH v3 1/2] eal: add 128-bit cmpset (x86-64 only)
Date: Fri, 25 Jan 2019 17:19:18 +0000
Message-ID: <9184057F7FC11744A2107296B6B8EB1E541CB6F0@FMSMSX108.amr.corp.intel.com>
References: <20190115223232.31866-1-gage.eads@intel.com>
 <20190116151835.22424-1-gage.eads@intel.com>
 <20190116151835.22424-2-gage.eads@intel.com>
 <AM6PR08MB36720B14933138607234687098830@AM6PR08MB3672.eurprd08.prod.outlook.com>
 <9184057F7FC11744A2107296B6B8EB1E541C8BCA@FMSMSX108.amr.corp.intel.com>
 <AM6PR08MB3672F8D8571B7258C5FF5B05989C0@AM6PR08MB3672.eurprd08.prod.outlook.com>
 <9184057F7FC11744A2107296B6B8EB1E541C94D9@FMSMSX108.amr.corp.intel.com>
 <AM6PR08MB3672A5B22F3E78E481B0685898980@AM6PR08MB3672.eurprd08.prod.outlook.com>
 <9184057F7FC11744A2107296B6B8EB1E541CA4CE@FMSMSX108.amr.corp.intel.com>
 <AM6PR08MB367211443C3F1906859E26CA989A0@AM6PR08MB3672.eurprd08.prod.outlook.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
Cc: "olivier.matz@6wind.com" <olivier.matz@6wind.com>,
 "arybchenko@solarflare.com" <arybchenko@solarflare.com>, "Richardson, Bruce"
 <bruce.richardson@intel.com>, "Ananyev, Konstantin"
 <konstantin.ananyev@intel.com>, nd <nd@arm.com>, "chaozhu@linux.vnet.ibm.com"
 <chaozhu@linux.vnet.ibm.com>, "jerinj@marvell.com" <jerinj@marvell.com>,
 "hemant.agrawal@nxp.com" <hemant.agrawal@nxp.com>, nd <nd@arm.com>
To: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>, "dev@dpdk.org"
 <dev@dpdk.org>
Return-path: <dev-bounces@dpdk.org>
Received: from mga17.intel.com (mga17.intel.com [192.55.52.151])
 by dpdk.org (Postfix) with ESMTP id 5B9722B9E
 for <dev@dpdk.org>; Fri, 25 Jan 2019 18:19:22 +0100 (CET)
In-Reply-To: <AM6PR08MB367211443C3F1906859E26CA989A0@AM6PR08MB3672.eurprd08.prod.outlook.com>
Content-Language: en-US
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <https://mails.dpdk.org/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://mails.dpdk.org/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <https://mails.dpdk.org/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
Errors-To: dev-bounces@dpdk.org
Sender: "dev" <dev-bounces@dpdk.org>


> -----Original Message-----
> From: Honnappa Nagarahalli [mailto:Honnappa.Nagarahalli@arm.com]
> Sent: Wednesday, January 23, 2019 11:22 PM
> To: Eads, Gage <gage.eads@intel.com>; dev@dpdk.org
> Cc: olivier.matz@6wind.com; arybchenko@solarflare.com; Richardson, Bruce
> <bruce.richardson@intel.com>; Ananyev, Konstantin
> <konstantin.ananyev@intel.com>; nd <nd@arm.com>;
> chaozhu@linux.vnet.ibm.com; jerinj@marvell.com; hemant.agrawal@nxp.com;
> nd <nd@arm.com>
> Subject: RE: [dpdk-dev] [PATCH v3 1/2] eal: add 128-bit cmpset (x86-64 on=
ly)
>=20
> > >
> > > Added other platform owners.
> > >
> > > > > > > > @@ -208,4 +209,25 @@ static inline void
> > > > > > > > rte_atomic64_clear(rte_atomic64_t
> > > > > > > > *v)  }  #endif
> > > > > > > >
> > > > > > > > +static inline int
> > > > > > > > +rte_atomic128_cmpset(volatile uint64_t *dst, uint64_t
> > > > > > > > +*exp, uint64_t
> > > > > > > > +*src) {
> > > > > > > The API name suggests it is a 128b operation. 'dst', 'exp' an=
d 'src'
> > > > > > > should be pointers to 128b (__int128)? Or we could define
> > > > > > > our own data
> > > > > > type.
> > > > > >
> > > > > > I agree, I'm not a big fan of the 64b pointers here. I avoided
> > > > > > __int128 originally because it fails to compile with
> > > > > > -pedantic, but on second thought (and with your suggestion of
> > > > > > a separate data type), we can resolve that with this typedef:
> > > > > >
> > > > > > typedef struct {
> > > > > >         RTE_STD_C11 __int128 val; } rte_int128_t;
> > > > > ok
> > > > >
> > > > > >
> > > > > > > Since, it is a new API, can we define it with memory
> > > > > > > orderings which will be more conducive to relaxed memory
> > > > > > > ordering based
> > > > architectures?
> > > > > > > You can refer to [1] and [2] for guidance.
> > > > > >
> > > > > > I certainly see the value in controlling the operation's
> > > > > > memory ordering, like in the __atomic intrinsics, but I'm not
> > > > > > sure this patchset is the right place to address that. I see
> > > > > > that work going a couple
> > > > > ways:
> > > > > > 1. Expand the existing rte_atomicN_* interfaces with
> > > > > > additional arguments. In that case, I'd prefer this be done in
> > > > > > a separate patchset that addresses all the atomic operations,
> > > > > > not just cmpset, so the interface changes are chosen according
> > > > > > to the needs of the full set of atomic operations. If this
> > > > > > approach is taken then there's no need to solve this while
> > > > > > rte_atomic128_cmpset is experimental, since all the
> > > > > other functions are non-experimental anyway.
> > > > > >
> > > > > > - Or -
> > > > > >
> > > > > > 2. Don't modify the existing rte_atomicN_* interfaces (or
> > > > > > their strongly ordered behavior), and instead create new
> > > > > > versions of them that take additional arguments. In this case,
> > > > > > we can implement
> > > > > > rte_atomic128_cmpset() as is and create a more flexible
> > > > > > version in a later
> > > > > patchset.
> > > > > >
> > > > > > Either way, I think the current interface (w.r.t. memory
> > > > > > ordering
> > > > > > options) can work and still leaves us in a good position for
> > > > > > future
> > > > > changes/improvements.
> > > > > >
> > > > > I do not see the need to modify/extend the existing
> > > > > rte_atomicN_* APIs as the corresponding __atomic intrinsics serve=
 as
> replacements.
> > > > > I expect that at some point, DPDK code base will not be using
> > > > rte_atomicN_* APIs.
> > > > > However, __atomic intrinsics do not support 128b wide parameters.
> > > > > Hence
> > > >
> > > > I don't think that's correct. From the GCC docs:
> > > >
> > > > "16-byte integral types are also allowed if `__int128' (see
> > > > __int128) is supported by the architecture."
> > > >
> > > > This works with x86 -64 -- I assume aarch64 also, but haven't confi=
rmed.
> > > >
> > > > Source:
> > > > https://gcc.gnu.org/onlinedocs/gcc-4.7.0/gcc/_005f_005fatomic-
> > > > Builtins.html
> > > (following is based on my reading, not based on experiments) My
> > > understanding is that the __atomic_load_n/store_n were implemented
> > > using a compare-and- swap HW instruction (due to lack of HW 128b
> > > atomic load and store instructions). This introduced the side effect
> > > or store/load respectively. Where as the user does not expect such
> > > side effects. As suggested in [1], it looks like GCC delegated the
> > > implementation to libatomic which 'it seems' uses locks to implement
> > > 128b __atomic intrinsics (needs to be verified)
> > >
> > > If __atomic intrinsics, for any of the supported platforms, do not
> > > have an optimal implementation, I prefer to add a DPDK API as an
> > > abstraction. A given platform can choose to implement such an API
> > > using __atomic intrinsics if it wants. The DPDK API can be similar
> > > to
> > __atomic_compare_exchange_n.
> > >
> >
> > Certainly. From the linked discussions, I see how this would affect
> > the design of (hypothetical functions) rte_atomic128_read() and
> > rte_atomic128_set(), but I don't see anything that suggests (for the
> > architectures being discussed) that __atomic_compare_exchange_n is
> suboptimal.
> I wrote some code and generated assembly to verify what is happening. On
> aarch64, this call is delegated to libatomic and libatomic needs to be li=
nked. In
> the generated assembly, it is clear that it uses locks (pthread mutex loc=
k) to
> provide atomicity for. For 32b and 64b the compiler generates the expecte=
d
> inline assembly. Both, ' __atomic_always_lock_free' and '
> __atomic_is_lock_free' return 0 to indicate that 128b __atomic intrinsics=
 are not
> lock free. (gcc - 8.2)
>=20
> Out of curiosity, I also did similar experiments on x86 (CPU E5-2660 v4).=
 Even
> after using -mcx16 flag the call is delegated to libatomic. I see the 'lo=
ck
> cmpxchg16b' in the generated assembly. But, ' __atomic_always_lock_free' =
and
> ' __atomic_is_lock_free' return 0 to indicate that 128b __atomic intrinsi=
cs are
> NOT lock free with gcc version 7.3.0. However with gcc version 5.4.0, '
> __atomic_is_lock_free' returns 1. I found more discussion at [3]. However=
, I am
> not an expert on x86.
>=20
> [3] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D80878
>=20
> These problems do not exist with 32b and 64b.
>=20

Thanks for investigating this. If GCC doesn't compile optimal code for aarc=
h64 (i.e. LDXP+STXP or CASP) I don't think we have a choice but to use our =
own implementation for 128-bit atomics, and it makes sense to model the int=
erface after the __atomic instrinsics as you suggested.

> >
> > > [1] https://patchwork.ozlabs.org/patch/721686/
> > > [2] https://gcc.gnu.org/ml/gcc/2017-01/msg00167.html
> > >
> > > >
> > > > > DPDK needs to write its own. Since this is the first API in that
> > > > > regard, I prefer that we start with a signature that resembles
> > > > > __atomic intrinsics which have been proven to provide best
> > > > > flexibility for
> > > > all the platforms supported by DPDK.