From mboxrd@z Thu Jan 1 00:00:00 1970 From: "Eads, Gage" Subject: Re: [PATCH v3 1/2] eal: add 128-bit cmpset (x86-64 only) Date: Fri, 25 Jan 2019 17:19:18 +0000 Message-ID: <9184057F7FC11744A2107296B6B8EB1E541CB6F0@FMSMSX108.amr.corp.intel.com> References: <20190115223232.31866-1-gage.eads@intel.com> <20190116151835.22424-1-gage.eads@intel.com> <20190116151835.22424-2-gage.eads@intel.com> <9184057F7FC11744A2107296B6B8EB1E541C8BCA@FMSMSX108.amr.corp.intel.com> <9184057F7FC11744A2107296B6B8EB1E541C94D9@FMSMSX108.amr.corp.intel.com> <9184057F7FC11744A2107296B6B8EB1E541CA4CE@FMSMSX108.amr.corp.intel.com> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable Cc: "olivier.matz@6wind.com" , "arybchenko@solarflare.com" , "Richardson, Bruce" , "Ananyev, Konstantin" , nd , "chaozhu@linux.vnet.ibm.com" , "jerinj@marvell.com" , "hemant.agrawal@nxp.com" , nd To: Honnappa Nagarahalli , "dev@dpdk.org" Return-path: Received: from mga17.intel.com (mga17.intel.com [192.55.52.151]) by dpdk.org (Postfix) with ESMTP id 5B9722B9E for ; Fri, 25 Jan 2019 18:19:22 +0100 (CET) In-Reply-To: Content-Language: en-US List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" > -----Original Message----- > From: Honnappa Nagarahalli [mailto:Honnappa.Nagarahalli@arm.com] > Sent: Wednesday, January 23, 2019 11:22 PM > To: Eads, Gage ; dev@dpdk.org > Cc: olivier.matz@6wind.com; arybchenko@solarflare.com; Richardson, Bruce > ; Ananyev, Konstantin > ; nd ; > chaozhu@linux.vnet.ibm.com; jerinj@marvell.com; hemant.agrawal@nxp.com; > nd > Subject: RE: [dpdk-dev] [PATCH v3 1/2] eal: add 128-bit cmpset (x86-64 on= ly) >=20 > > > > > > Added other platform owners. > > > > > > > > > > > @@ -208,4 +209,25 @@ static inline void > > > > > > > > rte_atomic64_clear(rte_atomic64_t > > > > > > > > *v) } #endif > > > > > > > > > > > > > > > > +static inline int > > > > > > > > +rte_atomic128_cmpset(volatile uint64_t *dst, uint64_t > > > > > > > > +*exp, uint64_t > > > > > > > > +*src) { > > > > > > > The API name suggests it is a 128b operation. 'dst', 'exp' an= d 'src' > > > > > > > should be pointers to 128b (__int128)? Or we could define > > > > > > > our own data > > > > > > type. > > > > > > > > > > > > I agree, I'm not a big fan of the 64b pointers here. I avoided > > > > > > __int128 originally because it fails to compile with > > > > > > -pedantic, but on second thought (and with your suggestion of > > > > > > a separate data type), we can resolve that with this typedef: > > > > > > > > > > > > typedef struct { > > > > > > RTE_STD_C11 __int128 val; } rte_int128_t; > > > > > ok > > > > > > > > > > > > > > > > > > Since, it is a new API, can we define it with memory > > > > > > > orderings which will be more conducive to relaxed memory > > > > > > > ordering based > > > > architectures? > > > > > > > You can refer to [1] and [2] for guidance. > > > > > > > > > > > > I certainly see the value in controlling the operation's > > > > > > memory ordering, like in the __atomic intrinsics, but I'm not > > > > > > sure this patchset is the right place to address that. I see > > > > > > that work going a couple > > > > > ways: > > > > > > 1. Expand the existing rte_atomicN_* interfaces with > > > > > > additional arguments. In that case, I'd prefer this be done in > > > > > > a separate patchset that addresses all the atomic operations, > > > > > > not just cmpset, so the interface changes are chosen according > > > > > > to the needs of the full set of atomic operations. If this > > > > > > approach is taken then there's no need to solve this while > > > > > > rte_atomic128_cmpset is experimental, since all the > > > > > other functions are non-experimental anyway. > > > > > > > > > > > > - Or - > > > > > > > > > > > > 2. Don't modify the existing rte_atomicN_* interfaces (or > > > > > > their strongly ordered behavior), and instead create new > > > > > > versions of them that take additional arguments. In this case, > > > > > > we can implement > > > > > > rte_atomic128_cmpset() as is and create a more flexible > > > > > > version in a later > > > > > patchset. > > > > > > > > > > > > Either way, I think the current interface (w.r.t. memory > > > > > > ordering > > > > > > options) can work and still leaves us in a good position for > > > > > > future > > > > > changes/improvements. > > > > > > > > > > > I do not see the need to modify/extend the existing > > > > > rte_atomicN_* APIs as the corresponding __atomic intrinsics serve= as > replacements. > > > > > I expect that at some point, DPDK code base will not be using > > > > rte_atomicN_* APIs. > > > > > However, __atomic intrinsics do not support 128b wide parameters. > > > > > Hence > > > > > > > > I don't think that's correct. From the GCC docs: > > > > > > > > "16-byte integral types are also allowed if `__int128' (see > > > > __int128) is supported by the architecture." > > > > > > > > This works with x86 -64 -- I assume aarch64 also, but haven't confi= rmed. > > > > > > > > Source: > > > > https://gcc.gnu.org/onlinedocs/gcc-4.7.0/gcc/_005f_005fatomic- > > > > Builtins.html > > > (following is based on my reading, not based on experiments) My > > > understanding is that the __atomic_load_n/store_n were implemented > > > using a compare-and- swap HW instruction (due to lack of HW 128b > > > atomic load and store instructions). This introduced the side effect > > > or store/load respectively. Where as the user does not expect such > > > side effects. As suggested in [1], it looks like GCC delegated the > > > implementation to libatomic which 'it seems' uses locks to implement > > > 128b __atomic intrinsics (needs to be verified) > > > > > > If __atomic intrinsics, for any of the supported platforms, do not > > > have an optimal implementation, I prefer to add a DPDK API as an > > > abstraction. A given platform can choose to implement such an API > > > using __atomic intrinsics if it wants. The DPDK API can be similar > > > to > > __atomic_compare_exchange_n. > > > > > > > Certainly. From the linked discussions, I see how this would affect > > the design of (hypothetical functions) rte_atomic128_read() and > > rte_atomic128_set(), but I don't see anything that suggests (for the > > architectures being discussed) that __atomic_compare_exchange_n is > suboptimal. > I wrote some code and generated assembly to verify what is happening. On > aarch64, this call is delegated to libatomic and libatomic needs to be li= nked. In > the generated assembly, it is clear that it uses locks (pthread mutex loc= k) to > provide atomicity for. For 32b and 64b the compiler generates the expecte= d > inline assembly. Both, ' __atomic_always_lock_free' and ' > __atomic_is_lock_free' return 0 to indicate that 128b __atomic intrinsics= are not > lock free. (gcc - 8.2) >=20 > Out of curiosity, I also did similar experiments on x86 (CPU E5-2660 v4).= Even > after using -mcx16 flag the call is delegated to libatomic. I see the 'lo= ck > cmpxchg16b' in the generated assembly. But, ' __atomic_always_lock_free' = and > ' __atomic_is_lock_free' return 0 to indicate that 128b __atomic intrinsi= cs are > NOT lock free with gcc version 7.3.0. However with gcc version 5.4.0, ' > __atomic_is_lock_free' returns 1. I found more discussion at [3]. However= , I am > not an expert on x86. >=20 > [3] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D80878 >=20 > These problems do not exist with 32b and 64b. >=20 Thanks for investigating this. If GCC doesn't compile optimal code for aarc= h64 (i.e. LDXP+STXP or CASP) I don't think we have a choice but to use our = own implementation for 128-bit atomics, and it makes sense to model the int= erface after the __atomic instrinsics as you suggested. > > > > > [1] https://patchwork.ozlabs.org/patch/721686/ > > > [2] https://gcc.gnu.org/ml/gcc/2017-01/msg00167.html > > > > > > > > > > > > DPDK needs to write its own. Since this is the first API in that > > > > > regard, I prefer that we start with a signature that resembles > > > > > __atomic intrinsics which have been proven to provide best > > > > > flexibility for > > > > all the platforms supported by DPDK.