From mboxrd@z Thu Jan 1 00:00:00 1970 From: "Eads, Gage" Subject: Re: [PATCH v3 6/8] stack: add C11 atomic implementation Date: Fri, 29 Mar 2019 19:24:45 +0000 Message-ID: <9184057F7FC11744A2107296B6B8EB1E5420D940@FMSMSX108.amr.corp.intel.com> References: <20190305164256.2367-1-gage.eads@intel.com> <20190306144559.391-1-gage.eads@intel.com> <20190306144559.391-7-gage.eads@intel.com> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable Cc: "olivier.matz@6wind.com" , "arybchenko@solarflare.com" , "Richardson, Bruce" , "Ananyev, Konstantin" , "Gavin Hu (Arm Technology China)" , nd , "thomas@monjalon.net" , nd , Thomas Monjalon To: Honnappa Nagarahalli , "dev@dpdk.org" Return-path: Received: from mga09.intel.com (mga09.intel.com [134.134.136.24]) by dpdk.org (Postfix) with ESMTP id 279C111A4 for ; Fri, 29 Mar 2019 20:24:48 +0100 (CET) In-Reply-To: Content-Language: en-US List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" [snip] > > +static __rte_always_inline void > > +__rte_stack_lf_push(struct rte_stack_lf_list *list, > > + struct rte_stack_lf_elem *first, > > + struct rte_stack_lf_elem *last, > > + unsigned int num) > > +{ > > +#ifndef RTE_ARCH_X86_64 > > + RTE_SET_USED(first); > > + RTE_SET_USED(last); > > + RTE_SET_USED(list); > > + RTE_SET_USED(num); > > +#else > > + struct rte_stack_lf_head old_head; > > + int success; > > + > > + old_head =3D list->head; > This can be a torn read (same as you have mentioned in > __rte_stack_lf_pop). I suggest we use acquire thread fence here as well > (please see the comments in __rte_stack_lf_pop). Agreed. I'll add the acquire fence. > > + > > + do { > > + struct rte_stack_lf_head new_head; > > + > We can add __atomic_thread_fence(__ATOMIC_ACQUIRE) here (please see > the comments in __rte_stack_lf_pop). Will add the fence here. > > + /* Swing the top pointer to the first element in the list and > > + * make the last element point to the old top. > > + */ > > + new_head.top =3D first; > > + new_head.cnt =3D old_head.cnt + 1; > > + > > + last->next =3D old_head.top; > > + > > + /* Use the release memmodel to ensure the writes to the LF > > LIFO > > + * elements are visible before the head pointer write. > > + */ > > + success =3D rte_atomic128_cmp_exchange( > > + (rte_int128_t *)&list->head, > > + (rte_int128_t *)&old_head, > > + (rte_int128_t *)&new_head, > > + 1, __ATOMIC_RELEASE, > > + __ATOMIC_RELAXED); > Success memory order can be RELAXED as the store to list->len.cnt is > RELEASE. The RELEASE success order here ensures that the store to 'last->next' is vi= sible before the head update. The RELEASE in the list->len.cnt store only g= uarantees that the preceding stores are visible before list->len.cnt's stor= e, but doesn't guarantee any ordering between the 'last->next' store and th= e head update, so we can't rely on that. > > + } while (success =3D=3D 0); > > + > > + /* Ensure the stack modifications are not reordered with respect > > + * to the LIFO len update. > > + */ > > + __atomic_add_fetch(&list->len.cnt, num, __ATOMIC_RELEASE); > > #endif } > > + > > +static __rte_always_inline struct rte_stack_lf_elem * > > +__rte_stack_lf_pop(struct rte_stack_lf_list *list, > > + unsigned int num, > > + void **obj_table, > > + struct rte_stack_lf_elem **last) { #ifndef > RTE_ARCH_X86_64 > > + RTE_SET_USED(obj_table); > > + RTE_SET_USED(last); > > + RTE_SET_USED(list); > > + RTE_SET_USED(num); > > + > > + return NULL; > > +#else > > + struct rte_stack_lf_head old_head; > > + int success; > > + > > + /* Reserve num elements, if available */ > > + while (1) { > > + uint64_t len =3D __atomic_load_n(&list->len.cnt, > > + __ATOMIC_ACQUIRE); > This can be outside the loop. Good idea. I'll move this out of the loop and change the atomic_compare_exc= hange_n's failure memory order to ACQUIRE. > > + > > + /* Does the list contain enough elements? */ > > + if (unlikely(len < num)) > > + return NULL; > > + > > + if (__atomic_compare_exchange_n(&list->len.cnt, > > + &len, len - num, > > + 0, __ATOMIC_RELAXED, > > + __ATOMIC_RELAXED)) > > + break; > > + } > > + > > +#ifndef RTE_ARCH_X86_64 > > + /* Use the acquire memmodel to ensure the reads to the LF LIFO > > elements > > + * are properly ordered with respect to the head pointer read. > > + * > > + * Note that for aarch64, GCC's implementation of __atomic_load_16 > > in > > + * libatomic uses locks, and so this function should be replaced by > > + * a new function (e.g. "rte_atomic128_load()"). > aarch64 does not have 'pure' atomic 128b load instructions. They have to = be > implemented using load/store exclusives. > > > + */ > > + __atomic_load((volatile __int128 *)&list->head, > > + &old_head, > > + __ATOMIC_ACQUIRE); > Since, we know of x86/aarch64 (power?) that cannot implement pure atomic > 128b loads, should we just use relaxed reads and assume the possibility o= f > torn reads for all architectures? Then, we can use acquire fence to preve= nt > the reordering (see below) That's a cleaner solution. I'll implement that, dropping the architecture d= istinction. > > +#else > > + /* x86-64 does not require an atomic load here; if a torn read > > +occurs, > IMO, we should not make architecture specific distinctions as this algori= thm is > based on C11 memory model. > > > + * the CAS will fail and set old_head to the correct/latest value. > > + */ > > + old_head =3D list->head; > > +#endif > > + > > + /* Pop num elements */ > > + do { > > + struct rte_stack_lf_head new_head; > > + struct rte_stack_lf_elem *tmp; > > + unsigned int i; > > + > We can add __atomic_thread_fence(__ATOMIC_ACQUIRE) here. Will do. > > + rte_prefetch0(old_head.top); > > + > > + tmp =3D old_head.top; > > + > > + /* Traverse the list to find the new head. A next pointer will > > + * either point to another element or NULL; if a thread > > + * encounters a pointer that has already been popped, the > > CAS > > + * will fail. > > + */ > > + for (i =3D 0; i < num && tmp !=3D NULL; i++) { > > + rte_prefetch0(tmp->next); > > + if (obj_table) > > + obj_table[i] =3D tmp->data; > > + if (last) > > + *last =3D tmp; > > + tmp =3D tmp->next; > > + } > > + > > + /* If NULL was encountered, the list was modified while > > + * traversing it. Retry. > > + */ > > + if (i !=3D num) > > + continue; > > + > > + new_head.top =3D tmp; > > + new_head.cnt =3D old_head.cnt + 1; > > + > > + success =3D rte_atomic128_cmp_exchange( > > + (rte_int128_t *)&list->head, > > + (rte_int128_t *)&old_head, > > + (rte_int128_t *)&new_head, > > + 1, __ATOMIC_ACQUIRE, > > + __ATOMIC_ACQUIRE); > The success order should be __ATOMIC_RELEASE as the write to list->len.cn= t > is relaxed. > The failure order can be __ATOMIC_RELAXED if the thread fence is added. Agreed on both counts. Will address. > > + } while (success =3D=3D 0); > > + > > + return old_head.top; > > +#endif > > +} > > + > > +#endif /* _RTE_STACK_C11_MEM_H_ */ > > -- > > 2.13.6