From mboxrd@z Thu Jan  1 00:00:00 1970
From: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
Subject: Re: [PATCH v3 6/8] stack: add C11 atomic implementation
Date: Thu, 28 Mar 2019 23:27:05 +0000
Message-ID: <VE1PR08MB5149F2B5E1A3AC800BC71D8698590@VE1PR08MB5149.eurprd08.prod.outlook.com>
References: <20190305164256.2367-1-gage.eads@intel.com>
 <20190306144559.391-1-gage.eads@intel.com>
 <20190306144559.391-7-gage.eads@intel.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
Cc: "olivier.matz@6wind.com" <olivier.matz@6wind.com>,
 "arybchenko@solarflare.com" <arybchenko@solarflare.com>,
 "bruce.richardson@intel.com" <bruce.richardson@intel.com>,
 "konstantin.ananyev@intel.com" <konstantin.ananyev@intel.com>, "Gavin Hu (Arm
 Technology China)" <Gavin.Hu@arm.com>, nd <nd@arm.com>, "thomas@monjalon.net"
 <thomas@monjalon.net>, nd <nd@arm.com>
To: Gage Eads <gage.eads@intel.com>, "dev@dpdk.org" <dev@dpdk.org>
Return-path: <dev-bounces@dpdk.org>
Received: from EUR03-AM5-obe.outbound.protection.outlook.com
 (mail-eopbgr30089.outbound.protection.outlook.com [40.107.3.89])
 by dpdk.org (Postfix) with ESMTP id CD1303772
 for <dev@dpdk.org>; Fri, 29 Mar 2019 00:27:06 +0100 (CET)
In-Reply-To: <20190306144559.391-7-gage.eads@intel.com>
Content-Language: en-US
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <https://mails.dpdk.org/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://mails.dpdk.org/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <https://mails.dpdk.org/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
Errors-To: dev-bounces@dpdk.org
Sender: "dev" <dev-bounces@dpdk.org>

>=20
> This commit adds an implementation of the lock-free stack push, pop, and
> length functions that use __atomic builtins, for systems that benefit fro=
m the
> finer-grained memory ordering control.
>=20
> Signed-off-by: Gage Eads <gage.eads@intel.com>
> ---
>  lib/librte_stack/Makefile            |   3 +-
>  lib/librte_stack/meson.build         |   3 +-
>  lib/librte_stack/rte_stack.h         |   4 +
>  lib/librte_stack/rte_stack_c11_mem.h | 175
> +++++++++++++++++++++++++++++++++++
>  4 files changed, 183 insertions(+), 2 deletions(-)  create mode 100644
> lib/librte_stack/rte_stack_c11_mem.h
>=20
> diff --git a/lib/librte_stack/Makefile b/lib/librte_stack/Makefile index
> 3ecddf033..94a7c1476 100644
> --- a/lib/librte_stack/Makefile
> +++ b/lib/librte_stack/Makefile
> @@ -19,6 +19,7 @@ SRCS-$(CONFIG_RTE_LIBRTE_STACK) :=3D rte_stack.c
>=20
>  # install includes
>  SYMLINK-$(CONFIG_RTE_LIBRTE_STACK)-include :=3D rte_stack.h \
> -					      rte_stack_generic.h
> +					      rte_stack_generic.h \
> +					      rte_stack_c11_mem.h
>=20
>  include $(RTE_SDK)/mk/rte.lib.mk
> diff --git a/lib/librte_stack/meson.build b/lib/librte_stack/meson.build =
index
> 99d7f9ec5..7e2d1dbb8 100644
> --- a/lib/librte_stack/meson.build
> +++ b/lib/librte_stack/meson.build
> @@ -6,4 +6,5 @@ allow_experimental_apis =3D true  version =3D 1  sources =
=3D
> files('rte_stack.c')  headers =3D files('rte_stack.h',
> -		'rte_stack_generic.h')
> +		'rte_stack_generic.h',
> +		'rte_stack_c11_mem.h')
> diff --git a/lib/librte_stack/rte_stack.h b/lib/librte_stack/rte_stack.h =
index
> b484313bb..de16f8fff 100644
> --- a/lib/librte_stack/rte_stack.h
> +++ b/lib/librte_stack/rte_stack.h
> @@ -91,7 +91,11 @@ struct rte_stack {
>   */
>  #define RTE_STACK_F_LF 0x0001
>=20
> +#ifdef RTE_USE_C11_MEM_MODEL
> +#include "rte_stack_c11_mem.h"
> +#else
>  #include "rte_stack_generic.h"
> +#endif
>=20
>  /**
>   * @internal Push several objects on the lock-free stack (MT-safe).
> diff --git a/lib/librte_stack/rte_stack_c11_mem.h
> b/lib/librte_stack/rte_stack_c11_mem.h
> new file mode 100644
> index 000000000..44f9ece6e
> --- /dev/null
> +++ b/lib/librte_stack/rte_stack_c11_mem.h
> @@ -0,0 +1,175 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2019 Intel Corporation
> + */
> +
> +#ifndef _RTE_STACK_C11_MEM_H_
> +#define _RTE_STACK_C11_MEM_H_
> +
> +#include <rte_branch_prediction.h>
> +#include <rte_prefetch.h>
> +
> +static __rte_always_inline unsigned int rte_stack_lf_len(struct
> +rte_stack *s) {
> +	/* stack_lf_push() and stack_lf_pop() do not update the list's
> contents
> +	 * and stack_lf->len atomically, which can cause the list to appear
> +	 * shorter than it actually is if this function is called while other
> +	 * threads are modifying the list.
> +	 *
> +	 * However, given the inherently approximate nature of the
> get_count
> +	 * callback -- even if the list and its size were updated atomically,
> +	 * the size could change between when get_count executes and when
> the
> +	 * value is returned to the caller -- this is acceptable.
> +	 *
> +	 * The stack_lf->len updates are placed such that the list may appear
> to
> +	 * have fewer elements than it does, but will never appear to have
> more
> +	 * elements. If the mempool is near-empty to the point that this is a
> +	 * concern, the user should consider increasing the mempool size.
> +	 */
> +	return (unsigned int)__atomic_load_n(&s->stack_lf.used.len.cnt,
> +					     __ATOMIC_RELAXED);
> +}
> +
> +static __rte_always_inline void
> +__rte_stack_lf_push(struct rte_stack_lf_list *list,
> +		    struct rte_stack_lf_elem *first,
> +		    struct rte_stack_lf_elem *last,
> +		    unsigned int num)
> +{
> +#ifndef RTE_ARCH_X86_64
> +	RTE_SET_USED(first);
> +	RTE_SET_USED(last);
> +	RTE_SET_USED(list);
> +	RTE_SET_USED(num);
> +#else
> +	struct rte_stack_lf_head old_head;
> +	int success;
> +
> +	old_head =3D list->head;
This can be a torn read (same as you have mentioned in __rte_stack_lf_pop).=
 I suggest we use acquire thread fence here as well (please see the comment=
s in __rte_stack_lf_pop).

> +
> +	do {
> +		struct rte_stack_lf_head new_head;
> +
We can add __atomic_thread_fence(__ATOMIC_ACQUIRE) here (please see the com=
ments in __rte_stack_lf_pop).

> +		/* Swing the top pointer to the first element in the list and
> +		 * make the last element point to the old top.
> +		 */
> +		new_head.top =3D first;
> +		new_head.cnt =3D old_head.cnt + 1;
> +
> +		last->next =3D old_head.top;
> +
> +		/* Use the release memmodel to ensure the writes to the LF
> LIFO
> +		 * elements are visible before the head pointer write.
> +		 */
> +		success =3D rte_atomic128_cmp_exchange(
> +				(rte_int128_t *)&list->head,
> +				(rte_int128_t *)&old_head,
> +				(rte_int128_t *)&new_head,
> +				1, __ATOMIC_RELEASE,
> +				__ATOMIC_RELAXED);
Success memory order can be RELAXED as the store to list->len.cnt is RELEAS=
E.

> +	} while (success =3D=3D 0);
> +
> +	/* Ensure the stack modifications are not reordered with respect
> +	 * to the LIFO len update.
> +	 */
> +	__atomic_add_fetch(&list->len.cnt, num, __ATOMIC_RELEASE);
> #endif }
> +
> +static __rte_always_inline struct rte_stack_lf_elem *
> +__rte_stack_lf_pop(struct rte_stack_lf_list *list,
> +		   unsigned int num,
> +		   void **obj_table,
> +		   struct rte_stack_lf_elem **last)
> +{
> +#ifndef RTE_ARCH_X86_64
> +	RTE_SET_USED(obj_table);
> +	RTE_SET_USED(last);
> +	RTE_SET_USED(list);
> +	RTE_SET_USED(num);
> +
> +	return NULL;
> +#else
> +	struct rte_stack_lf_head old_head;
> +	int success;
> +
> +	/* Reserve num elements, if available */
> +	while (1) {
> +		uint64_t len =3D __atomic_load_n(&list->len.cnt,
> +					       __ATOMIC_ACQUIRE);
This can be outside the loop.

> +
> +		/* Does the list contain enough elements? */
> +		if (unlikely(len < num))
> +			return NULL;
> +
> +		if (__atomic_compare_exchange_n(&list->len.cnt,
> +						&len, len - num,
> +						0, __ATOMIC_RELAXED,
> +						__ATOMIC_RELAXED))
> +			break;
> +	}
> +
> +#ifndef RTE_ARCH_X86_64
> +	/* Use the acquire memmodel to ensure the reads to the LF LIFO
> elements
> +	 * are properly ordered with respect to the head pointer read.
> +	 *
> +	 * Note that for aarch64, GCC's implementation of __atomic_load_16
> in
> +	 * libatomic uses locks, and so this function should be replaced by
> +	 * a new function (e.g. "rte_atomic128_load()").
aarch64 does not have 'pure' atomic 128b load instructions. They have to be=
 implemented using load/store exclusives.

> +	 */
> +	__atomic_load((volatile __int128 *)&list->head,
> +		      &old_head,
> +		      __ATOMIC_ACQUIRE);
Since, we know of x86/aarch64 (power?) that cannot implement pure atomic 12=
8b loads, should we just use relaxed reads and assume the possibility of to=
rn reads for all architectures? Then, we can use acquire fence to prevent t=
he reordering (see below)

> +#else
> +	/* x86-64 does not require an atomic load here; if a torn read occurs,
IMO, we should not make architecture specific distinctions as this algorith=
m is based on C11 memory model.

> +	 * the CAS will fail and set old_head to the correct/latest value.
> +	 */
> +	old_head =3D list->head;
> +#endif
> +
> +	/* Pop num elements */
> +	do {
> +		struct rte_stack_lf_head new_head;
> +		struct rte_stack_lf_elem *tmp;
> +		unsigned int i;
> +
We can add __atomic_thread_fence(__ATOMIC_ACQUIRE) here.

> +		rte_prefetch0(old_head.top);
> +
> +		tmp =3D old_head.top;
> +
> +		/* Traverse the list to find the new head. A next pointer will
> +		 * either point to another element or NULL; if a thread
> +		 * encounters a pointer that has already been popped, the
> CAS
> +		 * will fail.
> +		 */
> +		for (i =3D 0; i < num && tmp !=3D NULL; i++) {
> +			rte_prefetch0(tmp->next);
> +			if (obj_table)
> +				obj_table[i] =3D tmp->data;
> +			if (last)
> +				*last =3D tmp;
> +			tmp =3D tmp->next;
> +		}
> +
> +		/* If NULL was encountered, the list was modified while
> +		 * traversing it. Retry.
> +		 */
> +		if (i !=3D num)
> +			continue;
> +
> +		new_head.top =3D tmp;
> +		new_head.cnt =3D old_head.cnt + 1;
> +
> +		success =3D rte_atomic128_cmp_exchange(
> +				(rte_int128_t *)&list->head,
> +				(rte_int128_t *)&old_head,
> +				(rte_int128_t *)&new_head,
> +				1, __ATOMIC_ACQUIRE,
> +				__ATOMIC_ACQUIRE);
The success order should be __ATOMIC_RELEASE as the write to list->len.cnt =
is relaxed.
The failure order can be __ATOMIC_RELAXED if the thread fence is added.

> +	} while (success =3D=3D 0);
> +
> +	return old_head.top;
> +#endif
> +}
> +
> +#endif /* _RTE_STACK_C11_MEM_H_ */
> --
> 2.13.6