All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Eads, Gage" <gage.eads@intel.com>
To: Ola Liljedahl <Ola.Liljedahl@arm.com>, "dev@dpdk.org" <dev@dpdk.org>
Cc: "olivier.matz@6wind.com" <olivier.matz@6wind.com>,
	"stephen@networkplumber.org" <stephen@networkplumber.org>,
	nd <nd@arm.com>, "Richardson, Bruce" <bruce.richardson@intel.com>,
	"arybchenko@solarflare.com" <arybchenko@solarflare.com>,
	"Ananyev, Konstantin" <konstantin.ananyev@intel.com>
Subject: Re: [PATCH v3 2/5] ring: add a non-blocking implementation
Date: Tue, 22 Jan 2019 21:31:27 +0000	[thread overview]
Message-ID: <9184057F7FC11744A2107296B6B8EB1E541CA46E@FMSMSX108.amr.corp.intel.com> (raw)
In-Reply-To: <1548168583.31150.32.camel@arm.com>

Hi Ola,

<snip>

> > @@ -331,6 +433,319 @@ void rte_ring_dump(FILE *f, const struct
> > rte_ring *r);
> >  #endif
> >  #include "rte_ring_generic_64.h"
> >
> > +/* @internal 128-bit structure used by the non-blocking ring */
> > +struct nb_ring_entry {
> > +	void *ptr; /**< Data pointer */
> > +	uint64_t cnt; /**< Modification counter */
> Why not make 'cnt' uintptr_t? This way 32-bit architectures will also be
> supported. I think there are some claims that DPDK still supports e.g. ARMv7a
> and possibly also 32-bit x86?

I chose a 64-bit modification counter because (practically speaking) the ABA problem will not occur with such a large counter -- definitely not within my lifetime. See the "Discussion" section of the commit message for more information.

With a 32-bit counter, there is a very (very) low likelihood of it, but it is possible. Personally, I don't feel comfortable providing such code, because a) I doubt all users would understand the implementation well enough to do the risk/reward analysis, and b) such a bug would be near impossible to reproduce and root-cause if it did occur.

> 
> > +};
> > +
> > +/* The non-blocking ring algorithm is based on the original rte ring
> > +(derived
> > + * from FreeBSD's bufring.h) and inspired by Michael and Scott's
> > +non-blocking
> > + * concurrent queue.
> > + */
> > +
> > +/**
> > + * @internal
> > + *   Enqueue several objects on the non-blocking ring
> > +(single-producer only)
> > + *
> > + * @param r
> > + *   A pointer to the ring structure.
> > + * @param obj_table
> > + *   A pointer to a table of void * pointers (objects).
> > + * @param n
> > + *   The number of objects to add in the ring from the obj_table.
> > + * @param behavior
> > + *   RTE_RING_QUEUE_FIXED:    Enqueue a fixed number of items to the
> > +ring
> > + *   RTE_RING_QUEUE_VARIABLE: Enqueue as many items as possible to
> > +the ring
> > + * @param free_space
> > + *   returns the amount of space after the enqueue operation has
> > +finished
> > + * @return
> > + *   Actual number of objects enqueued.
> > + *   If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only.
> > + */
> > +static __rte_always_inline unsigned int
> > +__rte_ring_do_nb_enqueue_sp(struct rte_ring *r, void * const *obj_table,
> > +			    unsigned int n,
> > +			    enum rte_ring_queue_behavior behavior,
> > +			    unsigned int *free_space)
> > +{
> > +	uint32_t free_entries;
> > +	size_t head, next;
> > +
> > +	n = __rte_ring_move_prod_head_64(r, 1, n, behavior,
> > +					 &head, &next, &free_entries);
> > +	if (n == 0)
> > +		goto end;
> > +
> > +	ENQUEUE_PTRS_NB(r, &r[1], head, obj_table, n);
> > +
> > +	r->prod_64.tail += n;
> Don't we need release order when (or smp_wmb between) writing of the ring
> pointers and the update of tail? By updating the tail pointer, we are
> synchronising with a consumer.
> 
> I prefer using __atomic operations even for load and store. You can see which
> parts of the code that synchronise with each other, e.g. store-release to some
> location synchronises with load-acquire from the same location. If you don't
> know how different threads synchronise with each other, you are very likely to
> make mistakes.
> 

You can tell this code was written when I thought x86-64 was the only viable target :). Yes, you are correct.

With regards to using __atomic intrinsics, I'm planning on taking a similar approach to the functions duplicated in rte_ring_generic.h and rte_ring_c11_mem.h: one version that uses rte_atomic functions (and thus stricter memory ordering) and one that uses __atomic intrinsics (and thus can benefit from more relaxed memory ordering).

> > +
> > +end:
> > +	if (free_space != NULL)
> > +		*free_space = free_entries - n;
> > +	return n;
> > +}
> > +
> > +/**
> > + * @internal
> > + *   Enqueue several objects on the non-blocking ring (multi-producer
> > +safe)
> > + *
> > + * @param r
> > + *   A pointer to the ring structure.
> > + * @param obj_table
> > + *   A pointer to a table of void * pointers (objects).
> > + * @param n
> > + *   The number of objects to add in the ring from the obj_table.
> > + * @param behavior
> > + *   RTE_RING_QUEUE_FIXED:    Enqueue a fixed number of items to the
> > +ring
> > + *   RTE_RING_QUEUE_VARIABLE: Enqueue as many items as possible to
> > +the ring
> > + * @param free_space
> > + *   returns the amount of space after the enqueue operation has
> > +finished
> > + * @return
> > + *   Actual number of objects enqueued.
> > + *   If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only.
> > + */
> > +static __rte_always_inline unsigned int
> > +__rte_ring_do_nb_enqueue_mp(struct rte_ring *r, void * const *obj_table,
> > +			    unsigned int n,
> > +			    enum rte_ring_queue_behavior behavior,
> > +			    unsigned int *free_space)
> > +{
> > +#if !defined(RTE_ARCH_X86_64) || !defined(ALLOW_EXPERIMENTAL_API)
> > +	RTE_SET_USED(r);
> > +	RTE_SET_USED(obj_table);
> > +	RTE_SET_USED(n);
> > +	RTE_SET_USED(behavior);
> > +	RTE_SET_USED(free_space);
> > +#ifndef ALLOW_EXPERIMENTAL_API
> > +	printf("[%s()] RING_F_NB requires an experimental API."
> > +	       " Recompile with ALLOW_EXPERIMENTAL_API to use it.\n"
> > +	       , __func__);
> > +#endif
> > +	return 0;
> > +#endif
> > +#if defined(RTE_ARCH_X86_64) && defined(ALLOW_EXPERIMENTAL_API)
> > +	size_t head, next, tail;
> > +	uint32_t free_entries;
> > +	unsigned int i;
> > +
> > +	n = __rte_ring_move_prod_head_64(r, 0, n, behavior,
> > +					 &head, &next, &free_entries);
> > +	if (n == 0)
> > +		goto end;
> > +
> > +	for (i = 0; i < n; /* i incremented if enqueue succeeds */) {
> > +		struct nb_ring_entry old_value, new_value;
> > +		struct nb_ring_entry *ring_ptr;
> > +
> > +		/* Enqueue to the tail entry. If another thread wins the
> > race,
> > +		 * retry with the new tail.
> > +		 */
> > +		tail = r->prod_64.tail;
> > +
> > +		ring_ptr = &((struct nb_ring_entry *)&r[1])[tail & r->mask];
> This is an ugly expression and cast. Also I think it is unnecessary. What's
> preventing this from being written without a cast? Perhaps the ring array needs
> to be a union of "void *" and struct nb_ring_entry?

The cast is necessary for the correct pointer arithmetic (let "uintptr_t base == &r[1]"):
- With cast: ring_ptr = base + sizeof(struct nb_ring_entry) * (tail & r->mask);
- W/o cast: ring_ptr = base + sizeof(struct rte_ring) * (tail & r->mask);

FWIW, this is essentially the same as is done with the second argument (&r[1]) to ENQUEUE_PTRS and DEQUEUE_PTRS, but there it's split across multiple lines of code. The equivalent here would be:
 
struct nb_ring_entry *ring_base = (struct nb_ring_entry*)&r[1];
ring_ptr = ring_base[tail & r->mask];

Which is more legible, I think.

There is no ring array structure in which to add a union; the ring array is a contiguous chunk of memory that immediately follows after the end of a struct rte_ring. We interpret the memory there according to the ring entry data type (void * for regular rings and struct nb_ring_entry for non-blocking rings).

> 
> > +
> > +		old_value = *ring_ptr;
> > +
> > +		/* If the tail entry's modification counter doesn't match the
> > +		 * producer tail index, it's already been updated.
> > +		 */
> > +		if (old_value.cnt != tail)
> > +			continue;
> Continue restarts the loop at the condition test in the for statement, 'i' and 'n'
> are unchanged. Then we re-read 'prod_64.tail' and 'ring[tail & mask]'. If some
> other thread never updates 'prod_64.tail', the test here (ring[tail].cnt != tail) will
> still be true and we will spin forever.
> 
> Waiting for other threads <=> blocking behaviour so this is not a non- blocking
> design.
> 

You're absolutely right. The if-statement was added as optimization to avoid 128-bit cmpset operations that are known to fail, but in this form it violates the non-blocking design.

I see two solutions: 1) drop the if-statement altogether, or 2) attempt to update prod_64.tail before continuing. Both require every thread to attempt to update prod_64.tail on every iteration, but #2 will result in fewer failed 128-bit cmpsets.

> > +
> > +		/* Prepare the new entry. The cnt field mitigates the ABA
> > +		 * problem on the ring write.
> > +		 */
> > +		new_value.ptr = obj_table[i];
> > +		new_value.cnt = tail + r->size;
> > +
> > +		if (rte_atomic128_cmpset((volatile rte_int128_t *)ring_ptr,
> > +					 (rte_int128_t *)&old_value,
> > +					 (rte_int128_t *)&new_value))
> > +			i++;
> > +
> > +		/* Every thread attempts the cmpset, so they don't have to
> > wait
> > +		 * for the thread that successfully enqueued to the ring.
> > +		 * Using a 64-bit tail mitigates the ABA problem here.
> > +		 *
> > +		 * Built-in used to handle variable-sized tail index.
> > +		 */
> But prod_64.tail is 64 bits so not really variable size?
> 

(See next comment)

> > +		__sync_bool_compare_and_swap(&r->prod_64.tail, tail, tail +
> > 1);
> What memory order is required here? Why not use
> __atomic_compare_exchange() with explicit memory order parameters?
> 

This is an artifact from an older patchset that used uintptr_t, and before I learned that other platforms could support this non-blocking algorithm (hence the __sync type intrinsic was sufficient).

At any rate, as described earlier in this response, I plan on writing these functions using the __atomic builtins in the next patchset.

> > +	}
> > +
> > +end:
> > +	if (free_space != NULL)
> > +		*free_space = free_entries - n;
> > +	return n;
> > +#endif
> > +}
> > +
> > +/**
> > + * @internal Enqueue several objects on the non-blocking ring
> > + *
> > + * @param r
> > + *   A pointer to the ring structure.
> > + * @param obj_table
> > + *   A pointer to a table of void * pointers (objects).
> > + * @param n
> > + *   The number of objects to add in the ring from the obj_table.
> > + * @param behavior
> > + *   RTE_RING_QUEUE_FIXED:    Enqueue a fixed number of items to the
> > +ring
> > + *   RTE_RING_QUEUE_VARIABLE: Enqueue as many items as possible to
> > +the ring
> > + * @param is_sp
> > + *   Indicates whether to use single producer or multi-producer head
> > +update
> > + * @param free_space
> > + *   returns the amount of space after the enqueue operation has
> > +finished
> > + * @return
> > + *   Actual number of objects enqueued.
> > + *   If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only.
> > + */
> > +static __rte_always_inline unsigned int
> > +__rte_ring_do_nb_enqueue(struct rte_ring *r, void * const *obj_table,
> > +			 unsigned int n, enum rte_ring_queue_behavior
> > behavior,
> > +			 unsigned int is_sp, unsigned int *free_space) {
> > +	if (is_sp)
> > +		return __rte_ring_do_nb_enqueue_sp(r, obj_table, n,
> > +						   behavior, free_space);
> > +	else
> > +		return __rte_ring_do_nb_enqueue_mp(r, obj_table, n,
> > +						   behavior, free_space);
> > +}
> > +
> > +/**
> > + * @internal
> > + *   Dequeue several objects from the non-blocking ring
> > +(single-consumer
> > only)
> > + *
> > + * @param r
> > + *   A pointer to the ring structure.
> > + * @param obj_table
> > + *   A pointer to a table of void * pointers (objects).
> > + * @param n
> > + *   The number of objects to pull from the ring.
> > + * @param behavior
> > + *   RTE_RING_QUEUE_FIXED:    Dequeue a fixed number of items from
> > + the ring
> > + *   RTE_RING_QUEUE_VARIABLE: Dequeue as many items as possible from
> > + the ring
> > + * @param available
> > + *   returns the number of remaining ring entries after the dequeue
> > + has
> > finished
> > + * @return
> > + *   - Actual number of objects dequeued.
> > + *     If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only.
> > + */
> > +static __rte_always_inline unsigned int
> > +__rte_ring_do_nb_dequeue_sc(struct rte_ring *r, void **obj_table,
> > +			    unsigned int n,
> > +			    enum rte_ring_queue_behavior behavior,
> > +			    unsigned int *available)
> > +{
> > +	size_t head, next;
> > +	uint32_t entries;
> > +
> > +	n = __rte_ring_move_cons_head_64(r, 1, n, behavior,
> > +					 &head, &next, &entries);
> > +	if (n == 0)
> > +		goto end;
> > +
> > +	DEQUEUE_PTRS_NB(r, &r[1], head, obj_table, n);
> > +
> > +	r->cons_64.tail += n;
> Memory ordering? Consumer synchronises with producer.
> 

Agreed, that is missing here. Will fix.

Thanks,
Gage

> --
> Ola Liljedahl, Networking System Architect, Arm Phone +46706866373, Skype
> ola.liljedahl


  reply	other threads:[~2019-01-22 21:31 UTC|newest]

Thread overview: 102+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-01-10 21:01 [PATCH 0/6] Add non-blocking ring Gage Eads
2019-01-10 21:01 ` [PATCH 1/6] ring: change head and tail to pointer-width size Gage Eads
2019-01-11  4:38   ` Stephen Hemminger
2019-01-11 19:07     ` Eads, Gage
2019-01-11 10:25   ` Burakov, Anatoly
2019-01-11 19:12     ` Eads, Gage
2019-01-11 19:55       ` Stephen Hemminger
2019-01-15 15:48         ` Eads, Gage
2019-01-11 10:40   ` Burakov, Anatoly
2019-01-11 10:58     ` Bruce Richardson
2019-01-11 11:30       ` Burakov, Anatoly
     [not found]         ` <20190111115851.GC3336@bricha3-MOBL.ger.corp.intel.com>
2019-01-11 19:27           ` Eads, Gage
2019-01-21 14:14             ` Burakov, Anatoly
2019-01-22 18:27               ` Eads, Gage
2019-01-10 21:01 ` [PATCH 2/6] ring: add a non-blocking implementation Gage Eads
2019-01-10 21:01 ` [PATCH 3/6] test_ring: add non-blocking ring autotest Gage Eads
2019-01-10 21:01 ` [PATCH 4/6] test_ring_perf: add non-blocking ring perf test Gage Eads
2019-01-10 21:01 ` [PATCH 5/6] mempool/ring: add non-blocking ring handlers Gage Eads
2019-01-13 13:43   ` Andrew Rybchenko
2019-01-10 21:01 ` [PATCH 6/6] doc: add NB ring comment to EAL "known issues" Gage Eads
2019-01-11  2:51   ` Varghese, Vipin
2019-01-11 19:30     ` Eads, Gage
2019-01-14  0:07       ` Varghese, Vipin
2019-01-15 23:52 ` [PATCH v2 0/5] Add non-blocking ring Gage Eads
2019-01-15 23:52   ` [PATCH v2 1/5] ring: change head and tail to pointer-width size Gage Eads
2019-01-15 23:52   ` [PATCH v2 2/5] ring: add a non-blocking implementation Gage Eads
2019-01-15 23:52   ` [PATCH v2 3/5] test_ring: add non-blocking ring autotest Gage Eads
2019-01-15 23:52   ` [PATCH v2 4/5] test_ring_perf: add non-blocking ring perf test Gage Eads
2019-01-15 23:52   ` [PATCH v2 5/5] mempool/ring: add non-blocking ring handlers Gage Eads
2019-01-16  0:26   ` [PATCH v2 0/5] Add non-blocking ring Stephen Hemminger
2019-01-18 15:23   ` [PATCH v3 " Gage Eads
2019-01-18 15:23     ` [PATCH v3 1/5] ring: add 64-bit headtail structure Gage Eads
2019-01-18 15:23     ` [PATCH v3 2/5] ring: add a non-blocking implementation Gage Eads
2019-01-22 10:12       ` Ola Liljedahl
2019-01-22 14:49       ` Ola Liljedahl
2019-01-22 21:31         ` Eads, Gage [this message]
2019-01-23 10:16           ` Ola Liljedahl
2019-01-25 17:21             ` Eads, Gage
2019-01-28 10:35               ` Ola Liljedahl
2019-01-28 18:54                 ` Eads, Gage
2019-01-28 22:31                   ` Ola Liljedahl
2019-01-28 13:34               ` Jerin Jacob Kollanukkaran
2019-01-28 13:43                 ` Ola Liljedahl
2019-01-28 14:04                   ` Jerin Jacob Kollanukkaran
2019-01-28 14:06                     ` Ola Liljedahl
2019-01-28 18:59                 ` Eads, Gage
2019-01-18 15:23     ` [PATCH v3 3/5] test_ring: add non-blocking ring autotest Gage Eads
2019-01-18 15:23     ` [PATCH v3 4/5] test_ring_perf: add non-blocking ring perf test Gage Eads
2019-01-18 15:23     ` [PATCH v3 5/5] mempool/ring: add non-blocking ring handlers Gage Eads
2019-01-22  9:27     ` [PATCH v3 0/5] Add non-blocking ring Ola Liljedahl
2019-01-22 10:15       ` Ola Liljedahl
2019-01-22 19:15       ` Eads, Gage
2019-01-23 16:02       ` Jerin Jacob Kollanukkaran
2019-01-23 16:29         ` Ola Liljedahl
2019-01-28 13:10           ` [EXT] " Jerin Jacob Kollanukkaran
2019-01-25  5:20     ` Honnappa Nagarahalli
2019-01-25 17:42       ` Eads, Gage
2019-01-25 17:56       ` Eads, Gage
2019-01-28 10:41         ` Ola Liljedahl
2019-01-28 18:14     ` [PATCH v4 " Gage Eads
2019-01-28 18:14       ` [PATCH v4 1/5] ring: add 64-bit headtail structure Gage Eads
2019-01-29 12:56         ` Ola Liljedahl
2019-01-30  4:26           ` Eads, Gage
2019-01-28 18:14       ` [PATCH v4 2/5] ring: add a non-blocking implementation Gage Eads
2019-01-28 18:14       ` [PATCH v4 3/5] test_ring: add non-blocking ring autotest Gage Eads
2019-01-28 18:14       ` [PATCH v4 4/5] test_ring_perf: add non-blocking ring perf test Gage Eads
2019-01-28 18:14       ` [PATCH v4 5/5] mempool/ring: add non-blocking ring handlers Gage Eads
2019-03-05 17:40       ` [PATCH v5 0/6] Add lock-free ring and mempool handler Gage Eads
2019-03-05 17:40         ` [PATCH v5 1/6] ring: add a pointer-width headtail structure Gage Eads
2019-03-05 17:40         ` [PATCH v5 2/6] ring: add a ring start marker Gage Eads
2019-03-05 17:40         ` [PATCH v5 3/6] ring: add a lock-free implementation Gage Eads
2019-03-05 17:40         ` [PATCH v5 4/6] test_ring: add lock-free ring autotest Gage Eads
2019-03-05 17:40         ` [PATCH v5 5/6] test_ring_perf: add lock-free ring perf test Gage Eads
2019-03-05 17:40         ` [PATCH v5 6/6] mempool/ring: add lock-free ring handlers Gage Eads
2019-03-06 15:03         ` [PATCH v6 0/6] Add lock-free ring and mempool handler Gage Eads
2019-03-06 15:03           ` [PATCH v6 1/6] ring: add a pointer-width headtail structure Gage Eads
2019-03-06 15:03           ` [PATCH v6 2/6] ring: add a ring start marker Gage Eads
2019-03-06 15:03           ` [PATCH v6 3/6] ring: add a lock-free implementation Gage Eads
2019-03-06 15:03           ` [PATCH v6 4/6] test_ring: add lock-free ring autotest Gage Eads
2019-03-06 15:03           ` [PATCH v6 5/6] test_ring_perf: add lock-free ring perf test Gage Eads
2019-03-06 15:03           ` [PATCH v6 6/6] mempool/ring: add lock-free ring handlers Gage Eads
2019-03-18 21:35           ` [PATCH v7 0/6] Add lock-free ring and mempool handler Gage Eads
2019-03-18 21:35             ` [PATCH v7 1/6] ring: add a pointer-width headtail structure Gage Eads
2019-03-18 21:35             ` [PATCH v7 2/6] ring: add a ring start marker Gage Eads
2019-03-18 21:35             ` [PATCH v7 3/6] ring: add a lock-free implementation Gage Eads
2019-03-19 15:50               ` Stephen Hemminger
2019-03-18 21:35             ` [PATCH v7 4/6] test_ring: add lock-free ring autotest Gage Eads
2019-03-18 21:35             ` [PATCH v7 5/6] test_ring_perf: add lock-free ring perf test Gage Eads
2019-03-18 21:35             ` [PATCH v7 6/6] mempool/ring: add lock-free ring handlers Gage Eads
2019-03-18 21:49             ` [PATCH v7 0/6] Add lock-free ring and mempool handler Eads, Gage
2019-03-19 15:51               ` Stephen Hemminger
2019-04-01 19:23                 ` Eads, Gage
2019-04-02 10:16                   ` Ola Liljedahl
2019-04-04 22:28                     ` [dpdk-dev] " Eads, Gage
2019-03-19  1:20             ` [PATCH v8 " Gage Eads
2019-03-19  1:20               ` [PATCH v8 1/6] ring: add a pointer-width headtail structure Gage Eads
2019-03-19  1:20               ` [PATCH v8 2/6] ring: add a ring start marker Gage Eads
2019-03-19  1:20               ` [PATCH v8 3/6] ring: add a lock-free implementation Gage Eads
2019-03-19  1:20               ` [PATCH v8 4/6] test_ring: add lock-free ring autotest Gage Eads
2019-03-19  1:20               ` [PATCH v8 5/6] test_ring_perf: add lock-free ring perf test Gage Eads
2019-03-19  1:20               ` [PATCH v8 6/6] mempool/ring: add lock-free ring handlers Gage Eads
2019-04-03 16:46               ` [PATCH v8 0/6] Add lock-free ring and mempool handler Thomas Monjalon

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=9184057F7FC11744A2107296B6B8EB1E541CA46E@FMSMSX108.amr.corp.intel.com \
    --to=gage.eads@intel.com \
    --cc=Ola.Liljedahl@arm.com \
    --cc=arybchenko@solarflare.com \
    --cc=bruce.richardson@intel.com \
    --cc=dev@dpdk.org \
    --cc=konstantin.ananyev@intel.com \
    --cc=nd@arm.com \
    --cc=olivier.matz@6wind.com \
    --cc=stephen@networkplumber.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.