All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH] Linux kernel Wait-Free Concurrent Queue Implementation
@ 2013-03-11 21:36 Mathieu Desnoyers
  2013-03-14  4:22 ` Eric Wong
                   ` (4 more replies)
  0 siblings, 5 replies; 25+ messages in thread
From: Mathieu Desnoyers @ 2013-03-11 21:36 UTC (permalink / raw)
  To: Eric Wong
  Cc: Lai Jiangshan, Paul E. McKenney, Stephen Hemminger,
	Davide Libenzi, linux-kernel

Ported to the Linux kernel from Userspace RCU library, at commit
108a92e5b97ee91b2b902dba2dd2e78aab42f420.

Ref: http://git.lttng.org/userspace-rcu.git

It is provided as a starting point only. Test cases should be ported
from Userspace RCU to kernel space and thoroughly ran on a wide range of
architectures before considering this port production-ready.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
CC: Lai Jiangshan <laijs@cn.fujitsu.com>
CC: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
CC: Stephen Hemminger <shemminger@vyatta.com>
CC: Davide Libenzi <davidel@xmailserver.org>
CC: Eric Wong <normalperson@yhbt.net>
---
 include/linux/wfcqueue.h |  642 +++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 642 insertions(+)

Index: linux/include/linux/wfcqueue.h
===================================================================
--- /dev/null
+++ linux/include/linux/wfcqueue.h
@@ -0,0 +1,642 @@
+#ifndef _LINUX_WFCQUEUE_H
+#define _LINUX_WFCQUEUE_H
+
+/*
+ * linux/wfcqueue.h
+ *
+ * Concurrent Queue with Wait-Free Enqueue/Blocking Dequeue
+ *
+ * Copyright 2010-2013 - Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
+ * Copyright 2011-2012 - Lai Jiangshan <laijs@cn.fujitsu.com>
+ *
+ * This library is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * This library is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this library; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#include <linux/bug.h>
+#include <linux/types.h>
+#include <linux/compiler.h>
+#include <linux/mutex.h>
+#include <linux/delay.h>
+#include <asm/cmpxchg.h>
+#include <asm/processor.h>
+#include <asm/barrier.h>
+
+/*
+ * Concurrent Queue with Wait-Free Enqueue/Blocking Dequeue
+ *
+ * This queue has been designed and implemented collaboratively by
+ * Mathieu Desnoyers and Lai Jiangshan. Inspired from
+ * half-wait-free/half-blocking queue implementation done by Paul E.
+ * McKenney.
+ *
+ * Mutual exclusion of wfcq_* / __wfcq_* API
+ *
+ * Synchronization table:
+ *
+ * External synchronization techniques described in the API below is
+ * required between pairs marked with "X". No external synchronization
+ * required between pairs marked with "-".
+ *
+ * Legend:
+ * [1] wfcq_enqueue
+ * [2] __wfcq_splice (destination queue)
+ * [3] __wfcq_dequeue
+ * [4] __wfcq_splice (source queue)
+ * [5] __wfcq_first
+ * [6] __wfcq_next
+ *
+ *     [1] [2] [3] [4] [5] [6]
+ * [1]  -   -   -   -   -   -
+ * [2]  -   -   -   -   -   -
+ * [3]  -   -   X   X   X   X
+ * [4]  -   -   X   -   X   X
+ * [5]  -   -   X   X   -   -
+ * [6]  -   -   X   X   -   -
+ *
+ * Mutual exclusion can be ensured by holding wfcq_dequeue_lock().
+ *
+ * For convenience, wfcq_dequeue_blocking() and
+ * wfcq_splice_blocking() hold the dequeue lock.
+ *
+ * Besides locking, mutual exclusion of dequeue, splice and iteration
+ * can be ensured by performing all of those operations from a single
+ * thread, without requiring any lock.
+ */
+
+/*
+ * Load a data from shared memory.
+ */
+#define CMM_LOAD_SHARED(p)		ACCESS_ONCE(p)
+
+/*
+ * Identify a shared store.
+ */
+#define CMM_STORE_SHARED(x, v)		({ ACCESS_ONCE(x) = (v); })
+
+#define WFCQ_WOULDBLOCK		((void *) -1UL)
+#define WFCQ_ADAPT_ATTEMPTS		10	/* Retry if being set */
+#define WFCQ_WAIT			10	/* Wait 10 ms if being set */
+
+enum wfcq_ret {
+	WFCQ_RET_WOULDBLOCK =	-1,
+	WFCQ_RET_DEST_EMPTY =	0,
+	WFCQ_RET_DEST_NON_EMPTY =	1,
+	WFCQ_RET_SRC_EMPTY = 	2,
+};
+
+struct wfcq_node {
+	struct wfcq_node *next;
+};
+
+/*
+ * Do not put head and tail on the same cache-line if concurrent
+ * enqueue/dequeue are expected from many CPUs. This eliminates
+ * false-sharing between enqueue and dequeue.
+ */
+struct wfcq_head {
+	struct wfcq_node node;
+	struct mutex lock;
+};
+
+struct wfcq_tail {
+	struct wfcq_node *p;
+};
+
+/*
+ * wfcq_node_init: initialize wait-free queue node.
+ */
+static inline void wfcq_node_init(struct wfcq_node *node)
+{
+	node->next = NULL;
+}
+
+/*
+ * wfcq_init: initialize wait-free queue.
+ */
+static inline void wfcq_init(struct wfcq_head *head,
+		struct wfcq_tail *tail)
+{
+	/* Set queue head and tail */
+	wfcq_node_init(&head->node);
+	tail->p = &head->node;
+	mutex_init(&head->lock);
+}
+
+/*
+ * wfcq_empty: return whether wait-free queue is empty.
+ *
+ * No memory barrier is issued. No mutual exclusion is required.
+ *
+ * We perform the test on head->node.next to check if the queue is
+ * possibly empty, but we confirm this by checking if the tail pointer
+ * points to the head node because the tail pointer is the linearisation
+ * point of the enqueuers. Just checking the head next pointer could
+ * make a queue appear empty if an enqueuer is preempted for a long time
+ * between xchg() and setting the previous node's next pointer.
+ */
+static inline bool wfcq_empty(struct wfcq_head *head,
+		struct wfcq_tail *tail)
+{
+	/*
+	 * Queue is empty if no node is pointed by head->node.next nor
+	 * tail->p. Even though the tail->p check is sufficient to find
+	 * out of the queue is empty, we first check head->node.next as a
+	 * common case to ensure that dequeuers do not frequently access
+	 * enqueuer's tail->p cache line.
+	 */
+	return CMM_LOAD_SHARED(head->node.next) == NULL
+		&& CMM_LOAD_SHARED(tail->p) == &head->node;
+}
+
+static inline void wfcq_dequeue_lock(struct wfcq_head *head,
+		struct wfcq_tail *tail)
+{
+	mutex_lock(&head->lock);
+}
+
+static inline void wfcq_dequeue_unlock(struct wfcq_head *head,
+		struct wfcq_tail *tail)
+{
+	mutex_unlock(&head->lock);
+}
+
+static inline bool __wfcq_append(struct wfcq_head *head,
+		struct wfcq_tail *tail,
+		struct wfcq_node *new_head,
+		struct wfcq_node *new_tail)
+{
+	struct wfcq_node *old_tail;
+
+	/*
+	 * Implicit memory barrier before xchg() orders earlier
+	 * stores to data structure containing node and setting
+	 * node->next to NULL before publication.
+	 */
+	old_tail = xchg(&tail->p, new_tail);
+
+	/*
+	 * Implicit memory barrier after xchg() orders store to
+	 * q->tail before store to old_tail->next.
+	 *
+	 * At this point, dequeuers see a NULL tail->p->next, which
+	 * indicates that the queue is being appended to. The following
+	 * store will append "node" to the queue from a dequeuer
+	 * perspective.
+	 */
+	CMM_STORE_SHARED(old_tail->next, new_head);
+	/*
+	 * Return false if queue was empty prior to adding the node,
+	 * else return true.
+	 */
+	return old_tail != &head->node;
+}
+
+/*
+ * wfcq_enqueue: enqueue a node into a wait-free queue.
+ *
+ * Issues a full memory barrier before enqueue. No mutual exclusion is
+ * required.
+ *
+ * Returns false if the queue was empty prior to adding the node.
+ * Returns true otherwise.
+ */
+static inline bool wfcq_enqueue(struct wfcq_head *head,
+		struct wfcq_tail *tail,
+		struct wfcq_node *new_tail)
+{
+	return __wfcq_append(head, tail, new_tail, new_tail);
+}
+
+/*
+ * ___wfcq_busy_wait: adaptative busy-wait.
+ *
+ * Returns 1 if nonblocking and needs to block, 0 otherwise.
+ */
+static inline bool
+___wfcq_busy_wait(int *attempt, int blocking)
+{
+	if (!blocking)
+		return 1;
+	if (++(*attempt) >= WFCQ_ADAPT_ATTEMPTS) {
+		msleep(WFCQ_WAIT);
+		*attempt = 0;
+	} else {
+		cpu_relax();
+	}
+	return 0;
+}
+
+/*
+ * Waiting for enqueuer to complete enqueue and return the next node.
+ */
+static inline struct wfcq_node *
+___wfcq_node_sync_next(struct wfcq_node *node, int blocking)
+{
+	struct wfcq_node *next;
+	int attempt = 0;
+
+	/*
+	 * Adaptative busy-looping waiting for enqueuer to complete enqueue.
+	 */
+	while ((next = CMM_LOAD_SHARED(node->next)) == NULL) {
+		if (___wfcq_busy_wait(&attempt, blocking))
+			return WFCQ_WOULDBLOCK;
+	}
+
+	return next;
+}
+
+static inline struct wfcq_node *
+__wfcq_first(struct wfcq_head *head,
+		struct wfcq_tail *tail,
+		int blocking)
+{
+	struct wfcq_node *node;
+
+	if (wfcq_empty(head, tail))
+		return NULL;
+	node = ___wfcq_node_sync_next(&head->node, blocking);
+	/* Load head->node.next before loading node's content */
+	smp_read_barrier_depends();
+	return node;
+}
+
+/*
+ * __wfcq_first_blocking: get first node of a queue, without dequeuing.
+ *
+ * Content written into the node before enqueue is guaranteed to be
+ * consistent, but no other memory ordering is ensured.
+ * Dequeue/splice/iteration mutual exclusion should be ensured by the
+ * caller.
+ *
+ * Used by for-like iteration macros in linux/wfcqueue.h:
+ * __wfcq_for_each_blocking()
+ * __wfcq_for_each_blocking_safe()
+ *
+ * Returns NULL if queue is empty, first node otherwise.
+ */
+static inline struct wfcq_node *
+__wfcq_first_blocking(struct wfcq_head *head,
+		struct wfcq_tail *tail)
+{
+	return __wfcq_first(head, tail, 1);
+}
+
+
+/*
+ * __wfcq_first_nonblocking: get first node of a queue, without dequeuing.
+ *
+ * Same as __wfcq_first_blocking, but returns WFCQ_WOULDBLOCK if
+ * it needs to block.
+ */
+static inline struct wfcq_node *
+__wfcq_first_nonblocking(struct wfcq_head *head,
+		struct wfcq_tail *tail)
+{
+	return __wfcq_first(head, tail, 0);
+}
+
+static inline struct wfcq_node *
+__wfcq_next(struct wfcq_head *head,
+		struct wfcq_tail *tail,
+		struct wfcq_node *node,
+		int blocking)
+{
+	struct wfcq_node *next;
+
+	/*
+	 * Even though the following tail->p check is sufficient to find
+	 * out if we reached the end of the queue, we first check
+	 * node->next as a common case to ensure that iteration on nodes
+	 * do not frequently access enqueuer's tail->p cache line.
+	 */
+	if ((next = CMM_LOAD_SHARED(node->next)) == NULL) {
+		/* Load node->next before tail->p */
+		smp_rmb();
+		if (CMM_LOAD_SHARED(tail->p) == node)
+			return NULL;
+		next = ___wfcq_node_sync_next(node, blocking);
+	}
+	/* Load node->next before loading next's content */
+	smp_read_barrier_depends();
+	return next;
+}
+
+/*
+ * __wfcq_next_blocking: get next node of a queue, without dequeuing.
+ *
+ * Content written into the node before enqueue is guaranteed to be
+ * consistent, but no other memory ordering is ensured.
+ * Dequeue/splice/iteration mutual exclusion should be ensured by the
+ * caller.
+ *
+ * Used by for-like iteration macros in linux/wfcqueue.h:
+ * __wfcq_for_each_blocking()
+ * __wfcq_for_each_blocking_safe()
+ *
+ * Returns NULL if reached end of queue, non-NULL next queue node
+ * otherwise.
+ */
+static inline struct wfcq_node *
+__wfcq_next_blocking(struct wfcq_head *head,
+		struct wfcq_tail *tail,
+		struct wfcq_node *node)
+{
+	return __wfcq_next(head, tail, node, 1);
+}
+
+/*
+ * __wfcq_next_blocking: get next node of a queue, without dequeuing.
+ *
+ * Same as __wfcq_next_blocking, but returns WFCQ_WOULDBLOCK if
+ * it needs to block.
+ */
+static inline struct wfcq_node *
+__wfcq_next_nonblocking(struct wfcq_head *head,
+		struct wfcq_tail *tail,
+		struct wfcq_node *node)
+{
+	return __wfcq_next(head, tail, node, 0);
+}
+
+static inline struct wfcq_node *
+__wfcq_dequeue(struct wfcq_head *head,
+		struct wfcq_tail *tail,
+		int blocking)
+{
+	struct wfcq_node *node, *next;
+
+	if (wfcq_empty(head, tail))
+		return NULL;
+
+	node = ___wfcq_node_sync_next(&head->node, blocking);
+	if (!blocking && node == WFCQ_WOULDBLOCK)
+		return WFCQ_WOULDBLOCK;
+
+	if ((next = CMM_LOAD_SHARED(node->next)) == NULL) {
+		/*
+		 * @node is probably the only node in the queue.
+		 * Try to move the tail to &q->head.
+		 * q->head.next is set to NULL here, and stays
+		 * NULL if the cmpxchg succeeds. Should the
+		 * cmpxchg fail due to a concurrent enqueue, the
+		 * q->head.next will be set to the next node.
+		 * The implicit memory barrier before
+		 * cmpxchg() orders load node->next
+		 * before loading q->tail.
+		 * The implicit memory barrier before cmpxchg
+		 * orders load q->head.next before loading node's
+		 * content.
+		 */
+		wfcq_node_init(&head->node);
+		if (cmpxchg(&tail->p, node, &head->node) == node)
+			return node;
+		next = ___wfcq_node_sync_next(node, blocking);
+		/*
+		 * In nonblocking mode, if we would need to block to
+		 * get node's next, set the head next node pointer
+		 * (currently NULL) back to its original value.
+		 */
+		if (!blocking && next == WFCQ_WOULDBLOCK) {
+			head->node.next = node;
+			return WFCQ_WOULDBLOCK;
+		}
+	}
+
+	/*
+	 * Move queue head forward.
+	 */
+	head->node.next = next;
+
+	/* Load q->head.next before loading node's content */
+	smp_read_barrier_depends();
+	return node;
+}
+
+/*
+ * __wfcq_dequeue_blocking: dequeue a node from the queue.
+ *
+ * Content written into the node before enqueue is guaranteed to be
+ * consistent, but no other memory ordering is ensured.
+ * It is valid to reuse and free a dequeued node immediately.
+ * Dequeue/splice/iteration mutual exclusion should be ensured by the
+ * caller.
+ */
+static inline struct wfcq_node *
+__wfcq_dequeue_blocking(struct wfcq_head *head,
+		struct wfcq_tail *tail)
+{
+	return __wfcq_dequeue(head, tail, 1);
+}
+
+/*
+ * __wfcq_dequeue_nonblocking: dequeue a node from a wait-free queue.
+ *
+ * Same as __wfcq_dequeue_blocking, but returns WFCQ_WOULDBLOCK
+ * if it needs to block.
+ */
+static inline struct wfcq_node *
+__wfcq_dequeue_nonblocking(struct wfcq_head *head,
+		struct wfcq_tail *tail)
+{
+	return __wfcq_dequeue(head, tail, 0);
+}
+
+/*
+ * __wfcq_splice: enqueue all src_q nodes at the end of dest_q.
+ *
+ * Dequeue all nodes from src_q.
+ * dest_q must be already initialized.
+ * Mutual exclusion for src_q should be ensured by the caller as
+ * specified in the "Synchronisation table".
+ * Returns enum wfcq_ret which indicates the state of the src or
+ * dest queue.
+ */
+static inline enum wfcq_ret
+__wfcq_splice(
+		struct wfcq_head *dest_q_head,
+		struct wfcq_tail *dest_q_tail,
+		struct wfcq_head *src_q_head,
+		struct wfcq_tail *src_q_tail,
+		int blocking)
+{
+	struct wfcq_node *head, *tail;
+	int attempt = 0;
+
+	/*
+	 * Initial emptiness check to speed up cases where queue is
+	 * empty: only require loads to check if queue is empty.
+	 */
+	if (wfcq_empty(src_q_head, src_q_tail))
+		return WFCQ_RET_SRC_EMPTY;
+
+	for (;;) {
+		/*
+		 * Open-coded _wfcq_empty() by testing result of
+		 * xchg, as well as tail pointer vs head node
+		 * address.
+		 */
+		head = xchg(&src_q_head->node.next, NULL);
+		if (head)
+			break;	/* non-empty */
+		if (CMM_LOAD_SHARED(src_q_tail->p) == &src_q_head->node)
+			return WFCQ_RET_SRC_EMPTY;
+		if (___wfcq_busy_wait(&attempt, blocking))
+			return WFCQ_RET_WOULDBLOCK;
+	}
+
+	/*
+	 * Memory barrier implied before xchg() orders store to
+	 * src_q->head before store to src_q->tail. This is required by
+	 * concurrent enqueue on src_q, which exchanges the tail before
+	 * updating the previous tail's next pointer.
+	 */
+	tail = xchg(&src_q_tail->p, &src_q_head->node);
+
+	/*
+	 * Append the spliced content of src_q into dest_q. Does not
+	 * require mutual exclusion on dest_q (wait-free).
+	 */
+	if (__wfcq_append(dest_q_head, dest_q_tail, head, tail))
+		return WFCQ_RET_DEST_NON_EMPTY;
+	else
+		return WFCQ_RET_DEST_EMPTY;
+}
+
+/*
+ * __wfcq_splice_blocking: enqueue all src_q nodes at the end of dest_q.
+ *
+ * Dequeue all nodes from src_q.
+ * dest_q must be already initialized.
+ * Mutual exclusion for src_q should be ensured by the caller as
+ * specified in the "Synchronisation table".
+ * Returns enum wfcq_ret which indicates the state of the src or
+ * dest queue. Never returns WFCQ_RET_WOULDBLOCK.
+ */
+static inline enum wfcq_ret
+__wfcq_splice_blocking(
+		struct wfcq_head *dest_q_head,
+		struct wfcq_tail *dest_q_tail,
+		struct wfcq_head *src_q_head,
+		struct wfcq_tail *src_q_tail)
+{
+	return __wfcq_splice(dest_q_head, dest_q_tail,
+		src_q_head, src_q_tail, 1);
+}
+
+/*
+ * __wfcq_splice_nonblocking: enqueue all src_q nodes at the end of dest_q.
+ *
+ * Same as __wfcq_splice_blocking, but returns
+ * WFCQ_RET_WOULDBLOCK if it needs to block.
+ */
+static inline enum wfcq_ret
+__wfcq_splice_nonblocking(
+		struct wfcq_head *dest_q_head,
+		struct wfcq_tail *dest_q_tail,
+		struct wfcq_head *src_q_head,
+		struct wfcq_tail *src_q_tail)
+{
+	return __wfcq_splice(dest_q_head, dest_q_tail,
+		src_q_head, src_q_tail, 0);
+}
+
+/*
+ * wfcq_dequeue_blocking: dequeue a node from a wait-free queue.
+ *
+ * Content written into the node before enqueue is guaranteed to be
+ * consistent, but no other memory ordering is ensured.
+ * Mutual exclusion with wfcq_splice_blocking and dequeue lock is
+ * ensured.
+ * It is valid to reuse and free a dequeued node immediately.
+ */
+static inline struct wfcq_node *
+wfcq_dequeue_blocking(struct wfcq_head *head,
+		struct wfcq_tail *tail)
+{
+	struct wfcq_node *retval;
+
+	wfcq_dequeue_lock(head, tail);
+	retval = __wfcq_dequeue_blocking(head, tail);
+	wfcq_dequeue_unlock(head, tail);
+	return retval;
+}
+
+/*
+ * wfcq_splice_blocking: enqueue all src_q nodes at the end of dest_q.
+ *
+ * Dequeue all nodes from src_q.
+ * dest_q must be already initialized.
+ * Content written into the node before enqueue is guaranteed to be
+ * consistent, but no other memory ordering is ensured.
+ * Mutual exclusion with wfcq_dequeue_blocking and dequeue lock is
+ * ensured.
+ * Returns enum wfcq_ret which indicates the state of the src or
+ * dest queue. Never returns WFCQ_RET_WOULDBLOCK.
+ */
+static inline enum wfcq_ret
+wfcq_splice_blocking(
+		struct wfcq_head *dest_q_head,
+		struct wfcq_tail *dest_q_tail,
+		struct wfcq_head *src_q_head,
+		struct wfcq_tail *src_q_tail)
+{
+	enum wfcq_ret ret;
+
+	wfcq_dequeue_lock(src_q_head, src_q_tail);
+	ret = __wfcq_splice_blocking(dest_q_head, dest_q_tail,
+			src_q_head, src_q_tail);
+	wfcq_dequeue_unlock(src_q_head, src_q_tail);
+	return ret;
+}
+
+/*
+ * __wfcq_for_each_blocking: Iterate over all nodes in a queue,
+ * without dequeuing them.
+ * @head: head of the queue (struct wfcq_head pointer).
+ * @tail: tail of the queue (struct wfcq_tail pointer).
+ * @node: iterator on the queue (struct wfcq_node pointer).
+ *
+ * Content written into each node before enqueue is guaranteed to be
+ * consistent, but no other memory ordering is ensured.
+ * Dequeue/splice/iteration mutual exclusion should be ensured by the
+ * caller.
+ */
+#define __wfcq_for_each_blocking(head, tail, node)		\
+	for (node = __wfcq_first_blocking(head, tail);	\
+		node != NULL;					\
+		node = __wfcq_next_blocking(head, tail, node))
+
+/*
+ * __wfcq_for_each_blocking_safe: Iterate over all nodes in a queue,
+ * without dequeuing them. Safe against deletion.
+ * @head: head of the queue (struct wfcq_head pointer).
+ * @tail: tail of the queue (struct wfcq_tail pointer).
+ * @node: iterator on the queue (struct wfcq_node pointer).
+ * @n: struct wfcq_node pointer holding the next pointer (used
+ *     internally).
+ *
+ * Content written into each node before enqueue is guaranteed to be
+ * consistent, but no other memory ordering is ensured.
+ * Dequeue/splice/iteration mutual exclusion should be ensured by the
+ * caller.
+ */
+#define __wfcq_for_each_blocking_safe(head, tail, node, n)		       \
+	for (node = __wfcq_first_blocking(head, tail),		       \
+			n = (node ? __wfcq_next_blocking(head, tail, node) : NULL); \
+		node != NULL;						       \
+		node = n, n = (node ? __wfcq_next_blocking(head, tail, node) : NULL))
+
+#endif /* _LINUX_WFCQUEUE_H */

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH] Linux kernel Wait-Free Concurrent Queue Implementation
  2013-03-11 21:36 [RFC PATCH] Linux kernel Wait-Free Concurrent Queue Implementation Mathieu Desnoyers
@ 2013-03-14  4:22 ` Eric Wong
  2013-03-14 13:18   ` Mathieu Desnoyers
  2013-03-15  0:49 ` Peter Hurley
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 25+ messages in thread
From: Eric Wong @ 2013-03-14  4:22 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Lai Jiangshan, Paul E. McKenney, Stephen Hemminger,
	Davide Libenzi, linux-kernel

Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> Ported to the Linux kernel from Userspace RCU library, at commit
> 108a92e5b97ee91b2b902dba2dd2e78aab42f420.
> 
> Ref: http://git.lttng.org/userspace-rcu.git
> 
> It is provided as a starting point only. Test cases should be ported
> from Userspace RCU to kernel space and thoroughly ran on a wide range of
> architectures before considering this port production-ready.

Thanks, this seems to work.  Will post an early epoll patch in a minute.

Minor comments below.

> +/*
> + * Load a data from shared memory.
> + */
> +#define CMM_LOAD_SHARED(p)		ACCESS_ONCE(p)

When iterating through the queue by dequeueing, I needed a way
to get the tail at the start of the iteration and use that as
a sentinel while iterating, so I access the tail like this:

	struct wfcq_node *p = CMM_LOAD_SHARED(ep->rdltail.p);

I hope this is supported... it seems to work :)

Unlike most queue users, I need to stop iteration to prevent the same
item from appearing in the events returned by epoll_wait; since a
dequeued item may appear back in the wfcqueue immediately.

> +struct wfcq_head {
> +	struct wfcq_node node;
> +	struct mutex lock;
> +};

I'm not using this lock at all since I already have ep->mtx (which also
protects the ep->rbr).  Perhaps it should not be included; normal linked
list and most data structures I see in the kernel do not provide their
own locks, either

> +static inline void wfcq_init(struct wfcq_head *head,
> +		struct wfcq_tail *tail)
> +{
> +	/* Set queue head and tail */
> +	wfcq_node_init(&head->node);
> +	tail->p = &head->node;
> +	mutex_init(&head->lock);
> +}

There's no corresponding mutex_destroy, so I'm just destroying it
myself...

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH] Linux kernel Wait-Free Concurrent Queue Implementation
  2013-03-14  4:22 ` Eric Wong
@ 2013-03-14 13:18   ` Mathieu Desnoyers
  2013-03-14 19:07     ` Eric Wong
  0 siblings, 1 reply; 25+ messages in thread
From: Mathieu Desnoyers @ 2013-03-14 13:18 UTC (permalink / raw)
  To: Eric Wong
  Cc: Lai Jiangshan, Paul E. McKenney, Stephen Hemminger,
	Davide Libenzi, linux-kernel

* Eric Wong (normalperson@yhbt.net) wrote:
> Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> > Ported to the Linux kernel from Userspace RCU library, at commit
> > 108a92e5b97ee91b2b902dba2dd2e78aab42f420.
> > 
> > Ref: http://git.lttng.org/userspace-rcu.git
> > 
> > It is provided as a starting point only. Test cases should be ported
> > from Userspace RCU to kernel space and thoroughly ran on a wide range of
> > architectures before considering this port production-ready.
> 
> Thanks, this seems to work.  Will post an early epoll patch in a minute.
> 
> Minor comments below.
> 
> > +/*
> > + * Load a data from shared memory.
> > + */
> > +#define CMM_LOAD_SHARED(p)		ACCESS_ONCE(p)
> 
> When iterating through the queue by dequeueing, I needed a way
> to get the tail at the start of the iteration and use that as
> a sentinel while iterating, so I access the tail like this:
> 
> 	struct wfcq_node *p = CMM_LOAD_SHARED(ep->rdltail.p);
> 
> I hope this is supported... it seems to work :)

Ideally it would be good if users could try using the exposed APIs to do
these things, or if it's really needed, maybe it's a sign that we need
to extend the API.

> 
> Unlike most queue users, I need to stop iteration to prevent the same
> item from appearing in the events returned by epoll_wait; since a
> dequeued item may appear back in the wfcqueue immediately.

I think your use-case is very similar to our urcu call_rcu
implementation. I would recommend to use wfcq in the following way:

When you want to dequeue, define, on the stack:

struct wfcq_head qhead;
struct wfcq_tail qtail;
struct wfcq_node *node, *n;
enum wfcq_ret ret;

wfcq_init(&qhead, &qtail);

/*
 * Then use splice to move the entire source queue into the local queue.
 * Don't forget to grab the appropriate mutexes for eqpoll_q here.
 */
ret = __wfcq_splice(&qhead, &qtail, epoll_q_head, epoll_q_tail);

switch (ret) {
case WFCQ_RET_SRC_EMPTY:
        return -ENOENT;         /* No events to handle */
case WFCQ_RET_DEST_EMPTY:
case WFCQ_RET_DEST_NON_EMPTY:
        break;
}

/*
 * From this point, you can release the epoll_q lock and simply iterate
 * on the local queue using __wfcq_for_each() or __wfcq_for_each_safe()
 * if you need to free the nodes at the same time.
 */
__wfcq_for_each_safe(&qhead, &qtail, node, n) {
    ...
}

The advantage of using splice() over dequeue() is that you will reduce
the amount of interactions between concurrent enqueue and dequeue
operations on the head and tail of the same queue.

> 
> > +struct wfcq_head {
> > +	struct wfcq_node node;
> > +	struct mutex lock;
> > +};
> 
> I'm not using this lock at all since I already have ep->mtx (which also
> protects the ep->rbr).  Perhaps it should not be included; normal linked
> list and most data structures I see in the kernel do not provide their
> own locks, either

Good point. The Linux kernel habits are to leave the locks outside of
those structures whenever possible, so users can pick the right lock for
their need. Will remove, and remove all the "helpers" that use this lock
as well.

> 
> > +static inline void wfcq_init(struct wfcq_head *head,
> > +		struct wfcq_tail *tail)
> > +{
> > +	/* Set queue head and tail */
> > +	wfcq_node_init(&head->node);
> > +	tail->p = &head->node;
> > +	mutex_init(&head->lock);
> > +}
> 
> There's no corresponding mutex_destroy, so I'm just destroying it
> myself...

Since I remove it, it will become a non-issue.

After thinking about it, I plan to disable preemption around xchg and
store within __wfcq_append. This is a luxury we have at kernel-level
that we don't have in user-space, si it might be good to use it. This
will allow me to remove the 10ms sleep from the adaptative busy-wait in
___wfcq_busy_wait(), and replace this by a simple busy-wait. This will
eliminate those possible latency peaks from the dequeue/splice/iteration
side. It's a shame to have a possibility of 10ms sleep due to preemption
within a 2 instructions window. preempt_disable() will fix this, while
adding a very low overhead to enqueue path in preemptible kernels.

I'll send a new patch version soon,

Thanks for your comments!

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH] Linux kernel Wait-Free Concurrent Queue Implementation
  2013-03-14 13:18   ` Mathieu Desnoyers
@ 2013-03-14 19:07     ` Eric Wong
  2013-03-14 19:48       ` Mathieu Desnoyers
  2013-03-16 22:02       ` Eric Wong
  0 siblings, 2 replies; 25+ messages in thread
From: Eric Wong @ 2013-03-14 19:07 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Lai Jiangshan, Paul E. McKenney, Stephen Hemminger,
	Davide Libenzi, linux-kernel

Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> * Eric Wong (normalperson@yhbt.net) wrote:
> > Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> > > +/*
> > > + * Load a data from shared memory.
> > > + */
> > > +#define CMM_LOAD_SHARED(p)		ACCESS_ONCE(p)
> > 
> > When iterating through the queue by dequeueing, I needed a way
> > to get the tail at the start of the iteration and use that as
> > a sentinel while iterating, so I access the tail like this:
> > 
> > 	struct wfcq_node *p = CMM_LOAD_SHARED(ep->rdltail.p);
> > 
> > I hope this is supported... it seems to work :)
> 
> Ideally it would be good if users could try using the exposed APIs to do
> these things, or if it's really needed, maybe it's a sign that we need
> to extend the API.

Right.  If I can use splice, I will not need this.  more comments below
on splice...

> > Unlike most queue users, I need to stop iteration to prevent the same
> > item from appearing in the events returned by epoll_wait; since a
> > dequeued item may appear back in the wfcqueue immediately.
> 
> I think your use-case is very similar to our urcu call_rcu
> implementation. I would recommend to use wfcq in the following way:
> 
> When you want to dequeue, define, on the stack:
> 
> struct wfcq_head qhead;
> struct wfcq_tail qtail;
> struct wfcq_node *node, *n;
> enum wfcq_ret ret;
> 
> wfcq_init(&qhead, &qtail);
> 
> /*
>  * Then use splice to move the entire source queue into the local queue.
>  * Don't forget to grab the appropriate mutexes for eqpoll_q here.
>  */
> ret = __wfcq_splice(&qhead, &qtail, epoll_q_head, epoll_q_tail);
> 
> switch (ret) {
> case WFCQ_RET_SRC_EMPTY:
>         return -ENOENT;         /* No events to handle */
> case WFCQ_RET_DEST_EMPTY:
> case WFCQ_RET_DEST_NON_EMPTY:
>         break;
> }
> 
> /*
>  * From this point, you can release the epoll_q lock and simply iterate
>  * on the local queue using __wfcq_for_each() or __wfcq_for_each_safe()
>  * if you need to free the nodes at the same time.
>  */
> __wfcq_for_each_safe(&qhead, &qtail, node, n) {
>     ...
> }
> 
> The advantage of using splice() over dequeue() is that you will reduce
> the amount of interactions between concurrent enqueue and dequeue
> operations on the head and tail of the same queue.

I wanted to use splice here originally, but epoll_wait(2) may not
consume the entire queue, as it's limited to maxevents specified by the
user calling epoll_wait.

With unconsumed elements, I need to preserve ordering of the queue to
avoid starvation.  So I would either need to:

a) splice the unconsumed portion back to the head of the shared queue.
   I'm not sure if this is possible while elements are being enqueued...
   Using a mutex for splicing back to unconsumed elements is OK, and
   probably required anyways since we need to keep EPOLLONESHOT
   unmasking synced with dequeue.

b) preserve the unconsumed spliced queue across epoll_wait invocations
   but that requires checking both queues for event availability...

> I'll send a new patch version soon,
> 
> Thanks for your comments!

Thank you for this data structure!  I will update my patch tomorrow or
the day after and do more testing.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH] Linux kernel Wait-Free Concurrent Queue Implementation
  2013-03-14 19:07     ` Eric Wong
@ 2013-03-14 19:48       ` Mathieu Desnoyers
  2013-03-14 21:32         ` Eric Wong
  2013-03-16 22:02       ` Eric Wong
  1 sibling, 1 reply; 25+ messages in thread
From: Mathieu Desnoyers @ 2013-03-14 19:48 UTC (permalink / raw)
  To: Eric Wong
  Cc: Lai Jiangshan, Paul E. McKenney, Stephen Hemminger,
	Davide Libenzi, linux-kernel

* Eric Wong (normalperson@yhbt.net) wrote:
> Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> > * Eric Wong (normalperson@yhbt.net) wrote:
> > > Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> > > > +/*
> > > > + * Load a data from shared memory.
> > > > + */
> > > > +#define CMM_LOAD_SHARED(p)		ACCESS_ONCE(p)
> > > 
> > > When iterating through the queue by dequeueing, I needed a way
> > > to get the tail at the start of the iteration and use that as
> > > a sentinel while iterating, so I access the tail like this:
> > > 
> > > 	struct wfcq_node *p = CMM_LOAD_SHARED(ep->rdltail.p);
> > > 
> > > I hope this is supported... it seems to work :)
> > 
> > Ideally it would be good if users could try using the exposed APIs to do
> > these things, or if it's really needed, maybe it's a sign that we need
> > to extend the API.
> 
> Right.  If I can use splice, I will not need this.  more comments below
> on splice...
> 
> > > Unlike most queue users, I need to stop iteration to prevent the same
> > > item from appearing in the events returned by epoll_wait; since a
> > > dequeued item may appear back in the wfcqueue immediately.
> > 
> > I think your use-case is very similar to our urcu call_rcu
> > implementation. I would recommend to use wfcq in the following way:
> > 
> > When you want to dequeue, define, on the stack:
> > 
> > struct wfcq_head qhead;
> > struct wfcq_tail qtail;
> > struct wfcq_node *node, *n;
> > enum wfcq_ret ret;
> > 
> > wfcq_init(&qhead, &qtail);
> > 
> > /*
> >  * Then use splice to move the entire source queue into the local queue.
> >  * Don't forget to grab the appropriate mutexes for eqpoll_q here.
> >  */
> > ret = __wfcq_splice(&qhead, &qtail, epoll_q_head, epoll_q_tail);
> > 
> > switch (ret) {
> > case WFCQ_RET_SRC_EMPTY:
> >         return -ENOENT;         /* No events to handle */
> > case WFCQ_RET_DEST_EMPTY:
> > case WFCQ_RET_DEST_NON_EMPTY:
> >         break;
> > }
> > 
> > /*
> >  * From this point, you can release the epoll_q lock and simply iterate
> >  * on the local queue using __wfcq_for_each() or __wfcq_for_each_safe()
> >  * if you need to free the nodes at the same time.
> >  */
> > __wfcq_for_each_safe(&qhead, &qtail, node, n) {
> >     ...
> > }
> > 
> > The advantage of using splice() over dequeue() is that you will reduce
> > the amount of interactions between concurrent enqueue and dequeue
> > operations on the head and tail of the same queue.
> 
> I wanted to use splice here originally, but epoll_wait(2) may not
> consume the entire queue, as it's limited to maxevents specified by the
> user calling epoll_wait.

I see,

> 
> With unconsumed elements, I need to preserve ordering of the queue to
> avoid starvation.  So I would either need to:
> 
> a) splice the unconsumed portion back to the head of the shared queue.
>    I'm not sure if this is possible while elements are being enqueued...

That would be a double-ended queue. I haven't thought this problem
through yet.

>    Using a mutex for splicing back to unconsumed elements is OK, and
>    probably required anyways since we need to keep EPOLLONESHOT
>    unmasking synced with dequeue.
> 
> b) preserve the unconsumed spliced queue across epoll_wait invocations
>    but that requires checking both queues for event availability...

I think b) should be preferred over a).

Basically, instead of having the "unconsumed element" queue on the stack
as I suggested, you might want to keep it across epoll_wait invocations.

Before you start digging in the unconsumed element queue, you just start
by splicing the content of the shared queue into the tail of unconsumed
queue. Then, you simply dequeue the unconsumed queue elements until you
either reach the end or the max nr.

You should note that with this scheme, you'll have to dequeue items from
the unconsumed queue rather than just iterating over the elements. The
nice side of consuming all the elements (in the temp queue on the local
stack) is that you don't care about dequeuing, since the entire queue
will vanish. However, in this case, since you care about keeping the
queue after a partial iteration, you need to dequeue from it.

And yes, this approach involves checking both queues for event
availability. Hopefully none of this will be too much of an issue
performance-wise.

Another approach could be to let you work directly on the shared queue:

I could possibly implement a

void __wfcq_snapshot(struct wfcq_head *head,
                struct wfcq_tail *tail);

That would save a tail shapshot that would then be used to stop
iteration, dequeue and splice at the location of the tail snapshot. And

void __wfcq_snapshot_reset(struct wfcq_head *head,
                struct wfcq_tail *tail);

would set the tail snapshot pointer back to NULL.

This would require a few extra checks, but nothing very expensive I
expect.

Thoughts ?

> 
> > I'll send a new patch version soon,
> > 
> > Thanks for your comments!
> 
> Thank you for this data structure!  I will update my patch tomorrow or
> the day after and do more testing.

I look forward to see the next patch!

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH] Linux kernel Wait-Free Concurrent Queue Implementation
  2013-03-14 19:48       ` Mathieu Desnoyers
@ 2013-03-14 21:32         ` Eric Wong
  2013-03-15  2:38           ` Mathieu Desnoyers
  0 siblings, 1 reply; 25+ messages in thread
From: Eric Wong @ 2013-03-14 21:32 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Lai Jiangshan, Paul E. McKenney, Stephen Hemminger,
	Davide Libenzi, linux-kernel

Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> * Eric Wong (normalperson@yhbt.net) wrote:
> > Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> > > The advantage of using splice() over dequeue() is that you will reduce
> > > the amount of interactions between concurrent enqueue and dequeue
> > > operations on the head and tail of the same queue.
> > 
> > I wanted to use splice here originally, but epoll_wait(2) may not
> > consume the entire queue, as it's limited to maxevents specified by the
> > user calling epoll_wait.
> 
> I see,
> 
> > 
> > With unconsumed elements, I need to preserve ordering of the queue to
> > avoid starvation.  So I would either need to:
> > 
> > a) splice the unconsumed portion back to the head of the shared queue.
> >    I'm not sure if this is possible while elements are being enqueued...
> 
> That would be a double-ended queue. I haven't thought this problem
> through yet.
> 
> >    Using a mutex for splicing back to unconsumed elements is OK, and
> >    probably required anyways since we need to keep EPOLLONESHOT
> >    unmasking synced with dequeue.
> > 
> > b) preserve the unconsumed spliced queue across epoll_wait invocations
> >    but that requires checking both queues for event availability...
> 
> I think b) should be preferred over a).
> 
> Basically, instead of having the "unconsumed element" queue on the stack
> as I suggested, you might want to keep it across epoll_wait invocations.
> 
> Before you start digging in the unconsumed element queue, you just start
> by splicing the content of the shared queue into the tail of unconsumed
> queue. Then, you simply dequeue the unconsumed queue elements until you
> either reach the end or the max nr.
> 
> You should note that with this scheme, you'll have to dequeue items from
> the unconsumed queue rather than just iterating over the elements. The
> nice side of consuming all the elements (in the temp queue on the local
> stack) is that you don't care about dequeuing, since the entire queue
> will vanish. However, in this case, since you care about keeping the
> queue after a partial iteration, you need to dequeue from it.
> 
> And yes, this approach involves checking both queues for event
> availability. Hopefully none of this will be too much of an issue
> performance-wise.

Right.  I will try this, I don't think the check will be too expensive.

When dequeuing from the unconsumed queue, perhaps there should be a
"dequeue_local" function which omits the normal barriers required
for the shared queue.

With a splice and without needing barriers for iteration, this sounds good.

> Another approach could be to let you work directly on the shared queue:
> 
> I could possibly implement a
> 
> void __wfcq_snapshot(struct wfcq_head *head,
>                 struct wfcq_tail *tail);
> 
> That would save a tail shapshot that would then be used to stop
> iteration, dequeue and splice at the location of the tail snapshot. And
> 
> void __wfcq_snapshot_reset(struct wfcq_head *head,
>                 struct wfcq_tail *tail);
> 
> would set the tail snapshot pointer back to NULL.
> 
> This would require a few extra checks, but nothing very expensive I
> expect.
> 
> Thoughts ?

I'm not sure I follow, would using it be something like this?

	snapshot
	iterate (read-only, no dequeue)
	splice(discard_head, discard_tail, shared_head, iter_stop_point)
	snapshot_reset

I also use an atomic state variable to prevent an item from being queued
twice, and I unset that while iterating+dequeueing.

I change the state to IDLE while iterating and ep_poll_callback sets
state READY on any item before being spliced out, that would cause the
event to be lost when it's spliced out.

So I think dequeue/state IDLE while iterating is required for epoll_wait,
I'll try the private queue.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH] Linux kernel Wait-Free Concurrent Queue Implementation
  2013-03-11 21:36 [RFC PATCH] Linux kernel Wait-Free Concurrent Queue Implementation Mathieu Desnoyers
  2013-03-14  4:22 ` Eric Wong
@ 2013-03-15  0:49 ` Peter Hurley
  2013-03-15  2:08   ` Mathieu Desnoyers
  2013-03-21 11:43 ` [PATCH] wfcqueue: functions for local append and enqueue Eric Wong
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 25+ messages in thread
From: Peter Hurley @ 2013-03-15  0:49 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Eric Wong, Lai Jiangshan, Paul E. McKenney, Stephen Hemminger,
	Davide Libenzi, linux-kernel

On Mon, 2013-03-11 at 17:36 -0400, Mathieu Desnoyers wrote:
> +/*
> + * Do not put head and tail on the same cache-line if concurrent
> + * enqueue/dequeue are expected from many CPUs. This eliminates
> + * false-sharing between enqueue and dequeue.
> + */
> +struct wfcq_head {
> +	struct wfcq_node node;
> +	struct mutex lock;
> +};
> +
> +struct wfcq_tail {
> +	struct wfcq_node *p;
> +};


If you want to force separate cachelines for SMP, this would be

struct wfcq_head {
	struct wfcq_node node;
	struct mutex lock;
} ____cacheline_aligned_in_smp;

struct wfcq_tail {
	struct wfcq_node *p;
} ____cacheline_aligned_in_smp;


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH] Linux kernel Wait-Free Concurrent Queue Implementation
  2013-03-15  0:49 ` Peter Hurley
@ 2013-03-15  2:08   ` Mathieu Desnoyers
  0 siblings, 0 replies; 25+ messages in thread
From: Mathieu Desnoyers @ 2013-03-15  2:08 UTC (permalink / raw)
  To: Peter Hurley
  Cc: Eric Wong, Lai Jiangshan, Paul E. McKenney, Stephen Hemminger,
	Davide Libenzi, linux-kernel

* Peter Hurley (peter@hurleysoftware.com) wrote:
> On Mon, 2013-03-11 at 17:36 -0400, Mathieu Desnoyers wrote:
> > +/*
> > + * Do not put head and tail on the same cache-line if concurrent
> > + * enqueue/dequeue are expected from many CPUs. This eliminates
> > + * false-sharing between enqueue and dequeue.
> > + */
> > +struct wfcq_head {
> > +	struct wfcq_node node;
> > +	struct mutex lock;
> > +};
> > +
> > +struct wfcq_tail {
> > +	struct wfcq_node *p;
> > +};
> 
> 
> If you want to force separate cachelines for SMP, this would be
> 
> struct wfcq_head {
> 	struct wfcq_node node;
> 	struct mutex lock;
> } ____cacheline_aligned_in_smp;
> 
> struct wfcq_tail {
> 	struct wfcq_node *p;
> } ____cacheline_aligned_in_smp;

Well, the thing is: I don't want to force it. The queue head and tail
can be used in a few ways:

1) tail used by frequent enqueue on one CPU, head used for frequent
   dequeues on another CPU. In this case, we want head/tail on different
   cache lines.
2) same scenario as 1), but head and tail are placed in per-cpu data
   of the two CPUs. We don't need to align each structure explicitly.
3) tail and head used locally, e.g. on the stack, as splice destination
   followed by iteration or dequeue from the same CPU. We don't want to
   waste precious memory space in this case, so no alignment.

So as you see, only (1) actually requires explicit alignment of head and
tail.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH] Linux kernel Wait-Free Concurrent Queue Implementation
  2013-03-14 21:32         ` Eric Wong
@ 2013-03-15  2:38           ` Mathieu Desnoyers
  0 siblings, 0 replies; 25+ messages in thread
From: Mathieu Desnoyers @ 2013-03-15  2:38 UTC (permalink / raw)
  To: Eric Wong
  Cc: Lai Jiangshan, Paul E. McKenney, Stephen Hemminger,
	Davide Libenzi, linux-kernel

* Eric Wong (normalperson@yhbt.net) wrote:
> Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> > * Eric Wong (normalperson@yhbt.net) wrote:
> > > Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> > > > The advantage of using splice() over dequeue() is that you will reduce
> > > > the amount of interactions between concurrent enqueue and dequeue
> > > > operations on the head and tail of the same queue.
> > > 
> > > I wanted to use splice here originally, but epoll_wait(2) may not
> > > consume the entire queue, as it's limited to maxevents specified by the
> > > user calling epoll_wait.
> > 
> > I see,
> > 
> > > 
> > > With unconsumed elements, I need to preserve ordering of the queue to
> > > avoid starvation.  So I would either need to:
> > > 
> > > a) splice the unconsumed portion back to the head of the shared queue.
> > >    I'm not sure if this is possible while elements are being enqueued...
> > 
> > That would be a double-ended queue. I haven't thought this problem
> > through yet.
> > 
> > >    Using a mutex for splicing back to unconsumed elements is OK, and
> > >    probably required anyways since we need to keep EPOLLONESHOT
> > >    unmasking synced with dequeue.
> > > 
> > > b) preserve the unconsumed spliced queue across epoll_wait invocations
> > >    but that requires checking both queues for event availability...
> > 
> > I think b) should be preferred over a).
> > 
> > Basically, instead of having the "unconsumed element" queue on the stack
> > as I suggested, you might want to keep it across epoll_wait invocations.
> > 
> > Before you start digging in the unconsumed element queue, you just start
> > by splicing the content of the shared queue into the tail of unconsumed
> > queue. Then, you simply dequeue the unconsumed queue elements until you
> > either reach the end or the max nr.
> > 
> > You should note that with this scheme, you'll have to dequeue items from
> > the unconsumed queue rather than just iterating over the elements. The
> > nice side of consuming all the elements (in the temp queue on the local
> > stack) is that you don't care about dequeuing, since the entire queue
> > will vanish. However, in this case, since you care about keeping the
> > queue after a partial iteration, you need to dequeue from it.
> > 
> > And yes, this approach involves checking both queues for event
> > availability. Hopefully none of this will be too much of an issue
> > performance-wise.
> 
> Right.  I will try this, I don't think the check will be too expensive.
> 
> When dequeuing from the unconsumed queue, perhaps there should be a
> "dequeue_local" function which omits the normal barriers required
> for the shared queue.
> 
> With a splice and without needing barriers for iteration, this sounds good.

Well actually, __wfcq_dequeue() is really not that expensive. In terms
of synchronization, here is what it typically does:

node = ___wfcq_node_sync_next(&head->node);
  -> busy wait if node->next is NULL. This check is needed even if we
     work on a "local" queue, because the O(1) splice operation does not
     walk over every node to check for NULL next pointer: this is left
     to the dequeue/iteration operations.
if ((next = CMM_LOAD_SHARED(node->next)) == NULL) {
  -> only taken if we are getting the last node of the queue. Happens
     at most once per batch.
}

head->node.next = next;
  -> a simple store.

smp_read_barrier_depends();
  -> no-op on everything but Alpha.

return node;

So my recommendation would be to be careful before trying to come up
with flavors that remove barriers if those are not actually hurting the
fast-path significantly. By dequeue fast-path, I mean what needs to be
executed for dequeue of each node.

> 
> > Another approach could be to let you work directly on the shared queue:
> > 
> > I could possibly implement a
> > 
> > void __wfcq_snapshot(struct wfcq_head *head,
> >                 struct wfcq_tail *tail);
> > 
> > That would save a tail shapshot that would then be used to stop
> > iteration, dequeue and splice at the location of the tail snapshot. And
> > 
> > void __wfcq_snapshot_reset(struct wfcq_head *head,
> >                 struct wfcq_tail *tail);
> > 
> > would set the tail snapshot pointer back to NULL.
> > 
> > This would require a few extra checks, but nothing very expensive I
> > expect.
> > 
> > Thoughts ?
> 
> I'm not sure I follow, would using it be something like this?
> 
> 	snapshot
> 	iterate (read-only, no dequeue)
> 	splice(discard_head, discard_tail, shared_head, iter_stop_point)
> 	snapshot_reset

My idea was more like:

snapshot
__wfcq_dequeue until returns NULL or reach max
snapshot_reset

But I would really prefer the splice approach unless there is a very
significant performance gain to do otherwise, because I don't want to
clutter the API with various ways of performing the same thing
needlessly.

> 
> I also use an atomic state variable to prevent an item from being queued
> twice, and I unset that while iterating+dequeueing.
> 
> I change the state to IDLE while iterating and ep_poll_callback sets
> state READY on any item before being spliced out, that would cause the
> event to be lost when it's spliced out.
> 
> So I think dequeue/state IDLE while iterating is required for epoll_wait,
> I'll try the private queue.

Yep. Just to summarize the private queue approach:

within your epoll queue structure:

struct wfcq_tail queue_tail;       /* enqueue fast-path */

/* Below only touched by splice and dequeue */
struct wfcq_head queue_head;
struct wfcq_tail unconsumed_tail;
struct wfcq_head unconsumed_head ____cacheline_aligned;  /* dequeue fast-path */

In fact, what we really, really care about is that enqueue and dequeue
fast-paths don't touch the same cache-line. For the other fields, we
don't really care.

Then, when you need to handle the batch:

/* Hold dequeue mutex */

(void) __wfcq_splice(&epoll_queue->unconsumed_head,
        &epoll_queue->unconsumed_tail,
        &epoll_queue->queue_head,
        &epoll_queue->queue_tail);
/* You don't care about the return value here. */

for (i = 0; i < max_ev; i++) {
        struct wfcq_node *node;

        node = __wfcq_dequeue(&epoll_queue->unconsumed_head,
                        &epoll_queue->unconsumed_tail);
        if (!node)
                break;
        /* Deal with node, free node's container */
}

/* Release dequeue mutex */

You don't need to do an extra check for empty queues, because these
checks are embedded into splice and dequeue operations.

Thanks,

Mathieu


-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH] Linux kernel Wait-Free Concurrent Queue Implementation
  2013-03-14 19:07     ` Eric Wong
  2013-03-14 19:48       ` Mathieu Desnoyers
@ 2013-03-16 22:02       ` Eric Wong
  2013-03-17  2:03         ` Mathieu Desnoyers
  1 sibling, 1 reply; 25+ messages in thread
From: Eric Wong @ 2013-03-16 22:02 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Lai Jiangshan, Paul E. McKenney, Stephen Hemminger,
	Davide Libenzi, linux-kernel

Eric Wong <normalperson@yhbt.net> wrote:
> Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> > * Eric Wong (normalperson@yhbt.net) wrote:
> > > Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> > > > +/*
> > > > + * Load a data from shared memory.
> > > > + */
> > > > +#define CMM_LOAD_SHARED(p)		ACCESS_ONCE(p)
> > > 
> > > When iterating through the queue by dequeueing, I needed a way
> > > to get the tail at the start of the iteration and use that as
> > > a sentinel while iterating, so I access the tail like this:
> > > 
> > > 	struct wfcq_node *p = CMM_LOAD_SHARED(ep->rdltail.p);
> > > 
> > > I hope this is supported... it seems to work :)
> > 
> > Ideally it would be good if users could try using the exposed APIs to do
> > these things, or if it's really needed, maybe it's a sign that we need
> > to extend the API.
> 
> Right.  If I can use splice, I will not need this.  more comments below
> on splice...

Even with splice, I think I need to see the main tail at the start of
iteration to maintain compatibility (for weird apps that might care).

Consider this scenario:

  1) main.queue has 20 events

  2) epoll_wait(maxevents=16) called by user

  3) splice all 20 events into unconsumed.queue, main.queue is empty

  4) put_user + dequeue on 16 events from unconsumed.queue
     # unconsumed.queue has 4 left at this point

  5) main.queue gets several more events enqueued at any point after 3.

  6) epoll_wait(maxevents=16) called by user again

  7) put_user + dequeue on 4 remaining items in unconsumed.queue

     We can safely return 4 events back to the user at this point.

     However, this might break compatibility for existing users.  I'm
     not sure if there's any weird apps which know/expect the event
     count they'll get from epoll_wait, but maybe there is one...

  8) We could perform a splice off main.queue to fill the remaining
     slots the user requested, but we do not know if the things we
     splice from main.queue at this point were just dequeued in 7.

     If we loaded the main.queue.tail before 7, we could safely splice
     into unconsumed.queue and know when to stop when repeating the
     put_user + dequeue loop.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH] Linux kernel Wait-Free Concurrent Queue Implementation
  2013-03-16 22:02       ` Eric Wong
@ 2013-03-17  2:03         ` Mathieu Desnoyers
  2013-03-18 10:37           ` Eric Wong
  0 siblings, 1 reply; 25+ messages in thread
From: Mathieu Desnoyers @ 2013-03-17  2:03 UTC (permalink / raw)
  To: Eric Wong
  Cc: Lai Jiangshan, Paul E. McKenney, Stephen Hemminger,
	Davide Libenzi, linux-kernel

* Eric Wong (normalperson@yhbt.net) wrote:
> Eric Wong <normalperson@yhbt.net> wrote:
> > Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> > > * Eric Wong (normalperson@yhbt.net) wrote:
> > > > Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> > > > > +/*
> > > > > + * Load a data from shared memory.
> > > > > + */
> > > > > +#define CMM_LOAD_SHARED(p)		ACCESS_ONCE(p)
> > > > 
> > > > When iterating through the queue by dequeueing, I needed a way
> > > > to get the tail at the start of the iteration and use that as
> > > > a sentinel while iterating, so I access the tail like this:
> > > > 
> > > > 	struct wfcq_node *p = CMM_LOAD_SHARED(ep->rdltail.p);
> > > > 
> > > > I hope this is supported... it seems to work :)
> > > 
> > > Ideally it would be good if users could try using the exposed APIs to do
> > > these things, or if it's really needed, maybe it's a sign that we need
> > > to extend the API.
> > 
> > Right.  If I can use splice, I will not need this.  more comments below
> > on splice...
> 
> Even with splice, I think I need to see the main tail at the start of
> iteration to maintain compatibility (for weird apps that might care).

Thanks for providing this detailed scenario. I think there is an
important aspect in the use of splice I suggested on which we are not
fully understanding each other. I will annotate your scenario below with
clarifications:

> 
> Consider this scenario:
> 
>   1) main.queue has 20 events
> 
>   2) epoll_wait(maxevents=16) called by user
> 
>   3) splice all 20 events into unconsumed.queue, main.queue is empty
> 
>   4) put_user + dequeue on 16 events from unconsumed.queue
>      # unconsumed.queue has 4 left at this point
> 
>   5) main.queue gets several more events enqueued at any point after 3.

Let's suppose 11 events are enqueued into main.queue after 3.

> 
>   6) epoll_wait(maxevents=16) called by user again

Before 7), here is what should be done:

6.5) splice all new events from main.queue into unconsumed.queue.
     unconsumed.queue will now contain 4 + 11 = 15 events. Note
     that splice will preserve the right order of events within
     unconsumed.queue.

> 
>   7) put_user + dequeue on 4 remaining items in unconsumed.queue
> 
>      We can safely return 4 events back to the user at this point.

Step 7) will now return 15 events from unconsumed.queue.

> 
>      However, this might break compatibility for existing users.  I'm
>      not sure if there's any weird apps which know/expect the event
>      count they'll get from epoll_wait, but maybe there is one...

With the new step 6.5), I don't think the behavior will change compared
to what is already there.

> 
>   8) We could perform a splice off main.queue to fill the remaining
>      slots the user requested, but we do not know if the things we
>      splice from main.queue at this point were just dequeued in 7.
> 
>      If we loaded the main.queue.tail before 7, we could safely splice
>      into unconsumed.queue and know when to stop when repeating the
>      put_user + dequeue loop.

We can achieve the same thing by doing step 6.5) at the beginning of
epoll_wait(). It's important to do it at the beginning of epoll_wait for
the reason you discuss in 8) : if you wait until you notice that
unconsumed.queue is empty before refilling it from main.queue, you won't
be able to know if the events in main.queue were added after the first
event was dequeued.

Step 6.5) should be performed each time upon entry into epoll_wait(). It
does not matter if unconsumed.queue would happen to have enough events
to fill in the maxevents request or not (and you don't want to iterate
on the unconsumed.queue needlessly to count them): you can just do a
O(1) splice from main.queue into unconsumed.queue, and your original
semantic should be preserved.

Thoughts ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH] Linux kernel Wait-Free Concurrent Queue Implementation
  2013-03-17  2:03         ` Mathieu Desnoyers
@ 2013-03-18 10:37           ` Eric Wong
  0 siblings, 0 replies; 25+ messages in thread
From: Eric Wong @ 2013-03-18 10:37 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Lai Jiangshan, Paul E. McKenney, Stephen Hemminger,
	Davide Libenzi, linux-kernel

Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> Thanks for providing this detailed scenario. I think there is an
> important aspect in the use of splice I suggested on which we are not
> fully understanding each other. I will annotate your scenario below with
> clarifications:

Ah yes, I somehow thought splice would overwrite its destination.
I've confirmed everything works as expected and have an updated
RFC coming.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH] wfcqueue: functions for local append and enqueue
  2013-03-11 21:36 [RFC PATCH] Linux kernel Wait-Free Concurrent Queue Implementation Mathieu Desnoyers
  2013-03-14  4:22 ` Eric Wong
  2013-03-15  0:49 ` Peter Hurley
@ 2013-03-21 11:43 ` Eric Wong
  2013-03-22  2:01   ` Mathieu Desnoyers
  2013-03-29  8:10 ` [PATCH] wfcqueue: add function for unsynchronized prepend Eric Wong
  2013-04-11 21:23 ` [RFC PATCH] Linux kernel Wait-Free Concurrent Queue Implementation Eric Wong
  4 siblings, 1 reply; 25+ messages in thread
From: Eric Wong @ 2013-03-21 11:43 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Lai Jiangshan, Paul E. McKenney, Stephen Hemminger,
	Davide Libenzi, linux-kernel

With level-triggered epoll, append/enqueue operations to the
local/locked queues increase performance by avoiding unnecessary atomic
operations and barriers.  These are necessary to avoid performance
regressions when looping through ep_send_events and appending many
items to a queue.

Signed-off-by: Eric Wong <normalperson@yhbt.net>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Davide Libenzi <davidel@xmailserver.org>
---
  Benchmark for this coming with updated epoll patches.

 include/linux/wfcqueue.h | 43 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 43 insertions(+)

diff --git a/include/linux/wfcqueue.h b/include/linux/wfcqueue.h
index 9464a0c..7eb2aaa 100644
--- a/include/linux/wfcqueue.h
+++ b/include/linux/wfcqueue.h
@@ -205,6 +205,49 @@ static inline bool wfcq_enqueue(struct wfcq_head *head,
 }
 
 /*
+ * __wfcq_append_local: append one local queue to another local queue
+ *
+ * No memory barriers are issued.  Mutual exclusion is the responsibility
+ * of the caller.
+ *
+ * Returns false if the queue was empty prior to adding the node.
+ * Returns true otherwise.
+ */
+static inline bool __wfcq_append_local(struct wfcq_head *head,
+		struct wfcq_tail *tail,
+		struct wfcq_node *new_head,
+		struct wfcq_node *new_tail)
+{
+	struct wfcq_node *old_tail;
+
+	old_tail = tail->p;
+	tail->p = new_tail;
+	old_tail->next = new_head;
+
+	/*
+	 * Return false if queue was empty prior to adding the node,
+	 * else return true.
+	 */
+	return old_tail != &head->node;
+}
+
+/*
+ * wfcq_enqueue_local: enqueue a node into a local wait-free queue
+ *
+ * No memory barriers are issued.  Mutual exclusion is the responsibility
+ * of the caller.
+ *
+ * Returns false if the queue was empty prior to adding the node.
+ * Returns true otherwise.
+ */
+static inline bool wfcq_enqueue_local(struct wfcq_head *head,
+		struct wfcq_tail *tail,
+		struct wfcq_node *new_tail)
+{
+	return __wfcq_append_local(head, tail, new_tail, new_tail);
+}
+
+/*
  * ___wfcq_busy_wait: busy-wait.
  */
 static inline void ___wfcq_busy_wait(void)
-- 
1.8.2.rc3.2.g89ce8d6


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH] wfcqueue: functions for local append and enqueue
  2013-03-21 11:43 ` [PATCH] wfcqueue: functions for local append and enqueue Eric Wong
@ 2013-03-22  2:01   ` Mathieu Desnoyers
  2013-03-22 10:31     ` Eric Wong
  0 siblings, 1 reply; 25+ messages in thread
From: Mathieu Desnoyers @ 2013-03-22  2:01 UTC (permalink / raw)
  To: Eric Wong
  Cc: Lai Jiangshan, Paul E. McKenney, Stephen Hemminger,
	Davide Libenzi, linux-kernel

* Eric Wong (normalperson@yhbt.net) wrote:
> With level-triggered epoll, append/enqueue operations to the
> local/locked queues increase performance by avoiding unnecessary atomic
> operations and barriers.  These are necessary to avoid performance
> regressions when looping through ep_send_events and appending many
> items to a queue.

Sounds like a good idea,

> 
> Signed-off-by: Eric Wong <normalperson@yhbt.net>
> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> Cc: Stephen Hemminger <shemminger@vyatta.com>
> Cc: Davide Libenzi <davidel@xmailserver.org>
> ---
>   Benchmark for this coming with updated epoll patches.
> 
>  include/linux/wfcqueue.h | 43 +++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 43 insertions(+)
> 
> diff --git a/include/linux/wfcqueue.h b/include/linux/wfcqueue.h
> index 9464a0c..7eb2aaa 100644
> --- a/include/linux/wfcqueue.h
> +++ b/include/linux/wfcqueue.h
> @@ -205,6 +205,49 @@ static inline bool wfcq_enqueue(struct wfcq_head *head,
>  }
>  
>  /*
> + * __wfcq_append_local: append one local queue to another local queue
> + *
> + * No memory barriers are issued.  Mutual exclusion is the responsibility
> + * of the caller.
> + *
> + * Returns false if the queue was empty prior to adding the node.
> + * Returns true otherwise.
> + */
> +static inline bool __wfcq_append_local(struct wfcq_head *head,

Following the rest of the header, we could use:

___wfcq_append() for this function,

> +		struct wfcq_tail *tail,
> +		struct wfcq_node *new_head,
> +		struct wfcq_node *new_tail)
> +{
> +	struct wfcq_node *old_tail;
> +
> +	old_tail = tail->p;
> +	tail->p = new_tail;
> +	old_tail->next = new_head;
> +
> +	/*
> +	 * Return false if queue was empty prior to adding the node,
> +	 * else return true.
> +	 */
> +	return old_tail != &head->node;
> +}
> +
> +/*
> + * wfcq_enqueue_local: enqueue a node into a local wait-free queue
> + *
> + * No memory barriers are issued.  Mutual exclusion is the responsibility
> + * of the caller.
> + *
> + * Returns false if the queue was empty prior to adding the node.
> + * Returns true otherwise.
> + */
> +static inline bool wfcq_enqueue_local(struct wfcq_head *head,

and:

__wfcq_enqueue()

we should also update the "Synchronization table" at the beginning of
the file accordingly.

Thoughts ?

Thanks,

Mathieu

> +		struct wfcq_tail *tail,
> +		struct wfcq_node *new_tail)
> +{
> +	return __wfcq_append_local(head, tail, new_tail, new_tail);
> +}
> +
> +/*
>   * ___wfcq_busy_wait: busy-wait.
>   */
>  static inline void ___wfcq_busy_wait(void)
> -- 
> 1.8.2.rc3.2.g89ce8d6
> 

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] wfcqueue: functions for local append and enqueue
  2013-03-22  2:01   ` Mathieu Desnoyers
@ 2013-03-22 10:31     ` Eric Wong
  2013-03-23 19:07       ` [PATCH v2] " Eric Wong
  0 siblings, 1 reply; 25+ messages in thread
From: Eric Wong @ 2013-03-22 10:31 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Lai Jiangshan, Paul E. McKenney, Stephen Hemminger,
	Davide Libenzi, linux-kernel

Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> * Eric Wong (normalperson@yhbt.net) wrote:
> >  /*
> > + * __wfcq_append_local: append one local queue to another local queue
> > + *
> > + * No memory barriers are issued.  Mutual exclusion is the responsibility
> > + * of the caller.
> > + *
> > + * Returns false if the queue was empty prior to adding the node.
> > + * Returns true otherwise.
> > + */
> > +static inline bool __wfcq_append_local(struct wfcq_head *head,
> 
> Following the rest of the header, we could use:
> 
> ___wfcq_append() for this function,

OK.

However, I think ___ is a little confusing (but I'm easily confused :x).
I didn't even realize some functions in wfcqueue.h had ___ (vs __) until
just now(!)

But I will resend later tonight/tomorrow with ___ for consistency with
the existing API unless I've convinced you "_local" is easier :)

On a related note to underscores, I totally missed the existence of
____cacheline_aligned_in_smp in my first RFC even though I had seen
the __ version for .data around.

> and:
> 
> __wfcq_enqueue()
> 
> we should also update the "Synchronization table" at the beginning of
> the file accordingly.

Will update.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v2] wfcqueue: functions for local append and enqueue
  2013-03-22 10:31     ` Eric Wong
@ 2013-03-23 19:07       ` Eric Wong
  2013-03-23 19:43         ` Mathieu Desnoyers
  0 siblings, 1 reply; 25+ messages in thread
From: Eric Wong @ 2013-03-23 19:07 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Lai Jiangshan, Paul E. McKenney, Stephen Hemminger,
	Davide Libenzi, linux-kernel

With level-triggered epoll, append/enqueue operations to the
local/locked queues increase performance by avoiding unnecessary atomic
operations and barriers.  These are necessary to avoid performance
regressions when looping through ep_send_events and appending many
items to a local queue.

Signed-off-by: Eric Wong <normalperson@yhbt.net>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Davide Libenzi <davidel@xmailserver.org>
---
 I noticed the original __wfcq_append function was not in the
 synchronization table, so I left out ___wfcq_append from the table
 as well.

 include/linux/wfcqueue.h | 59 ++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 52 insertions(+), 7 deletions(-)

diff --git a/include/linux/wfcqueue.h b/include/linux/wfcqueue.h
index 9464a0c..800b1dd 100644
--- a/include/linux/wfcqueue.h
+++ b/include/linux/wfcqueue.h
@@ -55,14 +55,16 @@
  * [4] __wfcq_splice (source queue)
  * [5] __wfcq_first
  * [6] __wfcq_next
+ * [7] __wfcq_enqueue
  *
- *     [1] [2] [3] [4] [5] [6]
- * [1]  -   -   -   -   -   -
- * [2]  -   -   -   -   -   -
- * [3]  -   -   X   X   X   X
- * [4]  -   -   X   -   X   X
- * [5]  -   -   X   X   -   -
- * [6]  -   -   X   X   -   -
+ *     [1] [2] [3] [4] [5] [6] [7]
+ * [1]  -   -   -   -   -   -   X
+ * [2]  -   -   -   -   -   -   X
+ * [3]  -   -   X   X   X   X   X
+ * [4]  -   -   X   -   X   X   X
+ * [5]  -   -   X   X   -   -   X
+ * [6]  -   -   X   X   -   -   X
+ * [7]  X   X   X   X   X   X   X
  *
  * Besides locking, mutual exclusion of dequeue, splice and iteration
  * can be ensured by performing all of those operations from a single
@@ -205,6 +207,49 @@ static inline bool wfcq_enqueue(struct wfcq_head *head,
 }
 
 /*
+ * ___wfcq_append: append one local queue to another local queue
+ *
+ * No memory barriers are issued.  Mutual exclusion is the responsibility
+ * of the caller.
+ *
+ * Returns false if the queue was empty prior to adding the node.
+ * Returns true otherwise.
+ */
+static inline bool ___wfcq_append(struct wfcq_head *head,
+		struct wfcq_tail *tail,
+		struct wfcq_node *new_head,
+		struct wfcq_node *new_tail)
+{
+	struct wfcq_node *old_tail;
+
+	old_tail = tail->p;
+	tail->p = new_tail;
+	old_tail->next = new_head;
+
+	/*
+	 * Return false if queue was empty prior to adding the node,
+	 * else return true.
+	 */
+	return old_tail != &head->node;
+}
+
+/*
+ * __wfcq_enqueue: enqueue a node into a local queue
+ *
+ * No memory barriers are issued.  Mutual exclusion is the responsibility
+ * of the caller.
+ *
+ * Returns false if the queue was empty prior to adding the node.
+ * Returns true otherwise.
+ */
+static inline bool __wfcq_enqueue(struct wfcq_head *head,
+		struct wfcq_tail *tail,
+		struct wfcq_node *new_tail)
+{
+	return ___wfcq_append(head, tail, new_tail, new_tail);
+}
+
+/*
  * ___wfcq_busy_wait: busy-wait.
  */
 static inline void ___wfcq_busy_wait(void)
-- 
1.8.2.rc3.2.geae6cf5


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH v2] wfcqueue: functions for local append and enqueue
  2013-03-23 19:07       ` [PATCH v2] " Eric Wong
@ 2013-03-23 19:43         ` Mathieu Desnoyers
  2013-03-23 20:42           ` [PATCH v3] " Eric Wong
  0 siblings, 1 reply; 25+ messages in thread
From: Mathieu Desnoyers @ 2013-03-23 19:43 UTC (permalink / raw)
  To: Eric Wong
  Cc: Lai Jiangshan, Paul E. McKenney, Stephen Hemminger,
	Davide Libenzi, linux-kernel

* Eric Wong (normalperson@yhbt.net) wrote:
> With level-triggered epoll, append/enqueue operations to the
> local/locked queues increase performance by avoiding unnecessary atomic
> operations and barriers.  These are necessary to avoid performance
> regressions when looping through ep_send_events and appending many
> items to a local queue.
> 
> Signed-off-by: Eric Wong <normalperson@yhbt.net>
> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> Cc: Stephen Hemminger <shemminger@vyatta.com>
> Cc: Davide Libenzi <davidel@xmailserver.org>
> ---
>  I noticed the original __wfcq_append function was not in the
>  synchronization table, so I left out ___wfcq_append from the table
>  as well.
> 
>  include/linux/wfcqueue.h | 59 ++++++++++++++++++++++++++++++++++++++++++------
>  1 file changed, 52 insertions(+), 7 deletions(-)
> 
> diff --git a/include/linux/wfcqueue.h b/include/linux/wfcqueue.h
> index 9464a0c..800b1dd 100644
> --- a/include/linux/wfcqueue.h
> +++ b/include/linux/wfcqueue.h
> @@ -55,14 +55,16 @@
>   * [4] __wfcq_splice (source queue)
>   * [5] __wfcq_first
>   * [6] __wfcq_next
> + * [7] __wfcq_enqueue
>   *
> - *     [1] [2] [3] [4] [5] [6]
> - * [1]  -   -   -   -   -   -
> - * [2]  -   -   -   -   -   -
> - * [3]  -   -   X   X   X   X
> - * [4]  -   -   X   -   X   X
> - * [5]  -   -   X   X   -   -
> - * [6]  -   -   X   X   -   -
> + *     [1] [2] [3] [4] [5] [6] [7]
> + * [1]  -   -   -   -   -   -   X
> + * [2]  -   -   -   -   -   -   X
> + * [3]  -   -   X   X   X   X   X
> + * [4]  -   -   X   -   X   X   X
> + * [5]  -   -   X   X   -   -   X
> + * [6]  -   -   X   X   -   -   X
> + * [7]  X   X   X   X   X   X   X
>   *
>   * Besides locking, mutual exclusion of dequeue, splice and iteration
>   * can be ensured by performing all of those operations from a single
> @@ -205,6 +207,49 @@ static inline bool wfcq_enqueue(struct wfcq_head *head,
>  }
>  
>  /*
> + * ___wfcq_append: append one local queue to another local queue
> + *
> + * No memory barriers are issued.  Mutual exclusion is the responsibility
> + * of the caller.
> + *
> + * Returns false if the queue was empty prior to adding the node.
> + * Returns true otherwise.
> + */

__wfcq_append() and ___wfcq_append() are meant to be private to
wfcqueue.h. Therefore, the comment above should be removed, since it is
not part of the API.

I notice that I should have used ___wfcq_append() for the original
append function for consistency (other private helpers in this header
are prefixed with ___).

So maybe we should rename __wfcq_append to ___wfcq_append (making it
clear that it is a private helper), and introduce your helper as
___wfcq_append_local() (I don't care about having "local" in there since
it is not part of the exposed API).

> +static inline bool ___wfcq_append(struct wfcq_head *head,
> +		struct wfcq_tail *tail,
> +		struct wfcq_node *new_head,
> +		struct wfcq_node *new_tail)
> +{
> +	struct wfcq_node *old_tail;
> +
> +	old_tail = tail->p;
> +	tail->p = new_tail;
> +	old_tail->next = new_head;
> +
> +	/*
> +	 * Return false if queue was empty prior to adding the node,
> +	 * else return true.
> +	 */
> +	return old_tail != &head->node;
> +}
> +
> +/*
> + * __wfcq_enqueue: enqueue a node into a local queue

The concept of "local queue" is not clearly defined.

Perhaps it would be clearer to state:

 * __wfcq_enqueue: enqueue a node into a queue, requiring mutual exclusion.

Thoughts ?

Thanks,

Mathieu


> + *
> + * No memory barriers are issued.  Mutual exclusion is the responsibility
> + * of the caller.
> + *
> + * Returns false if the queue was empty prior to adding the node.
> + * Returns true otherwise.
> + */
> +static inline bool __wfcq_enqueue(struct wfcq_head *head,
> +		struct wfcq_tail *tail,
> +		struct wfcq_node *new_tail)
> +{
> +	return ___wfcq_append(head, tail, new_tail, new_tail);
> +}
> +
> +/*
>   * ___wfcq_busy_wait: busy-wait.
>   */
>  static inline void ___wfcq_busy_wait(void)
> -- 
> 1.8.2.rc3.2.geae6cf5
> 

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v3] wfcqueue: functions for local append and enqueue
  2013-03-23 19:43         ` Mathieu Desnoyers
@ 2013-03-23 20:42           ` Eric Wong
  2013-03-23 22:10             ` Mathieu Desnoyers
  0 siblings, 1 reply; 25+ messages in thread
From: Eric Wong @ 2013-03-23 20:42 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Lai Jiangshan, Paul E. McKenney, Stephen Hemminger,
	Davide Libenzi, linux-kernel

Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> * Eric Wong (normalperson@yhbt.net) wrote:
> >  /*
> > + * ___wfcq_append: append one local queue to another local queue
> > + *

> __wfcq_append() and ___wfcq_append() are meant to be private to
> wfcqueue.h. Therefore, the comment above should be removed, since it is
> not part of the API.
> 
> I notice that I should have used ___wfcq_append() for the original
> append function for consistency (other private helpers in this header
> are prefixed with ___).
> 
> So maybe we should rename __wfcq_append to ___wfcq_append (making it
> clear that it is a private helper), and introduce your helper as
> ___wfcq_append_local() (I don't care about having "local" in there since
> it is not part of the exposed API).

Thanks for the explanation, I've squashed that renames into my patch
below and removed the comment.

> > +/*
> > + * __wfcq_enqueue: enqueue a node into a local queue
> 
> The concept of "local queue" is not clearly defined.
> 
> Perhaps it would be clearer to state:
> 
>  * __wfcq_enqueue: enqueue a node into a queue, requiring mutual exclusion.

Sounds good to me.  Updated patch below:

-------------------------------8<-----------------------------
From: Eric Wong <normalperson@yhbt.net>
Date: Sat, 23 Mar 2013 19:07:26 +0000
Subject: [PATCH] wfcqueue: functions for local append and enqueue

With level-triggered epoll, append/enqueue operations to the
local/locked queues increase performance by avoiding unnecessary atomic
operations and barriers.  These are necessary to avoid performance
regressions when looping through ep_send_events and appending many items
to a local queue where the caller already manages mutual exclusion.

Changes since v1 and v2:
* Function renaming and documentation updates
* rename the existing private __wfcq_append to ___wfcq_append

Signed-off-by: Eric Wong <normalperson@yhbt.net>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Davide Libenzi <davidel@xmailserver.org>
---
 include/linux/wfcqueue.h | 56 +++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 46 insertions(+), 10 deletions(-)

diff --git a/include/linux/wfcqueue.h b/include/linux/wfcqueue.h
index 9464a0c..a452ab9 100644
--- a/include/linux/wfcqueue.h
+++ b/include/linux/wfcqueue.h
@@ -55,14 +55,16 @@
  * [4] __wfcq_splice (source queue)
  * [5] __wfcq_first
  * [6] __wfcq_next
+ * [7] __wfcq_enqueue
  *
- *     [1] [2] [3] [4] [5] [6]
- * [1]  -   -   -   -   -   -
- * [2]  -   -   -   -   -   -
- * [3]  -   -   X   X   X   X
- * [4]  -   -   X   -   X   X
- * [5]  -   -   X   X   -   -
- * [6]  -   -   X   X   -   -
+ *     [1] [2] [3] [4] [5] [6] [7]
+ * [1]  -   -   -   -   -   -   X
+ * [2]  -   -   -   -   -   -   X
+ * [3]  -   -   X   X   X   X   X
+ * [4]  -   -   X   -   X   X   X
+ * [5]  -   -   X   X   -   -   X
+ * [6]  -   -   X   X   -   -   X
+ * [7]  X   X   X   X   X   X   X
  *
  * Besides locking, mutual exclusion of dequeue, splice and iteration
  * can be ensured by performing all of those operations from a single
@@ -147,7 +149,7 @@ static inline bool wfcq_empty(struct wfcq_head *head,
 		&& CMM_LOAD_SHARED(tail->p) == &head->node;
 }
 
-static inline bool __wfcq_append(struct wfcq_head *head,
+static inline bool ___wfcq_append(struct wfcq_head *head,
 		struct wfcq_tail *tail,
 		struct wfcq_node *new_head,
 		struct wfcq_node *new_tail)
@@ -201,7 +203,41 @@ static inline bool wfcq_enqueue(struct wfcq_head *head,
 		struct wfcq_tail *tail,
 		struct wfcq_node *new_tail)
 {
-	return __wfcq_append(head, tail, new_tail, new_tail);
+	return ___wfcq_append(head, tail, new_tail, new_tail);
+}
+
+static inline bool ___wfcq_append_local(struct wfcq_head *head,
+		struct wfcq_tail *tail,
+		struct wfcq_node *new_head,
+		struct wfcq_node *new_tail)
+{
+	struct wfcq_node *old_tail;
+
+	old_tail = tail->p;
+	tail->p = new_tail;
+	old_tail->next = new_head;
+
+	/*
+	 * Return false if queue was empty prior to adding the node,
+	 * else return true.
+	 */
+	return old_tail != &head->node;
+}
+
+/*
+ * __wfcq_enqueue: enqueue a node into a queue, requiring mutual exclusion.
+ *
+ * No memory barriers are issued.  Mutual exclusion is the responsibility
+ * of the caller.
+ *
+ * Returns false if the queue was empty prior to adding the node.
+ * Returns true otherwise.
+ */
+static inline bool __wfcq_enqueue(struct wfcq_head *head,
+		struct wfcq_tail *tail,
+		struct wfcq_node *new_tail)
+{
+	return ___wfcq_append_local(head, tail, new_tail, new_tail);
 }
 
 /*
@@ -398,7 +434,7 @@ static inline enum wfcq_ret __wfcq_splice(
 	 * Append the spliced content of src_q into dest_q. Does not
 	 * require mutual exclusion on dest_q (wait-free).
 	 */
-	if (__wfcq_append(dest_q_head, dest_q_tail, head, tail))
+	if (___wfcq_append(dest_q_head, dest_q_tail, head, tail))
 		return WFCQ_RET_DEST_NON_EMPTY;
 	else
 		return WFCQ_RET_DEST_EMPTY;
-- 
1.8.2.rc3.2.geae6cf5


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH v3] wfcqueue: functions for local append and enqueue
  2013-03-23 20:42           ` [PATCH v3] " Eric Wong
@ 2013-03-23 22:10             ` Mathieu Desnoyers
  0 siblings, 0 replies; 25+ messages in thread
From: Mathieu Desnoyers @ 2013-03-23 22:10 UTC (permalink / raw)
  To: Eric Wong
  Cc: Lai Jiangshan, Paul E. McKenney, Stephen Hemminger,
	Davide Libenzi, linux-kernel

* Eric Wong (normalperson@yhbt.net) wrote:
> Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> > * Eric Wong (normalperson@yhbt.net) wrote:
> > >  /*
> > > + * ___wfcq_append: append one local queue to another local queue
> > > + *
> 
> > __wfcq_append() and ___wfcq_append() are meant to be private to
> > wfcqueue.h. Therefore, the comment above should be removed, since it is
> > not part of the API.
> > 
> > I notice that I should have used ___wfcq_append() for the original
> > append function for consistency (other private helpers in this header
> > are prefixed with ___).
> > 
> > So maybe we should rename __wfcq_append to ___wfcq_append (making it
> > clear that it is a private helper), and introduce your helper as
> > ___wfcq_append_local() (I don't care about having "local" in there since
> > it is not part of the exposed API).
> 
> Thanks for the explanation, I've squashed that renames into my patch
> below and removed the comment.
> 
> > > +/*
> > > + * __wfcq_enqueue: enqueue a node into a local queue
> > 
> > The concept of "local queue" is not clearly defined.
> > 
> > Perhaps it would be clearer to state:
> > 
> >  * __wfcq_enqueue: enqueue a node into a queue, requiring mutual exclusion.
> 
> Sounds good to me.  Updated patch below:
> 
> -------------------------------8<-----------------------------
> From: Eric Wong <normalperson@yhbt.net>
> Date: Sat, 23 Mar 2013 19:07:26 +0000
> Subject: [PATCH] wfcqueue: functions for local append and enqueue
> 
> With level-triggered epoll, append/enqueue operations to the
> local/locked queues increase performance by avoiding unnecessary atomic
> operations and barriers.  These are necessary to avoid performance
> regressions when looping through ep_send_events and appending many items
> to a local queue where the caller already manages mutual exclusion.

Looks good to me. Thanks!

Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>

> 
> Changes since v1 and v2:
> * Function renaming and documentation updates
> * rename the existing private __wfcq_append to ___wfcq_append
> 
> Signed-off-by: Eric Wong <normalperson@yhbt.net>
> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> Cc: Stephen Hemminger <shemminger@vyatta.com>
> Cc: Davide Libenzi <davidel@xmailserver.org>
> ---
>  include/linux/wfcqueue.h | 56 +++++++++++++++++++++++++++++++++++++++---------
>  1 file changed, 46 insertions(+), 10 deletions(-)
> 
> diff --git a/include/linux/wfcqueue.h b/include/linux/wfcqueue.h
> index 9464a0c..a452ab9 100644
> --- a/include/linux/wfcqueue.h
> +++ b/include/linux/wfcqueue.h
> @@ -55,14 +55,16 @@
>   * [4] __wfcq_splice (source queue)
>   * [5] __wfcq_first
>   * [6] __wfcq_next
> + * [7] __wfcq_enqueue
>   *
> - *     [1] [2] [3] [4] [5] [6]
> - * [1]  -   -   -   -   -   -
> - * [2]  -   -   -   -   -   -
> - * [3]  -   -   X   X   X   X
> - * [4]  -   -   X   -   X   X
> - * [5]  -   -   X   X   -   -
> - * [6]  -   -   X   X   -   -
> + *     [1] [2] [3] [4] [5] [6] [7]
> + * [1]  -   -   -   -   -   -   X
> + * [2]  -   -   -   -   -   -   X
> + * [3]  -   -   X   X   X   X   X
> + * [4]  -   -   X   -   X   X   X
> + * [5]  -   -   X   X   -   -   X
> + * [6]  -   -   X   X   -   -   X
> + * [7]  X   X   X   X   X   X   X
>   *
>   * Besides locking, mutual exclusion of dequeue, splice and iteration
>   * can be ensured by performing all of those operations from a single
> @@ -147,7 +149,7 @@ static inline bool wfcq_empty(struct wfcq_head *head,
>  		&& CMM_LOAD_SHARED(tail->p) == &head->node;
>  }
>  
> -static inline bool __wfcq_append(struct wfcq_head *head,
> +static inline bool ___wfcq_append(struct wfcq_head *head,
>  		struct wfcq_tail *tail,
>  		struct wfcq_node *new_head,
>  		struct wfcq_node *new_tail)
> @@ -201,7 +203,41 @@ static inline bool wfcq_enqueue(struct wfcq_head *head,
>  		struct wfcq_tail *tail,
>  		struct wfcq_node *new_tail)
>  {
> -	return __wfcq_append(head, tail, new_tail, new_tail);
> +	return ___wfcq_append(head, tail, new_tail, new_tail);
> +}
> +
> +static inline bool ___wfcq_append_local(struct wfcq_head *head,
> +		struct wfcq_tail *tail,
> +		struct wfcq_node *new_head,
> +		struct wfcq_node *new_tail)
> +{
> +	struct wfcq_node *old_tail;
> +
> +	old_tail = tail->p;
> +	tail->p = new_tail;
> +	old_tail->next = new_head;
> +
> +	/*
> +	 * Return false if queue was empty prior to adding the node,
> +	 * else return true.
> +	 */
> +	return old_tail != &head->node;
> +}
> +
> +/*
> + * __wfcq_enqueue: enqueue a node into a queue, requiring mutual exclusion.
> + *
> + * No memory barriers are issued.  Mutual exclusion is the responsibility
> + * of the caller.
> + *
> + * Returns false if the queue was empty prior to adding the node.
> + * Returns true otherwise.
> + */
> +static inline bool __wfcq_enqueue(struct wfcq_head *head,
> +		struct wfcq_tail *tail,
> +		struct wfcq_node *new_tail)
> +{
> +	return ___wfcq_append_local(head, tail, new_tail, new_tail);
>  }
>  
>  /*
> @@ -398,7 +434,7 @@ static inline enum wfcq_ret __wfcq_splice(
>  	 * Append the spliced content of src_q into dest_q. Does not
>  	 * require mutual exclusion on dest_q (wait-free).
>  	 */
> -	if (__wfcq_append(dest_q_head, dest_q_tail, head, tail))
> +	if (___wfcq_append(dest_q_head, dest_q_tail, head, tail))
>  		return WFCQ_RET_DEST_NON_EMPTY;
>  	else
>  		return WFCQ_RET_DEST_EMPTY;
> -- 
> 1.8.2.rc3.2.geae6cf5
> 

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH] wfcqueue: add function for unsynchronized prepend
  2013-03-11 21:36 [RFC PATCH] Linux kernel Wait-Free Concurrent Queue Implementation Mathieu Desnoyers
                   ` (2 preceding siblings ...)
  2013-03-21 11:43 ` [PATCH] wfcqueue: functions for local append and enqueue Eric Wong
@ 2013-03-29  8:10 ` Eric Wong
  2013-04-02 13:05   ` Mathieu Desnoyers
  2013-04-11 21:23 ` [RFC PATCH] Linux kernel Wait-Free Concurrent Queue Implementation Eric Wong
  4 siblings, 1 reply; 25+ messages in thread
From: Eric Wong @ 2013-03-29  8:10 UTC (permalink / raw)
  To: Mathieu Desnoyers; +Cc: linux-kernel

In some situations, it is necessary to prepend a node to a queue.
For epoll, this is necessary for two rare conditions:

* when the user triggers -EFAULT
* when reinjecting elements from the ovflist (implemented as a stack)

Signed-off-by: Eric Wong <normalperson@yhbt.net>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
---
 This is on top of my other patch to implement __wfcq_enqueue

 include/linux/wfcqueue.h | 35 +++++++++++++++++++++++++++--------
 1 file changed, 27 insertions(+), 8 deletions(-)

diff --git a/include/linux/wfcqueue.h b/include/linux/wfcqueue.h
index a452ab9..4cb8f22 100644
--- a/include/linux/wfcqueue.h
+++ b/include/linux/wfcqueue.h
@@ -56,15 +56,17 @@
  * [5] __wfcq_first
  * [6] __wfcq_next
  * [7] __wfcq_enqueue
+ * [8] __wfcq_prepend
  *
- *     [1] [2] [3] [4] [5] [6] [7]
- * [1]  -   -   -   -   -   -   X
- * [2]  -   -   -   -   -   -   X
- * [3]  -   -   X   X   X   X   X
- * [4]  -   -   X   -   X   X   X
- * [5]  -   -   X   X   -   -   X
- * [6]  -   -   X   X   -   -   X
- * [7]  X   X   X   X   X   X   X
+ *     [1] [2] [3] [4] [5] [6] [7] [8]
+ * [1]  -   -   -   -   -   -   X   X
+ * [2]  -   -   -   -   -   -   X   X
+ * [3]  -   -   X   X   X   X   X   X
+ * [4]  -   -   X   -   X   X   X   X
+ * [5]  -   -   X   X   -   -   X   X
+ * [6]  -   -   X   X   -   -   X   X
+ * [7]  X   X   X   X   X   X   X   X
+ * [8]  X   X   X   X   X   X   X   X
  *
  * Besides locking, mutual exclusion of dequeue, splice and iteration
  * can be ensured by performing all of those operations from a single
@@ -441,6 +443,23 @@ static inline enum wfcq_ret __wfcq_splice(
 }
 
 /*
+ * __wfcq_prepend: prepend a node into a queue, requiring mutual exclusion.
+ *
+ * No memory barriers are issued.  Mutual exclusion is the responsibility
+ * of the caller.
+ */
+static inline void __wfcq_prepend(struct wfcq_head *head,
+				struct wfcq_tail *tail, struct wfcq_node *node)
+{
+	node->next = head->node.next;
+	head->node.next = node;
+
+	/* if the queue was empty before, it is no longer empty now */
+	if (tail->p == &head->node)
+		tail->p = node;
+}
+
+/*
  * __wfcq_for_each: Iterate over all nodes in a queue,
  * without dequeuing them.
  * @head: head of the queue (struct wfcq_head pointer).
-- 
Eric Wong


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH] wfcqueue: add function for unsynchronized prepend
  2013-03-29  8:10 ` [PATCH] wfcqueue: add function for unsynchronized prepend Eric Wong
@ 2013-04-02 13:05   ` Mathieu Desnoyers
  2013-04-02 21:15     ` Eric Wong
  0 siblings, 1 reply; 25+ messages in thread
From: Mathieu Desnoyers @ 2013-04-02 13:05 UTC (permalink / raw)
  To: Eric Wong; +Cc: linux-kernel, Paul E. McKenney, Lai Jiangshan

* Eric Wong (normalperson@yhbt.net) wrote:
> In some situations, it is necessary to prepend a node to a queue.
> For epoll, this is necessary for two rare conditions:
> 
> * when the user triggers -EFAULT
> * when reinjecting elements from the ovflist (implemented as a stack)

This approach makes sense.

In terms of API naming, I wonder if "prepend" is the good counterpart
for "enqueue". Maybe "enqueue_first" or "enqueue_head" could be better
suited ?

Currently, we have an "append" function used internally, but it's not
exposed by the API.

Just for fun, I tried making your "prepend" wait-free (thus not
requiring a mutex), but it's really not obvious, because of its impact
on splice operation and dequeue-last-node operation.

Thoughts ?

Thanks,

Mathieu

> 
> Signed-off-by: Eric Wong <normalperson@yhbt.net>
> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> ---
>  This is on top of my other patch to implement __wfcq_enqueue
> 
>  include/linux/wfcqueue.h | 35 +++++++++++++++++++++++++++--------
>  1 file changed, 27 insertions(+), 8 deletions(-)
> 
> diff --git a/include/linux/wfcqueue.h b/include/linux/wfcqueue.h
> index a452ab9..4cb8f22 100644
> --- a/include/linux/wfcqueue.h
> +++ b/include/linux/wfcqueue.h
> @@ -56,15 +56,17 @@
>   * [5] __wfcq_first
>   * [6] __wfcq_next
>   * [7] __wfcq_enqueue
> + * [8] __wfcq_prepend
>   *
> - *     [1] [2] [3] [4] [5] [6] [7]
> - * [1]  -   -   -   -   -   -   X
> - * [2]  -   -   -   -   -   -   X
> - * [3]  -   -   X   X   X   X   X
> - * [4]  -   -   X   -   X   X   X
> - * [5]  -   -   X   X   -   -   X
> - * [6]  -   -   X   X   -   -   X
> - * [7]  X   X   X   X   X   X   X
> + *     [1] [2] [3] [4] [5] [6] [7] [8]
> + * [1]  -   -   -   -   -   -   X   X
> + * [2]  -   -   -   -   -   -   X   X
> + * [3]  -   -   X   X   X   X   X   X
> + * [4]  -   -   X   -   X   X   X   X
> + * [5]  -   -   X   X   -   -   X   X
> + * [6]  -   -   X   X   -   -   X   X
> + * [7]  X   X   X   X   X   X   X   X
> + * [8]  X   X   X   X   X   X   X   X
>   *
>   * Besides locking, mutual exclusion of dequeue, splice and iteration
>   * can be ensured by performing all of those operations from a single
> @@ -441,6 +443,23 @@ static inline enum wfcq_ret __wfcq_splice(
>  }
>  
>  /*
> + * __wfcq_prepend: prepend a node into a queue, requiring mutual exclusion.
> + *
> + * No memory barriers are issued.  Mutual exclusion is the responsibility
> + * of the caller.
> + */
> +static inline void __wfcq_prepend(struct wfcq_head *head,
> +				struct wfcq_tail *tail, struct wfcq_node *node)
> +{
> +	node->next = head->node.next;
> +	head->node.next = node;
> +
> +	/* if the queue was empty before, it is no longer empty now */
> +	if (tail->p == &head->node)
> +		tail->p = node;
> +}
> +
> +/*
>   * __wfcq_for_each: Iterate over all nodes in a queue,
>   * without dequeuing them.
>   * @head: head of the queue (struct wfcq_head pointer).
> -- 
> Eric Wong
> 

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] wfcqueue: add function for unsynchronized prepend
  2013-04-02 13:05   ` Mathieu Desnoyers
@ 2013-04-02 21:15     ` Eric Wong
  2013-04-06 21:42       ` [RFC PATCH] wfcqueue: implement __wfcq_enqueue_head() Mathieu Desnoyers
  0 siblings, 1 reply; 25+ messages in thread
From: Eric Wong @ 2013-04-02 21:15 UTC (permalink / raw)
  To: Mathieu Desnoyers; +Cc: linux-kernel, Paul E. McKenney, Lai Jiangshan

Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> * Eric Wong (normalperson@yhbt.net) wrote:
> > In some situations, it is necessary to prepend a node to a queue.
> > For epoll, this is necessary for two rare conditions:
> > 
> > * when the user triggers -EFAULT
> > * when reinjecting elements from the ovflist (implemented as a stack)
> 
> This approach makes sense.
> 
> In terms of API naming, I wonder if "prepend" is the good counterpart
> for "enqueue". Maybe "enqueue_first" or "enqueue_head" could be better
> suited ?

I'm not sure, I'll go with whatever name you decide is best :)

Thanks for looking at my patch, I'll post an updated version
once a name is decided.

-- 
Eric "isuckatnamingthings" Wong

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [RFC PATCH] wfcqueue: implement __wfcq_enqueue_head()
  2013-04-02 21:15     ` Eric Wong
@ 2013-04-06 21:42       ` Mathieu Desnoyers
  0 siblings, 0 replies; 25+ messages in thread
From: Mathieu Desnoyers @ 2013-04-06 21:42 UTC (permalink / raw)
  To: Eric Wong; +Cc: linux-kernel, Paul E. McKenney, Lai Jiangshan

Implement enqueue-to-head. It can run concurrently with enqueue and
splice to queue, but requires a mutex against all other queue
operations.

Useful for special-cases where a queue needs to have nodes enqueued into
its head.

This patch is only compile-tested.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
---
 include/linux/wfcqueue.h |   54 ++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 47 insertions(+), 7 deletions(-)

Index: linux/include/linux/wfcqueue.h
===================================================================
--- linux.orig/include/linux/wfcqueue.h
+++ linux/include/linux/wfcqueue.h
@@ -55,14 +55,16 @@
  * [4] __wfcq_splice (source queue)
  * [5] __wfcq_first
  * [6] __wfcq_next
+ * [7] __wfcq_enqueue_head
  *
- *     [1] [2] [3] [4] [5] [6]
- * [1]  -   -   -   -   -   -
- * [2]  -   -   -   -   -   -
- * [3]  -   -   X   X   X   X
- * [4]  -   -   X   -   X   X
- * [5]  -   -   X   X   -   -
- * [6]  -   -   X   X   -   -
+ *     [1] [2] [3] [4] [5] [6] [7]
+ * [1]  -   -   -   -   -   -   -
+ * [2]  -   -   -   -   -   -   X
+ * [3]  -   -   X   X   X   X   X
+ * [4]  -   -   X   -   X   X   X
+ * [5]  -   -   X   X   -   -   X
+ * [6]  -   -   X   X   -   -   X
+ * [7]  -   X   X   X   X   X   X
  *
  * Besides locking, mutual exclusion of dequeue, splice and iteration
  * can be ensured by performing all of those operations from a single
@@ -230,6 +232,44 @@ ___wfcq_node_sync_next(struct wfcq_node
 }
 
 /*
+ * __wfcq_enqueue_head: prepend a node into a queue.
+ *
+ * No memory barriers are issued. Mutual exclusion is the responsibility
+ * of the caller.
+ *
+ * Returns false if the queue was empty prior to adding the node.
+ * Returns true otherwise.
+ */
+static inline bool __wfcq_enqueue_head(struct wfcq_head *head,
+		struct wfcq_tail *tail,
+		struct wfcq_node *node)
+{
+	bool not_empty = 0;
+
+        /*
+	 * Move tail if queue was empty. Tail pointer is the
+	 * linearization point of enqueuers.
+	 */
+	if (cmpxchg(&tail->p, &head->node, node) != &head->node) {
+		not_empty = 1;
+		/*
+		 * Queue was non-empty. We need to wait for
+		 * head->node.next to become non-NULL, because a
+		 * concurrent wfcq_append may be updating it.
+		 */
+		node->next = ___wfcq_node_sync_next(&head->node);
+	}
+	/*
+	 * From this point, we know that wfcq_append cannot touch
+	 * head->node.next, either because we successfully moved tail->p
+	 * to node, or because we waited for head->node.next to become
+	 * non-NULL. It is therefore safe to update it.
+	 */
+	head->node.next = node;
+	return not_empty;
+}
+
+/*
  * __wfcq_first: get first node of a queue, without dequeuing.
  *
  * Content written into the node before enqueue is guaranteed to be


-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH] Linux kernel Wait-Free Concurrent Queue Implementation
  2013-03-11 21:36 [RFC PATCH] Linux kernel Wait-Free Concurrent Queue Implementation Mathieu Desnoyers
                   ` (3 preceding siblings ...)
  2013-03-29  8:10 ` [PATCH] wfcqueue: add function for unsynchronized prepend Eric Wong
@ 2013-04-11 21:23 ` Eric Wong
  2013-04-11 22:44   ` Mathieu Desnoyers
  4 siblings, 1 reply; 25+ messages in thread
From: Eric Wong @ 2013-04-11 21:23 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Lai Jiangshan, Paul E. McKenney, Stephen Hemminger,
	Davide Libenzi, linux-kernel

Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> It is provided as a starting point only. Test cases should be ported
> from Userspace RCU to kernel space and thoroughly ran on a wide range of
> architectures before considering this port production-ready.

Hi Mathieu, any progress on this front?  I'm not sure how to approach
test cases in the kernel, perhaps taking hints from locking_selftest()
would be a start.

Fwiw, I've been running wfcqueue+epoll for a few weeks now on my
personal machines without problems :)

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH] Linux kernel Wait-Free Concurrent Queue Implementation
  2013-04-11 21:23 ` [RFC PATCH] Linux kernel Wait-Free Concurrent Queue Implementation Eric Wong
@ 2013-04-11 22:44   ` Mathieu Desnoyers
  0 siblings, 0 replies; 25+ messages in thread
From: Mathieu Desnoyers @ 2013-04-11 22:44 UTC (permalink / raw)
  To: Eric Wong
  Cc: Lai Jiangshan, Paul E. McKenney, Stephen Hemminger,
	Davide Libenzi, linux-kernel

* Eric Wong (normalperson@yhbt.net) wrote:
> Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> > It is provided as a starting point only. Test cases should be ported
> > from Userspace RCU to kernel space and thoroughly ran on a wide range of
> > architectures before considering this port production-ready.
> 
> Hi Mathieu, any progress on this front?

Nope, I've been busy on other things.

> I'm not sure how to approach
> test cases in the kernel, perhaps taking hints from locking_selftest()
> would be a start.

I guess creating a wfcq_selftest() derived from userspace RCU
test_urcu_wfcq.c might be a good start.

> Fwiw, I've been running wfcqueue+epoll for a few weeks now on my
> personal machines without problems :)

Great!

Thanks,

Mathieu


-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2013-04-11 22:44 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-03-11 21:36 [RFC PATCH] Linux kernel Wait-Free Concurrent Queue Implementation Mathieu Desnoyers
2013-03-14  4:22 ` Eric Wong
2013-03-14 13:18   ` Mathieu Desnoyers
2013-03-14 19:07     ` Eric Wong
2013-03-14 19:48       ` Mathieu Desnoyers
2013-03-14 21:32         ` Eric Wong
2013-03-15  2:38           ` Mathieu Desnoyers
2013-03-16 22:02       ` Eric Wong
2013-03-17  2:03         ` Mathieu Desnoyers
2013-03-18 10:37           ` Eric Wong
2013-03-15  0:49 ` Peter Hurley
2013-03-15  2:08   ` Mathieu Desnoyers
2013-03-21 11:43 ` [PATCH] wfcqueue: functions for local append and enqueue Eric Wong
2013-03-22  2:01   ` Mathieu Desnoyers
2013-03-22 10:31     ` Eric Wong
2013-03-23 19:07       ` [PATCH v2] " Eric Wong
2013-03-23 19:43         ` Mathieu Desnoyers
2013-03-23 20:42           ` [PATCH v3] " Eric Wong
2013-03-23 22:10             ` Mathieu Desnoyers
2013-03-29  8:10 ` [PATCH] wfcqueue: add function for unsynchronized prepend Eric Wong
2013-04-02 13:05   ` Mathieu Desnoyers
2013-04-02 21:15     ` Eric Wong
2013-04-06 21:42       ` [RFC PATCH] wfcqueue: implement __wfcq_enqueue_head() Mathieu Desnoyers
2013-04-11 21:23 ` [RFC PATCH] Linux kernel Wait-Free Concurrent Queue Implementation Eric Wong
2013-04-11 22:44   ` Mathieu Desnoyers

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.