All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC v3 1/2] epoll: avoid spinlock contention with wfcqueue
@ 2013-03-21 11:52 Eric Wong
  2013-03-21 11:54 ` [RFC v3 2/2] epoll: use a local wfcq functions for Level Trigger Eric Wong
                   ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Eric Wong @ 2013-03-21 11:52 UTC (permalink / raw)
  To: linux-kernel
  Cc: Davide Libenzi, Al Viro, Andrew Morton, Mathieu Desnoyers, linux-fsdevel

This is still not a proper commit, I've lightly tested this.

Replace the spinlock-protected linked list ready list with wfcqueue.

This improves performance under heavy, multi-threaded workloads with
multiple threads calling epoll_wait.

Using my multi-threaded EPOLLONESHOT microbenchmark, performance is
nearly doubled:

$ eponeshotmt -t 4 -w 4 -f 10 -c 1000000

Before:
real    0m 9.58s
user    0m 1.22s
sys     0m 37.08s

After:
real    0m 5.00s
user    0m 1.07s
sys     0m 18.92s

ref: http://yhbt.net/eponeshotmt.c

Unfortunately, there are still regressions for the common,
single threaded, Level Trigger use case.

Things changed/removed:

* ep->ovflist - is no longer needed, the main ready list continues
  to be appended to while we iterate through the transaction list.

* ep_scan_ready_list - not enough generic code between users
  anymore to warrant this.  ep_poll_readyevents_proc (used for
  poll) is read-only, using __wfcq_for_each, while
  ep_send_events (used for epoll_wait) dequeues and needs
  __wfcq_for_each_safe.

* ep->lock renamed to ep->wqlock; this only protects waitqueues now
  (we use trylock to further avoid redundant wakeups)

* EPOLL_CTL_DEL/close() on a ready file will not immediately release
  epitem memory, epoll_wait() must be called since there's no way to
  delete a ready item from wfcqueue in O(1) time.  In practice this
  should not be a problem, any valid app using epoll must call
  epoll_wait occasionally.  Unfreed epitems still count against
  max_user_watches to protect against local DoS.  This should be the
  only possibly-noticeable change (in case there's an app that blindly
  adds/deletes things from the rbtree but never calls epoll_wait)

Changes since v1:
* fixed memory leak with pre-existing zombies in ep_free
* updated to use the latest wfcqueue.h APIs
* switched to using __wfcq_splice and a global transaction list
  (this is like the old txlist in ep_scan_ready_list)
* redundant wakeups avoided in ep_notify_waiters:
  - only attempt a wakeup when an item is enqueued the first time
  - use spin_trylock_irqsave when attempting notification, since a
    failed lock means either another task is already waking, or
    ep_poll is already running and will check anyways upon releasing
    wqlock, anyways.
* explicitly cache-aligned rdltail in SMP
* added ep_item_state for atomically reading epi->state with barrier
  (avoids WARN_ON in ep_send_events)
* reverted epi->nwait removal, it was not necessary
  sizeof(epitem) is still <= 128 bytes on 64-bit machines

Changes since v2:
* epi->state is no longer atomic, we only cmpxchg in ep_poll_callback
  now and rely on implicit barriers in other places for reading.
* intermediate EP_STATE_DEQUEUE removed, this (with xchg) caused too
  much overhead in the ep_send_events loop and could not eliminate
  starvation dangers from improper EPOLLET usage (the original code
  had this problem, too, the window is just a few cycles larger, now).
* minor code cleanups

Lightly-tested-by: Eric Wong <normalperson@yhbt.net>
Cc: Davide Libenzi <davidel@xmailserver.org>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
---

 fs/eventpoll.c | 615 ++++++++++++++++++++++++++-------------------------------
 1 file changed, 276 insertions(+), 339 deletions(-)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index a81f0ea..e039555 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -40,6 +40,7 @@
 #include <linux/atomic.h>
 #include <linux/proc_fs.h>
 #include <linux/seq_file.h>
+#include <linux/wfcqueue.h>
 
 /*
  * LOCKING:
@@ -47,15 +48,13 @@
  *
  * 1) epmutex (mutex)
  * 2) ep->mtx (mutex)
- * 3) ep->lock (spinlock)
+ * 3) ep->wqlock (spinlock)
  *
- * The acquire order is the one listed above, from 1 to 3.
- * We need a spinlock (ep->lock) because we manipulate objects
- * from inside the poll callback, that might be triggered from
- * a wake_up() that in turn might be called from IRQ context.
- * So we can't sleep inside the poll callback and hence we need
- * a spinlock. During the event transfer loop (from kernel to
- * user space) we could end up sleeping due a copy_to_user(), so
+ * The acquire order is the one listed above, from 1 to 2.
+ *
+ * We only have a spinlock (ep->wqlock) to manipulate the waitqueues.
+ * During the event transfer loop (from kernel to user space)
+ * we could end up sleeping due a copy_to_user(), so
  * we need a lock that will allow us to sleep. This lock is a
  * mutex (ep->mtx). It is acquired during the event transfer loop,
  * during epoll_ctl(EPOLL_CTL_DEL) and during eventpoll_release_file().
@@ -82,8 +81,8 @@
  * of epoll file descriptors, we use the current recursion depth as
  * the lockdep subkey.
  * It is possible to drop the "ep->mtx" and to use the global
- * mutex "epmutex" (together with "ep->lock") to have it working,
- * but having "ep->mtx" will make the interface more scalable.
+ * mutex "epmutex" to have it working,  but having "ep->mtx" will
+ * make the interface more scalable.
  * Events that require holding "epmutex" are very rare, while for
  * normal operations the epoll private "ep->mtx" will guarantee
  * a better scalability.
@@ -126,6 +125,21 @@ struct nested_calls {
 };
 
 /*
+ * epitem state transitions               callers
+ * ---------------------------------------------------------------------------
+ * EP_STATE_IDLE    -> EP_STATE_READY     ep_poll_callback, ep_enqueue_process
+ * EP_STATE_READY   -> EP_STATE_IDLE      ep_send_events
+ * EP_STATE_IDLE    -> kmem_cache_free    ep_remove
+ * EP_STATE_READY   -> EP_STATE_ZOMBIE    ep_remove
+ * EP_STATE_ZOMBIE  -> kmem_cache_free    ep_send_events
+ */
+enum epoll_item_state {
+	EP_STATE_ZOMBIE = -1,
+	EP_STATE_IDLE = 0,
+	EP_STATE_READY = 1
+};
+
+/*
  * Each file descriptor added to the eventpoll interface will
  * have an entry of this type linked to the "rbr" RB tree.
  * Avoid increasing the size of this struct, there can be many thousands
@@ -136,13 +150,7 @@ struct epitem {
 	struct rb_node rbn;
 
 	/* List header used to link this structure to the eventpoll ready list */
-	struct list_head rdllink;
-
-	/*
-	 * Works together "struct eventpoll"->ovflist in keeping the
-	 * single linked chain of items.
-	 */
-	struct epitem *next;
+	struct wfcq_node rdllink;
 
 	/* The file descriptor information this item refers to */
 	struct epoll_filefd ffd;
@@ -150,6 +158,9 @@ struct epitem {
 	/* Number of active wait queue attached to poll operations */
 	int nwait;
 
+	/* state of this item */
+	enum epoll_item_state state;
+
 	/* List containing poll wait queues */
 	struct list_head pwqlist;
 
@@ -172,37 +183,38 @@ struct epitem {
  * interface.
  */
 struct eventpoll {
-	/* Protect the access to this structure */
-	spinlock_t lock;
-
 	/*
 	 * This mutex is used to ensure that files are not removed
 	 * while epoll is using them. This is held during the event
 	 * collection loop, the file cleanup path, the epoll file exit
 	 * code and the ctl operations.
+	 * This also protects rdlhead, txlhead, and txltail.
 	 */
 	struct mutex mtx;
 
+	/* head of the enqueue ready list */
+	struct wfcq_head rdlhead;
+
+	/*
+	 * transactional ready list, these are only accessed when ep->mtx
+	 * is held.  We splice into these (from rdlhead) and work on this list
+	 */
+	struct wfcq_head txlhead;
+	struct wfcq_tail txltail;
+
+	/* Protect the access to wait queues */
+	spinlock_t wqlock;
+
 	/* Wait queue used by sys_epoll_wait() */
 	wait_queue_head_t wq;
 
 	/* Wait queue used by file->poll() */
 	wait_queue_head_t poll_wait;
 
-	/* List of ready file descriptors */
-	struct list_head rdllist;
-
 	/* RB tree root used to store monitored fd structs */
 	struct rb_root rbr;
 
-	/*
-	 * This is a single linked list that chains all the "struct epitem" that
-	 * happened while transferring ready events to userspace w/out
-	 * holding ->lock.
-	 */
-	struct epitem *ovflist;
-
-	/* wakeup_source used when ep_scan_ready_list is running */
+	/* wakeup_source used when ep_send_events is running */
 	struct wakeup_source *ws;
 
 	/* The user that created the eventpoll descriptor */
@@ -213,6 +225,8 @@ struct eventpoll {
 	/* used to optimize loop detection check */
 	int visited;
 	struct list_head visited_list_link;
+
+	struct wfcq_tail rdltail ____cacheline_aligned_in_smp;
 };
 
 /* Wait structure used by the poll hooks */
@@ -239,12 +253,6 @@ struct ep_pqueue {
 	struct epitem *epi;
 };
 
-/* Used by the ep_send_events() function as callback private data */
-struct ep_send_events_data {
-	int maxevents;
-	struct epoll_event __user *events;
-};
-
 /*
  * Configuration options available inside /proc/sys/fs/epoll/
  */
@@ -347,6 +355,12 @@ static inline struct epitem *ep_item_from_epqueue(poll_table *p)
 	return container_of(p, struct ep_pqueue, pt)->epi;
 }
 
+/* Get the "struct epitem" from a wfcq node */
+static inline struct epitem *ep_item_from_node(struct wfcq_node *p)
+{
+	return container_of(p, struct epitem, rdllink);
+}
+
 /* Tells if the epoll_ctl(2) operation needs an event copy from userspace */
 static inline int ep_op_has_event(int op)
 {
@@ -360,17 +374,30 @@ static void ep_nested_calls_init(struct nested_calls *ncalls)
 	spin_lock_init(&ncalls->lock);
 }
 
+/* splice newly queued ready items onto our transaction list */
+static inline enum wfcq_ret ep_ready_prepare(struct eventpoll *ep)
+{
+	lockdep_assert_held(&ep->mtx);
+
+	return __wfcq_splice(&ep->txlhead, &ep->txltail,
+				&ep->rdlhead, &ep->rdltail);
+}
+
 /**
  * ep_events_available - Checks if ready events might be available.
+ *                       This must be called with ep->mtx held
  *
  * @ep: Pointer to the eventpoll context.
  *
- * Returns: Returns a value different than zero if ready events are available,
- *          or zero otherwise.
+ * Returns: Returns true ready events are available, or false otherwise.
  */
-static inline int ep_events_available(struct eventpoll *ep)
+static inline bool ep_events_available(struct eventpoll *ep)
 {
-	return !list_empty(&ep->rdllist) || ep->ovflist != EP_UNACTIVE_PTR;
+	enum wfcq_ret ret = ep_ready_prepare(ep);
+
+	if (ret == WFCQ_RET_SRC_EMPTY)
+		return ! wfcq_empty(&ep->txlhead, &ep->txltail);
+	return true;
 }
 
 /**
@@ -507,6 +534,34 @@ static void ep_poll_safewake(wait_queue_head_t *wq)
 	put_cpu();
 }
 
+/* try to wake up any processes in epoll_(p)wait or poll() users with epoll */
+static void ep_notify_waiters(struct eventpoll *ep)
+{
+	unsigned long flags;
+	int pwake;
+
+	/*
+	 * If we can't get the wqlock, this means either:
+	 * ep_poll is working on it and will recheck the ready list when
+	 * it releases wqlock anyways
+	 * ...Or another task is running this function and another wakeup
+	 * would be redundant
+	 */
+	if (spin_trylock_irqsave(&ep->wqlock, flags)) {
+		/* Notify epoll_wait tasks  */
+		if (waitqueue_active(&ep->wq))
+			wake_up_locked(&ep->wq);
+
+		pwake = waitqueue_active(&ep->poll_wait);
+
+		spin_unlock_irqrestore(&ep->wqlock, flags);
+
+		/* Notify the ->poll() wait list (may be outside lock) */
+		if (pwake)
+			ep_poll_safewake(&ep->poll_wait);
+	}
+}
+
 static void ep_remove_wait_queue(struct eppoll_entry *pwq)
 {
 	wait_queue_head_t *whead;
@@ -570,104 +625,11 @@ static inline void ep_pm_stay_awake_rcu(struct epitem *epi)
 	rcu_read_unlock();
 }
 
-/**
- * ep_scan_ready_list - Scans the ready list in a way that makes possible for
- *                      the scan code, to call f_op->poll(). Also allows for
- *                      O(NumReady) performance.
- *
- * @ep: Pointer to the epoll private data structure.
- * @sproc: Pointer to the scan callback.
- * @priv: Private opaque data passed to the @sproc callback.
- * @depth: The current depth of recursive f_op->poll calls.
- *
- * Returns: The same integer error code returned by the @sproc callback.
- */
-static int ep_scan_ready_list(struct eventpoll *ep,
-			      int (*sproc)(struct eventpoll *,
-					   struct list_head *, void *),
-			      void *priv,
-			      int depth)
+static noinline void ep_reap(struct eventpoll *ep, struct epitem *epi)
 {
-	int error, pwake = 0;
-	unsigned long flags;
-	struct epitem *epi, *nepi;
-	LIST_HEAD(txlist);
-
-	/*
-	 * We need to lock this because we could be hit by
-	 * eventpoll_release_file() and epoll_ctl().
-	 */
-	mutex_lock_nested(&ep->mtx, depth);
-
-	/*
-	 * Steal the ready list, and re-init the original one to the
-	 * empty list. Also, set ep->ovflist to NULL so that events
-	 * happening while looping w/out locks, are not lost. We cannot
-	 * have the poll callback to queue directly on ep->rdllist,
-	 * because we want the "sproc" callback to be able to do it
-	 * in a lockless way.
-	 */
-	spin_lock_irqsave(&ep->lock, flags);
-	list_splice_init(&ep->rdllist, &txlist);
-	ep->ovflist = NULL;
-	spin_unlock_irqrestore(&ep->lock, flags);
-
-	/*
-	 * Now call the callback function.
-	 */
-	error = (*sproc)(ep, &txlist, priv);
-
-	spin_lock_irqsave(&ep->lock, flags);
-	/*
-	 * During the time we spent inside the "sproc" callback, some
-	 * other events might have been queued by the poll callback.
-	 * We re-insert them inside the main ready-list here.
-	 */
-	for (nepi = ep->ovflist; (epi = nepi) != NULL;
-	     nepi = epi->next, epi->next = EP_UNACTIVE_PTR) {
-		/*
-		 * We need to check if the item is already in the list.
-		 * During the "sproc" callback execution time, items are
-		 * queued into ->ovflist but the "txlist" might already
-		 * contain them, and the list_splice() below takes care of them.
-		 */
-		if (!ep_is_linked(&epi->rdllink)) {
-			list_add_tail(&epi->rdllink, &ep->rdllist);
-			ep_pm_stay_awake(epi);
-		}
-	}
-	/*
-	 * We need to set back ep->ovflist to EP_UNACTIVE_PTR, so that after
-	 * releasing the lock, events will be queued in the normal way inside
-	 * ep->rdllist.
-	 */
-	ep->ovflist = EP_UNACTIVE_PTR;
-
-	/*
-	 * Quickly re-inject items left on "txlist".
-	 */
-	list_splice(&txlist, &ep->rdllist);
-	__pm_relax(ep->ws);
-
-	if (!list_empty(&ep->rdllist)) {
-		/*
-		 * Wake up (if active) both the eventpoll wait list and
-		 * the ->poll() wait list (delayed after we release the lock).
-		 */
-		if (waitqueue_active(&ep->wq))
-			wake_up_locked(&ep->wq);
-		if (waitqueue_active(&ep->poll_wait))
-			pwake++;
-	}
-	spin_unlock_irqrestore(&ep->lock, flags);
-
-	mutex_unlock(&ep->mtx);
-
-	/* We have to call this outside the lock */
-	if (pwake)
-		ep_poll_safewake(&ep->poll_wait);
-
-	return error;
+	wakeup_source_unregister(ep_wakeup_source(epi));
+	kmem_cache_free(epi_cache, epi);
+	atomic_long_dec(&ep->user->epoll_watches);
 }
 
 /*
@@ -676,17 +638,9 @@ static int ep_scan_ready_list(struct eventpoll *ep,
  */
 static int ep_remove(struct eventpoll *ep, struct epitem *epi)
 {
-	unsigned long flags;
 	struct file *file = epi->ffd.file;
 
-	/*
-	 * Removes poll wait queue hooks. We _have_ to do this without holding
-	 * the "ep->lock" otherwise a deadlock might occur. This because of the
-	 * sequence of the lock acquisition. Here we do "ep->lock" then the wait
-	 * queue head lock when unregistering the wait queue. The wakeup callback
-	 * will run by holding the wait queue head lock and will call our callback
-	 * that will try to get "ep->lock".
-	 */
+	/* Removes poll wait queue hooks, epi->state cannot change after this */
 	ep_unregister_pollwait(ep, epi);
 
 	/* Remove the current item from the list of epoll hooks */
@@ -697,17 +651,12 @@ static int ep_remove(struct eventpoll *ep, struct epitem *epi)
 
 	rb_erase(&epi->rbn, &ep->rbr);
 
-	spin_lock_irqsave(&ep->lock, flags);
-	if (ep_is_linked(&epi->rdllink))
-		list_del_init(&epi->rdllink);
-	spin_unlock_irqrestore(&ep->lock, flags);
-
-	wakeup_source_unregister(ep_wakeup_source(epi));
-
-	/* At this point it is safe to free the eventpoll item */
-	kmem_cache_free(epi_cache, epi);
-
-	atomic_long_dec(&ep->user->epoll_watches);
+	/* rely on ep_send_events to cleanup if we got enqueued */
+	if (epi->state == EP_STATE_READY)
+		epi->state = EP_STATE_ZOMBIE;
+	else
+		/* At this point it is safe to free the eventpoll item */
+		ep_reap(ep, epi);
 
 	return 0;
 }
@@ -716,6 +665,7 @@ static void ep_free(struct eventpoll *ep)
 {
 	struct rb_node *rbp;
 	struct epitem *epi;
+	struct wfcq_node *node, *n;
 
 	/* We need to release all tasks waiting for these file */
 	if (waitqueue_active(&ep->poll_wait))
@@ -740,17 +690,27 @@ static void ep_free(struct eventpoll *ep)
 		ep_unregister_pollwait(ep, epi);
 	}
 
+	/* We only lock ep->mtx to prevent a lockdep warning */
+	mutex_lock(&ep->mtx);
+
+	/* reap zombies, these are no longer in ep->rbr */
+	ep_ready_prepare(ep);
+	__wfcq_for_each_safe(&ep->txlhead, &ep->txltail, node, n) {
+		epi = ep_item_from_node(node);
+		__wfcq_dequeue(&ep->txlhead, &ep->txltail);
+		if (epi->state == EP_STATE_ZOMBIE)
+			ep_reap(ep, epi);
+	}
+
 	/*
 	 * Walks through the whole tree by freeing each "struct epitem". At this
 	 * point we are sure no poll callbacks will be lingering around, and also by
 	 * holding "epmutex" we can be sure that no file cleanup code will hit
-	 * us during this operation. So we can avoid the lock on "ep->lock".
-	 * We do not need to lock ep->mtx, either, we only do it to prevent
-	 * a lockdep warning.
+	 * us during this operation.
 	 */
-	mutex_lock(&ep->mtx);
 	while ((rbp = rb_first(&ep->rbr)) != NULL) {
 		epi = rb_entry(rbp, struct epitem, rbn);
+		epi->state = EP_STATE_IDLE; /* prevent new zombies */
 		ep_remove(ep, epi);
 	}
 	mutex_unlock(&ep->mtx);
@@ -779,34 +739,50 @@ static inline unsigned int ep_item_poll(struct epitem *epi, poll_table *pt)
 	return epi->ffd.file->f_op->poll(epi->ffd.file, pt) & epi->event.events;
 }
 
-static int ep_read_events_proc(struct eventpoll *ep, struct list_head *head,
-			       void *priv)
+/* used our own poll() implementation */
+static int ep_poll_readyevents_proc(void *priv, void *cookie, int call_nests)
 {
-	struct epitem *epi, *tmp;
+	struct eventpoll *ep = priv;
+	struct epitem *epi;
+	struct wfcq_node *node;
+	int ret = 0;
 	poll_table pt;
 
 	init_poll_funcptr(&pt, NULL);
 
-	list_for_each_entry_safe(epi, tmp, head, rdllink) {
-		if (ep_item_poll(epi, &pt))
-			return POLLIN | POLLRDNORM;
-		else {
+	/*
+	 * this iteration should be safe because:
+	 * 1) ep->mtx is locked, preventing dequeue and epi invalidation
+	 * 2) stops at the tail when we started iteration
+	 */
+	mutex_lock_nested(&ep->mtx, call_nests + 1);
+	ep_ready_prepare(ep);
+	__wfcq_for_each(&ep->txlhead, &ep->txltail, node) {
+		epi = ep_item_from_node(node);
+
+		/* zombie item, let ep_send_events cleanup */
+		if (epi->state == EP_STATE_ZOMBIE)
+			continue;
+
+		pt._key = epi->event.events;
+
+		if (ep_item_poll(epi, &pt)) {
+			ret = POLLIN | POLLRDNORM;
+			break;
+		} else {
 			/*
 			 * Item has been dropped into the ready list by the poll
 			 * callback, but it's not actually ready, as far as
-			 * caller requested events goes. We can remove it here.
+			 * caller requested events goes.
+			 * We cannot remove it here, ep_send_events will
+			 * recheck and remove if possible
 			 */
 			__pm_relax(ep_wakeup_source(epi));
-			list_del_init(&epi->rdllink);
 		}
 	}
+	mutex_unlock(&ep->mtx);
 
-	return 0;
-}
-
-static int ep_poll_readyevents_proc(void *priv, void *cookie, int call_nests)
-{
-	return ep_scan_ready_list(priv, ep_read_events_proc, NULL, call_nests + 1);
+	return ret;
 }
 
 static unsigned int ep_eventpoll_poll(struct file *file, poll_table *wait)
@@ -913,13 +889,13 @@ static int ep_alloc(struct eventpoll **pep)
 	if (unlikely(!ep))
 		goto free_uid;
 
-	spin_lock_init(&ep->lock);
+	spin_lock_init(&ep->wqlock);
 	mutex_init(&ep->mtx);
 	init_waitqueue_head(&ep->wq);
 	init_waitqueue_head(&ep->poll_wait);
-	INIT_LIST_HEAD(&ep->rdllist);
+	wfcq_init(&ep->rdlhead, &ep->rdltail);
+	wfcq_init(&ep->txlhead, &ep->txltail);
 	ep->rbr = RB_ROOT;
-	ep->ovflist = EP_UNACTIVE_PTR;
 	ep->user = user;
 
 	*pep = ep;
@@ -960,6 +936,14 @@ static struct epitem *ep_find(struct eventpoll *ep, struct file *file, int fd)
 	return epir;
 }
 
+/* returns true if successfully marked ready, false if it was already ready */
+static inline bool ep_mark_ready(struct epitem *epi)
+{
+	/* We only want to go from idle -> ready */
+	return (cmpxchg(&epi->state, EP_STATE_IDLE, EP_STATE_READY)
+	        == EP_STATE_IDLE);
+}
+
 /*
  * This is the callback that is passed to the wait queue wakeup
  * mechanism. It is called by the stored file descriptors when they
@@ -967,8 +951,6 @@ static struct epitem *ep_find(struct eventpoll *ep, struct file *file, int fd)
  */
 static int ep_poll_callback(wait_queue_t *wait, unsigned mode, int sync, void *key)
 {
-	int pwake = 0;
-	unsigned long flags;
 	struct epitem *epi = ep_item_from_wait(wait);
 	struct eventpoll *ep = epi->ep;
 
@@ -983,7 +965,8 @@ static int ep_poll_callback(wait_queue_t *wait, unsigned mode, int sync, void *k
 		list_del_init(&wait->task_list);
 	}
 
-	spin_lock_irqsave(&ep->lock, flags);
+	/* pairs with smp_mb in ep_modify */
+	smp_rmb();
 
 	/*
 	 * If the event mask does not contain any poll(2) event, we consider the
@@ -992,7 +975,7 @@ static int ep_poll_callback(wait_queue_t *wait, unsigned mode, int sync, void *k
 	 * until the next EPOLL_CTL_MOD will be issued.
 	 */
 	if (!(epi->event.events & ~EP_PRIVATE_BITS))
-		goto out_unlock;
+		return 1;
 
 	/*
 	 * Check the events coming with the callback. At this stage, not
@@ -1001,52 +984,14 @@ static int ep_poll_callback(wait_queue_t *wait, unsigned mode, int sync, void *k
 	 * test for "key" != NULL before the event match test.
 	 */
 	if (key && !((unsigned long) key & epi->event.events))
-		goto out_unlock;
-
-	/*
-	 * If we are transferring events to userspace, we can hold no locks
-	 * (because we're accessing user memory, and because of linux f_op->poll()
-	 * semantics). All the events that happen during that period of time are
-	 * chained in ep->ovflist and requeued later on.
-	 */
-	if (unlikely(ep->ovflist != EP_UNACTIVE_PTR)) {
-		if (epi->next == EP_UNACTIVE_PTR) {
-			epi->next = ep->ovflist;
-			ep->ovflist = epi;
-			if (epi->ws) {
-				/*
-				 * Activate ep->ws since epi->ws may get
-				 * deactivated at any time.
-				 */
-				__pm_stay_awake(ep->ws);
-			}
+		return 1;
 
-		}
-		goto out_unlock;
-	}
-
-	/* If this file is already in the ready list we exit soon */
-	if (!ep_is_linked(&epi->rdllink)) {
-		list_add_tail(&epi->rdllink, &ep->rdllist);
+	if (ep_mark_ready(epi)) {
+		wfcq_enqueue(&ep->rdlhead, &ep->rdltail, &epi->rdllink);
 		ep_pm_stay_awake_rcu(epi);
+		ep_notify_waiters(ep);
 	}
 
-	/*
-	 * Wake up ( if active ) both the eventpoll wait list and the ->poll()
-	 * wait list.
-	 */
-	if (waitqueue_active(&ep->wq))
-		wake_up_locked(&ep->wq);
-	if (waitqueue_active(&ep->poll_wait))
-		pwake++;
-
-out_unlock:
-	spin_unlock_irqrestore(&ep->lock, flags);
-
-	/* We have to call this outside the lock */
-	if (pwake)
-		ep_poll_safewake(&ep->poll_wait);
-
 	return 1;
 }
 
@@ -1224,14 +1169,23 @@ static noinline void ep_destroy_wakeup_source(struct epitem *epi)
 	wakeup_source_unregister(ws);
 }
 
+/* enqueue an epitem in process context via epoll_ctl */
+static void ep_enqueue_process(struct eventpoll *ep, struct epitem *epi)
+{
+	if (ep_mark_ready(epi)) {
+		wfcq_enqueue(&ep->rdlhead, &ep->rdltail, &epi->rdllink);
+		ep_pm_stay_awake(epi);
+		ep_notify_waiters(ep);
+	}
+}
+
 /*
  * Must be called with "mtx" held.
  */
 static int ep_insert(struct eventpoll *ep, struct epoll_event *event,
 		     struct file *tfile, int fd)
 {
-	int error, revents, pwake = 0;
-	unsigned long flags;
+	int error, revents;
 	long user_watches;
 	struct epitem *epi;
 	struct ep_pqueue epq;
@@ -1243,14 +1197,14 @@ static int ep_insert(struct eventpoll *ep, struct epoll_event *event,
 		return -ENOMEM;
 
 	/* Item initialization follow here ... */
-	INIT_LIST_HEAD(&epi->rdllink);
+	wfcq_node_init(&epi->rdllink);
 	INIT_LIST_HEAD(&epi->fllink);
 	INIT_LIST_HEAD(&epi->pwqlist);
 	epi->ep = ep;
+	epi->state = EP_STATE_IDLE;
 	ep_set_ffd(&epi->ffd, tfile, fd);
-	epi->event = *event;
 	epi->nwait = 0;
-	epi->next = EP_UNACTIVE_PTR;
+	epi->event = *event;
 	if (epi->event.events & EPOLLWAKEUP) {
 		error = ep_create_wakeup_source(epi);
 		if (error)
@@ -1297,28 +1251,11 @@ static int ep_insert(struct eventpoll *ep, struct epoll_event *event,
 	if (reverse_path_check())
 		goto error_remove_epi;
 
-	/* We have to drop the new item inside our item list to keep track of it */
-	spin_lock_irqsave(&ep->lock, flags);
-
-	/* If the file is already "ready" we drop it inside the ready list */
-	if ((revents & event->events) && !ep_is_linked(&epi->rdllink)) {
-		list_add_tail(&epi->rdllink, &ep->rdllist);
-		ep_pm_stay_awake(epi);
-
-		/* Notify waiting tasks that events are available */
-		if (waitqueue_active(&ep->wq))
-			wake_up_locked(&ep->wq);
-		if (waitqueue_active(&ep->poll_wait))
-			pwake++;
-	}
-
-	spin_unlock_irqrestore(&ep->lock, flags);
-
 	atomic_long_inc(&ep->user->epoll_watches);
 
-	/* We have to call this outside the lock */
-	if (pwake)
-		ep_poll_safewake(&ep->poll_wait);
+	/* If the file is already "ready" we drop it inside the ready list */
+	if (revents & event->events)
+		ep_enqueue_process(ep, epi);
 
 	return 0;
 
@@ -1331,18 +1268,26 @@ error_remove_epi:
 	rb_erase(&epi->rbn, &ep->rbr);
 
 error_unregister:
+	/* ep->state cannot be set to EP_STATE_READY after this: */
 	ep_unregister_pollwait(ep, epi);
 
 	/*
+	 * ep_poll_callback may've fired during ep_unregister_pollwait
+	 * and set epi->state via cmpxchg, ensure we see it
+	 */
+	smp_rmb();
+
+	/*
+	 * Rely on ep_send_events to free epi if we got enqueued.
 	 * We need to do this because an event could have been arrived on some
-	 * allocated wait queue. Note that we don't care about the ep->ovflist
-	 * list, since that is used/cleaned only inside a section bound by "mtx".
-	 * And ep_insert() is called with "mtx" held.
+	 * allocated wait queue.
 	 */
-	spin_lock_irqsave(&ep->lock, flags);
-	if (ep_is_linked(&epi->rdllink))
-		list_del_init(&epi->rdllink);
-	spin_unlock_irqrestore(&ep->lock, flags);
+	if (epi->state == EP_STATE_READY) {
+		/* count the zombie against user watches to prevent OOM DoS */
+		atomic_long_inc(&ep->user->epoll_watches);
+		epi->state = EP_STATE_ZOMBIE;
+		return error;
+	}
 
 	wakeup_source_unregister(ep_wakeup_source(epi));
 
@@ -1358,7 +1303,6 @@ error_create_wakeup_source:
  */
 static int ep_modify(struct eventpoll *ep, struct epitem *epi, struct epoll_event *event)
 {
-	int pwake = 0;
 	unsigned int revents;
 	poll_table pt;
 
@@ -1384,9 +1328,7 @@ static int ep_modify(struct eventpoll *ep, struct epitem *epi, struct epoll_even
 	 * 1) Flush epi changes above to other CPUs.  This ensures
 	 *    we do not miss events from ep_poll_callback if an
 	 *    event occurs immediately after we call f_op->poll().
-	 *    We need this because we did not take ep->lock while
-	 *    changing epi above (but ep_poll_callback does take
-	 *    ep->lock).
+	 *    This pairs with the first smb_rmb in ep_poll_callback
 	 *
 	 * 2) We also need to ensure we do not miss _past_ events
 	 *    when calling f_op->poll().  This barrier also
@@ -1408,49 +1350,42 @@ static int ep_modify(struct eventpoll *ep, struct epitem *epi, struct epoll_even
 	 * If the item is "hot" and it is not registered inside the ready
 	 * list, push it inside.
 	 */
-	if (revents & event->events) {
-		spin_lock_irq(&ep->lock);
-		if (!ep_is_linked(&epi->rdllink)) {
-			list_add_tail(&epi->rdllink, &ep->rdllist);
-			ep_pm_stay_awake(epi);
-
-			/* Notify waiting tasks that events are available */
-			if (waitqueue_active(&ep->wq))
-				wake_up_locked(&ep->wq);
-			if (waitqueue_active(&ep->poll_wait))
-				pwake++;
-		}
-		spin_unlock_irq(&ep->lock);
-	}
-
-	/* We have to call this outside the lock */
-	if (pwake)
-		ep_poll_safewake(&ep->poll_wait);
+	if (revents & event->events)
+		ep_enqueue_process(ep, epi);
 
 	return 0;
 }
 
-static int ep_send_events_proc(struct eventpoll *ep, struct list_head *head,
-			       void *priv)
+static int ep_send_events(struct eventpoll *ep,
+			  struct epoll_event __user *uevent, int maxevents)
 {
-	struct ep_send_events_data *esed = priv;
-	int eventcnt;
+	int eventcnt = 0;
 	unsigned int revents;
 	struct epitem *epi;
-	struct epoll_event __user *uevent;
 	struct wakeup_source *ws;
+	struct wfcq_node *node, *n;
+	enum epoll_item_state state;
 	poll_table pt;
 
 	init_poll_funcptr(&pt, NULL);
 
 	/*
-	 * We can loop without lock because we are passed a task private list.
-	 * Items cannot vanish during the loop because ep_scan_ready_list() is
-	 * holding "mtx" during this call.
+	 * Items cannot vanish during the loop because we are holding
+	 * "mtx" during this call.
 	 */
-	for (eventcnt = 0, uevent = esed->events;
-	     !list_empty(head) && eventcnt < esed->maxevents;) {
-		epi = list_first_entry(head, struct epitem, rdllink);
+	__wfcq_for_each_safe(&ep->txlhead, &ep->txltail, node, n) {
+		epi = ep_item_from_node(node);
+		__wfcq_dequeue(&ep->txlhead, &ep->txltail);
+
+		/* no barrier needed, splicing txl should be enough */
+		state = epi->state;
+
+		if (state == EP_STATE_ZOMBIE) {
+			ep_reap(ep, epi);
+			continue;
+		}
+		WARN_ON(state != EP_STATE_READY);
+		wfcq_node_init(&epi->rdllink);
 
 		/*
 		 * Activate ep->ws before deactivating epi->ws to prevent
@@ -1459,7 +1394,7 @@ static int ep_send_events_proc(struct eventpoll *ep, struct list_head *head,
 		 *
 		 * This could be rearranged to delay the deactivation of epi->ws
 		 * instead, but then epi->ws would temporarily be out of sync
-		 * with ep_is_linked().
+		 * with epi->state.
 		 */
 		ws = ep_wakeup_source(epi);
 		if (ws) {
@@ -1468,25 +1403,29 @@ static int ep_send_events_proc(struct eventpoll *ep, struct list_head *head,
 			__pm_relax(ws);
 		}
 
-		list_del_init(&epi->rdllink);
-
 		revents = ep_item_poll(epi, &pt);
 
 		/*
 		 * If the event mask intersect the caller-requested one,
-		 * deliver the event to userspace. Again, ep_scan_ready_list()
+		 * deliver the event to userspace. Again, ep_poll()
 		 * is holding "mtx", so no operations coming from userspace
 		 * can change the item.
 		 */
 		if (revents) {
 			if (__put_user(revents, &uevent->events) ||
 			    __put_user(epi->event.data, &uevent->data)) {
-				list_add(&epi->rdllink, head);
+				wfcq_enqueue(&ep->txlhead, &ep->txltail,
+							&epi->rdllink);
 				ep_pm_stay_awake(epi);
-				return eventcnt ? eventcnt : -EFAULT;
+				if (!eventcnt)
+					eventcnt = -EFAULT;
+				break;
 			}
-			eventcnt++;
+
 			uevent++;
+			if (++eventcnt == maxevents)
+				n = NULL; /* stop iteration */
+
 			if (epi->event.events & EPOLLONESHOT)
 				epi->event.events &= EP_PRIVATE_BITS;
 			else if (!(epi->event.events & EPOLLET)) {
@@ -1495,32 +1434,25 @@ static int ep_send_events_proc(struct eventpoll *ep, struct list_head *head,
 				 * Trigger mode, we need to insert back inside
 				 * the ready list, so that the next call to
 				 * epoll_wait() will check again the events
-				 * availability. At this point, no one can insert
-				 * into ep->rdllist besides us. The epoll_ctl()
-				 * callers are locked out by
-				 * ep_scan_ready_list() holding "mtx" and the
-				 * poll callback will queue them in ep->ovflist.
+				 * availability.
 				 */
-				list_add_tail(&epi->rdllink, &ep->rdllist);
+				wfcq_enqueue(&ep->rdlhead, &ep->rdltail,
+							&epi->rdllink);
 				ep_pm_stay_awake(epi);
+				continue;
 			}
 		}
+
+		/*
+		 * reset item state for EPOLLONESHOT and EPOLLET
+		 * no barrier here, rely on ep->mtx release for write barrier
+		 */
+		epi->state = EP_STATE_IDLE;
 	}
 
 	return eventcnt;
 }
 
-static int ep_send_events(struct eventpoll *ep,
-			  struct epoll_event __user *events, int maxevents)
-{
-	struct ep_send_events_data esed;
-
-	esed.maxevents = maxevents;
-	esed.events = events;
-
-	return ep_scan_ready_list(ep, ep_send_events_proc, &esed, 0);
-}
-
 static inline struct timespec ep_set_mstimeout(long ms)
 {
 	struct timespec now, ts = {
@@ -1552,11 +1484,12 @@ static inline struct timespec ep_set_mstimeout(long ms)
 static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
 		   int maxevents, long timeout)
 {
-	int res = 0, eavail, timed_out = 0;
+	int res = 0;
 	unsigned long flags;
 	long slack = 0;
 	wait_queue_t wait;
 	ktime_t expires, *to = NULL;
+	bool eavail;
 
 	if (timeout > 0) {
 		struct timespec end_time = ep_set_mstimeout(timeout);
@@ -1564,27 +1497,23 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
 		slack = select_estimate_accuracy(&end_time);
 		to = &expires;
 		*to = timespec_to_ktime(end_time);
-	} else if (timeout == 0) {
-		/*
-		 * Avoid the unnecessary trip to the wait queue loop, if the
-		 * caller specified a non blocking operation.
-		 */
-		timed_out = 1;
-		spin_lock_irqsave(&ep->lock, flags);
-		goto check_events;
 	}
 
-fetch_events:
-	spin_lock_irqsave(&ep->lock, flags);
+	mutex_lock(&ep->mtx);
+	eavail = ep_events_available(ep);
+
+	if (!eavail && timeout) {
+wait_queue_loop:
 
-	if (!ep_events_available(ep)) {
 		/*
 		 * We don't have any available event to return to the caller.
 		 * We need to sleep here, and we will be wake up by
 		 * ep_poll_callback() when events will become available.
 		 */
 		init_waitqueue_entry(&wait, current);
+		spin_lock_irqsave(&ep->wqlock, flags);
 		__add_wait_queue_exclusive(&ep->wq, &wait);
+		spin_unlock_irqrestore(&ep->wqlock, flags);
 
 		for (;;) {
 			/*
@@ -1593,37 +1522,45 @@ fetch_events:
 			 * to TASK_INTERRUPTIBLE before doing the checks.
 			 */
 			set_current_state(TASK_INTERRUPTIBLE);
-			if (ep_events_available(ep) || timed_out)
+			eavail = ep_events_available(ep);
+			if (eavail || !timeout)
 				break;
 			if (signal_pending(current)) {
 				res = -EINTR;
 				break;
 			}
+			mutex_unlock(&ep->mtx);
 
-			spin_unlock_irqrestore(&ep->lock, flags);
 			if (!schedule_hrtimeout_range(to, slack, HRTIMER_MODE_ABS))
-				timed_out = 1;
+				timeout = 0;
 
-			spin_lock_irqsave(&ep->lock, flags);
+			mutex_lock(&ep->mtx);
 		}
+
+		spin_lock_irqsave(&ep->wqlock, flags);
 		__remove_wait_queue(&ep->wq, &wait);
+		spin_unlock_irqrestore(&ep->wqlock, flags);
 
 		set_current_state(TASK_RUNNING);
 	}
-check_events:
-	/* Is it worth to try to dig for events ? */
-	eavail = ep_events_available(ep);
-
-	spin_unlock_irqrestore(&ep->lock, flags);
 
 	/*
 	 * Try to transfer events to user space. In case we get 0 events and
 	 * there's still timeout left over, we go trying again in search of
 	 * more luck.
 	 */
-	if (!res && eavail &&
-	    !(res = ep_send_events(ep, events, maxevents)) && !timed_out)
-		goto fetch_events;
+	if (!res) {
+		res = ep_send_events(ep, events, maxevents);
+		if (!res && timeout)
+			goto wait_queue_loop;
+	}
+
+	eavail = ep_events_available(ep);
+	mutex_unlock(&ep->mtx);
+
+	/* we may not have transferred everything, wake up others */
+	if (eavail)
+		ep_notify_waiters(ep);
 
 	return res;
 }
-- 
Eric Wong

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC v3 2/2] epoll: use a local wfcq functions for Level Trigger
  2013-03-21 11:52 [RFC v3 1/2] epoll: avoid spinlock contention with wfcqueue Eric Wong
@ 2013-03-21 11:54 ` Eric Wong
  2013-03-21 12:01 ` [RFC v3 1/2] epoll: avoid spinlock contention with wfcqueue Eric Wong
  2013-03-22  1:39 ` Arve Hjønnevåg
  2 siblings, 0 replies; 15+ messages in thread
From: Eric Wong @ 2013-03-21 11:54 UTC (permalink / raw)
  To: linux-kernel
  Cc: Davide Libenzi, Al Viro, Andrew Morton, Mathieu Desnoyers, linux-fsdevel

Using Davide's epwbench to test, this mostly fixes a performance
regression with level trigger in a single-threaded workload.
Single-threaded level trigger is probably the most common method
of using epoll, so performance is important.

Before (without wfcq at all):
AVG: 5.448400
SIG: 0.003056

Before (with wfcq):
AVG: 7.735375
SIG: 0.000064

Currently (with wfcq local):
AVG: 5.532024
SIG: 0.000244

The margin of error for my tests is pretty high under KVM.  I'm getting
wildly varying (~0.10s) differences between runs on the same boot,
However the difference between using plain wfcq and these new _local
functions is large and outside of the margin of error.

ref: http://www.xmailserver.org/epwbench.c

Somewhat-tested-by: Eric Wong <normalperson@yhbt.net>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Davide Libenzi <davidel@xmailserver.org>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
---
 fs/eventpoll.c | 20 +++++++++++++++-----
 1 file changed, 15 insertions(+), 5 deletions(-)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index e039555..1e04175 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -1356,7 +1356,7 @@ static int ep_modify(struct eventpoll *ep, struct epitem *epi, struct epoll_even
 	return 0;
 }
 
-static int ep_send_events(struct eventpoll *ep,
+static int ep_send_events(struct eventpoll *ep, bool *eavail,
 			  struct epoll_event __user *uevent, int maxevents)
 {
 	int eventcnt = 0;
@@ -1366,7 +1366,10 @@ static int ep_send_events(struct eventpoll *ep,
 	struct wfcq_node *node, *n;
 	enum epoll_item_state state;
 	poll_table pt;
+	struct wfcq_head lthead;
+	struct wfcq_tail lttail;
 
+	wfcq_init(&lthead, &lttail);
 	init_poll_funcptr(&pt, NULL);
 
 	/*
@@ -1414,7 +1417,7 @@ static int ep_send_events(struct eventpoll *ep,
 		if (revents) {
 			if (__put_user(revents, &uevent->events) ||
 			    __put_user(epi->event.data, &uevent->data)) {
-				wfcq_enqueue(&ep->txlhead, &ep->txltail,
+				wfcq_enqueue_local(&ep->txlhead, &ep->txltail,
 							&epi->rdllink);
 				ep_pm_stay_awake(epi);
 				if (!eventcnt)
@@ -1436,7 +1439,7 @@ static int ep_send_events(struct eventpoll *ep,
 				 * epoll_wait() will check again the events
 				 * availability.
 				 */
-				wfcq_enqueue(&ep->rdlhead, &ep->rdltail,
+				wfcq_enqueue_local(&lthead, &lttail,
 							&epi->rdllink);
 				ep_pm_stay_awake(epi);
 				continue;
@@ -1450,6 +1453,14 @@ static int ep_send_events(struct eventpoll *ep,
 		epi->state = EP_STATE_IDLE;
 	}
 
+	/* grab any events we got while copying */
+	*eavail = ep_events_available(ep);
+
+	/* requeue level-triggered items */
+	if (__wfcq_splice(&ep->txlhead, &ep->txltail, &lthead, &lttail)
+	    != WFCQ_RET_SRC_EMPTY)
+		*eavail = true;
+
 	return eventcnt;
 }
 
@@ -1550,12 +1561,11 @@ wait_queue_loop:
 	 * more luck.
 	 */
 	if (!res) {
-		res = ep_send_events(ep, events, maxevents);
+		res = ep_send_events(ep, &eavail, events, maxevents);
 		if (!res && timeout)
 			goto wait_queue_loop;
 	}
 
-	eavail = ep_events_available(ep);
 	mutex_unlock(&ep->mtx);
 
 	/* we may not have transferred everything, wake up others */
-- 
Eric Wong

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [RFC v3 1/2] epoll: avoid spinlock contention with wfcqueue
  2013-03-21 11:52 [RFC v3 1/2] epoll: avoid spinlock contention with wfcqueue Eric Wong
  2013-03-21 11:54 ` [RFC v3 2/2] epoll: use a local wfcq functions for Level Trigger Eric Wong
@ 2013-03-21 12:01 ` Eric Wong
  2013-03-22  1:39 ` Arve Hjønnevåg
  2 siblings, 0 replies; 15+ messages in thread
From: Eric Wong @ 2013-03-21 12:01 UTC (permalink / raw)
  To: linux-kernel
  Cc: Davide Libenzi, Al Viro, Andrew Morton, Mathieu Desnoyers, linux-fsdevel

Eric Wong <normalperson@yhbt.net> wrote:
> This is still not a proper commit, I've lightly tested this.

Btw, full series here (already sent to LKML and some in -mm)
http://yhbt.net/epoll-wfcqueue-v3.8.3-20130321.mbox
(should apply cleanly to all 3.8/3.9 kernels)

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC v3 1/2] epoll: avoid spinlock contention with wfcqueue
  2013-03-21 11:52 [RFC v3 1/2] epoll: avoid spinlock contention with wfcqueue Eric Wong
  2013-03-21 11:54 ` [RFC v3 2/2] epoll: use a local wfcq functions for Level Trigger Eric Wong
  2013-03-21 12:01 ` [RFC v3 1/2] epoll: avoid spinlock contention with wfcqueue Eric Wong
@ 2013-03-22  1:39 ` Arve Hjønnevåg
  2013-03-22  3:24     ` Eric Wong
  2 siblings, 1 reply; 15+ messages in thread
From: Arve Hjønnevåg @ 2013-03-22  1:39 UTC (permalink / raw)
  To: Eric Wong
  Cc: linux-kernel, Davide Libenzi, Al Viro, Andrew Morton,
	Mathieu Desnoyers, linux-fsdevel

On Thu, Mar 21, 2013 at 4:52 AM, Eric Wong <normalperson@yhbt.net> wrote:
> This is still not a proper commit, I've lightly tested this.
>
> Replace the spinlock-protected linked list ready list with wfcqueue.
>
> This improves performance under heavy, multi-threaded workloads with
> multiple threads calling epoll_wait.
>
> Using my multi-threaded EPOLLONESHOT microbenchmark, performance is
> nearly doubled:
>
> $ eponeshotmt -t 4 -w 4 -f 10 -c 1000000
>
> Before:
> real    0m 9.58s
> user    0m 1.22s
> sys     0m 37.08s
>
> After:
> real    0m 5.00s
> user    0m 1.07s
> sys     0m 18.92s
>
> ref: http://yhbt.net/eponeshotmt.c
>
> Unfortunately, there are still regressions for the common,
> single threaded, Level Trigger use case.
>
> Things changed/removed:
>
> * ep->ovflist - is no longer needed, the main ready list continues
>   to be appended to while we iterate through the transaction list.
>
> * ep_scan_ready_list - not enough generic code between users
>   anymore to warrant this.  ep_poll_readyevents_proc (used for
>   poll) is read-only, using __wfcq_for_each, while
>   ep_send_events (used for epoll_wait) dequeues and needs
>   __wfcq_for_each_safe.
>
> * ep->lock renamed to ep->wqlock; this only protects waitqueues now
>   (we use trylock to further avoid redundant wakeups)
>
> * EPOLL_CTL_DEL/close() on a ready file will not immediately release
>   epitem memory, epoll_wait() must be called since there's no way to
>   delete a ready item from wfcqueue in O(1) time.  In practice this
>   should not be a problem, any valid app using epoll must call
>   epoll_wait occasionally.  Unfreed epitems still count against
>   max_user_watches to protect against local DoS.  This should be the
>   only possibly-noticeable change (in case there's an app that blindly
>   adds/deletes things from the rbtree but never calls epoll_wait)
>
> Changes since v1:
> * fixed memory leak with pre-existing zombies in ep_free
> * updated to use the latest wfcqueue.h APIs
> * switched to using __wfcq_splice and a global transaction list
>   (this is like the old txlist in ep_scan_ready_list)
> * redundant wakeups avoided in ep_notify_waiters:
>   - only attempt a wakeup when an item is enqueued the first time
>   - use spin_trylock_irqsave when attempting notification, since a
>     failed lock means either another task is already waking, or
>     ep_poll is already running and will check anyways upon releasing
>     wqlock, anyways.
> * explicitly cache-aligned rdltail in SMP
> * added ep_item_state for atomically reading epi->state with barrier
>   (avoids WARN_ON in ep_send_events)
> * reverted epi->nwait removal, it was not necessary
>   sizeof(epitem) is still <= 128 bytes on 64-bit machines
>
> Changes since v2:
> * epi->state is no longer atomic, we only cmpxchg in ep_poll_callback
>   now and rely on implicit barriers in other places for reading.
> * intermediate EP_STATE_DEQUEUE removed, this (with xchg) caused too
>   much overhead in the ep_send_events loop and could not eliminate
>   starvation dangers from improper EPOLLET usage (the original code
>   had this problem, too, the window is just a few cycles larger, now).
> * minor code cleanups
>
> Lightly-tested-by: Eric Wong <normalperson@yhbt.net>
> Cc: Davide Libenzi <davidel@xmailserver.org>
> Cc: Al Viro <viro@ZenIV.linux.org.uk>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> ---
>
...
> @@ -967,8 +951,6 @@ static struct epitem *ep_find(struct eventpoll *ep, struct file *file, int fd)
>   */
>  static int ep_poll_callback(wait_queue_t *wait, unsigned mode, int sync, void *key)
>  {
> -       int pwake = 0;
> -       unsigned long flags;
>         struct epitem *epi = ep_item_from_wait(wait);
>         struct eventpoll *ep = epi->ep;
>
> @@ -983,7 +965,8 @@ static int ep_poll_callback(wait_queue_t *wait, unsigned mode, int sync, void *k
>                 list_del_init(&wait->task_list);
>         }
>
> -       spin_lock_irqsave(&ep->lock, flags);
> +       /* pairs with smp_mb in ep_modify */
> +       smp_rmb();
>
>         /*
>          * If the event mask does not contain any poll(2) event, we consider the
> @@ -992,7 +975,7 @@ static int ep_poll_callback(wait_queue_t *wait, unsigned mode, int sync, void *k
>          * until the next EPOLL_CTL_MOD will be issued.
>          */
>         if (!(epi->event.events & ~EP_PRIVATE_BITS))
> -               goto out_unlock;
> +               return 1;
>
>         /*
>          * Check the events coming with the callback. At this stage, not
> @@ -1001,52 +984,14 @@ static int ep_poll_callback(wait_queue_t *wait, unsigned mode, int sync, void *k
>          * test for "key" != NULL before the event match test.
>          */
>         if (key && !((unsigned long) key & epi->event.events))
> -               goto out_unlock;
> -
> -       /*
> -        * If we are transferring events to userspace, we can hold no locks
> -        * (because we're accessing user memory, and because of linux f_op->poll()
> -        * semantics). All the events that happen during that period of time are
> -        * chained in ep->ovflist and requeued later on.
> -        */
> -       if (unlikely(ep->ovflist != EP_UNACTIVE_PTR)) {
> -               if (epi->next == EP_UNACTIVE_PTR) {
> -                       epi->next = ep->ovflist;
> -                       ep->ovflist = epi;
> -                       if (epi->ws) {
> -                               /*
> -                                * Activate ep->ws since epi->ws may get
> -                                * deactivated at any time.
> -                                */
> -                               __pm_stay_awake(ep->ws);
> -                       }
> +               return 1;
>
> -               }
> -               goto out_unlock;
> -       }
> -
> -       /* If this file is already in the ready list we exit soon */
> -       if (!ep_is_linked(&epi->rdllink)) {
> -               list_add_tail(&epi->rdllink, &ep->rdllist);
> +       if (ep_mark_ready(epi)) {
> +               wfcq_enqueue(&ep->rdlhead, &ep->rdltail, &epi->rdllink);
>                 ep_pm_stay_awake_rcu(epi);
> +               ep_notify_waiters(ep);
>         }
>
> -       /*
> -        * Wake up ( if active ) both the eventpoll wait list and the ->poll()
> -        * wait list.
> -        */
> -       if (waitqueue_active(&ep->wq))
> -               wake_up_locked(&ep->wq);
> -       if (waitqueue_active(&ep->poll_wait))
> -               pwake++;
> -
> -out_unlock:
> -       spin_unlock_irqrestore(&ep->lock, flags);
> -
> -       /* We have to call this outside the lock */
> -       if (pwake)
> -               ep_poll_safewake(&ep->poll_wait);
> -
>         return 1;
>  }
>
...
> -static int ep_send_events_proc(struct eventpoll *ep, struct list_head *head,
> -                              void *priv)
> +static int ep_send_events(struct eventpoll *ep,
> +                         struct epoll_event __user *uevent, int maxevents)
>  {
> -       struct ep_send_events_data *esed = priv;
> -       int eventcnt;
> +       int eventcnt = 0;
>         unsigned int revents;
>         struct epitem *epi;
> -       struct epoll_event __user *uevent;
>         struct wakeup_source *ws;
> +       struct wfcq_node *node, *n;
> +       enum epoll_item_state state;
>         poll_table pt;
>
>         init_poll_funcptr(&pt, NULL);
>
>         /*
> -        * We can loop without lock because we are passed a task private list.
> -        * Items cannot vanish during the loop because ep_scan_ready_list() is
> -        * holding "mtx" during this call.
> +        * Items cannot vanish during the loop because we are holding
> +        * "mtx" during this call.
>          */
> -       for (eventcnt = 0, uevent = esed->events;
> -            !list_empty(head) && eventcnt < esed->maxevents;) {
> -               epi = list_first_entry(head, struct epitem, rdllink);
> +       __wfcq_for_each_safe(&ep->txlhead, &ep->txltail, node, n) {
> +               epi = ep_item_from_node(node);
> +               __wfcq_dequeue(&ep->txlhead, &ep->txltail);
> +
> +               /* no barrier needed, splicing txl should be enough */
> +               state = epi->state;
> +
> +               if (state == EP_STATE_ZOMBIE) {
> +                       ep_reap(ep, epi);
> +                       continue;
> +               }
> +               WARN_ON(state != EP_STATE_READY);
> +               wfcq_node_init(&epi->rdllink);
>
>                 /*
>                  * Activate ep->ws before deactivating epi->ws to prevent

Does anything deactivate ep->ws now?

> @@ -1459,7 +1394,7 @@ static int ep_send_events_proc(struct eventpoll *ep, struct list_head *head,
>                  *
>                  * This could be rearranged to delay the deactivation of epi->ws
>                  * instead, but then epi->ws would temporarily be out of sync
> -                * with ep_is_linked().
> +                * with epi->state.
>                  */
>                 ws = ep_wakeup_source(epi);
>                 if (ws) {
> @@ -1468,25 +1403,29 @@ static int ep_send_events_proc(struct eventpoll *ep, struct list_head *head,
>                         __pm_relax(ws);
>                 }
>
> -               list_del_init(&epi->rdllink);
> -
>                 revents = ep_item_poll(epi, &pt);
>
>                 /*
>                  * If the event mask intersect the caller-requested one,
> -                * deliver the event to userspace. Again, ep_scan_ready_list()
> +                * deliver the event to userspace. Again, ep_poll()
>                  * is holding "mtx", so no operations coming from userspace
>                  * can change the item.
>                  */
>                 if (revents) {
>                         if (__put_user(revents, &uevent->events) ||
>                             __put_user(epi->event.data, &uevent->data)) {
> -                               list_add(&epi->rdllink, head);
> +                               wfcq_enqueue(&ep->txlhead, &ep->txltail,
> +                                                       &epi->rdllink);
>                                 ep_pm_stay_awake(epi);
> -                               return eventcnt ? eventcnt : -EFAULT;
> +                               if (!eventcnt)
> +                                       eventcnt = -EFAULT;
> +                               break;
>                         }
> -                       eventcnt++;
> +
>                         uevent++;
> +                       if (++eventcnt == maxevents)
> +                               n = NULL; /* stop iteration */
> +
>                         if (epi->event.events & EPOLLONESHOT)
>                                 epi->event.events &= EP_PRIVATE_BITS;
>                         else if (!(epi->event.events & EPOLLET)) {
> @@ -1495,32 +1434,25 @@ static int ep_send_events_proc(struct eventpoll *ep, struct list_head *head,
>                                  * Trigger mode, we need to insert back inside
>                                  * the ready list, so that the next call to
>                                  * epoll_wait() will check again the events
> -                                * availability. At this point, no one can insert
> -                                * into ep->rdllist besides us. The epoll_ctl()
> -                                * callers are locked out by
> -                                * ep_scan_ready_list() holding "mtx" and the
> -                                * poll callback will queue them in ep->ovflist.
> +                                * availability.
>                                  */
> -                               list_add_tail(&epi->rdllink, &ep->rdllist);
> +                               wfcq_enqueue(&ep->rdlhead, &ep->rdltail,
> +                                                       &epi->rdllink);
>                                 ep_pm_stay_awake(epi);
> +                               continue;
>                         }
>                 }
> +
> +               /*
> +                * reset item state for EPOLLONESHOT and EPOLLET
> +                * no barrier here, rely on ep->mtx release for write barrier
> +                */

What happens if ep_poll_callback runs before you set epi->state below?
It used to queue on ep->ovflist and call __pm_stay_awake on ep->ws,
but now it does not appear to do anything.

> +               epi->state = EP_STATE_IDLE;
>         }
>
>         return eventcnt;
>  }
>
...

-- 
Arve Hjønnevåg

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC v3 1/2] epoll: avoid spinlock contention with wfcqueue
  2013-03-22  1:39 ` Arve Hjønnevåg
@ 2013-03-22  3:24     ` Eric Wong
  0 siblings, 0 replies; 15+ messages in thread
From: Eric Wong @ 2013-03-22  3:24 UTC (permalink / raw)
  To: Arve Hjønnevåg
  Cc: linux-kernel, Davide Libenzi, Al Viro, Andrew Morton,
	Mathieu Desnoyers, linux-fsdevel

Arve Hjønnevåg <arve@android.com> wrote:
> On Thu, Mar 21, 2013 at 4:52 AM, Eric Wong <normalperson@yhbt.net> wrote:
> > Changes since v2:
> > * epi->state is no longer atomic, we only cmpxchg in ep_poll_callback
> >   now and rely on implicit barriers in other places for reading.
> > * intermediate EP_STATE_DEQUEUE removed, this (with xchg) caused too
> >   much overhead in the ep_send_events loop and could not eliminate
> >   starvation dangers from improper EPOLLET usage (the original code
> >   had this problem, too, the window is just a few cycles larger, now).
> > * minor code cleanups

> >                 /*
> >                  * Activate ep->ws before deactivating epi->ws to prevent
> 
> Does anything deactivate ep->ws now?

Oops, I left that out when I killed ep_scan_ready_list.
But I think we need a different approach to wakeup sources in
this series...

> > +               /*
> > +                * reset item state for EPOLLONESHOT and EPOLLET
> > +                * no barrier here, rely on ep->mtx release for write barrier
> > +                */
> 
> What happens if ep_poll_callback runs before you set epi->state below?
> It used to queue on ep->ovflist and call __pm_stay_awake on ep->ws,
> but now it does not appear to do anything.
> 
> > +               epi->state = EP_STATE_IDLE;
> >         }
> >
> >         return eventcnt;
> >  }
> >

With EPOLLET and improper usage (not hitting EAGAIN), the event now
has a larger window to be lost (as mentioned in my changelog).

As far as correct __pm_stay_awake/__pm_relax handling, perhaps adding
an atomic counter to struct eventpoll (or each epitem) will work?

If we go with atomic counter in struct eventpoll, is per-epitem
wakeup_source still necessary?  We have space in epitem now, but
maybe one day we will might need it.

Thanks for looking at this patch.

Btw, I'm curious; which applications use EPOLLWAKEUP?

My epoll work is focused on network servers with thousands of clients,
and I don't think any of them use (or have use for) EPOLLWAKEUP.
But I will keep EPOLLWAKEUP users in mind when working on epoll :)

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC v3 1/2] epoll: avoid spinlock contention with wfcqueue
@ 2013-03-22  3:24     ` Eric Wong
  0 siblings, 0 replies; 15+ messages in thread
From: Eric Wong @ 2013-03-22  3:24 UTC (permalink / raw)
  To: Arve Hjønnevåg
  Cc: linux-kernel, Davide Libenzi, Al Viro, Andrew Morton,
	Mathieu Desnoyers, linux-fsdevel

Arve Hjønnevåg <arve@android.com> wrote:
> On Thu, Mar 21, 2013 at 4:52 AM, Eric Wong <normalperson@yhbt.net> wrote:
> > Changes since v2:
> > * epi->state is no longer atomic, we only cmpxchg in ep_poll_callback
> >   now and rely on implicit barriers in other places for reading.
> > * intermediate EP_STATE_DEQUEUE removed, this (with xchg) caused too
> >   much overhead in the ep_send_events loop and could not eliminate
> >   starvation dangers from improper EPOLLET usage (the original code
> >   had this problem, too, the window is just a few cycles larger, now).
> > * minor code cleanups

> >                 /*
> >                  * Activate ep->ws before deactivating epi->ws to prevent
> 
> Does anything deactivate ep->ws now?

Oops, I left that out when I killed ep_scan_ready_list.
But I think we need a different approach to wakeup sources in
this series...

> > +               /*
> > +                * reset item state for EPOLLONESHOT and EPOLLET
> > +                * no barrier here, rely on ep->mtx release for write barrier
> > +                */
> 
> What happens if ep_poll_callback runs before you set epi->state below?
> It used to queue on ep->ovflist and call __pm_stay_awake on ep->ws,
> but now it does not appear to do anything.
> 
> > +               epi->state = EP_STATE_IDLE;
> >         }
> >
> >         return eventcnt;
> >  }
> >

With EPOLLET and improper usage (not hitting EAGAIN), the event now
has a larger window to be lost (as mentioned in my changelog).

As far as correct __pm_stay_awake/__pm_relax handling, perhaps adding
an atomic counter to struct eventpoll (or each epitem) will work?

If we go with atomic counter in struct eventpoll, is per-epitem
wakeup_source still necessary?  We have space in epitem now, but
maybe one day we will might need it.

Thanks for looking at this patch.

Btw, I'm curious; which applications use EPOLLWAKEUP?

My epoll work is focused on network servers with thousands of clients,
and I don't think any of them use (or have use for) EPOLLWAKEUP.
But I will keep EPOLLWAKEUP users in mind when working on epoll :)
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC v3 1/2] epoll: avoid spinlock contention with wfcqueue
  2013-03-22  3:24     ` Eric Wong
@ 2013-03-22  4:07       ` Arve Hjønnevåg
  -1 siblings, 0 replies; 15+ messages in thread
From: Arve Hjønnevåg @ 2013-03-22  4:07 UTC (permalink / raw)
  To: Eric Wong
  Cc: linux-kernel, Davide Libenzi, Al Viro, Andrew Morton,
	Mathieu Desnoyers, linux-fsdevel

On Thu, Mar 21, 2013 at 8:24 PM, Eric Wong <normalperson@yhbt.net> wrote:
> Arve Hjønnevåg <arve@android.com> wrote:
>> On Thu, Mar 21, 2013 at 4:52 AM, Eric Wong <normalperson@yhbt.net> wrote:
>> > Changes since v2:
>> > * epi->state is no longer atomic, we only cmpxchg in ep_poll_callback
>> >   now and rely on implicit barriers in other places for reading.
>> > * intermediate EP_STATE_DEQUEUE removed, this (with xchg) caused too
>> >   much overhead in the ep_send_events loop and could not eliminate
>> >   starvation dangers from improper EPOLLET usage (the original code
>> >   had this problem, too, the window is just a few cycles larger, now).
>> > * minor code cleanups
>
>> >                 /*
>> >                  * Activate ep->ws before deactivating epi->ws to prevent
>>
>> Does anything deactivate ep->ws now?
>
> Oops, I left that out when I killed ep_scan_ready_list.
> But I think we need a different approach to wakeup sources in
> this series...
>
>> > +               /*
>> > +                * reset item state for EPOLLONESHOT and EPOLLET
>> > +                * no barrier here, rely on ep->mtx release for write barrier
>> > +                */
>>
>> What happens if ep_poll_callback runs before you set epi->state below?
>> It used to queue on ep->ovflist and call __pm_stay_awake on ep->ws,
>> but now it does not appear to do anything.
>>
>> > +               epi->state = EP_STATE_IDLE;
>> >         }
>> >
>> >         return eventcnt;
>> >  }
>> >
>
> With EPOLLET and improper usage (not hitting EAGAIN), the event now
> has a larger window to be lost (as mentioned in my changelog).
>

What about the case where EPOLLET is not set? The old code did not
drop events in that case.

> As far as correct __pm_stay_awake/__pm_relax handling, perhaps adding
> an atomic counter to struct eventpoll (or each epitem) will work?
>

The wakeup_source should stay in sync with the epoll state. I don't
think any additional state is needed.

> If we go with atomic counter in struct eventpoll, is per-epitem
> wakeup_source still necessary?  We have space in epitem now, but
> maybe one day we will might need it.
>

The wakeup_source per epitem is useful for accounting reasons. If
suspend fails, it is useful to know which device caused it.

> Thanks for looking at this patch.
>
> Btw, I'm curious; which applications use EPOLLWAKEUP?
>
> My epoll work is focused on network servers with thousands of clients,
> and I don't think any of them use (or have use for) EPOLLWAKEUP.
> But I will keep EPOLLWAKEUP users in mind when working on epoll :)

EPOLLWAKEUP is only needed on systems that use suspend. I don't know
if it is currently in use, but it is intended to at least replace the
evdev wakelock in the android kernel, but user-space needs to be
updated before we can drop that patch.

-- 
Arve Hjønnevåg

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC v3 1/2] epoll: avoid spinlock contention with wfcqueue
@ 2013-03-22  4:07       ` Arve Hjønnevåg
  0 siblings, 0 replies; 15+ messages in thread
From: Arve Hjønnevåg @ 2013-03-22  4:07 UTC (permalink / raw)
  To: Eric Wong
  Cc: linux-kernel, Davide Libenzi, Al Viro, Andrew Morton,
	Mathieu Desnoyers, linux-fsdevel

On Thu, Mar 21, 2013 at 8:24 PM, Eric Wong <normalperson@yhbt.net> wrote:
> Arve Hjønnevåg <arve@android.com> wrote:
>> On Thu, Mar 21, 2013 at 4:52 AM, Eric Wong <normalperson@yhbt.net> wrote:
>> > Changes since v2:
>> > * epi->state is no longer atomic, we only cmpxchg in ep_poll_callback
>> >   now and rely on implicit barriers in other places for reading.
>> > * intermediate EP_STATE_DEQUEUE removed, this (with xchg) caused too
>> >   much overhead in the ep_send_events loop and could not eliminate
>> >   starvation dangers from improper EPOLLET usage (the original code
>> >   had this problem, too, the window is just a few cycles larger, now).
>> > * minor code cleanups
>
>> >                 /*
>> >                  * Activate ep->ws before deactivating epi->ws to prevent
>>
>> Does anything deactivate ep->ws now?
>
> Oops, I left that out when I killed ep_scan_ready_list.
> But I think we need a different approach to wakeup sources in
> this series...
>
>> > +               /*
>> > +                * reset item state for EPOLLONESHOT and EPOLLET
>> > +                * no barrier here, rely on ep->mtx release for write barrier
>> > +                */
>>
>> What happens if ep_poll_callback runs before you set epi->state below?
>> It used to queue on ep->ovflist and call __pm_stay_awake on ep->ws,
>> but now it does not appear to do anything.
>>
>> > +               epi->state = EP_STATE_IDLE;
>> >         }
>> >
>> >         return eventcnt;
>> >  }
>> >
>
> With EPOLLET and improper usage (not hitting EAGAIN), the event now
> has a larger window to be lost (as mentioned in my changelog).
>

What about the case where EPOLLET is not set? The old code did not
drop events in that case.

> As far as correct __pm_stay_awake/__pm_relax handling, perhaps adding
> an atomic counter to struct eventpoll (or each epitem) will work?
>

The wakeup_source should stay in sync with the epoll state. I don't
think any additional state is needed.

> If we go with atomic counter in struct eventpoll, is per-epitem
> wakeup_source still necessary?  We have space in epitem now, but
> maybe one day we will might need it.
>

The wakeup_source per epitem is useful for accounting reasons. If
suspend fails, it is useful to know which device caused it.

> Thanks for looking at this patch.
>
> Btw, I'm curious; which applications use EPOLLWAKEUP?
>
> My epoll work is focused on network servers with thousands of clients,
> and I don't think any of them use (or have use for) EPOLLWAKEUP.
> But I will keep EPOLLWAKEUP users in mind when working on epoll :)

EPOLLWAKEUP is only needed on systems that use suspend. I don't know
if it is currently in use, but it is intended to at least replace the
evdev wakelock in the android kernel, but user-space needs to be
updated before we can drop that patch.

-- 
Arve Hjønnevåg
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC v3 1/2] epoll: avoid spinlock contention with wfcqueue
  2013-03-22  4:07       ` Arve Hjønnevåg
  (?)
@ 2013-03-22 10:31       ` Eric Wong
  2013-03-22 19:24         ` Eric Wong
  2013-03-22 22:18           ` Arve Hjønnevåg
  -1 siblings, 2 replies; 15+ messages in thread
From: Eric Wong @ 2013-03-22 10:31 UTC (permalink / raw)
  To: Arve Hjønnevåg
  Cc: linux-kernel, Davide Libenzi, Al Viro, Andrew Morton,
	Mathieu Desnoyers, linux-fsdevel

Arve Hjønnevåg <arve@android.com> wrote:
> On Thu, Mar 21, 2013 at 8:24 PM, Eric Wong <normalperson@yhbt.net> wrote:
> >
> > With EPOLLET and improper usage (not hitting EAGAIN), the event now
> > has a larger window to be lost (as mentioned in my changelog).
> >
> 
> What about the case where EPOLLET is not set? The old code did not
> drop events in that case.

Nothing is dropped, if the event wasn't on the ready list before,
ep_poll_callback may still append the ready list while __put_user
is running.

If the event was on the ready list:

1) It does not matter for EPOLLONESHOT, it'll get masked out and
   discarded in the next ep_send_events call until ep_modify reenables
   it.  Since ep_modify and ep_send_events both take ep->mtx, there's
   no conflict.

2) Level Trigger - event stays ready, so nothing is dropped.

> > As far as correct __pm_stay_awake/__pm_relax handling, perhaps adding
> > an atomic counter to struct eventpoll (or each epitem) will work?
> 
> The wakeup_source should stay in sync with the epoll state. I don't
> think any additional state is needed.

The problem is epi->state is not set atomically in ep_send_events,

Having atomic operations in the loop hurts performance (early versions
of this patch did that, and hurt the single-threaded case).

Maybe I'll only set epi->state atomically if epi->ws is used...

> > If we go with atomic counter in struct eventpoll, is per-epitem
> > wakeup_source still necessary?  We have space in epitem now, but
> > maybe one day we will might need it.
> >
> 
> The wakeup_source per epitem is useful for accounting reasons. If
> suspend fails, it is useful to know which device caused it.

OK.  I'll keep epitem->ws

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC v3 1/2] epoll: avoid spinlock contention with wfcqueue
  2013-03-22 10:31       ` Eric Wong
@ 2013-03-22 19:24         ` Eric Wong
  2013-03-22 22:27           ` Arve Hjønnevåg
  2013-03-22 22:18           ` Arve Hjønnevåg
  1 sibling, 1 reply; 15+ messages in thread
From: Eric Wong @ 2013-03-22 19:24 UTC (permalink / raw)
  To: Arve Hjønnevåg
  Cc: linux-kernel, Davide Libenzi, Al Viro, Andrew Morton,
	Mathieu Desnoyers, linux-fsdevel

Eric Wong <normalperson@yhbt.net> wrote:
> Arve Hjønnevåg <arve@android.com> wrote:
> > On Thu, Mar 21, 2013 at 8:24 PM, Eric Wong <normalperson@yhbt.net> wrote:
> > >
> > > With EPOLLET and improper usage (not hitting EAGAIN), the event now
> > > has a larger window to be lost (as mentioned in my changelog).
> > >
> > 
> > What about the case where EPOLLET is not set? The old code did not
> > drop events in that case.
> 
> Nothing is dropped, if the event wasn't on the ready list before,
> ep_poll_callback may still append the ready list while __put_user
> is running.
> 
> If the event was on the ready list:
> 
> 1) It does not matter for EPOLLONESHOT, it'll get masked out and
>    discarded in the next ep_send_events call until ep_modify reenables
>    it.  Since ep_modify and ep_send_events both take ep->mtx, there's
>    no conflict.
> 
> 2) Level Trigger - event stays ready, so nothing is dropped.
> 
> > > As far as correct __pm_stay_awake/__pm_relax handling, perhaps adding
> > > an atomic counter to struct eventpoll (or each epitem) will work?
> > 
> > The wakeup_source should stay in sync with the epoll state. I don't
> > think any additional state is needed.
> 
> The problem is epi->state is not set atomically in ep_send_events,
> 
> Having atomic operations in the loop hurts performance (early versions
> of this patch did that, and hurt the single-threaded case).
> 
> Maybe I'll only set epi->state atomically if epi->ws is used...
> 
> > > If we go with atomic counter in struct eventpoll, is per-epitem
> > > wakeup_source still necessary?  We have space in epitem now, but
> > > maybe one day we will might need it.
> > >
> > 
> > The wakeup_source per epitem is useful for accounting reasons. If
> > suspend fails, it is useful to know which device caused it.
> 
> OK.  I'll keep epitem->ws

Perhaps just using epitem->ws and removing ep->ws can work.

I think the following change to keep wakeup_source in sync with
epi->state is sufficient to prevent suspend.

But I'm not familiar with suspend.  Is it possible to suspend while
a) spinning on a lock?
b) holding a spinlock?

Since we avoid spinlocks in the main ep_poll_callback path, maybe the
chance of entering suspend is reduced anyways since we may activate
the ws sooner.

What do you think?

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 1e04175..531ad46 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -214,9 +214,6 @@ struct eventpoll {
 	/* RB tree root used to store monitored fd structs */
 	struct rb_root rbr;
 
-	/* wakeup_source used when ep_send_events is running */
-	struct wakeup_source *ws;
-
 	/* The user that created the eventpoll descriptor */
 	struct user_struct *user;
 
@@ -718,7 +715,6 @@ static void ep_free(struct eventpoll *ep)
 	mutex_unlock(&epmutex);
 	mutex_destroy(&ep->mtx);
 	free_uid(ep->user);
-	wakeup_source_unregister(ep->ws);
 	kfree(ep);
 }
 
@@ -1137,12 +1133,6 @@ static int ep_create_wakeup_source(struct epitem *epi)
 	const char *name;
 	struct wakeup_source *ws;
 
-	if (!epi->ep->ws) {
-		epi->ep->ws = wakeup_source_register("eventpoll");
-		if (!epi->ep->ws)
-			return -ENOMEM;
-	}
-
 	name = epi->ffd.file->f_path.dentry->d_name.name;
 	ws = wakeup_source_register(name);
 
@@ -1390,22 +1380,6 @@ static int ep_send_events(struct eventpoll *ep, bool *eavail,
 		WARN_ON(state != EP_STATE_READY);
 		wfcq_node_init(&epi->rdllink);
 
-		/*
-		 * Activate ep->ws before deactivating epi->ws to prevent
-		 * triggering auto-suspend here (in case we reactive epi->ws
-		 * below).
-		 *
-		 * This could be rearranged to delay the deactivation of epi->ws
-		 * instead, but then epi->ws would temporarily be out of sync
-		 * with epi->state.
-		 */
-		ws = ep_wakeup_source(epi);
-		if (ws) {
-			if (ws->active)
-				__pm_stay_awake(ep->ws);
-			__pm_relax(ws);
-		}
-
 		revents = ep_item_poll(epi, &pt);
 
 		/*
@@ -1419,7 +1393,6 @@ static int ep_send_events(struct eventpoll *ep, bool *eavail,
 			    __put_user(epi->event.data, &uevent->data)) {
 				wfcq_enqueue_local(&ep->txlhead, &ep->txltail,
 							&epi->rdllink);
-				ep_pm_stay_awake(epi);
 				if (!eventcnt)
 					eventcnt = -EFAULT;
 				break;
@@ -1441,13 +1414,34 @@ static int ep_send_events(struct eventpoll *ep, bool *eavail,
 				 */
 				wfcq_enqueue_local(&lthead, &lttail,
 							&epi->rdllink);
-				ep_pm_stay_awake(epi);
 				continue;
 			}
 		}
 
 		/*
-		 * reset item state for EPOLLONESHOT and EPOLLET
+		 * Deactivate the wakeup source before marking it idle.
+		 * The barrier implied by the spinlock in __pm_relax ensures
+		 * any ep_poll_callback callers running will see the
+		 * deactivated ws before epi->state == EP_STATE_IDLE.
+		 *
+		 * For EPOLLET, the event may still be merged into the one
+		 * that is currently on its way into userspace, but it has
+		 * always been the responsibility of userspace to trigger
+		 * EAGAIN on the file before it expects the item to appear
+		 * again in epoll_wait.
+		 *
+		 * Level Trigger never gets here, so the ws remains active.
+		 *
+		 * EPOLLONESHOT will either be dropped by ep_poll_callback
+		 * or dropped the next time ep_send_events is called, so the
+		 * ws is irrelevant until it is hit by ep_modify
+		 */
+		ws = ep_wakeup_source(epi);
+		if (ws)
+			__pm_relax(ws);
+
+		/*
+		 * reset item state for EPOLLONESHOT and EPOLLET.
 		 * no barrier here, rely on ep->mtx release for write barrier
 		 */
 		epi->state = EP_STATE_IDLE;

-- 
Eric Wong

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [RFC v3 1/2] epoll: avoid spinlock contention with wfcqueue
  2013-03-22 10:31       ` Eric Wong
@ 2013-03-22 22:18           ` Arve Hjønnevåg
  2013-03-22 22:18           ` Arve Hjønnevåg
  1 sibling, 0 replies; 15+ messages in thread
From: Arve Hjønnevåg @ 2013-03-22 22:18 UTC (permalink / raw)
  To: Eric Wong
  Cc: linux-kernel, Davide Libenzi, Al Viro, Andrew Morton,
	Mathieu Desnoyers, linux-fsdevel

On Fri, Mar 22, 2013 at 3:31 AM, Eric Wong <normalperson@yhbt.net> wrote:
> Arve Hjønnevåg <arve@android.com> wrote:
>> On Thu, Mar 21, 2013 at 8:24 PM, Eric Wong <normalperson@yhbt.net> wrote:
>> >
>> > With EPOLLET and improper usage (not hitting EAGAIN), the event now
>> > has a larger window to be lost (as mentioned in my changelog).
>> >
>>
>> What about the case where EPOLLET is not set? The old code did not
>> drop events in that case.
>
> Nothing is dropped, if the event wasn't on the ready list before,
> ep_poll_callback may still append the ready list while __put_user
> is running.
>
> If the event was on the ready list:
>
> 1) It does not matter for EPOLLONESHOT, it'll get masked out and
>    discarded in the next ep_send_events call until ep_modify reenables
>    it.  Since ep_modify and ep_send_events both take ep->mtx, there's
>    no conflict.
>
> 2) Level Trigger - event stays ready, so nothing is dropped.
>

At some point the level triggered event has to get cleared. As far as
I can tell, your new code will drop new events that occur between
"revents = ep_item_poll(epi, &pt);" and "epi->state = EP_STATE_IDLE;"
in that case.

-- 
Arve Hjønnevåg

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC v3 1/2] epoll: avoid spinlock contention with wfcqueue
@ 2013-03-22 22:18           ` Arve Hjønnevåg
  0 siblings, 0 replies; 15+ messages in thread
From: Arve Hjønnevåg @ 2013-03-22 22:18 UTC (permalink / raw)
  To: Eric Wong
  Cc: linux-kernel, Davide Libenzi, Al Viro, Andrew Morton,
	Mathieu Desnoyers, linux-fsdevel

On Fri, Mar 22, 2013 at 3:31 AM, Eric Wong <normalperson@yhbt.net> wrote:
> Arve Hjønnevåg <arve@android.com> wrote:
>> On Thu, Mar 21, 2013 at 8:24 PM, Eric Wong <normalperson@yhbt.net> wrote:
>> >
>> > With EPOLLET and improper usage (not hitting EAGAIN), the event now
>> > has a larger window to be lost (as mentioned in my changelog).
>> >
>>
>> What about the case where EPOLLET is not set? The old code did not
>> drop events in that case.
>
> Nothing is dropped, if the event wasn't on the ready list before,
> ep_poll_callback may still append the ready list while __put_user
> is running.
>
> If the event was on the ready list:
>
> 1) It does not matter for EPOLLONESHOT, it'll get masked out and
>    discarded in the next ep_send_events call until ep_modify reenables
>    it.  Since ep_modify and ep_send_events both take ep->mtx, there's
>    no conflict.
>
> 2) Level Trigger - event stays ready, so nothing is dropped.
>

At some point the level triggered event has to get cleared. As far as
I can tell, your new code will drop new events that occur between
"revents = ep_item_poll(epi, &pt);" and "epi->state = EP_STATE_IDLE;"
in that case.

-- 
Arve Hjønnevåg
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC v3 1/2] epoll: avoid spinlock contention with wfcqueue
  2013-03-22 19:24         ` Eric Wong
@ 2013-03-22 22:27           ` Arve Hjønnevåg
  0 siblings, 0 replies; 15+ messages in thread
From: Arve Hjønnevåg @ 2013-03-22 22:27 UTC (permalink / raw)
  To: Eric Wong
  Cc: linux-kernel, Davide Libenzi, Al Viro, Andrew Morton,
	Mathieu Desnoyers, linux-fsdevel

On Fri, Mar 22, 2013 at 12:24 PM, Eric Wong <normalperson@yhbt.net> wrote:
...
> Perhaps just using epitem->ws and removing ep->ws can work.
>
> I think the following change to keep wakeup_source in sync with
> epi->state is sufficient to prevent suspend.
>
> But I'm not familiar with suspend.  Is it possible to suspend while
> a) spinning on a lock?
> b) holding a spinlock?
>

If you code cannot be preempted suspend will not complete. However, if
you drop the last wakeup_source that was active, you may trigger a new
suspend attempt, so it is best to avoid this.

> Since we avoid spinlocks in the main ep_poll_callback path, maybe the
> chance of entering suspend is reduced anyways since we may activate
> the ws sooner.
>
> What do you think?
>

I think you should make sure ep_poll_callback does not return without
an active wakeup_source if EPOLLWAKEUP is set.

> diff --git a/fs/eventpoll.c b/fs/eventpoll.c
> index 1e04175..531ad46 100644
> --- a/fs/eventpoll.c
> +++ b/fs/eventpoll.c
> @@ -214,9 +214,6 @@ struct eventpoll {
>         /* RB tree root used to store monitored fd structs */
>         struct rb_root rbr;
>
> -       /* wakeup_source used when ep_send_events is running */
> -       struct wakeup_source *ws;
> -
>         /* The user that created the eventpoll descriptor */
>         struct user_struct *user;
>
> @@ -718,7 +715,6 @@ static void ep_free(struct eventpoll *ep)
>         mutex_unlock(&epmutex);
>         mutex_destroy(&ep->mtx);
>         free_uid(ep->user);
> -       wakeup_source_unregister(ep->ws);
>         kfree(ep);
>  }
>
> @@ -1137,12 +1133,6 @@ static int ep_create_wakeup_source(struct epitem *epi)
>         const char *name;
>         struct wakeup_source *ws;
>
> -       if (!epi->ep->ws) {
> -               epi->ep->ws = wakeup_source_register("eventpoll");
> -               if (!epi->ep->ws)
> -                       return -ENOMEM;
> -       }
> -
>         name = epi->ffd.file->f_path.dentry->d_name.name;
>         ws = wakeup_source_register(name);
>
> @@ -1390,22 +1380,6 @@ static int ep_send_events(struct eventpoll *ep, bool *eavail,
>                 WARN_ON(state != EP_STATE_READY);
>                 wfcq_node_init(&epi->rdllink);
>
> -               /*
> -                * Activate ep->ws before deactivating epi->ws to prevent
> -                * triggering auto-suspend here (in case we reactive epi->ws
> -                * below).
> -                *
> -                * This could be rearranged to delay the deactivation of epi->ws
> -                * instead, but then epi->ws would temporarily be out of sync
> -                * with epi->state.
> -                */
> -               ws = ep_wakeup_source(epi);
> -               if (ws) {
> -                       if (ws->active)
> -                               __pm_stay_awake(ep->ws);
> -                       __pm_relax(ws);
> -               }
> -
>                 revents = ep_item_poll(epi, &pt);
>
>                 /*
> @@ -1419,7 +1393,6 @@ static int ep_send_events(struct eventpoll *ep, bool *eavail,
>                             __put_user(epi->event.data, &uevent->data)) {
>                                 wfcq_enqueue_local(&ep->txlhead, &ep->txltail,
>                                                         &epi->rdllink);
> -                               ep_pm_stay_awake(epi);
>                                 if (!eventcnt)
>                                         eventcnt = -EFAULT;
>                                 break;
> @@ -1441,13 +1414,34 @@ static int ep_send_events(struct eventpoll *ep, bool *eavail,
>                                  */
>                                 wfcq_enqueue_local(&lthead, &lttail,
>                                                         &epi->rdllink);
> -                               ep_pm_stay_awake(epi);
>                                 continue;
>                         }
>                 }
>
>                 /*
> -                * reset item state for EPOLLONESHOT and EPOLLET
> +                * Deactivate the wakeup source before marking it idle.
> +                * The barrier implied by the spinlock in __pm_relax ensures
> +                * any ep_poll_callback callers running will see the
> +                * deactivated ws before epi->state == EP_STATE_IDLE.
> +                *
> +                * For EPOLLET, the event may still be merged into the one
> +                * that is currently on its way into userspace, but it has
> +                * always been the responsibility of userspace to trigger
> +                * EAGAIN on the file before it expects the item to appear
> +                * again in epoll_wait.
> +                *
> +                * Level Trigger never gets here, so the ws remains active.

I don't think this statement is true.

> +                *
> +                * EPOLLONESHOT will either be dropped by ep_poll_callback
> +                * or dropped the next time ep_send_events is called, so the
> +                * ws is irrelevant until it is hit by ep_modify
> +                */
> +               ws = ep_wakeup_source(epi);
> +               if (ws)
> +                       __pm_relax(ws);
> +
> +               /*
> +                * reset item state for EPOLLONESHOT and EPOLLET.
>                  * no barrier here, rely on ep->mtx release for write barrier
>                  */
>                 epi->state = EP_STATE_IDLE;
>
> --
> Eric Wong



-- 
Arve Hjønnevåg

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC v3 1/2] epoll: avoid spinlock contention with wfcqueue
  2013-03-22 22:18           ` Arve Hjønnevåg
  (?)
@ 2013-03-22 22:54           ` Eric Wong
  2013-03-23 10:16             ` Eric Wong
  -1 siblings, 1 reply; 15+ messages in thread
From: Eric Wong @ 2013-03-22 22:54 UTC (permalink / raw)
  To: Arve Hjønnevåg
  Cc: linux-kernel, Davide Libenzi, Al Viro, Andrew Morton,
	Mathieu Desnoyers, linux-fsdevel

Arve Hjønnevåg <arve@android.com> wrote:
> On Fri, Mar 22, 2013 at 3:31 AM, Eric Wong <normalperson@yhbt.net> wrote:
> > Arve Hjønnevåg <arve@android.com> wrote:
> >> On Thu, Mar 21, 2013 at 8:24 PM, Eric Wong <normalperson@yhbt.net> wrote:
> >> >
> >> > With EPOLLET and improper usage (not hitting EAGAIN), the event now
> >> > has a larger window to be lost (as mentioned in my changelog).
> >> >
> >>
> >> What about the case where EPOLLET is not set? The old code did not
> >> drop events in that case.
> >
> > Nothing is dropped, if the event wasn't on the ready list before,
> > ep_poll_callback may still append the ready list while __put_user
> > is running.
> >
> > If the event was on the ready list:
> >
> > 1) It does not matter for EPOLLONESHOT, it'll get masked out and
> >    discarded in the next ep_send_events call until ep_modify reenables
> >    it.  Since ep_modify and ep_send_events both take ep->mtx, there's
> >    no conflict.
> >
> > 2) Level Trigger - event stays ready, so nothing is dropped.
> >
> 
> At some point the level triggered event has to get cleared. As far as
> I can tell, your new code will drop new events that occur between
> "revents = ep_item_poll(epi, &pt);" and "epi->state = EP_STATE_IDLE;"
> in that case.

Thanks for catching that, I'll need to fix that.  Maybe reintroduce
EP_STATE_DEQUEUE, but just for the (LT && !revents) case.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC v3 1/2] epoll: avoid spinlock contention with wfcqueue
  2013-03-22 22:54           ` Eric Wong
@ 2013-03-23 10:16             ` Eric Wong
  0 siblings, 0 replies; 15+ messages in thread
From: Eric Wong @ 2013-03-23 10:16 UTC (permalink / raw)
  To: Arve Hjønnevåg
  Cc: linux-kernel, Davide Libenzi, Al Viro, Andrew Morton,
	Mathieu Desnoyers, linux-fsdevel

Eric Wong <normalperson@yhbt.net> wrote:
> Arve Hjønnevåg <arve@android.com> wrote:
> > 
> > At some point the level triggered event has to get cleared. As far as
> > I can tell, your new code will drop new events that occur between
> > "revents = ep_item_poll(epi, &pt);" and "epi->state = EP_STATE_IDLE;"
> > in that case.
> 
> Thanks for catching that, I'll need to fix that.  Maybe reintroduce
> EP_STATE_DEQUEUE, but just for the (LT && !revents) case.

I reintroduced ovflist (with less locking) instead.
I think the problem with !revents you pointed out affects non-LT, as well.

Will post an updated series (including wfcq changes, and some other
cleanups/fixes) this weekend.  Here's a work-in-progress on top of
my original [RFC v3 2/2]   Comments greatly appreciated, thanks.

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 1e04175..c3b2ad8 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -152,6 +152,12 @@ struct epitem {
 	/* List header used to link this structure to the eventpoll ready list */
 	struct wfcq_node rdllink;
 
+	/*
+	 * Works together "struct eventpoll"->ovflist in keeping the
+	 * single linked chain of items.
+	 */
+	struct epitem *next;
+
 	/* The file descriptor information this item refers to */
 	struct epoll_filefd ffd;
 
@@ -226,6 +232,16 @@ struct eventpoll {
 	int visited;
 	struct list_head visited_list_link;
 
+	/* this only protects ovflist */
+	spinlock_t ovflock;
+
+	/*
+	 * This is a single linked list that chains all the "struct epitem" that
+	 * happened while transferring ready events to userspace w/out
+	 * holding ->lock.
+	 */
+	struct epitem *ovflist;
+
 	struct wfcq_tail rdltail ____cacheline_aligned_in_smp;
 };
 
@@ -890,12 +906,14 @@ static int ep_alloc(struct eventpoll **pep)
 		goto free_uid;
 
 	spin_lock_init(&ep->wqlock);
+	spin_lock_init(&ep->ovflock);
 	mutex_init(&ep->mtx);
 	init_waitqueue_head(&ep->wq);
 	init_waitqueue_head(&ep->poll_wait);
 	wfcq_init(&ep->rdlhead, &ep->rdltail);
 	wfcq_init(&ep->txlhead, &ep->txltail);
 	ep->rbr = RB_ROOT;
+	ep->ovflist = EP_UNACTIVE_PTR;
 	ep->user = user;
 
 	*pep = ep;
@@ -953,6 +971,7 @@ static int ep_poll_callback(wait_queue_t *wait, unsigned mode, int sync, void *k
 {
 	struct epitem *epi = ep_item_from_wait(wait);
 	struct eventpoll *ep = epi->ep;
+	unsigned long flags;
 
 	if ((unsigned long)key & POLLFREE) {
 		RCU_INIT_POINTER(ep_pwq_from_wait(wait)->whead, NULL);
@@ -990,7 +1009,31 @@ static int ep_poll_callback(wait_queue_t *wait, unsigned mode, int sync, void *k
 		wfcq_enqueue(&ep->rdlhead, &ep->rdltail, &epi->rdllink);
 		ep_pm_stay_awake_rcu(epi);
 		ep_notify_waiters(ep);
+		return 1;
+	}
+
+	/*
+	 * If we are transferring events to userspace, we can hold no locks
+	 * (because we're accessing user memory, and because of linux f_op->poll()
+	 * semantics). All the events that happen during that period of time are
+	 * chained in ep->ovflist and requeued later on.
+	 */
+	spin_lock_irqsave(&ep->ovflock, flags);
+	if (unlikely(ep->ovflist != EP_UNACTIVE_PTR)) {
+		if (epi->next == EP_UNACTIVE_PTR) {
+			epi->next = ep->ovflist;
+			ep->ovflist = epi;
+			if (epi->ws) {
+				/*
+				 * Activate ep->ws since epi->ws may get
+				 * deactivated at any time.
+				 */
+				__pm_stay_awake(ep->ws);
+			}
+		}
+		/* no need to wake up waiters, ep_send_events */
 	}
+	spin_unlock_irqrestore(&ep->ovflock, flags);
 
 	return 1;
 }
@@ -1204,6 +1247,7 @@ static int ep_insert(struct eventpoll *ep, struct epoll_event *event,
 	epi->state = EP_STATE_IDLE;
 	ep_set_ffd(&epi->ffd, tfile, fd);
 	epi->nwait = 0;
+	epi->next = EP_UNACTIVE_PTR;
 	epi->event = *event;
 	if (epi->event.events & EPOLLWAKEUP) {
 		error = ep_create_wakeup_source(epi);
@@ -1356,6 +1400,61 @@ static int ep_modify(struct eventpoll *ep, struct epitem *epi, struct epoll_even
 	return 0;
 }
 
+static void ep_ovf_enable(struct eventpoll *ep)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&ep->ovflock, flags);
+	ep->ovflist = NULL;
+	spin_unlock_irqrestore(&ep->ovflock, flags);
+}
+
+/*
+ * disables the ovf queue and reinjects events that went into ovf queue
+ * into txlist.
+ */
+static void ep_ovf_disable(struct eventpoll *ep)
+{
+	unsigned long flags;
+	struct epitem *epi, *nepi;
+
+	spin_lock_irqsave(&ep->ovflock, flags);
+	nepi = ep->ovflist;
+	ep->ovflist = EP_UNACTIVE_PTR;
+	spin_unlock_irqrestore(&ep->ovflock, flags);
+
+	/*
+	 * We can work on ovflist without the ovflock since we are certain
+	 * ep_poll_callback will not append to ovflist anymore.
+	 *
+	 * We still need ep->mtx to be held during this, though.
+	 *
+	 * During the time we spent inside the ep_send_events, some
+	 * other events might have been queued by the poll callback.
+	 * We re-insert them inside the main ready-list here.
+	 */
+	for (; (epi = nepi) != NULL;
+	     nepi = epi->next, epi->next = EP_UNACTIVE_PTR) {
+		/*
+		 * We need to check if the item is already in the list.
+		 * During the ep_send_events loop execution time, items are
+		 * queued into ->ovflist but the "txlhead/tail" might already
+		 * contain them, and the ep_mark_ready takes care of them
+		 */
+		if (!ep_mark_ready(epi))
+			continue;
+
+		wfcq_enqueue_local(&ep->txlhead, &ep->txltail, &epi->rdllink);
+		ep_pm_stay_awake(epi);
+	}
+
+	/*
+	 * we've now activated all the epi->ws we could not activate from
+	 * ep_poll_callback while ovflist was active, we may now relax this
+	 */
+	__pm_relax(ep->ws);
+}
+
 static int ep_send_events(struct eventpoll *ep, bool *eavail,
 			  struct epoll_event __user *uevent, int maxevents)
 {
@@ -1372,6 +1471,8 @@ static int ep_send_events(struct eventpoll *ep, bool *eavail,
 	wfcq_init(&lthead, &lttail);
 	init_poll_funcptr(&pt, NULL);
 
+	ep_ovf_enable(ep);
+
 	/*
 	 * Items cannot vanish during the loop because we are holding
 	 * "mtx" during this call.
@@ -1390,22 +1491,6 @@ static int ep_send_events(struct eventpoll *ep, bool *eavail,
 		WARN_ON(state != EP_STATE_READY);
 		wfcq_node_init(&epi->rdllink);
 
-		/*
-		 * Activate ep->ws before deactivating epi->ws to prevent
-		 * triggering auto-suspend here (in case we reactive epi->ws
-		 * below).
-		 *
-		 * This could be rearranged to delay the deactivation of epi->ws
-		 * instead, but then epi->ws would temporarily be out of sync
-		 * with epi->state.
-		 */
-		ws = ep_wakeup_source(epi);
-		if (ws) {
-			if (ws->active)
-				__pm_stay_awake(ep->ws);
-			__pm_relax(ws);
-		}
-
 		revents = ep_item_poll(epi, &pt);
 
 		/*
@@ -1419,7 +1504,6 @@ static int ep_send_events(struct eventpoll *ep, bool *eavail,
 			    __put_user(epi->event.data, &uevent->data)) {
 				wfcq_enqueue_local(&ep->txlhead, &ep->txltail,
 							&epi->rdllink);
-				ep_pm_stay_awake(epi);
 				if (!eventcnt)
 					eventcnt = -EFAULT;
 				break;
@@ -1441,19 +1525,28 @@ static int ep_send_events(struct eventpoll *ep, bool *eavail,
 				 */
 				wfcq_enqueue_local(&lthead, &lttail,
 							&epi->rdllink);
-				ep_pm_stay_awake(epi);
 				continue;
 			}
 		}
 
 		/*
-		 * reset item state for EPOLLONESHOT and EPOLLET
-		 * no barrier here, rely on ep->mtx release for write barrier
+		 * Deactivate the wakeup source before marking it idle.
+		 * The barrier implied by the spinlock in __pm_relax ensures
+		 * any future ep_poll_callback callers running will see the
+		 * deactivated ws before epi->state == EP_STATE_IDLE.
 		 */
+		ws = ep_wakeup_source(epi);
+		if (ws)
+			__pm_relax(ws);
+
+		/* no barrier needed, wait until ep_ovf_disable */
 		epi->state = EP_STATE_IDLE;
 	}
 
-	/* grab any events we got while copying */
+	/* grab any (possibly redundant) events we got while copying */
+	ep_ovf_disable(ep);
+
+	/* see if we have any more items left (or very new ones) */
 	*eavail = ep_events_available(ep);
 
 	/* requeue level-triggered items */

^ permalink raw reply related	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2013-03-23 10:16 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-03-21 11:52 [RFC v3 1/2] epoll: avoid spinlock contention with wfcqueue Eric Wong
2013-03-21 11:54 ` [RFC v3 2/2] epoll: use a local wfcq functions for Level Trigger Eric Wong
2013-03-21 12:01 ` [RFC v3 1/2] epoll: avoid spinlock contention with wfcqueue Eric Wong
2013-03-22  1:39 ` Arve Hjønnevåg
2013-03-22  3:24   ` Eric Wong
2013-03-22  3:24     ` Eric Wong
2013-03-22  4:07     ` Arve Hjønnevåg
2013-03-22  4:07       ` Arve Hjønnevåg
2013-03-22 10:31       ` Eric Wong
2013-03-22 19:24         ` Eric Wong
2013-03-22 22:27           ` Arve Hjønnevåg
2013-03-22 22:18         ` Arve Hjønnevåg
2013-03-22 22:18           ` Arve Hjønnevåg
2013-03-22 22:54           ` Eric Wong
2013-03-23 10:16             ` Eric Wong

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.