linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 1/1] epoll: use rwlock in order to reduce ep_poll_callback() contention
@ 2018-12-03 11:02 Roman Penyaev
  2018-12-03 17:34 ` Linus Torvalds
                   ` (2 more replies)
  0 siblings, 3 replies; 19+ messages in thread
From: Roman Penyaev @ 2018-12-03 11:02 UTC (permalink / raw)
  Cc: Roman Penyaev, Alexander Viro, Paul E. McKenney, Linus Torvalds,
	linux-fsdevel, linux-kernel

Hi all,

The goal of this patch is to reduce contention of ep_poll_callback() which
can be called concurrently from different CPUs in case of high events
rates and many fds per epoll.  Problem can be very well reproduced by
generating events (write to pipe or eventfd) from many threads, while
consumer thread does polling.  In other words this patch increases the
bandwidth of events which can be delivered from sources to the poller by
adding poll items in a lockless way to the list.

The main change is in replacement of the spinlock with a rwlock, which is
taken on read in ep_poll_callback(), and then by adding poll items to the
tail of the list using xchg atomic instruction.  Write lock is taken
everywhere else in order to stop list modifications and guarantee that list
updates are fully completed (I assume that write side of a rwlock does not
starve, it seems qrwlock implementation has these guarantees).

The following are some microbenchmark results based on the test [1] which
starts threads which generate N events each.  The test ends when all
events are successfully fetched by the poller thread:

spinlock
========

threads       run time    events per ms
-------       ---------   -------------
      8         13191ms         6064/ms
     16         30758ms         5201/ms
     32         44315ms         7220/ms

rwlock + xchg
=============

threads       run time    events per ms
-------       ---------   -------------
      8          8581ms         9323/ms
     16         13800ms        11594/ms
     32         24167ms        13240/ms

According to the results bandwidth of delivered events is significantly
increased, thus execution time is reduced.

This is RFC because I did not run any benchmarks comparing current
qrwlock and spinlock implementations (4.19 kernel), although I did
not notice any epoll performance degradations in other benchmarks.

Also I'm not quite sure where to put very special lockless variant
of adding element to the list (list_add_tail_lockless() in this
patch).  Seems keeping it locally is safer.

[1] https://github.com/rouming/test-tools/blob/master/stress-epoll.c

Signed-off-by: Roman Penyaev <rpenyaev@suse.de>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
---
 fs/eventpoll.c | 107 +++++++++++++++++++++++++++++++------------------
 1 file changed, 69 insertions(+), 38 deletions(-)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 42bbe6824b4b..89debda47aca 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -50,10 +50,10 @@
  *
  * 1) epmutex (mutex)
  * 2) ep->mtx (mutex)
- * 3) ep->wq.lock (spinlock)
+ * 3) ep->lock (rwlock)
  *
  * The acquire order is the one listed above, from 1 to 3.
- * We need a spinlock (ep->wq.lock) because we manipulate objects
+ * We need a rwlock (ep->lock) because we manipulate objects
  * from inside the poll callback, that might be triggered from
  * a wake_up() that in turn might be called from IRQ context.
  * So we can't sleep inside the poll callback and hence we need
@@ -85,7 +85,7 @@
  * of epoll file descriptors, we use the current recursion depth as
  * the lockdep subkey.
  * It is possible to drop the "ep->mtx" and to use the global
- * mutex "epmutex" (together with "ep->wq.lock") to have it working,
+ * mutex "epmutex" (together with "ep->lock") to have it working,
  * but having "ep->mtx" will make the interface more scalable.
  * Events that require holding "epmutex" are very rare, while for
  * normal operations the epoll private "ep->mtx" will guarantee
@@ -182,8 +182,6 @@ struct epitem {
  * This structure is stored inside the "private_data" member of the file
  * structure and represents the main data structure for the eventpoll
  * interface.
- *
- * Access to it is protected by the lock inside wq.
  */
 struct eventpoll {
 	/*
@@ -203,13 +201,16 @@ struct eventpoll {
 	/* List of ready file descriptors */
 	struct list_head rdllist;
 
+	/* Lock which protects rdllist and ovflist */
+	rwlock_t lock;
+
 	/* RB tree root used to store monitored fd structs */
 	struct rb_root_cached rbr;
 
 	/*
 	 * This is a single linked list that chains all the "struct epitem" that
 	 * happened while transferring ready events to userspace w/out
-	 * holding ->wq.lock.
+	 * holding ->lock.
 	 */
 	struct epitem *ovflist;
 
@@ -697,17 +698,17 @@ static __poll_t ep_scan_ready_list(struct eventpoll *ep,
 	 * because we want the "sproc" callback to be able to do it
 	 * in a lockless way.
 	 */
-	spin_lock_irq(&ep->wq.lock);
+	write_lock_irq(&ep->lock);
 	list_splice_init(&ep->rdllist, &txlist);
 	ep->ovflist = NULL;
-	spin_unlock_irq(&ep->wq.lock);
+	write_unlock_irq(&ep->lock);
 
 	/*
 	 * Now call the callback function.
 	 */
 	res = (*sproc)(ep, &txlist, priv);
 
-	spin_lock_irq(&ep->wq.lock);
+	write_lock_irq(&ep->lock);
 	/*
 	 * During the time we spent inside the "sproc" callback, some
 	 * other events might have been queued by the poll callback.
@@ -722,7 +723,8 @@ static __poll_t ep_scan_ready_list(struct eventpoll *ep,
 		 * contain them, and the list_splice() below takes care of them.
 		 */
 		if (!ep_is_linked(epi)) {
-			list_add_tail(&epi->rdllink, &ep->rdllist);
+			/* Reverse ->ovflist, events should be in FIFO */
+			list_add(&epi->rdllink, &ep->rdllist);
 			ep_pm_stay_awake(epi);
 		}
 	}
@@ -745,11 +747,11 @@ static __poll_t ep_scan_ready_list(struct eventpoll *ep,
 		 * the ->poll() wait list (delayed after we release the lock).
 		 */
 		if (waitqueue_active(&ep->wq))
-			wake_up_locked(&ep->wq);
+			wake_up(&ep->wq);
 		if (waitqueue_active(&ep->poll_wait))
 			pwake++;
 	}
-	spin_unlock_irq(&ep->wq.lock);
+	write_unlock_irq(&ep->lock);
 
 	if (!ep_locked)
 		mutex_unlock(&ep->mtx);
@@ -789,10 +791,10 @@ static int ep_remove(struct eventpoll *ep, struct epitem *epi)
 
 	rb_erase_cached(&epi->rbn, &ep->rbr);
 
-	spin_lock_irq(&ep->wq.lock);
+	write_lock_irq(&ep->lock);
 	if (ep_is_linked(epi))
 		list_del_init(&epi->rdllink);
-	spin_unlock_irq(&ep->wq.lock);
+	write_unlock_irq(&ep->lock);
 
 	wakeup_source_unregister(ep_wakeup_source(epi));
 	/*
@@ -842,7 +844,7 @@ static void ep_free(struct eventpoll *ep)
 	 * Walks through the whole tree by freeing each "struct epitem". At this
 	 * point we are sure no poll callbacks will be lingering around, and also by
 	 * holding "epmutex" we can be sure that no file cleanup code will hit
-	 * us during this operation. So we can avoid the lock on "ep->wq.lock".
+	 * us during this operation. So we can avoid the lock on "ep->lock".
 	 * We do not need to lock ep->mtx, either, we only do it to prevent
 	 * a lockdep warning.
 	 */
@@ -1023,6 +1025,7 @@ static int ep_alloc(struct eventpoll **pep)
 		goto free_uid;
 
 	mutex_init(&ep->mtx);
+	rwlock_init(&ep->lock);
 	init_waitqueue_head(&ep->wq);
 	init_waitqueue_head(&ep->poll_wait);
 	INIT_LIST_HEAD(&ep->rdllist);
@@ -1112,10 +1115,38 @@ struct file *get_epoll_tfile_raw_ptr(struct file *file, int tfd,
 }
 #endif /* CONFIG_CHECKPOINT_RESTORE */
 
+/*
+ * Adds a new entry to the tail of the list in a lockless way, i.e.
+ * multiple CPUs are allowed to call this function concurrently.
+ *
+ * Beware: it is necessary to prevent any other modifications of the
+ *         existing list until all changes are completed, in other words
+ *         concurrent list_add_tail_lockless() calls should be protected
+ *         with a read lock, where write lock acts as a barrier which
+ *         makes sure all list_add_tail_lockless() calls are fully
+ *         completed.
+ */
+static inline void list_add_tail_lockless(struct list_head *new,
+					  struct list_head *head)
+{
+	struct list_head *prev;
+
+	new->next = head;
+	prev = xchg(&head->prev, new);
+	prev->next = new;
+	new->prev = prev;
+}
+
 /*
  * This is the callback that is passed to the wait queue wakeup
  * mechanism. It is called by the stored file descriptors when they
  * have events to report.
+ *
+ * This callback takes a read lock in order not to content with concurrent
+ * events from another file descriptors, thus all modifications to ->rdllist
+ * or ->ovflist are lockless.  Read lock is paired with the write lock from
+ * ep_scan_ready_list(), which stops all list modifications and guarantees
+ * that lists won't be corrupted.
  */
 static int ep_poll_callback(wait_queue_entry_t *wait, unsigned mode, int sync, void *key)
 {
@@ -1126,7 +1157,7 @@ static int ep_poll_callback(wait_queue_entry_t *wait, unsigned mode, int sync, v
 	__poll_t pollflags = key_to_poll(key);
 	int ewake = 0;
 
-	spin_lock_irqsave(&ep->wq.lock, flags);
+	read_lock_irqsave(&ep->lock, flags);
 
 	ep_set_busy_poll_napi_id(epi);
 
@@ -1156,8 +1187,8 @@ static int ep_poll_callback(wait_queue_entry_t *wait, unsigned mode, int sync, v
 	 */
 	if (unlikely(ep->ovflist != EP_UNACTIVE_PTR)) {
 		if (epi->next == EP_UNACTIVE_PTR) {
-			epi->next = ep->ovflist;
-			ep->ovflist = epi;
+			/* Atomically exchange tail */
+			epi->next = xchg(&ep->ovflist, epi);
 			if (epi->ws) {
 				/*
 				 * Activate ep->ws since epi->ws may get
@@ -1172,7 +1203,7 @@ static int ep_poll_callback(wait_queue_entry_t *wait, unsigned mode, int sync, v
 
 	/* If this file is already in the ready list we exit soon */
 	if (!ep_is_linked(epi)) {
-		list_add_tail(&epi->rdllink, &ep->rdllist);
+		list_add_tail_lockless(&epi->rdllink, &ep->rdllist);
 		ep_pm_stay_awake_rcu(epi);
 	}
 
@@ -1197,13 +1228,13 @@ static int ep_poll_callback(wait_queue_entry_t *wait, unsigned mode, int sync, v
 				break;
 			}
 		}
-		wake_up_locked(&ep->wq);
+		wake_up(&ep->wq);
 	}
 	if (waitqueue_active(&ep->poll_wait))
 		pwake++;
 
 out_unlock:
-	spin_unlock_irqrestore(&ep->wq.lock, flags);
+	read_unlock_irqrestore(&ep->lock, flags);
 
 	/* We have to call this outside the lock */
 	if (pwake)
@@ -1489,7 +1520,7 @@ static int ep_insert(struct eventpoll *ep, const struct epoll_event *event,
 		goto error_remove_epi;
 
 	/* We have to drop the new item inside our item list to keep track of it */
-	spin_lock_irq(&ep->wq.lock);
+	write_lock_irq(&ep->lock);
 
 	/* record NAPI ID of new item if present */
 	ep_set_busy_poll_napi_id(epi);
@@ -1501,12 +1532,12 @@ static int ep_insert(struct eventpoll *ep, const struct epoll_event *event,
 
 		/* Notify waiting tasks that events are available */
 		if (waitqueue_active(&ep->wq))
-			wake_up_locked(&ep->wq);
+			wake_up(&ep->wq);
 		if (waitqueue_active(&ep->poll_wait))
 			pwake++;
 	}
 
-	spin_unlock_irq(&ep->wq.lock);
+	write_unlock_irq(&ep->lock);
 
 	atomic_long_inc(&ep->user->epoll_watches);
 
@@ -1532,10 +1563,10 @@ static int ep_insert(struct eventpoll *ep, const struct epoll_event *event,
 	 * list, since that is used/cleaned only inside a section bound by "mtx".
 	 * And ep_insert() is called with "mtx" held.
 	 */
-	spin_lock_irq(&ep->wq.lock);
+	write_lock_irq(&ep->lock);
 	if (ep_is_linked(epi))
 		list_del_init(&epi->rdllink);
-	spin_unlock_irq(&ep->wq.lock);
+	write_unlock_irq(&ep->lock);
 
 	wakeup_source_unregister(ep_wakeup_source(epi));
 
@@ -1579,9 +1610,9 @@ static int ep_modify(struct eventpoll *ep, struct epitem *epi,
 	 * 1) Flush epi changes above to other CPUs.  This ensures
 	 *    we do not miss events from ep_poll_callback if an
 	 *    event occurs immediately after we call f_op->poll().
-	 *    We need this because we did not take ep->wq.lock while
+	 *    We need this because we did not take ep->lock while
 	 *    changing epi above (but ep_poll_callback does take
-	 *    ep->wq.lock).
+	 *    ep->lock).
 	 *
 	 * 2) We also need to ensure we do not miss _past_ events
 	 *    when calling f_op->poll().  This barrier also
@@ -1600,18 +1631,18 @@ static int ep_modify(struct eventpoll *ep, struct epitem *epi,
 	 * list, push it inside.
 	 */
 	if (ep_item_poll(epi, &pt, 1)) {
-		spin_lock_irq(&ep->wq.lock);
+		write_lock_irq(&ep->lock);
 		if (!ep_is_linked(epi)) {
 			list_add_tail(&epi->rdllink, &ep->rdllist);
 			ep_pm_stay_awake(epi);
 
 			/* Notify waiting tasks that events are available */
 			if (waitqueue_active(&ep->wq))
-				wake_up_locked(&ep->wq);
+				wake_up(&ep->wq);
 			if (waitqueue_active(&ep->poll_wait))
 				pwake++;
 		}
-		spin_unlock_irq(&ep->wq.lock);
+		write_unlock_irq(&ep->lock);
 	}
 
 	/* We have to call this outside the lock */
@@ -1764,7 +1795,7 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
 		 * caller specified a non blocking operation.
 		 */
 		timed_out = 1;
-		spin_lock_irq(&ep->wq.lock);
+		write_lock_irq(&ep->lock);
 		goto check_events;
 	}
 
@@ -1773,7 +1804,7 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
 	if (!ep_events_available(ep))
 		ep_busy_loop(ep, timed_out);
 
-	spin_lock_irq(&ep->wq.lock);
+	write_lock_irq(&ep->lock);
 
 	if (!ep_events_available(ep)) {
 		/*
@@ -1789,7 +1820,7 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
 		 * ep_poll_callback() when events will become available.
 		 */
 		init_waitqueue_entry(&wait, current);
-		__add_wait_queue_exclusive(&ep->wq, &wait);
+		add_wait_queue_exclusive(&ep->wq, &wait);
 
 		for (;;) {
 			/*
@@ -1815,21 +1846,21 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
 				break;
 			}
 
-			spin_unlock_irq(&ep->wq.lock);
+			write_unlock_irq(&ep->lock);
 			if (!schedule_hrtimeout_range(to, slack, HRTIMER_MODE_ABS))
 				timed_out = 1;
 
-			spin_lock_irq(&ep->wq.lock);
+			write_lock_irq(&ep->lock);
 		}
 
-		__remove_wait_queue(&ep->wq, &wait);
+		remove_wait_queue(&ep->wq, &wait);
 		__set_current_state(TASK_RUNNING);
 	}
 check_events:
 	/* Is it worth to try to dig for events ? */
 	eavail = ep_events_available(ep);
 
-	spin_unlock_irq(&ep->wq.lock);
+	write_unlock_irq(&ep->lock);
 
 	/*
 	 * Try to transfer events to user space. In case we get 0 events and
-- 
2.19.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 1/1] epoll: use rwlock in order to reduce ep_poll_callback() contention
  2018-12-03 11:02 [RFC PATCH 1/1] epoll: use rwlock in order to reduce ep_poll_callback() contention Roman Penyaev
@ 2018-12-03 17:34 ` Linus Torvalds
  2018-12-04 11:50   ` Roman Penyaev
  2018-12-04 17:23 ` Jason Baron
  2018-12-05 23:46 ` Eric Wong
  2 siblings, 1 reply; 19+ messages in thread
From: Linus Torvalds @ 2018-12-03 17:34 UTC (permalink / raw)
  To: Roman Penyaev
  Cc: Al Viro, Paul McKenney, linux-fsdevel, Linux List Kernel Mailing

On Mon, Dec 3, 2018 at 3:03 AM Roman Penyaev <rpenyaev@suse.de> wrote:
>
> Also I'm not quite sure where to put very special lockless variant
> of adding element to the list (list_add_tail_lockless() in this
> patch).  Seems keeping it locally is safer.

That function is scary, and can be mis-used so easily that I
definitely don't want to see it anywhere else.

Afaik, it's *really* important that only "add_tail" operations can be
done in parallel.

This also ends up making the memory ordering of "xchg()" very very
important. Yes, we've documented it as being an ordering op, but I'm
not sure we've relied on it this directly before.

I also note that now we do more/different locking in the waitqueue
handling, because the code now takes both that rwlock _and_ the
waitqueue spinlock for wakeup. That also makes me worried that the
"waitqueue_active()" games are no no longer reliable. I think they're
fine (looks like they are only done under the write-lock, so it's
effectively the same serialization anyway), but the upshoot of all of
this is that I *really* want others to look at this patch too. A lot
of small subtle things here.

                Linus

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 1/1] epoll: use rwlock in order to reduce ep_poll_callback() contention
  2018-12-03 17:34 ` Linus Torvalds
@ 2018-12-04 11:50   ` Roman Penyaev
  2018-12-04 23:59     ` Andrea Parri
  0 siblings, 1 reply; 19+ messages in thread
From: Roman Penyaev @ 2018-12-04 11:50 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Al Viro, Paul McKenney, linux-fsdevel, Linux List Kernel Mailing

On 2018-12-03 18:34, Linus Torvalds wrote:
> On Mon, Dec 3, 2018 at 3:03 AM Roman Penyaev <rpenyaev@suse.de> wrote:
>> 
>> Also I'm not quite sure where to put very special lockless variant
>> of adding element to the list (list_add_tail_lockless() in this
>> patch).  Seems keeping it locally is safer.
> 
> That function is scary, and can be mis-used so easily that I
> definitely don't want to see it anywhere else.
> 
> Afaik, it's *really* important that only "add_tail" operations can be
> done in parallel.

True, adding element either to head or to tail can work in parallel,
any mix will corrupt the list.  I tried to reflect this in the comment
of list_add_tail_lockless().  Although not sure has it become clearer
to a reader or not.


> This also ends up making the memory ordering of "xchg()" very very
> important. Yes, we've documented it as being an ordering op, but I'm
> not sure we've relied on it this directly before.

Seems exit_mm() does exactly the same, the following chunk:

		up_read(&mm->mmap_sem);

		self.task = current;
		self.next = xchg(&core_state->dumper.next, &self);


At least code pattern looks similar.


> I also note that now we do more/different locking in the waitqueue
> handling, because the code now takes both that rwlock _and_ the
> waitqueue spinlock for wakeup. That also makes me worried that the
> "waitqueue_active()" games are no no longer reliable. I think they're
> fine (looks like they are only done under the write-lock, so it's
> effectively the same serialization anyway),


The only difference in waking up is that same epollitem waitqueue can be
observed as active from different CPUs, real wake up happens only once
(wake_up() takes wq.lock, so should be fine to call it multiple times),
but 1 is returned for all callers of ep_poll_callback() who has seen
the wq as active.

If epollitem is created with EPOLLEXCLUSIVE flag, then 1, which is 
returned
from ep_poll_callback(), indicates "break the loop, exclusive wake up 
has
happened" (the loop is in __wake_up_common), but even we consider this
exclusive wake up case this seems is totally fine, because wake up 
events
are not lost and epollitem will scan all ready fds and eventually will
observe all of the callers (who has returned 1 from ep_poll_callback())
as ready.  I hope I did not miss anything.


> but the upshoot of all of
> this is that I *really* want others to look at this patch too. A lot
> of small subtle things here.

Would be great if someone can look through, eventpoll.c looks a
bit abandoned.

--
Roman

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 1/1] epoll: use rwlock in order to reduce ep_poll_callback() contention
  2018-12-03 11:02 [RFC PATCH 1/1] epoll: use rwlock in order to reduce ep_poll_callback() contention Roman Penyaev
  2018-12-03 17:34 ` Linus Torvalds
@ 2018-12-04 17:23 ` Jason Baron
  2018-12-04 19:02   ` Paul E. McKenney
                     ` (4 more replies)
  2018-12-05 23:46 ` Eric Wong
  2 siblings, 5 replies; 19+ messages in thread
From: Jason Baron @ 2018-12-04 17:23 UTC (permalink / raw)
  To: Roman Penyaev
  Cc: Alexander Viro, Paul E. McKenney, Linus Torvalds, linux-fsdevel,
	linux-kernel



On 12/3/18 6:02 AM, Roman Penyaev wrote:
> Hi all,
> 
> The goal of this patch is to reduce contention of ep_poll_callback() which
> can be called concurrently from different CPUs in case of high events
> rates and many fds per epoll.  Problem can be very well reproduced by
> generating events (write to pipe or eventfd) from many threads, while
> consumer thread does polling.  In other words this patch increases the
> bandwidth of events which can be delivered from sources to the poller by
> adding poll items in a lockless way to the list.
> 
> The main change is in replacement of the spinlock with a rwlock, which is
> taken on read in ep_poll_callback(), and then by adding poll items to the
> tail of the list using xchg atomic instruction.  Write lock is taken
> everywhere else in order to stop list modifications and guarantee that list
> updates are fully completed (I assume that write side of a rwlock does not
> starve, it seems qrwlock implementation has these guarantees).
> 
> The following are some microbenchmark results based on the test [1] which
> starts threads which generate N events each.  The test ends when all
> events are successfully fetched by the poller thread:
> 
> spinlock
> ========
> 
> threads       run time    events per ms
> -------       ---------   -------------
>       8         13191ms         6064/ms
>      16         30758ms         5201/ms
>      32         44315ms         7220/ms
> 
> rwlock + xchg
> =============
> 
> threads       run time    events per ms
> -------       ---------   -------------
>       8          8581ms         9323/ms
>      16         13800ms        11594/ms
>      32         24167ms        13240/ms
> 
> According to the results bandwidth of delivered events is significantly
> increased, thus execution time is reduced.
> 
> This is RFC because I did not run any benchmarks comparing current
> qrwlock and spinlock implementations (4.19 kernel), although I did
> not notice any epoll performance degradations in other benchmarks.
> 
> Also I'm not quite sure where to put very special lockless variant
> of adding element to the list (list_add_tail_lockless() in this
> patch).  Seems keeping it locally is safer.
> 
> [1] https://github.com/rouming/test-tools/blob/master/stress-epoll.c
> 
> Signed-off-by: Roman Penyaev <rpenyaev@suse.de>
> Cc: Alexander Viro <viro@zeniv.linux.org.uk>
> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: linux-fsdevel@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> ---
>  fs/eventpoll.c | 107 +++++++++++++++++++++++++++++++------------------
>  1 file changed, 69 insertions(+), 38 deletions(-)
> 
> diff --git a/fs/eventpoll.c b/fs/eventpoll.c
> index 42bbe6824b4b..89debda47aca 100644
> --- a/fs/eventpoll.c
> +++ b/fs/eventpoll.c
> @@ -50,10 +50,10 @@
>   *
>   * 1) epmutex (mutex)
>   * 2) ep->mtx (mutex)
> - * 3) ep->wq.lock (spinlock)
> + * 3) ep->lock (rwlock)
>   *
>   * The acquire order is the one listed above, from 1 to 3.
> - * We need a spinlock (ep->wq.lock) because we manipulate objects
> + * We need a rwlock (ep->lock) because we manipulate objects
>   * from inside the poll callback, that might be triggered from
>   * a wake_up() that in turn might be called from IRQ context.
>   * So we can't sleep inside the poll callback and hence we need
> @@ -85,7 +85,7 @@
>   * of epoll file descriptors, we use the current recursion depth as
>   * the lockdep subkey.
>   * It is possible to drop the "ep->mtx" and to use the global
> - * mutex "epmutex" (together with "ep->wq.lock") to have it working,
> + * mutex "epmutex" (together with "ep->lock") to have it working,
>   * but having "ep->mtx" will make the interface more scalable.
>   * Events that require holding "epmutex" are very rare, while for
>   * normal operations the epoll private "ep->mtx" will guarantee
> @@ -182,8 +182,6 @@ struct epitem {
>   * This structure is stored inside the "private_data" member of the file
>   * structure and represents the main data structure for the eventpoll
>   * interface.
> - *
> - * Access to it is protected by the lock inside wq.
>   */
>  struct eventpoll {
>  	/*
> @@ -203,13 +201,16 @@ struct eventpoll {
>  	/* List of ready file descriptors */
>  	struct list_head rdllist;
>  
> +	/* Lock which protects rdllist and ovflist */
> +	rwlock_t lock;
> +
>  	/* RB tree root used to store monitored fd structs */
>  	struct rb_root_cached rbr;
>  
>  	/*
>  	 * This is a single linked list that chains all the "struct epitem" that
>  	 * happened while transferring ready events to userspace w/out
> -	 * holding ->wq.lock.
> +	 * holding ->lock.
>  	 */
>  	struct epitem *ovflist;
>  
> @@ -697,17 +698,17 @@ static __poll_t ep_scan_ready_list(struct eventpoll *ep,
>  	 * because we want the "sproc" callback to be able to do it
>  	 * in a lockless way.
>  	 */
> -	spin_lock_irq(&ep->wq.lock);
> +	write_lock_irq(&ep->lock);
>  	list_splice_init(&ep->rdllist, &txlist);
>  	ep->ovflist = NULL;
> -	spin_unlock_irq(&ep->wq.lock);
> +	write_unlock_irq(&ep->lock);
>  
>  	/*
>  	 * Now call the callback function.
>  	 */
>  	res = (*sproc)(ep, &txlist, priv);
>  
> -	spin_lock_irq(&ep->wq.lock);
> +	write_lock_irq(&ep->lock);
>  	/*
>  	 * During the time we spent inside the "sproc" callback, some
>  	 * other events might have been queued by the poll callback.
> @@ -722,7 +723,8 @@ static __poll_t ep_scan_ready_list(struct eventpoll *ep,
>  		 * contain them, and the list_splice() below takes care of them.
>  		 */
>  		if (!ep_is_linked(epi)) {
> -			list_add_tail(&epi->rdllink, &ep->rdllist);
> +			/* Reverse ->ovflist, events should be in FIFO */
> +			list_add(&epi->rdllink, &ep->rdllist);
>  			ep_pm_stay_awake(epi);
>  		}
>  	}
> @@ -745,11 +747,11 @@ static __poll_t ep_scan_ready_list(struct eventpoll *ep,
>  		 * the ->poll() wait list (delayed after we release the lock).
>  		 */
>  		if (waitqueue_active(&ep->wq))
> -			wake_up_locked(&ep->wq);
> +			wake_up(&ep->wq);
>  		if (waitqueue_active(&ep->poll_wait))
>  			pwake++;
>  	}
> -	spin_unlock_irq(&ep->wq.lock);
> +	write_unlock_irq(&ep->lock);
>  
>  	if (!ep_locked)
>  		mutex_unlock(&ep->mtx);
> @@ -789,10 +791,10 @@ static int ep_remove(struct eventpoll *ep, struct epitem *epi)
>  
>  	rb_erase_cached(&epi->rbn, &ep->rbr);
>  
> -	spin_lock_irq(&ep->wq.lock);
> +	write_lock_irq(&ep->lock);
>  	if (ep_is_linked(epi))
>  		list_del_init(&epi->rdllink);
> -	spin_unlock_irq(&ep->wq.lock);
> +	write_unlock_irq(&ep->lock);
>  
>  	wakeup_source_unregister(ep_wakeup_source(epi));
>  	/*
> @@ -842,7 +844,7 @@ static void ep_free(struct eventpoll *ep)
>  	 * Walks through the whole tree by freeing each "struct epitem". At this
>  	 * point we are sure no poll callbacks will be lingering around, and also by
>  	 * holding "epmutex" we can be sure that no file cleanup code will hit
> -	 * us during this operation. So we can avoid the lock on "ep->wq.lock".
> +	 * us during this operation. So we can avoid the lock on "ep->lock".
>  	 * We do not need to lock ep->mtx, either, we only do it to prevent
>  	 * a lockdep warning.
>  	 */
> @@ -1023,6 +1025,7 @@ static int ep_alloc(struct eventpoll **pep)
>  		goto free_uid;
>  
>  	mutex_init(&ep->mtx);
> +	rwlock_init(&ep->lock);
>  	init_waitqueue_head(&ep->wq);
>  	init_waitqueue_head(&ep->poll_wait);
>  	INIT_LIST_HEAD(&ep->rdllist);
> @@ -1112,10 +1115,38 @@ struct file *get_epoll_tfile_raw_ptr(struct file *file, int tfd,
>  }
>  #endif /* CONFIG_CHECKPOINT_RESTORE */
>  
> +/*
> + * Adds a new entry to the tail of the list in a lockless way, i.e.
> + * multiple CPUs are allowed to call this function concurrently.
> + *
> + * Beware: it is necessary to prevent any other modifications of the
> + *         existing list until all changes are completed, in other words
> + *         concurrent list_add_tail_lockless() calls should be protected
> + *         with a read lock, where write lock acts as a barrier which
> + *         makes sure all list_add_tail_lockless() calls are fully
> + *         completed.
> + */
> +static inline void list_add_tail_lockless(struct list_head *new,
> +					  struct list_head *head)
> +{
> +	struct list_head *prev;
> +
> +	new->next = head;
> +	prev = xchg(&head->prev, new);
> +	prev->next = new;
> +	new->prev = prev;
> +}
> +
>  /*
>   * This is the callback that is passed to the wait queue wakeup
>   * mechanism. It is called by the stored file descriptors when they
>   * have events to report.
> + *
> + * This callback takes a read lock in order not to content with concurrent
> + * events from another file descriptors, thus all modifications to ->rdllist
> + * or ->ovflist are lockless.  Read lock is paired with the write lock from
> + * ep_scan_ready_list(), which stops all list modifications and guarantees
> + * that lists won't be corrupted.
>   */
>  static int ep_poll_callback(wait_queue_entry_t *wait, unsigned mode, int sync, void *key)
>  {
> @@ -1126,7 +1157,7 @@ static int ep_poll_callback(wait_queue_entry_t *wait, unsigned mode, int sync, v
>  	__poll_t pollflags = key_to_poll(key);
>  	int ewake = 0;
>  
> -	spin_lock_irqsave(&ep->wq.lock, flags);
> +	read_lock_irqsave(&ep->lock, flags);
>  
>  	ep_set_busy_poll_napi_id(epi);
>  
> @@ -1156,8 +1187,8 @@ static int ep_poll_callback(wait_queue_entry_t *wait, unsigned mode, int sync, v
>  	 */
>  	if (unlikely(ep->ovflist != EP_UNACTIVE_PTR)) {
>  		if (epi->next == EP_UNACTIVE_PTR) {
> -			epi->next = ep->ovflist;
> -			ep->ovflist = epi;
> +			/* Atomically exchange tail */
> +			epi->next = xchg(&ep->ovflist, epi);

This also relies on the fact that the same epi can't be added to the
list in parallel, because the wait queue doing the wakeup will have the
wait_queue_head lock. That was a little confusing for me b/c we only had
the read_lock() above.

>  			if (epi->ws) {
>  				/*
>  				 * Activate ep->ws since epi->ws may get
> @@ -1172,7 +1203,7 @@ static int ep_poll_callback(wait_queue_entry_t *wait, unsigned mode, int sync, v
>  
>  	/* If this file is already in the ready list we exit soon */
>  	if (!ep_is_linked(epi)) {
> -		list_add_tail(&epi->rdllink, &ep->rdllist);
> +		list_add_tail_lockless(&epi->rdllink, &ep->rdllist);
>  		ep_pm_stay_awake_rcu(epi);
>  	}

same for this.

>  
> @@ -1197,13 +1228,13 @@ static int ep_poll_callback(wait_queue_entry_t *wait, unsigned mode, int sync, v
>  				break;
>  			}
>  		}
> -		wake_up_locked(&ep->wq);
> +		wake_up(&ep->wq);

why the switch here to the locked() variant? Shouldn't the 'reader'
side, in this case which takes the rwlock for write see all updates in a
coherent state at this point?

>  	}
>  	if (waitqueue_active(&ep->poll_wait))
>  		pwake++;
>  
>  out_unlock:
> -	spin_unlock_irqrestore(&ep->wq.lock, flags);
> +	read_unlock_irqrestore(&ep->lock, flags);
>  
>  	/* We have to call this outside the lock */
>  	if (pwake)
> @@ -1489,7 +1520,7 @@ static int ep_insert(struct eventpoll *ep, const struct epoll_event *event,
>  		goto error_remove_epi;
>  
>  	/* We have to drop the new item inside our item list to keep track of it */
> -	spin_lock_irq(&ep->wq.lock);
> +	write_lock_irq(&ep->lock);
>  
>  	/* record NAPI ID of new item if present */
>  	ep_set_busy_poll_napi_id(epi);
> @@ -1501,12 +1532,12 @@ static int ep_insert(struct eventpoll *ep, const struct epoll_event *event,
>  
>  		/* Notify waiting tasks that events are available */
>  		if (waitqueue_active(&ep->wq))
> -			wake_up_locked(&ep->wq);
> +			wake_up(&ep->wq);

is this necessary to switch as well? Is this to make lockdep happy?
Looks like there are few more conversions too below...

Thanks,

-Jason



>  		if (waitqueue_active(&ep->poll_wait))
>  			pwake++;
>  	}
>  
> -	spin_unlock_irq(&ep->wq.lock);
> +	write_unlock_irq(&ep->lock);
>  
>  	atomic_long_inc(&ep->user->epoll_watches);
>  
> @@ -1532,10 +1563,10 @@ static int ep_insert(struct eventpoll *ep, const struct epoll_event *event,
>  	 * list, since that is used/cleaned only inside a section bound by "mtx".
>  	 * And ep_insert() is called with "mtx" held.
>  	 */
> -	spin_lock_irq(&ep->wq.lock);
> +	write_lock_irq(&ep->lock);
>  	if (ep_is_linked(epi))
>  		list_del_init(&epi->rdllink);
> -	spin_unlock_irq(&ep->wq.lock);
> +	write_unlock_irq(&ep->lock);
>  
>  	wakeup_source_unregister(ep_wakeup_source(epi));
>  
> @@ -1579,9 +1610,9 @@ static int ep_modify(struct eventpoll *ep, struct epitem *epi,
>  	 * 1) Flush epi changes above to other CPUs.  This ensures
>  	 *    we do not miss events from ep_poll_callback if an
>  	 *    event occurs immediately after we call f_op->poll().
> -	 *    We need this because we did not take ep->wq.lock while
> +	 *    We need this because we did not take ep->lock while
>  	 *    changing epi above (but ep_poll_callback does take
> -	 *    ep->wq.lock).
> +	 *    ep->lock).
>  	 *
>  	 * 2) We also need to ensure we do not miss _past_ events
>  	 *    when calling f_op->poll().  This barrier also
> @@ -1600,18 +1631,18 @@ static int ep_modify(struct eventpoll *ep, struct epitem *epi,
>  	 * list, push it inside.
>  	 */
>  	if (ep_item_poll(epi, &pt, 1)) {
> -		spin_lock_irq(&ep->wq.lock);
> +		write_lock_irq(&ep->lock);
>  		if (!ep_is_linked(epi)) {
>  			list_add_tail(&epi->rdllink, &ep->rdllist);
>  			ep_pm_stay_awake(epi);
>  
>  			/* Notify waiting tasks that events are available */
>  			if (waitqueue_active(&ep->wq))
> -				wake_up_locked(&ep->wq);
> +				wake_up(&ep->wq);
>  			if (waitqueue_active(&ep->poll_wait))
>  				pwake++;
>  		}
> -		spin_unlock_irq(&ep->wq.lock);
> +		write_unlock_irq(&ep->lock);
>  	}
>  
>  	/* We have to call this outside the lock */
> @@ -1764,7 +1795,7 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
>  		 * caller specified a non blocking operation.
>  		 */
>  		timed_out = 1;
> -		spin_lock_irq(&ep->wq.lock);
> +		write_lock_irq(&ep->lock);
>  		goto check_events;
>  	}
>  
> @@ -1773,7 +1804,7 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
>  	if (!ep_events_available(ep))
>  		ep_busy_loop(ep, timed_out);
>  
> -	spin_lock_irq(&ep->wq.lock);
> +	write_lock_irq(&ep->lock);
>  
>  	if (!ep_events_available(ep)) {
>  		/*
> @@ -1789,7 +1820,7 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
>  		 * ep_poll_callback() when events will become available.
>  		 */
>  		init_waitqueue_entry(&wait, current);
> -		__add_wait_queue_exclusive(&ep->wq, &wait);
> +		add_wait_queue_exclusive(&ep->wq, &wait);
>  
>  		for (;;) {
>  			/*
> @@ -1815,21 +1846,21 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
>  				break;
>  			}
>  
> -			spin_unlock_irq(&ep->wq.lock);
> +			write_unlock_irq(&ep->lock);
>  			if (!schedule_hrtimeout_range(to, slack, HRTIMER_MODE_ABS))
>  				timed_out = 1;
>  
> -			spin_lock_irq(&ep->wq.lock);
> +			write_lock_irq(&ep->lock);
>  		}
>  
> -		__remove_wait_queue(&ep->wq, &wait);
> +		remove_wait_queue(&ep->wq, &wait);
>  		__set_current_state(TASK_RUNNING);
>  	}
>  check_events:
>  	/* Is it worth to try to dig for events ? */
>  	eavail = ep_events_available(ep);
>  
> -	spin_unlock_irq(&ep->wq.lock);
> +	write_unlock_irq(&ep->lock);
>  
>  	/*
>  	 * Try to transfer events to user space. In case we get 0 events and
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 1/1] epoll: use rwlock in order to reduce ep_poll_callback() contention
  2018-12-04 17:23 ` Jason Baron
@ 2018-12-04 19:02   ` Paul E. McKenney
  2018-12-05 11:22     ` Roman Penyaev
  2018-12-05 11:16   ` Roman Penyaev
                     ` (3 subsequent siblings)
  4 siblings, 1 reply; 19+ messages in thread
From: Paul E. McKenney @ 2018-12-04 19:02 UTC (permalink / raw)
  To: Jason Baron
  Cc: Roman Penyaev, Alexander Viro, Linus Torvalds, linux-fsdevel,
	linux-kernel

On Tue, Dec 04, 2018 at 12:23:08PM -0500, Jason Baron wrote:
> 
> 
> On 12/3/18 6:02 AM, Roman Penyaev wrote:
> > Hi all,
> > 
> > The goal of this patch is to reduce contention of ep_poll_callback() which
> > can be called concurrently from different CPUs in case of high events
> > rates and many fds per epoll.  Problem can be very well reproduced by
> > generating events (write to pipe or eventfd) from many threads, while
> > consumer thread does polling.  In other words this patch increases the
> > bandwidth of events which can be delivered from sources to the poller by
> > adding poll items in a lockless way to the list.
> > 
> > The main change is in replacement of the spinlock with a rwlock, which is
> > taken on read in ep_poll_callback(), and then by adding poll items to the
> > tail of the list using xchg atomic instruction.  Write lock is taken
> > everywhere else in order to stop list modifications and guarantee that list
> > updates are fully completed (I assume that write side of a rwlock does not
> > starve, it seems qrwlock implementation has these guarantees).
> > 
> > The following are some microbenchmark results based on the test [1] which
> > starts threads which generate N events each.  The test ends when all
> > events are successfully fetched by the poller thread:
> > 
> > spinlock
> > ========
> > 
> > threads       run time    events per ms
> > -------       ---------   -------------
> >       8         13191ms         6064/ms
> >      16         30758ms         5201/ms
> >      32         44315ms         7220/ms
> > 
> > rwlock + xchg
> > =============
> > 
> > threads       run time    events per ms
> > -------       ---------   -------------
> >       8          8581ms         9323/ms
> >      16         13800ms        11594/ms
> >      32         24167ms        13240/ms
> > 
> > According to the results bandwidth of delivered events is significantly
> > increased, thus execution time is reduced.
> > 
> > This is RFC because I did not run any benchmarks comparing current
> > qrwlock and spinlock implementations (4.19 kernel), although I did
> > not notice any epoll performance degradations in other benchmarks.
> > 
> > Also I'm not quite sure where to put very special lockless variant
> > of adding element to the list (list_add_tail_lockless() in this
> > patch).  Seems keeping it locally is safer.
> > 
> > [1] https://github.com/rouming/test-tools/blob/master/stress-epoll.c
> > 
> > Signed-off-by: Roman Penyaev <rpenyaev@suse.de>
> > Cc: Alexander Viro <viro@zeniv.linux.org.uk>
> > Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> > Cc: Linus Torvalds <torvalds@linux-foundation.org>
> > Cc: linux-fsdevel@vger.kernel.org
> > Cc: linux-kernel@vger.kernel.org
> > ---
> >  fs/eventpoll.c | 107 +++++++++++++++++++++++++++++++------------------
> >  1 file changed, 69 insertions(+), 38 deletions(-)
> > 
> > diff --git a/fs/eventpoll.c b/fs/eventpoll.c
> > index 42bbe6824b4b..89debda47aca 100644
> > --- a/fs/eventpoll.c
> > +++ b/fs/eventpoll.c
> > @@ -50,10 +50,10 @@
> >   *
> >   * 1) epmutex (mutex)
> >   * 2) ep->mtx (mutex)
> > - * 3) ep->wq.lock (spinlock)
> > + * 3) ep->lock (rwlock)
> >   *
> >   * The acquire order is the one listed above, from 1 to 3.
> > - * We need a spinlock (ep->wq.lock) because we manipulate objects
> > + * We need a rwlock (ep->lock) because we manipulate objects
> >   * from inside the poll callback, that might be triggered from
> >   * a wake_up() that in turn might be called from IRQ context.
> >   * So we can't sleep inside the poll callback and hence we need
> > @@ -85,7 +85,7 @@
> >   * of epoll file descriptors, we use the current recursion depth as
> >   * the lockdep subkey.
> >   * It is possible to drop the "ep->mtx" and to use the global
> > - * mutex "epmutex" (together with "ep->wq.lock") to have it working,
> > + * mutex "epmutex" (together with "ep->lock") to have it working,
> >   * but having "ep->mtx" will make the interface more scalable.
> >   * Events that require holding "epmutex" are very rare, while for
> >   * normal operations the epoll private "ep->mtx" will guarantee
> > @@ -182,8 +182,6 @@ struct epitem {
> >   * This structure is stored inside the "private_data" member of the file
> >   * structure and represents the main data structure for the eventpoll
> >   * interface.
> > - *
> > - * Access to it is protected by the lock inside wq.
> >   */
> >  struct eventpoll {
> >  	/*
> > @@ -203,13 +201,16 @@ struct eventpoll {
> >  	/* List of ready file descriptors */
> >  	struct list_head rdllist;
> >  
> > +	/* Lock which protects rdllist and ovflist */
> > +	rwlock_t lock;
> > +
> >  	/* RB tree root used to store monitored fd structs */
> >  	struct rb_root_cached rbr;
> >  
> >  	/*
> >  	 * This is a single linked list that chains all the "struct epitem" that
> >  	 * happened while transferring ready events to userspace w/out
> > -	 * holding ->wq.lock.
> > +	 * holding ->lock.
> >  	 */
> >  	struct epitem *ovflist;
> >  
> > @@ -697,17 +698,17 @@ static __poll_t ep_scan_ready_list(struct eventpoll *ep,
> >  	 * because we want the "sproc" callback to be able to do it
> >  	 * in a lockless way.
> >  	 */
> > -	spin_lock_irq(&ep->wq.lock);
> > +	write_lock_irq(&ep->lock);
> >  	list_splice_init(&ep->rdllist, &txlist);
> >  	ep->ovflist = NULL;
> > -	spin_unlock_irq(&ep->wq.lock);
> > +	write_unlock_irq(&ep->lock);
> >  
> >  	/*
> >  	 * Now call the callback function.
> >  	 */
> >  	res = (*sproc)(ep, &txlist, priv);
> >  
> > -	spin_lock_irq(&ep->wq.lock);
> > +	write_lock_irq(&ep->lock);
> >  	/*
> >  	 * During the time we spent inside the "sproc" callback, some
> >  	 * other events might have been queued by the poll callback.
> > @@ -722,7 +723,8 @@ static __poll_t ep_scan_ready_list(struct eventpoll *ep,
> >  		 * contain them, and the list_splice() below takes care of them.
> >  		 */
> >  		if (!ep_is_linked(epi)) {
> > -			list_add_tail(&epi->rdllink, &ep->rdllist);
> > +			/* Reverse ->ovflist, events should be in FIFO */
> > +			list_add(&epi->rdllink, &ep->rdllist);
> >  			ep_pm_stay_awake(epi);
> >  		}
> >  	}
> > @@ -745,11 +747,11 @@ static __poll_t ep_scan_ready_list(struct eventpoll *ep,
> >  		 * the ->poll() wait list (delayed after we release the lock).
> >  		 */
> >  		if (waitqueue_active(&ep->wq))
> > -			wake_up_locked(&ep->wq);
> > +			wake_up(&ep->wq);
> >  		if (waitqueue_active(&ep->poll_wait))
> >  			pwake++;
> >  	}
> > -	spin_unlock_irq(&ep->wq.lock);
> > +	write_unlock_irq(&ep->lock);
> >  
> >  	if (!ep_locked)
> >  		mutex_unlock(&ep->mtx);
> > @@ -789,10 +791,10 @@ static int ep_remove(struct eventpoll *ep, struct epitem *epi)
> >  
> >  	rb_erase_cached(&epi->rbn, &ep->rbr);
> >  
> > -	spin_lock_irq(&ep->wq.lock);
> > +	write_lock_irq(&ep->lock);
> >  	if (ep_is_linked(epi))
> >  		list_del_init(&epi->rdllink);
> > -	spin_unlock_irq(&ep->wq.lock);
> > +	write_unlock_irq(&ep->lock);
> >  
> >  	wakeup_source_unregister(ep_wakeup_source(epi));
> >  	/*
> > @@ -842,7 +844,7 @@ static void ep_free(struct eventpoll *ep)
> >  	 * Walks through the whole tree by freeing each "struct epitem". At this
> >  	 * point we are sure no poll callbacks will be lingering around, and also by
> >  	 * holding "epmutex" we can be sure that no file cleanup code will hit
> > -	 * us during this operation. So we can avoid the lock on "ep->wq.lock".
> > +	 * us during this operation. So we can avoid the lock on "ep->lock".
> >  	 * We do not need to lock ep->mtx, either, we only do it to prevent
> >  	 * a lockdep warning.
> >  	 */
> > @@ -1023,6 +1025,7 @@ static int ep_alloc(struct eventpoll **pep)
> >  		goto free_uid;
> >  
> >  	mutex_init(&ep->mtx);
> > +	rwlock_init(&ep->lock);
> >  	init_waitqueue_head(&ep->wq);
> >  	init_waitqueue_head(&ep->poll_wait);
> >  	INIT_LIST_HEAD(&ep->rdllist);
> > @@ -1112,10 +1115,38 @@ struct file *get_epoll_tfile_raw_ptr(struct file *file, int tfd,
> >  }
> >  #endif /* CONFIG_CHECKPOINT_RESTORE */
> >  
> > +/*
> > + * Adds a new entry to the tail of the list in a lockless way, i.e.
> > + * multiple CPUs are allowed to call this function concurrently.
> > + *
> > + * Beware: it is necessary to prevent any other modifications of the
> > + *         existing list until all changes are completed, in other words
> > + *         concurrent list_add_tail_lockless() calls should be protected
> > + *         with a read lock, where write lock acts as a barrier which
> > + *         makes sure all list_add_tail_lockless() calls are fully
> > + *         completed.
> > + */
> > +static inline void list_add_tail_lockless(struct list_head *new,
> > +					  struct list_head *head)
> > +{
> > +	struct list_head *prev;
> > +
> > +	new->next = head;
> > +	prev = xchg(&head->prev, new);
> > +	prev->next = new;
> > +	new->prev = prev;
> > +}
> > +
> >  /*
> >   * This is the callback that is passed to the wait queue wakeup
> >   * mechanism. It is called by the stored file descriptors when they
> >   * have events to report.
> > + *
> > + * This callback takes a read lock in order not to content with concurrent
> > + * events from another file descriptors, thus all modifications to ->rdllist
> > + * or ->ovflist are lockless.  Read lock is paired with the write lock from
> > + * ep_scan_ready_list(), which stops all list modifications and guarantees
> > + * that lists won't be corrupted.
> >   */
> >  static int ep_poll_callback(wait_queue_entry_t *wait, unsigned mode, int sync, void *key)
> >  {
> > @@ -1126,7 +1157,7 @@ static int ep_poll_callback(wait_queue_entry_t *wait, unsigned mode, int sync, v
> >  	__poll_t pollflags = key_to_poll(key);
> >  	int ewake = 0;
> >  
> > -	spin_lock_irqsave(&ep->wq.lock, flags);
> > +	read_lock_irqsave(&ep->lock, flags);
> >  
> >  	ep_set_busy_poll_napi_id(epi);
> >  
> > @@ -1156,8 +1187,8 @@ static int ep_poll_callback(wait_queue_entry_t *wait, unsigned mode, int sync, v
> >  	 */
> >  	if (unlikely(ep->ovflist != EP_UNACTIVE_PTR)) {
> >  		if (epi->next == EP_UNACTIVE_PTR) {
> > -			epi->next = ep->ovflist;
> > -			ep->ovflist = epi;
> > +			/* Atomically exchange tail */
> > +			epi->next = xchg(&ep->ovflist, epi);
> 
> This also relies on the fact that the same epi can't be added to the
> list in parallel, because the wait queue doing the wakeup will have the
> wait_queue_head lock. That was a little confusing for me b/c we only had
> the read_lock() above.

I missed this as well.

I was also concerned about "fly-by" wakeups where the to-be-awoken task
never really goes to sleep, but it looks like tasks are unconditionally
queued, which seems like it should avoid that problem.

Please do some testing with artificial delays in the lockless queuing
code, for example, via udelay() or similar.  If there are races, this
would help force them to happen.

							Thanx, Paul

> >  			if (epi->ws) {
> >  				/*
> >  				 * Activate ep->ws since epi->ws may get
> > @@ -1172,7 +1203,7 @@ static int ep_poll_callback(wait_queue_entry_t *wait, unsigned mode, int sync, v
> >  
> >  	/* If this file is already in the ready list we exit soon */
> >  	if (!ep_is_linked(epi)) {
> > -		list_add_tail(&epi->rdllink, &ep->rdllist);
> > +		list_add_tail_lockless(&epi->rdllink, &ep->rdllist);
> >  		ep_pm_stay_awake_rcu(epi);
> >  	}
> 
> same for this.
> 
> >  
> > @@ -1197,13 +1228,13 @@ static int ep_poll_callback(wait_queue_entry_t *wait, unsigned mode, int sync, v
> >  				break;
> >  			}
> >  		}
> > -		wake_up_locked(&ep->wq);
> > +		wake_up(&ep->wq);
> 
> why the switch here to the locked() variant? Shouldn't the 'reader'
> side, in this case which takes the rwlock for write see all updates in a
> coherent state at this point?
> 
> >  	}
> >  	if (waitqueue_active(&ep->poll_wait))
> >  		pwake++;
> >  
> >  out_unlock:
> > -	spin_unlock_irqrestore(&ep->wq.lock, flags);
> > +	read_unlock_irqrestore(&ep->lock, flags);
> >  
> >  	/* We have to call this outside the lock */
> >  	if (pwake)
> > @@ -1489,7 +1520,7 @@ static int ep_insert(struct eventpoll *ep, const struct epoll_event *event,
> >  		goto error_remove_epi;
> >  
> >  	/* We have to drop the new item inside our item list to keep track of it */
> > -	spin_lock_irq(&ep->wq.lock);
> > +	write_lock_irq(&ep->lock);
> >  
> >  	/* record NAPI ID of new item if present */
> >  	ep_set_busy_poll_napi_id(epi);
> > @@ -1501,12 +1532,12 @@ static int ep_insert(struct eventpoll *ep, const struct epoll_event *event,
> >  
> >  		/* Notify waiting tasks that events are available */
> >  		if (waitqueue_active(&ep->wq))
> > -			wake_up_locked(&ep->wq);
> > +			wake_up(&ep->wq);
> 
> is this necessary to switch as well? Is this to make lockdep happy?
> Looks like there are few more conversions too below...
> 
> Thanks,
> 
> -Jason
> 
> 
> 
> >  		if (waitqueue_active(&ep->poll_wait))
> >  			pwake++;
> >  	}
> >  
> > -	spin_unlock_irq(&ep->wq.lock);
> > +	write_unlock_irq(&ep->lock);
> >  
> >  	atomic_long_inc(&ep->user->epoll_watches);
> >  
> > @@ -1532,10 +1563,10 @@ static int ep_insert(struct eventpoll *ep, const struct epoll_event *event,
> >  	 * list, since that is used/cleaned only inside a section bound by "mtx".
> >  	 * And ep_insert() is called with "mtx" held.
> >  	 */
> > -	spin_lock_irq(&ep->wq.lock);
> > +	write_lock_irq(&ep->lock);
> >  	if (ep_is_linked(epi))
> >  		list_del_init(&epi->rdllink);
> > -	spin_unlock_irq(&ep->wq.lock);
> > +	write_unlock_irq(&ep->lock);
> >  
> >  	wakeup_source_unregister(ep_wakeup_source(epi));
> >  
> > @@ -1579,9 +1610,9 @@ static int ep_modify(struct eventpoll *ep, struct epitem *epi,
> >  	 * 1) Flush epi changes above to other CPUs.  This ensures
> >  	 *    we do not miss events from ep_poll_callback if an
> >  	 *    event occurs immediately after we call f_op->poll().
> > -	 *    We need this because we did not take ep->wq.lock while
> > +	 *    We need this because we did not take ep->lock while
> >  	 *    changing epi above (but ep_poll_callback does take
> > -	 *    ep->wq.lock).
> > +	 *    ep->lock).
> >  	 *
> >  	 * 2) We also need to ensure we do not miss _past_ events
> >  	 *    when calling f_op->poll().  This barrier also
> > @@ -1600,18 +1631,18 @@ static int ep_modify(struct eventpoll *ep, struct epitem *epi,
> >  	 * list, push it inside.
> >  	 */
> >  	if (ep_item_poll(epi, &pt, 1)) {
> > -		spin_lock_irq(&ep->wq.lock);
> > +		write_lock_irq(&ep->lock);
> >  		if (!ep_is_linked(epi)) {
> >  			list_add_tail(&epi->rdllink, &ep->rdllist);
> >  			ep_pm_stay_awake(epi);
> >  
> >  			/* Notify waiting tasks that events are available */
> >  			if (waitqueue_active(&ep->wq))
> > -				wake_up_locked(&ep->wq);
> > +				wake_up(&ep->wq);
> >  			if (waitqueue_active(&ep->poll_wait))
> >  				pwake++;
> >  		}
> > -		spin_unlock_irq(&ep->wq.lock);
> > +		write_unlock_irq(&ep->lock);
> >  	}
> >  
> >  	/* We have to call this outside the lock */
> > @@ -1764,7 +1795,7 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
> >  		 * caller specified a non blocking operation.
> >  		 */
> >  		timed_out = 1;
> > -		spin_lock_irq(&ep->wq.lock);
> > +		write_lock_irq(&ep->lock);
> >  		goto check_events;
> >  	}
> >  
> > @@ -1773,7 +1804,7 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
> >  	if (!ep_events_available(ep))
> >  		ep_busy_loop(ep, timed_out);
> >  
> > -	spin_lock_irq(&ep->wq.lock);
> > +	write_lock_irq(&ep->lock);
> >  
> >  	if (!ep_events_available(ep)) {
> >  		/*
> > @@ -1789,7 +1820,7 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
> >  		 * ep_poll_callback() when events will become available.
> >  		 */
> >  		init_waitqueue_entry(&wait, current);
> > -		__add_wait_queue_exclusive(&ep->wq, &wait);
> > +		add_wait_queue_exclusive(&ep->wq, &wait);
> >  
> >  		for (;;) {
> >  			/*
> > @@ -1815,21 +1846,21 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
> >  				break;
> >  			}
> >  
> > -			spin_unlock_irq(&ep->wq.lock);
> > +			write_unlock_irq(&ep->lock);
> >  			if (!schedule_hrtimeout_range(to, slack, HRTIMER_MODE_ABS))
> >  				timed_out = 1;
> >  
> > -			spin_lock_irq(&ep->wq.lock);
> > +			write_lock_irq(&ep->lock);
> >  		}
> >  
> > -		__remove_wait_queue(&ep->wq, &wait);
> > +		remove_wait_queue(&ep->wq, &wait);
> >  		__set_current_state(TASK_RUNNING);
> >  	}
> >  check_events:
> >  	/* Is it worth to try to dig for events ? */
> >  	eavail = ep_events_available(ep);
> >  
> > -	spin_unlock_irq(&ep->wq.lock);
> > +	write_unlock_irq(&ep->lock);
> >  
> >  	/*
> >  	 * Try to transfer events to user space. In case we get 0 events and
> > 
> 


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 1/1] epoll: use rwlock in order to reduce ep_poll_callback() contention
  2018-12-04 11:50   ` Roman Penyaev
@ 2018-12-04 23:59     ` Andrea Parri
  2018-12-05 11:25       ` Roman Penyaev
  0 siblings, 1 reply; 19+ messages in thread
From: Andrea Parri @ 2018-12-04 23:59 UTC (permalink / raw)
  To: Roman Penyaev
  Cc: Linus Torvalds, Al Viro, Paul McKenney, linux-fsdevel,
	Linux List Kernel Mailing

Hi Roman,

On Tue, Dec 04, 2018 at 12:50:58PM +0100, Roman Penyaev wrote:
> On 2018-12-03 18:34, Linus Torvalds wrote:

> > This also ends up making the memory ordering of "xchg()" very very
> > important. Yes, we've documented it as being an ordering op, but I'm
> > not sure we've relied on it this directly before.
> 
> Seems exit_mm() does exactly the same, the following chunk:
> 
> 		up_read(&mm->mmap_sem);
> 
> 		self.task = current;
> 		self.next = xchg(&core_state->dumper.next, &self);
> 
> 
> At least code pattern looks similar.

Maybe add a comment on top of (your) xchg() to note/justify these memory
ordering requirements?  As Paul said: "if there are races, this would
help force them to happen" (and simplify the review, this/future).

  Andrea

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 1/1] epoll: use rwlock in order to reduce ep_poll_callback() contention
  2018-12-04 17:23 ` Jason Baron
  2018-12-04 19:02   ` Paul E. McKenney
@ 2018-12-05 11:16   ` Roman Penyaev
  2018-12-05 16:38     ` Jason Baron
  2018-12-06  1:54   ` Davidlohr Bueso
                     ` (2 subsequent siblings)
  4 siblings, 1 reply; 19+ messages in thread
From: Roman Penyaev @ 2018-12-05 11:16 UTC (permalink / raw)
  To: Jason Baron
  Cc: Alexander Viro, Paul E. McKenney, Linus Torvalds, linux-fsdevel,
	linux-kernel

On 2018-12-04 18:23, Jason Baron wrote:
> On 12/3/18 6:02 AM, Roman Penyaev wrote:

[...]

>> 
>>  	ep_set_busy_poll_napi_id(epi);
>> 
>> @@ -1156,8 +1187,8 @@ static int ep_poll_callback(wait_queue_entry_t 
>> *wait, unsigned mode, int sync, v
>>  	 */
>>  	if (unlikely(ep->ovflist != EP_UNACTIVE_PTR)) {
>>  		if (epi->next == EP_UNACTIVE_PTR) {
>> -			epi->next = ep->ovflist;
>> -			ep->ovflist = epi;
>> +			/* Atomically exchange tail */
>> +			epi->next = xchg(&ep->ovflist, epi);
> 
> This also relies on the fact that the same epi can't be added to the
> list in parallel, because the wait queue doing the wakeup will have the
> wait_queue_head lock. That was a little confusing for me b/c we only 
> had
> the read_lock() above.

Yes, that is indeed not obvious path, but wq is locked by 
wake_up_*_poll()
call or caller of wake_up_locked_poll() has to be sure wq.lock is taken.

I'll add an explicit comment here, thanks for pointing out.

> 
>>  			if (epi->ws) {
>>  				/*
>>  				 * Activate ep->ws since epi->ws may get
>> @@ -1172,7 +1203,7 @@ static int ep_poll_callback(wait_queue_entry_t 
>> *wait, unsigned mode, int sync, v
>> 
>>  	/* If this file is already in the ready list we exit soon */
>>  	if (!ep_is_linked(epi)) {
>> -		list_add_tail(&epi->rdllink, &ep->rdllist);
>> +		list_add_tail_lockless(&epi->rdllink, &ep->rdllist);
>>  		ep_pm_stay_awake_rcu(epi);
>>  	}
> 
> same for this.

... and an explicit comment here.

> 
>> 
>> @@ -1197,13 +1228,13 @@ static int ep_poll_callback(wait_queue_entry_t 
>> *wait, unsigned mode, int sync, v
>>  				break;
>>  			}
>>  		}
>> -		wake_up_locked(&ep->wq);
>> +		wake_up(&ep->wq);
> 
> why the switch here to the locked() variant? Shouldn't the 'reader'
> side, in this case which takes the rwlock for write see all updates in 
> a
> coherent state at this point?

lockdep inside __wake_up_common expects wq_head->lock is taken, and
seems this is not a good idea to leave wq naked on wake up path,
when several CPUs can enter wake function.  Although 
default_wake_function
is protected by spinlock inside try_to_wake_up(), but for example
autoremove_wake_function() can't be called concurrently for the same wq
(it removes wq entry from the list).  Also in case of bookmarks
__wake_up_common adds an entry to the list, thus can't be called without
any locks.

I understand you concern and you are right saying that read side sees
wq entries as stable, but that will work only if __wake_up_common does
not modify anything, that is seems true now, but of course it is
too scary to rely on that in the future.

> 
>>  	}
>>  	if (waitqueue_active(&ep->poll_wait))
>>  		pwake++;
>> 
>>  out_unlock:
>> -	spin_unlock_irqrestore(&ep->wq.lock, flags);
>> +	read_unlock_irqrestore(&ep->lock, flags);
>> 
>>  	/* We have to call this outside the lock */
>>  	if (pwake)
>> @@ -1489,7 +1520,7 @@ static int ep_insert(struct eventpoll *ep, const 
>> struct epoll_event *event,
>>  		goto error_remove_epi;
>> 
>>  	/* We have to drop the new item inside our item list to keep track 
>> of it */
>> -	spin_lock_irq(&ep->wq.lock);
>> +	write_lock_irq(&ep->lock);
>> 
>>  	/* record NAPI ID of new item if present */
>>  	ep_set_busy_poll_napi_id(epi);
>> @@ -1501,12 +1532,12 @@ static int ep_insert(struct eventpoll *ep, 
>> const struct epoll_event *event,
>> 
>>  		/* Notify waiting tasks that events are available */
>>  		if (waitqueue_active(&ep->wq))
>> -			wake_up_locked(&ep->wq);
>> +			wake_up(&ep->wq);
> 
> is this necessary to switch as well? Is this to make lockdep happy?
> Looks like there are few more conversions too below...

Yes, necessary, wq.lock should be taken.

--
Roman


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 1/1] epoll: use rwlock in order to reduce ep_poll_callback() contention
  2018-12-04 19:02   ` Paul E. McKenney
@ 2018-12-05 11:22     ` Roman Penyaev
  0 siblings, 0 replies; 19+ messages in thread
From: Roman Penyaev @ 2018-12-05 11:22 UTC (permalink / raw)
  To: paulmck
  Cc: Jason Baron, Alexander Viro, Linus Torvalds, linux-fsdevel, linux-kernel

On 2018-12-04 20:02, Paul E. McKenney wrote:
> On Tue, Dec 04, 2018 at 12:23:08PM -0500, Jason Baron wrote:
>> 
>> 
>> On 12/3/18 6:02 AM, Roman Penyaev wrote:
>> > Hi all,
>> >
>> > The goal of this patch is to reduce contention of ep_poll_callback() which
>> > can be called concurrently from different CPUs in case of high events
>> > rates and many fds per epoll.  Problem can be very well reproduced by
>> > generating events (write to pipe or eventfd) from many threads, while
>> > consumer thread does polling.  In other words this patch increases the
>> > bandwidth of events which can be delivered from sources to the poller by
>> > adding poll items in a lockless way to the list.
>> >
>> > The main change is in replacement of the spinlock with a rwlock, which is
>> > taken on read in ep_poll_callback(), and then by adding poll items to the
>> > tail of the list using xchg atomic instruction.  Write lock is taken
>> > everywhere else in order to stop list modifications and guarantee that list
>> > updates are fully completed (I assume that write side of a rwlock does not
>> > starve, it seems qrwlock implementation has these guarantees).
>> >
>> > The following are some microbenchmark results based on the test [1] which
>> > starts threads which generate N events each.  The test ends when all
>> > events are successfully fetched by the poller thread:
>> >
>> > spinlock
>> > ========
>> >
>> > threads       run time    events per ms
>> > -------       ---------   -------------
>> >       8         13191ms         6064/ms
>> >      16         30758ms         5201/ms
>> >      32         44315ms         7220/ms
>> >
>> > rwlock + xchg
>> > =============
>> >
>> > threads       run time    events per ms
>> > -------       ---------   -------------
>> >       8          8581ms         9323/ms
>> >      16         13800ms        11594/ms
>> >      32         24167ms        13240/ms
>> >
>> > According to the results bandwidth of delivered events is significantly
>> > increased, thus execution time is reduced.
>> >
>> > This is RFC because I did not run any benchmarks comparing current
>> > qrwlock and spinlock implementations (4.19 kernel), although I did
>> > not notice any epoll performance degradations in other benchmarks.
>> >
>> > Also I'm not quite sure where to put very special lockless variant
>> > of adding element to the list (list_add_tail_lockless() in this
>> > patch).  Seems keeping it locally is safer.
>> >
>> > [1] https://github.com/rouming/test-tools/blob/master/stress-epoll.c
>> >
>> > Signed-off-by: Roman Penyaev <rpenyaev@suse.de>
>> > Cc: Alexander Viro <viro@zeniv.linux.org.uk>
>> > Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
>> > Cc: Linus Torvalds <torvalds@linux-foundation.org>
>> > Cc: linux-fsdevel@vger.kernel.org
>> > Cc: linux-kernel@vger.kernel.org
>> > ---
>> >  fs/eventpoll.c | 107 +++++++++++++++++++++++++++++++------------------
>> >  1 file changed, 69 insertions(+), 38 deletions(-)
>> >
>> > diff --git a/fs/eventpoll.c b/fs/eventpoll.c
>> > index 42bbe6824b4b..89debda47aca 100644
>> > --- a/fs/eventpoll.c
>> > +++ b/fs/eventpoll.c
>> > @@ -50,10 +50,10 @@
>> >   *
>> >   * 1) epmutex (mutex)
>> >   * 2) ep->mtx (mutex)
>> > - * 3) ep->wq.lock (spinlock)
>> > + * 3) ep->lock (rwlock)
>> >   *
>> >   * The acquire order is the one listed above, from 1 to 3.
>> > - * We need a spinlock (ep->wq.lock) because we manipulate objects
>> > + * We need a rwlock (ep->lock) because we manipulate objects
>> >   * from inside the poll callback, that might be triggered from
>> >   * a wake_up() that in turn might be called from IRQ context.
>> >   * So we can't sleep inside the poll callback and hence we need
>> > @@ -85,7 +85,7 @@
>> >   * of epoll file descriptors, we use the current recursion depth as
>> >   * the lockdep subkey.
>> >   * It is possible to drop the "ep->mtx" and to use the global
>> > - * mutex "epmutex" (together with "ep->wq.lock") to have it working,
>> > + * mutex "epmutex" (together with "ep->lock") to have it working,
>> >   * but having "ep->mtx" will make the interface more scalable.
>> >   * Events that require holding "epmutex" are very rare, while for
>> >   * normal operations the epoll private "ep->mtx" will guarantee
>> > @@ -182,8 +182,6 @@ struct epitem {
>> >   * This structure is stored inside the "private_data" member of the file
>> >   * structure and represents the main data structure for the eventpoll
>> >   * interface.
>> > - *
>> > - * Access to it is protected by the lock inside wq.
>> >   */
>> >  struct eventpoll {
>> >  	/*
>> > @@ -203,13 +201,16 @@ struct eventpoll {
>> >  	/* List of ready file descriptors */
>> >  	struct list_head rdllist;
>> >
>> > +	/* Lock which protects rdllist and ovflist */
>> > +	rwlock_t lock;
>> > +
>> >  	/* RB tree root used to store monitored fd structs */
>> >  	struct rb_root_cached rbr;
>> >
>> >  	/*
>> >  	 * This is a single linked list that chains all the "struct epitem" that
>> >  	 * happened while transferring ready events to userspace w/out
>> > -	 * holding ->wq.lock.
>> > +	 * holding ->lock.
>> >  	 */
>> >  	struct epitem *ovflist;
>> >
>> > @@ -697,17 +698,17 @@ static __poll_t ep_scan_ready_list(struct eventpoll *ep,
>> >  	 * because we want the "sproc" callback to be able to do it
>> >  	 * in a lockless way.
>> >  	 */
>> > -	spin_lock_irq(&ep->wq.lock);
>> > +	write_lock_irq(&ep->lock);
>> >  	list_splice_init(&ep->rdllist, &txlist);
>> >  	ep->ovflist = NULL;
>> > -	spin_unlock_irq(&ep->wq.lock);
>> > +	write_unlock_irq(&ep->lock);
>> >
>> >  	/*
>> >  	 * Now call the callback function.
>> >  	 */
>> >  	res = (*sproc)(ep, &txlist, priv);
>> >
>> > -	spin_lock_irq(&ep->wq.lock);
>> > +	write_lock_irq(&ep->lock);
>> >  	/*
>> >  	 * During the time we spent inside the "sproc" callback, some
>> >  	 * other events might have been queued by the poll callback.
>> > @@ -722,7 +723,8 @@ static __poll_t ep_scan_ready_list(struct eventpoll *ep,
>> >  		 * contain them, and the list_splice() below takes care of them.
>> >  		 */
>> >  		if (!ep_is_linked(epi)) {
>> > -			list_add_tail(&epi->rdllink, &ep->rdllist);
>> > +			/* Reverse ->ovflist, events should be in FIFO */
>> > +			list_add(&epi->rdllink, &ep->rdllist);
>> >  			ep_pm_stay_awake(epi);
>> >  		}
>> >  	}
>> > @@ -745,11 +747,11 @@ static __poll_t ep_scan_ready_list(struct eventpoll *ep,
>> >  		 * the ->poll() wait list (delayed after we release the lock).
>> >  		 */
>> >  		if (waitqueue_active(&ep->wq))
>> > -			wake_up_locked(&ep->wq);
>> > +			wake_up(&ep->wq);
>> >  		if (waitqueue_active(&ep->poll_wait))
>> >  			pwake++;
>> >  	}
>> > -	spin_unlock_irq(&ep->wq.lock);
>> > +	write_unlock_irq(&ep->lock);
>> >
>> >  	if (!ep_locked)
>> >  		mutex_unlock(&ep->mtx);
>> > @@ -789,10 +791,10 @@ static int ep_remove(struct eventpoll *ep, struct epitem *epi)
>> >
>> >  	rb_erase_cached(&epi->rbn, &ep->rbr);
>> >
>> > -	spin_lock_irq(&ep->wq.lock);
>> > +	write_lock_irq(&ep->lock);
>> >  	if (ep_is_linked(epi))
>> >  		list_del_init(&epi->rdllink);
>> > -	spin_unlock_irq(&ep->wq.lock);
>> > +	write_unlock_irq(&ep->lock);
>> >
>> >  	wakeup_source_unregister(ep_wakeup_source(epi));
>> >  	/*
>> > @@ -842,7 +844,7 @@ static void ep_free(struct eventpoll *ep)
>> >  	 * Walks through the whole tree by freeing each "struct epitem". At this
>> >  	 * point we are sure no poll callbacks will be lingering around, and also by
>> >  	 * holding "epmutex" we can be sure that no file cleanup code will hit
>> > -	 * us during this operation. So we can avoid the lock on "ep->wq.lock".
>> > +	 * us during this operation. So we can avoid the lock on "ep->lock".
>> >  	 * We do not need to lock ep->mtx, either, we only do it to prevent
>> >  	 * a lockdep warning.
>> >  	 */
>> > @@ -1023,6 +1025,7 @@ static int ep_alloc(struct eventpoll **pep)
>> >  		goto free_uid;
>> >
>> >  	mutex_init(&ep->mtx);
>> > +	rwlock_init(&ep->lock);
>> >  	init_waitqueue_head(&ep->wq);
>> >  	init_waitqueue_head(&ep->poll_wait);
>> >  	INIT_LIST_HEAD(&ep->rdllist);
>> > @@ -1112,10 +1115,38 @@ struct file *get_epoll_tfile_raw_ptr(struct file *file, int tfd,
>> >  }
>> >  #endif /* CONFIG_CHECKPOINT_RESTORE */
>> >
>> > +/*
>> > + * Adds a new entry to the tail of the list in a lockless way, i.e.
>> > + * multiple CPUs are allowed to call this function concurrently.
>> > + *
>> > + * Beware: it is necessary to prevent any other modifications of the
>> > + *         existing list until all changes are completed, in other words
>> > + *         concurrent list_add_tail_lockless() calls should be protected
>> > + *         with a read lock, where write lock acts as a barrier which
>> > + *         makes sure all list_add_tail_lockless() calls are fully
>> > + *         completed.
>> > + */
>> > +static inline void list_add_tail_lockless(struct list_head *new,
>> > +					  struct list_head *head)
>> > +{
>> > +	struct list_head *prev;
>> > +
>> > +	new->next = head;
>> > +	prev = xchg(&head->prev, new);
>> > +	prev->next = new;
>> > +	new->prev = prev;
>> > +}
>> > +
>> >  /*
>> >   * This is the callback that is passed to the wait queue wakeup
>> >   * mechanism. It is called by the stored file descriptors when they
>> >   * have events to report.
>> > + *
>> > + * This callback takes a read lock in order not to content with concurrent
>> > + * events from another file descriptors, thus all modifications to ->rdllist
>> > + * or ->ovflist are lockless.  Read lock is paired with the write lock from
>> > + * ep_scan_ready_list(), which stops all list modifications and guarantees
>> > + * that lists won't be corrupted.
>> >   */
>> >  static int ep_poll_callback(wait_queue_entry_t *wait, unsigned mode, int sync, void *key)
>> >  {
>> > @@ -1126,7 +1157,7 @@ static int ep_poll_callback(wait_queue_entry_t *wait, unsigned mode, int sync, v
>> >  	__poll_t pollflags = key_to_poll(key);
>> >  	int ewake = 0;
>> >
>> > -	spin_lock_irqsave(&ep->wq.lock, flags);
>> > +	read_lock_irqsave(&ep->lock, flags);
>> >
>> >  	ep_set_busy_poll_napi_id(epi);
>> >
>> > @@ -1156,8 +1187,8 @@ static int ep_poll_callback(wait_queue_entry_t *wait, unsigned mode, int sync, v
>> >  	 */
>> >  	if (unlikely(ep->ovflist != EP_UNACTIVE_PTR)) {
>> >  		if (epi->next == EP_UNACTIVE_PTR) {
>> > -			epi->next = ep->ovflist;
>> > -			ep->ovflist = epi;
>> > +			/* Atomically exchange tail */
>> > +			epi->next = xchg(&ep->ovflist, epi);
>> 
>> This also relies on the fact that the same epi can't be added to the
>> list in parallel, because the wait queue doing the wakeup will have 
>> the
>> wait_queue_head lock. That was a little confusing for me b/c we only 
>> had
>> the read_lock() above.
> 
> I missed this as well.
> 
> I was also concerned about "fly-by" wakeups where the to-be-awoken task
> never really goes to sleep, but it looks like tasks are unconditionally
> queued, which seems like it should avoid that problem.
> 
> Please do some testing with artificial delays in the lockless queuing
> code, for example, via udelay() or similar.  If there are races, this
> would help force them to happen.

Yep, that what I am doing right now, experimenting with different
polling scenarios and stressing with random delays.  Thanks.

--
Roman


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 1/1] epoll: use rwlock in order to reduce ep_poll_callback() contention
  2018-12-04 23:59     ` Andrea Parri
@ 2018-12-05 11:25       ` Roman Penyaev
  0 siblings, 0 replies; 19+ messages in thread
From: Roman Penyaev @ 2018-12-05 11:25 UTC (permalink / raw)
  To: Andrea Parri
  Cc: Linus Torvalds, Al Viro, Paul McKenney, linux-fsdevel,
	Linux List Kernel Mailing

On 2018-12-05 00:59, Andrea Parri wrote:
> Hi Roman,
> 
> On Tue, Dec 04, 2018 at 12:50:58PM +0100, Roman Penyaev wrote:
>> On 2018-12-03 18:34, Linus Torvalds wrote:
> 
>> > This also ends up making the memory ordering of "xchg()" very very
>> > important. Yes, we've documented it as being an ordering op, but I'm
>> > not sure we've relied on it this directly before.
>> 
>> Seems exit_mm() does exactly the same, the following chunk:
>> 
>> 		up_read(&mm->mmap_sem);
>> 
>> 		self.task = current;
>> 		self.next = xchg(&core_state->dumper.next, &self);
>> 
>> 
>> At least code pattern looks similar.
> 
> Maybe add a comment on top of (your) xchg() to note/justify these 
> memory
> ordering requirements?  As Paul said: "if there are races, this would
> help force them to happen" (and simplify the review, this/future).

Hi Andrea,

Sure, this path is tricky, so will I cook something up.

--
Roman

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 1/1] epoll: use rwlock in order to reduce ep_poll_callback() contention
  2018-12-05 11:16   ` Roman Penyaev
@ 2018-12-05 16:38     ` Jason Baron
  2018-12-05 20:11       ` Roman Penyaev
  0 siblings, 1 reply; 19+ messages in thread
From: Jason Baron @ 2018-12-05 16:38 UTC (permalink / raw)
  To: Roman Penyaev
  Cc: Alexander Viro, Paul E. McKenney, Linus Torvalds, linux-fsdevel,
	linux-kernel



On 12/5/18 6:16 AM, Roman Penyaev wrote:
> On 2018-12-04 18:23, Jason Baron wrote:
>> On 12/3/18 6:02 AM, Roman Penyaev wrote:
> 
> [...]
> 
>>>
>>>      ep_set_busy_poll_napi_id(epi);
>>>
>>> @@ -1156,8 +1187,8 @@ static int ep_poll_callback(wait_queue_entry_t *wait, unsigned mode, int sync, v
>>>       */
>>>      if (unlikely(ep->ovflist != EP_UNACTIVE_PTR)) {
>>>          if (epi->next == EP_UNACTIVE_PTR) {
>>> -            epi->next = ep->ovflist;
>>> -            ep->ovflist = epi;
>>> +            /* Atomically exchange tail */
>>> +            epi->next = xchg(&ep->ovflist, epi);
>>
>> This also relies on the fact that the same epi can't be added to the
>> list in parallel, because the wait queue doing the wakeup will have the
>> wait_queue_head lock. That was a little confusing for me b/c we only had
>> the read_lock() above.
> 
> Yes, that is indeed not obvious path, but wq is locked by wake_up_*_poll()
> call or caller of wake_up_locked_poll() has to be sure wq.lock is taken.
> 
> I'll add an explicit comment here, thanks for pointing out.
> 
>>
>>>              if (epi->ws) {
>>>                  /*
>>>                   * Activate ep->ws since epi->ws may get
>>> @@ -1172,7 +1203,7 @@ static int ep_poll_callback(wait_queue_entry_t *wait, unsigned mode, int sync, v
>>>
>>>      /* If this file is already in the ready list we exit soon */
>>>      if (!ep_is_linked(epi)) {
>>> -        list_add_tail(&epi->rdllink, &ep->rdllist);
>>> +        list_add_tail_lockless(&epi->rdllink, &ep->rdllist);
>>>          ep_pm_stay_awake_rcu(epi);
>>>      }
>>
>> same for this.
> 
> ... and an explicit comment here.
> 
>>
>>>
>>> @@ -1197,13 +1228,13 @@ static int ep_poll_callback(wait_queue_entry_t *wait, unsigned mode, int sync, v
>>>                  break;
>>>              }
>>>          }
>>> -        wake_up_locked(&ep->wq);
>>> +        wake_up(&ep->wq);
>>
>> why the switch here to the locked() variant? Shouldn't the 'reader'
>> side, in this case which takes the rwlock for write see all updates in a
>> coherent state at this point?
> 
> lockdep inside __wake_up_common expects wq_head->lock is taken, and
> seems this is not a good idea to leave wq naked on wake up path,
> when several CPUs can enter wake function.  Although default_wake_function
> is protected by spinlock inside try_to_wake_up(), but for example
> autoremove_wake_function() can't be called concurrently for the same wq
> (it removes wq entry from the list).  Also in case of bookmarks
> __wake_up_common adds an entry to the list, thus can't be called without
> any locks.
> 
> I understand you concern and you are right saying that read side sees
> wq entries as stable, but that will work only if __wake_up_common does
> not modify anything, that is seems true now, but of course it is
> too scary to rely on that in the future.

I think it might be interesting for, at least testing, to see if not grabbing
wq.lock improves your benchmarks any further? fwiw, epoll only recently started
grabbing wq.lock bc lockdep required it.

Thanks,

-Jason

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 1/1] epoll: use rwlock in order to reduce ep_poll_callback() contention
  2018-12-05 16:38     ` Jason Baron
@ 2018-12-05 20:11       ` Roman Penyaev
  0 siblings, 0 replies; 19+ messages in thread
From: Roman Penyaev @ 2018-12-05 20:11 UTC (permalink / raw)
  To: Jason Baron
  Cc: Alexander Viro, Paul E. McKenney, Linus Torvalds, linux-fsdevel,
	linux-kernel

On 2018-12-05 17:38, Jason Baron wrote:
> 
> I think it might be interesting for, at least testing, to see if not 
> grabbing
> wq.lock improves your benchmarks any further? fwiw, epoll only recently 
> started
> grabbing wq.lock bc lockdep required it.

That's easy!  I've just tested with the following hunk applied to my 
patch on top:

+++ b/fs/eventpoll.c
@@ -1228,7 +1228,7 @@ static int ep_poll_callback(wait_queue_entry_t 
*wait, unsigned mode, int sync, v
                                 break;
                         }
                 }
-               wake_up(&ep->wq);
+               wake_up_locked(&ep->wq);
         }

Run time:

threads   w/ wq.lock   w/o wq.lock
-------   ----------   -----------
      8        8581ms        8602ms
     16       13800ms       13715ms
     32       24167ms       23817ms

No big difference.  According to perf the contention is on read lock and
on try_to_wake_up(), the p->pi_lock, which serializes access exactly 
like
vanished wq.lock.


-   24.41%     5.39%  a.out    [kernel.kallsyms]   [k] ep_poll_callback
    - 19.02% ep_poll_callback
       + 11.88% _raw_read_lock_irqsave
       + 5.74% _raw_read_unlock_irqrestore
       - 1.39% __wake_up_common
          - 1.22% try_to_wake_up
             + 0.98% _raw_spin_lock_irqsave


--
Roman

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 1/1] epoll: use rwlock in order to reduce ep_poll_callback() contention
  2018-12-03 11:02 [RFC PATCH 1/1] epoll: use rwlock in order to reduce ep_poll_callback() contention Roman Penyaev
  2018-12-03 17:34 ` Linus Torvalds
  2018-12-04 17:23 ` Jason Baron
@ 2018-12-05 23:46 ` Eric Wong
  2018-12-06 10:52   ` Roman Penyaev
  2 siblings, 1 reply; 19+ messages in thread
From: Eric Wong @ 2018-12-05 23:46 UTC (permalink / raw)
  To: Roman Penyaev
  Cc: Alexander Viro, Paul E. McKenney, Linus Torvalds, linux-fsdevel,
	linux-kernel, Mathieu Desnoyers

Roman Penyaev <rpenyaev@suse.de> wrote:
> Hi all,
> 
> The goal of this patch is to reduce contention of ep_poll_callback() which
> can be called concurrently from different CPUs in case of high events
> rates and many fds per epoll.  Problem can be very well reproduced by
> generating events (write to pipe or eventfd) from many threads, while
> consumer thread does polling.  In other words this patch increases the
> bandwidth of events which can be delivered from sources to the poller by
> adding poll items in a lockless way to the list.

Hi Roman,

I also tried to solve this problem many years ago with help of
the well-tested-in-userspace wfcqueue from Mathieu's URCU.

I was also looking to solve contention with parallel epoll_wait
callers with this.  AFAIK, it worked well; but needed the
userspace tests from wfcqueue ported over to the kernel and more
review.

I didn't have enough computing power to show the real-world
benefits or funding to continue:

	https://lore.kernel.org/lkml/?q=wfcqueue+d:..20130501

It might not be too much trouble for you to brush up the wait-free
patches and test them against the rwlock implementation.

(I only noticed this thread since I was catching up on some
 public-inbox work :>)

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 1/1] epoll: use rwlock in order to reduce ep_poll_callback() contention
  2018-12-04 17:23 ` Jason Baron
  2018-12-04 19:02   ` Paul E. McKenney
  2018-12-05 11:16   ` Roman Penyaev
@ 2018-12-06  1:54   ` Davidlohr Bueso
  2018-12-06  3:08   ` Davidlohr Bueso
  2018-12-06  4:04   ` Davidlohr Bueso
  4 siblings, 0 replies; 19+ messages in thread
From: Davidlohr Bueso @ 2018-12-06  1:54 UTC (permalink / raw)
  To: Jason Baron
  Cc: Roman Penyaev, Alexander Viro, Paul E. McKenney, Linus Torvalds,
	linux-fsdevel, linux-kernel, akpm

+ akpm.

Also, there are some epoll patches queued for -next, and as such
this patch does not apply against linux-next.

Thanks,
Davidlohr

On Tue, 04 Dec 2018, Jason Baron wrote:
>On 12/3/18 6:02 AM, Roman Penyaev wrote:
>> Hi all,
>>
>> The goal of this patch is to reduce contention of ep_poll_callback() which
>> can be called concurrently from different CPUs in case of high events
>> rates and many fds per epoll.  Problem can be very well reproduced by
>> generating events (write to pipe or eventfd) from many threads, while
>> consumer thread does polling.  In other words this patch increases the
>> bandwidth of events which can be delivered from sources to the poller by
>> adding poll items in a lockless way to the list.
>>
>> The main change is in replacement of the spinlock with a rwlock, which is
>> taken on read in ep_poll_callback(), and then by adding poll items to the
>> tail of the list using xchg atomic instruction.  Write lock is taken
>> everywhere else in order to stop list modifications and guarantee that list
>> updates are fully completed (I assume that write side of a rwlock does not
>> starve, it seems qrwlock implementation has these guarantees).
>>
>> The following are some microbenchmark results based on the test [1] which
>> starts threads which generate N events each.  The test ends when all
>> events are successfully fetched by the poller thread:
>>
>> spinlock
>> ========
>>
>> threads       run time    events per ms
>> -------       ---------   -------------
>>       8         13191ms         6064/ms
>>      16         30758ms         5201/ms
>>      32         44315ms         7220/ms
>>
>> rwlock + xchg
>> =============
>>
>> threads       run time    events per ms
>> -------       ---------   -------------
>>       8          8581ms         9323/ms
>>      16         13800ms        11594/ms
>>      32         24167ms        13240/ms
>>
>> According to the results bandwidth of delivered events is significantly
>> increased, thus execution time is reduced.
>>
>> This is RFC because I did not run any benchmarks comparing current
>> qrwlock and spinlock implementations (4.19 kernel), although I did
>> not notice any epoll performance degradations in other benchmarks.
>>
>> Also I'm not quite sure where to put very special lockless variant
>> of adding element to the list (list_add_tail_lockless() in this
>> patch).  Seems keeping it locally is safer.
>>
>> [1] https://github.com/rouming/test-tools/blob/master/stress-epoll.c
>>
>> Signed-off-by: Roman Penyaev <rpenyaev@suse.de>
>> Cc: Alexander Viro <viro@zeniv.linux.org.uk>
>> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
>> Cc: Linus Torvalds <torvalds@linux-foundation.org>
>> Cc: linux-fsdevel@vger.kernel.org
>> Cc: linux-kernel@vger.kernel.org
>> ---
>>  fs/eventpoll.c | 107 +++++++++++++++++++++++++++++++------------------
>>  1 file changed, 69 insertions(+), 38 deletions(-)
>>
>> diff --git a/fs/eventpoll.c b/fs/eventpoll.c
>> index 42bbe6824b4b..89debda47aca 100644
>> --- a/fs/eventpoll.c
>> +++ b/fs/eventpoll.c
>> @@ -50,10 +50,10 @@
>>   *
>>   * 1) epmutex (mutex)
>>   * 2) ep->mtx (mutex)
>> - * 3) ep->wq.lock (spinlock)
>> + * 3) ep->lock (rwlock)
>>   *
>>   * The acquire order is the one listed above, from 1 to 3.
>> - * We need a spinlock (ep->wq.lock) because we manipulate objects
>> + * We need a rwlock (ep->lock) because we manipulate objects
>>   * from inside the poll callback, that might be triggered from
>>   * a wake_up() that in turn might be called from IRQ context.
>>   * So we can't sleep inside the poll callback and hence we need
>> @@ -85,7 +85,7 @@
>>   * of epoll file descriptors, we use the current recursion depth as
>>   * the lockdep subkey.
>>   * It is possible to drop the "ep->mtx" and to use the global
>> - * mutex "epmutex" (together with "ep->wq.lock") to have it working,
>> + * mutex "epmutex" (together with "ep->lock") to have it working,
>>   * but having "ep->mtx" will make the interface more scalable.
>>   * Events that require holding "epmutex" are very rare, while for
>>   * normal operations the epoll private "ep->mtx" will guarantee
>> @@ -182,8 +182,6 @@ struct epitem {
>>   * This structure is stored inside the "private_data" member of the file
>>   * structure and represents the main data structure for the eventpoll
>>   * interface.
>> - *
>> - * Access to it is protected by the lock inside wq.
>>   */
>>  struct eventpoll {
>>  	/*
>> @@ -203,13 +201,16 @@ struct eventpoll {
>>  	/* List of ready file descriptors */
>>  	struct list_head rdllist;
>>
>> +	/* Lock which protects rdllist and ovflist */
>> +	rwlock_t lock;
>> +
>>  	/* RB tree root used to store monitored fd structs */
>>  	struct rb_root_cached rbr;
>>
>>  	/*
>>  	 * This is a single linked list that chains all the "struct epitem" that
>>  	 * happened while transferring ready events to userspace w/out
>> -	 * holding ->wq.lock.
>> +	 * holding ->lock.
>>  	 */
>>  	struct epitem *ovflist;
>>
>> @@ -697,17 +698,17 @@ static __poll_t ep_scan_ready_list(struct eventpoll *ep,
>>  	 * because we want the "sproc" callback to be able to do it
>>  	 * in a lockless way.
>>  	 */
>> -	spin_lock_irq(&ep->wq.lock);
>> +	write_lock_irq(&ep->lock);
>>  	list_splice_init(&ep->rdllist, &txlist);
>>  	ep->ovflist = NULL;
>> -	spin_unlock_irq(&ep->wq.lock);
>> +	write_unlock_irq(&ep->lock);
>>
>>  	/*
>>  	 * Now call the callback function.
>>  	 */
>>  	res = (*sproc)(ep, &txlist, priv);
>>
>> -	spin_lock_irq(&ep->wq.lock);
>> +	write_lock_irq(&ep->lock);
>>  	/*
>>  	 * During the time we spent inside the "sproc" callback, some
>>  	 * other events might have been queued by the poll callback.
>> @@ -722,7 +723,8 @@ static __poll_t ep_scan_ready_list(struct eventpoll *ep,
>>  		 * contain them, and the list_splice() below takes care of them.
>>  		 */
>>  		if (!ep_is_linked(epi)) {
>> -			list_add_tail(&epi->rdllink, &ep->rdllist);
>> +			/* Reverse ->ovflist, events should be in FIFO */
>> +			list_add(&epi->rdllink, &ep->rdllist);
>>  			ep_pm_stay_awake(epi);
>>  		}
>>  	}
>> @@ -745,11 +747,11 @@ static __poll_t ep_scan_ready_list(struct eventpoll *ep,
>>  		 * the ->poll() wait list (delayed after we release the lock).
>>  		 */
>>  		if (waitqueue_active(&ep->wq))
>> -			wake_up_locked(&ep->wq);
>> +			wake_up(&ep->wq);
>>  		if (waitqueue_active(&ep->poll_wait))
>>  			pwake++;
>>  	}
>> -	spin_unlock_irq(&ep->wq.lock);
>> +	write_unlock_irq(&ep->lock);
>>
>>  	if (!ep_locked)
>>  		mutex_unlock(&ep->mtx);
>> @@ -789,10 +791,10 @@ static int ep_remove(struct eventpoll *ep, struct epitem *epi)
>>
>>  	rb_erase_cached(&epi->rbn, &ep->rbr);
>>
>> -	spin_lock_irq(&ep->wq.lock);
>> +	write_lock_irq(&ep->lock);
>>  	if (ep_is_linked(epi))
>>  		list_del_init(&epi->rdllink);
>> -	spin_unlock_irq(&ep->wq.lock);
>> +	write_unlock_irq(&ep->lock);
>>
>>  	wakeup_source_unregister(ep_wakeup_source(epi));
>>  	/*
>> @@ -842,7 +844,7 @@ static void ep_free(struct eventpoll *ep)
>>  	 * Walks through the whole tree by freeing each "struct epitem". At this
>>  	 * point we are sure no poll callbacks will be lingering around, and also by
>>  	 * holding "epmutex" we can be sure that no file cleanup code will hit
>> -	 * us during this operation. So we can avoid the lock on "ep->wq.lock".
>> +	 * us during this operation. So we can avoid the lock on "ep->lock".
>>  	 * We do not need to lock ep->mtx, either, we only do it to prevent
>>  	 * a lockdep warning.
>>  	 */
>> @@ -1023,6 +1025,7 @@ static int ep_alloc(struct eventpoll **pep)
>>  		goto free_uid;
>>
>>  	mutex_init(&ep->mtx);
>> +	rwlock_init(&ep->lock);
>>  	init_waitqueue_head(&ep->wq);
>>  	init_waitqueue_head(&ep->poll_wait);
>>  	INIT_LIST_HEAD(&ep->rdllist);
>> @@ -1112,10 +1115,38 @@ struct file *get_epoll_tfile_raw_ptr(struct file *file, int tfd,
>>  }
>>  #endif /* CONFIG_CHECKPOINT_RESTORE */
>>
>> +/*
>> + * Adds a new entry to the tail of the list in a lockless way, i.e.
>> + * multiple CPUs are allowed to call this function concurrently.
>> + *
>> + * Beware: it is necessary to prevent any other modifications of the
>> + *         existing list until all changes are completed, in other words
>> + *         concurrent list_add_tail_lockless() calls should be protected
>> + *         with a read lock, where write lock acts as a barrier which
>> + *         makes sure all list_add_tail_lockless() calls are fully
>> + *         completed.
>> + */
>> +static inline void list_add_tail_lockless(struct list_head *new,
>> +					  struct list_head *head)
>> +{
>> +	struct list_head *prev;
>> +
>> +	new->next = head;
>> +	prev = xchg(&head->prev, new);
>> +	prev->next = new;
>> +	new->prev = prev;
>> +}
>> +
>>  /*
>>   * This is the callback that is passed to the wait queue wakeup
>>   * mechanism. It is called by the stored file descriptors when they
>>   * have events to report.
>> + *
>> + * This callback takes a read lock in order not to content with concurrent
>> + * events from another file descriptors, thus all modifications to ->rdllist
>> + * or ->ovflist are lockless.  Read lock is paired with the write lock from
>> + * ep_scan_ready_list(), which stops all list modifications and guarantees
>> + * that lists won't be corrupted.
>>   */
>>  static int ep_poll_callback(wait_queue_entry_t *wait, unsigned mode, int sync, void *key)
>>  {
>> @@ -1126,7 +1157,7 @@ static int ep_poll_callback(wait_queue_entry_t *wait, unsigned mode, int sync, v
>>  	__poll_t pollflags = key_to_poll(key);
>>  	int ewake = 0;
>>
>> -	spin_lock_irqsave(&ep->wq.lock, flags);
>> +	read_lock_irqsave(&ep->lock, flags);
>>
>>  	ep_set_busy_poll_napi_id(epi);
>>
>> @@ -1156,8 +1187,8 @@ static int ep_poll_callback(wait_queue_entry_t *wait, unsigned mode, int sync, v
>>  	 */
>>  	if (unlikely(ep->ovflist != EP_UNACTIVE_PTR)) {
>>  		if (epi->next == EP_UNACTIVE_PTR) {
>> -			epi->next = ep->ovflist;
>> -			ep->ovflist = epi;
>> +			/* Atomically exchange tail */
>> +			epi->next = xchg(&ep->ovflist, epi);
>
>This also relies on the fact that the same epi can't be added to the
>list in parallel, because the wait queue doing the wakeup will have the
>wait_queue_head lock. That was a little confusing for me b/c we only had
>the read_lock() above.
>
>>  			if (epi->ws) {
>>  				/*
>>  				 * Activate ep->ws since epi->ws may get
>> @@ -1172,7 +1203,7 @@ static int ep_poll_callback(wait_queue_entry_t *wait, unsigned mode, int sync, v
>>
>>  	/* If this file is already in the ready list we exit soon */
>>  	if (!ep_is_linked(epi)) {
>> -		list_add_tail(&epi->rdllink, &ep->rdllist);
>> +		list_add_tail_lockless(&epi->rdllink, &ep->rdllist);
>>  		ep_pm_stay_awake_rcu(epi);
>>  	}
>
>same for this.
>
>>
>> @@ -1197,13 +1228,13 @@ static int ep_poll_callback(wait_queue_entry_t *wait, unsigned mode, int sync, v
>>  				break;
>>  			}
>>  		}
>> -		wake_up_locked(&ep->wq);
>> +		wake_up(&ep->wq);
>
>why the switch here to the locked() variant? Shouldn't the 'reader'
>side, in this case which takes the rwlock for write see all updates in a
>coherent state at this point?
>
>>  	}
>>  	if (waitqueue_active(&ep->poll_wait))
>>  		pwake++;
>>
>>  out_unlock:
>> -	spin_unlock_irqrestore(&ep->wq.lock, flags);
>> +	read_unlock_irqrestore(&ep->lock, flags);
>>
>>  	/* We have to call this outside the lock */
>>  	if (pwake)
>> @@ -1489,7 +1520,7 @@ static int ep_insert(struct eventpoll *ep, const struct epoll_event *event,
>>  		goto error_remove_epi;
>>
>>  	/* We have to drop the new item inside our item list to keep track of it */
>> -	spin_lock_irq(&ep->wq.lock);
>> +	write_lock_irq(&ep->lock);
>>
>>  	/* record NAPI ID of new item if present */
>>  	ep_set_busy_poll_napi_id(epi);
>> @@ -1501,12 +1532,12 @@ static int ep_insert(struct eventpoll *ep, const struct epoll_event *event,
>>
>>  		/* Notify waiting tasks that events are available */
>>  		if (waitqueue_active(&ep->wq))
>> -			wake_up_locked(&ep->wq);
>> +			wake_up(&ep->wq);
>
>is this necessary to switch as well? Is this to make lockdep happy?
>Looks like there are few more conversions too below...
>
>Thanks,
>
>-Jason
>
>
>
>>  		if (waitqueue_active(&ep->poll_wait))
>>  			pwake++;
>>  	}
>>
>> -	spin_unlock_irq(&ep->wq.lock);
>> +	write_unlock_irq(&ep->lock);
>>
>>  	atomic_long_inc(&ep->user->epoll_watches);
>>
>> @@ -1532,10 +1563,10 @@ static int ep_insert(struct eventpoll *ep, const struct epoll_event *event,
>>  	 * list, since that is used/cleaned only inside a section bound by "mtx".
>>  	 * And ep_insert() is called with "mtx" held.
>>  	 */
>> -	spin_lock_irq(&ep->wq.lock);
>> +	write_lock_irq(&ep->lock);
>>  	if (ep_is_linked(epi))
>>  		list_del_init(&epi->rdllink);
>> -	spin_unlock_irq(&ep->wq.lock);
>> +	write_unlock_irq(&ep->lock);
>>
>>  	wakeup_source_unregister(ep_wakeup_source(epi));
>>
>> @@ -1579,9 +1610,9 @@ static int ep_modify(struct eventpoll *ep, struct epitem *epi,
>>  	 * 1) Flush epi changes above to other CPUs.  This ensures
>>  	 *    we do not miss events from ep_poll_callback if an
>>  	 *    event occurs immediately after we call f_op->poll().
>> -	 *    We need this because we did not take ep->wq.lock while
>> +	 *    We need this because we did not take ep->lock while
>>  	 *    changing epi above (but ep_poll_callback does take
>> -	 *    ep->wq.lock).
>> +	 *    ep->lock).
>>  	 *
>>  	 * 2) We also need to ensure we do not miss _past_ events
>>  	 *    when calling f_op->poll().  This barrier also
>> @@ -1600,18 +1631,18 @@ static int ep_modify(struct eventpoll *ep, struct epitem *epi,
>>  	 * list, push it inside.
>>  	 */
>>  	if (ep_item_poll(epi, &pt, 1)) {
>> -		spin_lock_irq(&ep->wq.lock);
>> +		write_lock_irq(&ep->lock);
>>  		if (!ep_is_linked(epi)) {
>>  			list_add_tail(&epi->rdllink, &ep->rdllist);
>>  			ep_pm_stay_awake(epi);
>>
>>  			/* Notify waiting tasks that events are available */
>>  			if (waitqueue_active(&ep->wq))
>> -				wake_up_locked(&ep->wq);
>> +				wake_up(&ep->wq);
>>  			if (waitqueue_active(&ep->poll_wait))
>>  				pwake++;
>>  		}
>> -		spin_unlock_irq(&ep->wq.lock);
>> +		write_unlock_irq(&ep->lock);
>>  	}
>>
>>  	/* We have to call this outside the lock */
>> @@ -1764,7 +1795,7 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
>>  		 * caller specified a non blocking operation.
>>  		 */
>>  		timed_out = 1;
>> -		spin_lock_irq(&ep->wq.lock);
>> +		write_lock_irq(&ep->lock);
>>  		goto check_events;
>>  	}
>>
>> @@ -1773,7 +1804,7 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
>>  	if (!ep_events_available(ep))
>>  		ep_busy_loop(ep, timed_out);
>>
>> -	spin_lock_irq(&ep->wq.lock);
>> +	write_lock_irq(&ep->lock);
>>
>>  	if (!ep_events_available(ep)) {
>>  		/*
>> @@ -1789,7 +1820,7 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
>>  		 * ep_poll_callback() when events will become available.
>>  		 */
>>  		init_waitqueue_entry(&wait, current);
>> -		__add_wait_queue_exclusive(&ep->wq, &wait);
>> +		add_wait_queue_exclusive(&ep->wq, &wait);
>>
>>  		for (;;) {
>>  			/*
>> @@ -1815,21 +1846,21 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
>>  				break;
>>  			}
>>
>> -			spin_unlock_irq(&ep->wq.lock);
>> +			write_unlock_irq(&ep->lock);
>>  			if (!schedule_hrtimeout_range(to, slack, HRTIMER_MODE_ABS))
>>  				timed_out = 1;
>>
>> -			spin_lock_irq(&ep->wq.lock);
>> +			write_lock_irq(&ep->lock);
>>  		}
>>
>> -		__remove_wait_queue(&ep->wq, &wait);
>> +		remove_wait_queue(&ep->wq, &wait);
>>  		__set_current_state(TASK_RUNNING);
>>  	}
>>  check_events:
>>  	/* Is it worth to try to dig for events ? */
>>  	eavail = ep_events_available(ep);
>>
>> -	spin_unlock_irq(&ep->wq.lock);
>> +	write_unlock_irq(&ep->lock);
>>
>>  	/*
>>  	 * Try to transfer events to user space. In case we get 0 events and
>>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 1/1] epoll: use rwlock in order to reduce ep_poll_callback() contention
  2018-12-04 17:23 ` Jason Baron
                     ` (2 preceding siblings ...)
  2018-12-06  1:54   ` Davidlohr Bueso
@ 2018-12-06  3:08   ` Davidlohr Bueso
  2018-12-06 10:27     ` Roman Penyaev
  2018-12-06  4:04   ` Davidlohr Bueso
  4 siblings, 1 reply; 19+ messages in thread
From: Davidlohr Bueso @ 2018-12-06  3:08 UTC (permalink / raw)
  To: Jason Baron
  Cc: Roman Penyaev, Alexander Viro, Paul E. McKenney, Linus Torvalds,
	linux-fsdevel, linux-kernel, akpm

>On 12/3/18 6:02 AM, Roman Penyaev wrote:

>>  		if (!ep_is_linked(epi)) {
>> -			list_add_tail(&epi->rdllink, &ep->rdllist);
>> +			/* Reverse ->ovflist, events should be in FIFO */
>> +			list_add(&epi->rdllink, &ep->rdllist);
>>  			ep_pm_stay_awake(epi);
>>  		}

This should probably a separate patch as it fixes the ordering,
regardless of the rwlock+xchg optimization.

Thanks,
Davidlohr

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 1/1] epoll: use rwlock in order to reduce ep_poll_callback() contention
  2018-12-04 17:23 ` Jason Baron
                     ` (3 preceding siblings ...)
  2018-12-06  3:08   ` Davidlohr Bueso
@ 2018-12-06  4:04   ` Davidlohr Bueso
  2018-12-06 10:25     ` Roman Penyaev
  4 siblings, 1 reply; 19+ messages in thread
From: Davidlohr Bueso @ 2018-12-06  4:04 UTC (permalink / raw)
  To: Jason Baron
  Cc: Roman Penyaev, Alexander Viro, Paul E. McKenney, Linus Torvalds,
	linux-fsdevel, linux-kernel, akpm

On 12/3/18 6:02 AM, Roman Penyaev wrote:

> The main change is in replacement of the spinlock with a rwlock, which is
> taken on read in ep_poll_callback(), and then by adding poll items to the
> tail of the list using xchg atomic instruction.  Write lock is taken
> everywhere else in order to stop list modifications and guarantee that list
> updates are fully completed (I assume that write side of a rwlock does not
> starve, it seems qrwlock implementation has these guarantees).

Its good then that Will recently ported qrwlocks to arm64, which iirc had
a bad case of writer starvation. In general, qrwlock will maintain reader
to writer ratios of acquisitions fairly well, but will favor readers over
writers in scenarios where when too many tasks (more than ncpus).

Thanks,
Davidlohr

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 1/1] epoll: use rwlock in order to reduce ep_poll_callback() contention
  2018-12-06  4:04   ` Davidlohr Bueso
@ 2018-12-06 10:25     ` Roman Penyaev
  0 siblings, 0 replies; 19+ messages in thread
From: Roman Penyaev @ 2018-12-06 10:25 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Jason Baron, Alexander Viro, Paul E. McKenney, Linus Torvalds,
	linux-fsdevel, linux-kernel, akpm

On 2018-12-06 05:04, Davidlohr Bueso wrote:
> On 12/3/18 6:02 AM, Roman Penyaev wrote:
> 
>> The main change is in replacement of the spinlock with a rwlock, which 
>> is
>> taken on read in ep_poll_callback(), and then by adding poll items to 
>> the
>> tail of the list using xchg atomic instruction.  Write lock is taken
>> everywhere else in order to stop list modifications and guarantee that 
>> list
>> updates are fully completed (I assume that write side of a rwlock does 
>> not
>> starve, it seems qrwlock implementation has these guarantees).
> 
> Its good then that Will recently ported qrwlocks to arm64, which iirc 
> had
> a bad case of writer starvation. In general, qrwlock will maintain 
> reader
> to writer ratios of acquisitions fairly well, but will favor readers 
> over
> writers in scenarios where when too many tasks (more than ncpus).

Thanks for noting that.  Then that should not be a problem, since number 
of
parallel ep_poll_callback() calls can't be greater then number of CPUs
because of the wq.lock which is taken by the caller of 
ep_poll_callback().

BTW, did someone make any estimations how much does the latency on the
write side increase if the number of readers is greater than CPUs?

--
Roman


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 1/1] epoll: use rwlock in order to reduce ep_poll_callback() contention
  2018-12-06  3:08   ` Davidlohr Bueso
@ 2018-12-06 10:27     ` Roman Penyaev
  0 siblings, 0 replies; 19+ messages in thread
From: Roman Penyaev @ 2018-12-06 10:27 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Jason Baron, Alexander Viro, Paul E. McKenney, Linus Torvalds,
	linux-fsdevel, linux-kernel, akpm

On 2018-12-06 04:08, Davidlohr Bueso wrote:
>> On 12/3/18 6:02 AM, Roman Penyaev wrote:
> 
>>>  		if (!ep_is_linked(epi)) {
>>> -			list_add_tail(&epi->rdllink, &ep->rdllist);
>>> +			/* Reverse ->ovflist, events should be in FIFO */
>>> +			list_add(&epi->rdllink, &ep->rdllist);
>>>  			ep_pm_stay_awake(epi);
>>>  		}
> 
> This should probably a separate patch as it fixes the ordering,
> regardless of the rwlock+xchg optimization.

Yes, should not be a part of this RFC.  Thanks.

--
Roman


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 1/1] epoll: use rwlock in order to reduce ep_poll_callback() contention
  2018-12-05 23:46 ` Eric Wong
@ 2018-12-06 10:52   ` Roman Penyaev
  2018-12-06 20:35     ` Eric Wong
  0 siblings, 1 reply; 19+ messages in thread
From: Roman Penyaev @ 2018-12-06 10:52 UTC (permalink / raw)
  To: Eric Wong
  Cc: Alexander Viro, Paul E. McKenney, Linus Torvalds, linux-fsdevel,
	linux-kernel, Mathieu Desnoyers

On 2018-12-06 00:46, Eric Wong wrote:
> Roman Penyaev <rpenyaev@suse.de> wrote:
>> Hi all,
>> 
>> The goal of this patch is to reduce contention of ep_poll_callback() 
>> which
>> can be called concurrently from different CPUs in case of high events
>> rates and many fds per epoll.  Problem can be very well reproduced by
>> generating events (write to pipe or eventfd) from many threads, while
>> consumer thread does polling.  In other words this patch increases the
>> bandwidth of events which can be delivered from sources to the poller 
>> by
>> adding poll items in a lockless way to the list.
> 
> Hi Roman,
> 
> I also tried to solve this problem many years ago with help of
> the well-tested-in-userspace wfcqueue from Mathieu's URCU.
> 
> I was also looking to solve contention with parallel epoll_wait
> callers with this.  AFAIK, it worked well; but needed the
> userspace tests from wfcqueue ported over to the kernel and more
> review.
> 
> I didn't have enough computing power to show the real-world
> benefits or funding to continue:
> 
> 	https://lore.kernel.org/lkml/?q=wfcqueue+d:..20130501

Hi Eric,

Nice work.  That was a huge change by itself and by dependency
on wfcqueue.  I could not find any valuable discussion on this,
what was the reaction of the community?


> It might not be too much trouble for you to brush up the wait-free
> patches and test them against the rwlock implementation.

Ha :)  I may try to cherry-pick these patches, let's see how many
conflicts I have to resolve, eventpoll.c has been changed a lot
since that (6 years passed, right?)

But reading your work description I can assume that epoll_wait() calls
should be faster, because they do not content with ep_poll_callback(),
and I did not try to solve this, only contention between producers,
which make my change tiny.

I also found your https://yhbt.net/eponeshotmt.c , where you count
number of bare epoll_wait() calls, which IMO is not correct, because
we need to count how many events are delivered, but not how fast
you've returned from epoll_wait().  But as I said no doubts that
getting rid of contention between consumer and producers will show
even better results.

--
Roman

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 1/1] epoll: use rwlock in order to reduce ep_poll_callback() contention
  2018-12-06 10:52   ` Roman Penyaev
@ 2018-12-06 20:35     ` Eric Wong
  0 siblings, 0 replies; 19+ messages in thread
From: Eric Wong @ 2018-12-06 20:35 UTC (permalink / raw)
  To: Roman Penyaev
  Cc: Alexander Viro, Paul E. McKenney, Linus Torvalds, linux-fsdevel,
	linux-kernel, Mathieu Desnoyers

Roman Penyaev <rpenyaev@suse.de> wrote:
> On 2018-12-06 00:46, Eric Wong wrote:
> > Roman Penyaev <rpenyaev@suse.de> wrote:
> > > Hi all,
> > > 
> > > The goal of this patch is to reduce contention of ep_poll_callback()
> > > which
> > > can be called concurrently from different CPUs in case of high events
> > > rates and many fds per epoll.  Problem can be very well reproduced by
> > > generating events (write to pipe or eventfd) from many threads, while
> > > consumer thread does polling.  In other words this patch increases the
> > > bandwidth of events which can be delivered from sources to the
> > > poller by
> > > adding poll items in a lockless way to the list.
> > 
> > Hi Roman,
> > 
> > I also tried to solve this problem many years ago with help of
> > the well-tested-in-userspace wfcqueue from Mathieu's URCU.
> > 
> > I was also looking to solve contention with parallel epoll_wait
> > callers with this.  AFAIK, it worked well; but needed the
> > userspace tests from wfcqueue ported over to the kernel and more
> > review.
> > 
> > I didn't have enough computing power to show the real-world
> > benefits or funding to continue:
> > 
> > 	https://lore.kernel.org/lkml/?q=wfcqueue+d:..20130501
> 
> Hi Eric,
> 
> Nice work.  That was a huge change by itself and by dependency
> on wfcqueue.  I could not find any valuable discussion on this,
> what was the reaction of the community?

Hi Roman, AFAIK there wasn't much reaction.  Mathieu was VERY
helpful with wfcqueue but there wasn't much else.  Honestly, I'm
surprised wfcqueue hasn't made it into more places; I love it :)

(More recently, I started an effort to get glibc malloc to use wfcqueue:
https://public-inbox.org/libc-alpha/20180731084936.g4yw6wnvt677miti@dcvr/ )

> > It might not be too much trouble for you to brush up the wait-free
> > patches and test them against the rwlock implementation.
> 
> Ha :)  I may try to cherry-pick these patches, let's see how many
> conflicts I have to resolve, eventpoll.c has been changed a lot
> since that (6 years passed, right?)

AFAIK not, epoll remains a queue with a key-value mapping.
I'm not a regular/experienced kernel hacker and I had no trouble
understanding eventpoll.c years ago.

> But reading your work description I can assume that epoll_wait() calls
> should be faster, because they do not content with ep_poll_callback(),
> and I did not try to solve this, only contention between producers,
> which make my change tiny.

Yes, I recall that was it.  My real-world programs[1], even without
slow HDD access, didn't show it, though.

> I also found your https://yhbt.net/eponeshotmt.c , where you count
> number of bare epoll_wait() calls, which IMO is not correct, because
> we need to count how many events are delivered, but not how fast
> you've returned from epoll_wait().  But as I said no doubts that
> getting rid of contention between consumer and producers will show
> even better results.

"epoll_wait calls" == "events delivered" in my case since I (ab)use
epoll_wait with max_events=1 as a work-distribution mechanism
between threads.  Not a common use-case, I admit.

My design was terrible from a syscall overhead POV, but my
bottleneck for real-world use for cmogstored[1] was dozens of
rotational HDDs in JBOD configuration; so I favored elimination
of head-of-line blocking over throughput of epoll itself.

My motivation for hacking on epoll back then was only to look
better on synthetic benchmarks that didn't hit slow HDDs :)



[1] git clone https://bogomips.org/cmogstored.git/
    the Ragel-generated HTTP parser was also a bottleneck in
    synthetic benchmarks, as we

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2018-12-06 20:35 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-12-03 11:02 [RFC PATCH 1/1] epoll: use rwlock in order to reduce ep_poll_callback() contention Roman Penyaev
2018-12-03 17:34 ` Linus Torvalds
2018-12-04 11:50   ` Roman Penyaev
2018-12-04 23:59     ` Andrea Parri
2018-12-05 11:25       ` Roman Penyaev
2018-12-04 17:23 ` Jason Baron
2018-12-04 19:02   ` Paul E. McKenney
2018-12-05 11:22     ` Roman Penyaev
2018-12-05 11:16   ` Roman Penyaev
2018-12-05 16:38     ` Jason Baron
2018-12-05 20:11       ` Roman Penyaev
2018-12-06  1:54   ` Davidlohr Bueso
2018-12-06  3:08   ` Davidlohr Bueso
2018-12-06 10:27     ` Roman Penyaev
2018-12-06  4:04   ` Davidlohr Bueso
2018-12-06 10:25     ` Roman Penyaev
2018-12-05 23:46 ` Eric Wong
2018-12-06 10:52   ` Roman Penyaev
2018-12-06 20:35     ` Eric Wong

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).