linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 0/3] epoll: introduce round robin wakeup mode
@ 2015-02-24 21:25 Jason Baron
  2015-02-24 21:25 ` [PATCH v3 1/3] sched/wait: add __wake_up_rotate() Jason Baron
                   ` (3 more replies)
  0 siblings, 4 replies; 16+ messages in thread
From: Jason Baron @ 2015-02-24 21:25 UTC (permalink / raw)
  To: peterz, mingo, viro
  Cc: akpm, normalperson, davidel, mtk.manpages, luto, linux-kernel,
	linux-fsdevel, linux-api

Hi,

When we are sharing a wakeup source among multiple epoll fds, we end up with
thundering herd wakeups, since there is currently no way to add to the
wakeup source exclusively. This series introduces a new EPOLL_ROTATE flag
to allow for round robin exclusive wakeups.

I believe this patch series addresses the two main concerns that were raised in
prior postings. Namely, that it affected code (and potentially performance)
of the core kernel wakeup functions, even in cases where it was not strictly
needed, and that it could lead to wakeup starvation (since we were are no
longer waking up all waiters). It does so by adding an extra layer of
indirection, whereby waiters are attached to a 'psuedo' epoll fd, which in turn
is attached directly to the wakeup source.

Patch 1 introduces the required wakeup hooks. This could be restricted to just
the epoll code, but I added them to the generic code in case other ppl might
find them useful.

Patch 2 adds an optimization to the epoll wakeup code that allows EPOLL_ROTATE
to work optimally, however it could be its own standalone patch.

Finally, patch 3 adds the EPOLL_ROTATE, and documents the API usage.

I'm also inlining test code making use of this interface, which shows roughly
a 50% speedup, similar to my previous results: http://lwn.net/Articles/632590/.

Sample epoll_create1 manpage text:

EPOLL_ROTATE
	Set the 'exclusive rotation' rotation flag on the new file descriptor.
	This new file descriptor can be added via epoll_ctl() to at most 1
	non-epoll file descriptors. Any epoll fds addeded directory to the
	new file descriptor via epoll_ctl() will be woken up in a round robin
	exclusive manner.

Thanks,

-Jason

v3:
-restrict epoll exclusive rotate wakeups to within the epoll code
-Add epoll optimization for overflow list

Jason Baron (3):
  sched/wait: add __wake_up_rotate()
  epoll: limit wakeups to the overflow list
  epoll: Add EPOLL_ROTATE mode

 fs/eventpoll.c                 | 52 +++++++++++++++++++++++++++++++++++-------
 include/linux/wait.h           |  1 +
 include/uapi/linux/eventpoll.h |  4 ++++
 kernel/sched/wait.c            | 27 ++++++++++++++++++++++
 4 files changed, 76 insertions(+), 8 deletions(-)

-- 
1.8.2.rc2



#include <unistd.h>
#include <sys/epoll.h>
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>

#define NUM_THREADS 100
#define NUM_EVENTS 20000
#define EPOLLEXCLUSIVE (1 << 28)
#define EPOLLBALANCED (1 << 27)

int optimize, exclusive;
int p[2];
int ep_src_fd;
pthread_t threads[NUM_THREADS];
int event_count[NUM_THREADS];

struct epoll_event evt = {
	.events = EPOLLIN 
};

void die(const char *msg) {
    perror(msg);
    exit(-1);
}

void *run_func(void *ptr)
{
	int i = 0;
	int j = 0;
	int ret;
	int epfd;
	char buf[4];
	int id = *(int *)ptr;
	int *contents;

	if ((epfd = epoll_create(1)) < 0)
		die("create");

	ret = epoll_ctl(epfd, EPOLL_CTL_ADD, ep_src_fd, &evt);
	if (ret)
		perror("epoll_ctl add error!\n");

	while (1) { 
    		ret = epoll_wait(epfd, &evt, 10000, -1);
		ret = read(p[0], buf, sizeof(int));
		if (ret == 4)
			event_count[id]++;
	}
}

#define EPOLL_ROTATE 1

int main(int argc, char *argv[])
{
	int ret, i, j;
	int id[NUM_THREADS];
	int total = 0;
	int nohit = 0;
	int extra_wakeups = 0;

	if (argc == 2) {
		if (strcmp(argv[1], "-o") == 0)
			optimize = 1;
		if (strcmp(argv[1], "-e") == 0)
			exclusive = 1;
	}

	if (pipe(p) < 0)
		die("pipe");
	if (optimize) {
		if ((ep_src_fd = epoll_create1(EPOLL_ROTATE)) < 0)
			die("create");
	} else {
		if ((ep_src_fd = epoll_create1(0)) < 0)
			die("create");
	}
			
	ret = epoll_ctl(ep_src_fd, EPOLL_CTL_ADD, p[0], &evt);
	if (ret)
		perror("epoll_ctl add core error!\n");

	for (i = 0; i < NUM_THREADS; i++) {
		id[i] = i;
		pthread_create(&threads[i], NULL, run_func, &id[i]);
	} 

	for (j = 0; j < NUM_EVENTS; j++) {
		write(p[1], p, sizeof(int));
		usleep(100);
	}

	for (i = 0; i < NUM_THREADS; i++) {
		pthread_cancel(threads[i]);
		printf("joined: %d\n", i);
		printf("event count: %d\n", event_count[i]);
		total += event_count[i];
		if (!event_count[i])
			nohit++;
	} 

	printf("total events is: %d\n", total);
	printf("nohit is: %d\n", nohit);
}

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH v3 1/3] sched/wait: add __wake_up_rotate()
  2015-02-24 21:25 [PATCH v3 0/3] epoll: introduce round robin wakeup mode Jason Baron
@ 2015-02-24 21:25 ` Jason Baron
  2015-02-24 21:25 ` [PATCH v3 2/3] epoll: restrict wakeups to the overflow list Jason Baron
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 16+ messages in thread
From: Jason Baron @ 2015-02-24 21:25 UTC (permalink / raw)
  To: peterz, mingo, viro
  Cc: akpm, normalperson, davidel, mtk.manpages, luto, linux-kernel,
	linux-fsdevel, linux-api

Create a special queue where waiters are 'rotated' to the end of the queue
after they are woken up. Waiters are expected to be added 'exclusively'
to this queue, and the wakeup must occur with __wake_up_rotate().

The current issue with just adding a waiter as exclusive is that it that often
results in the same thread woken up again and again. The first intended user of
this functionality is epoll.

Signed-off-by: Jason Baron <jbaron@akamai.com>
---
 include/linux/wait.h |  1 +
 kernel/sched/wait.c  | 27 +++++++++++++++++++++++++++
 2 files changed, 28 insertions(+)

diff --git a/include/linux/wait.h b/include/linux/wait.h
index 2232ed1..86f06f4 100644
--- a/include/linux/wait.h
+++ b/include/linux/wait.h
@@ -152,6 +152,7 @@ void __wake_up_sync_key(wait_queue_head_t *q, unsigned int mode, int nr, void *k
 void __wake_up_locked(wait_queue_head_t *q, unsigned int mode, int nr);
 void __wake_up_sync(wait_queue_head_t *q, unsigned int mode, int nr);
 void __wake_up_bit(wait_queue_head_t *, void *, int);
+void __wake_up_rotate(wait_queue_head_t *q, unsigned int mode, int nr_exclusive, int wake_flags, void *key);
 int __wait_on_bit(wait_queue_head_t *, struct wait_bit_queue *, wait_bit_action_f *, unsigned);
 int __wait_on_bit_lock(wait_queue_head_t *, struct wait_bit_queue *, wait_bit_action_f *, unsigned);
 void wake_up_bit(void *, int);
diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c
index 852143a..2ceed03 100644
--- a/kernel/sched/wait.c
+++ b/kernel/sched/wait.c
@@ -157,6 +157,33 @@ void __wake_up_sync(wait_queue_head_t *q, unsigned int mode, int nr_exclusive)
 EXPORT_SYMBOL_GPL(__wake_up_sync);	/* For internal use only */
 
 /*
+ * Special wait queue were anything added as excluive will be rotated to the
+ * back of the queue in order to balance the wakeups.
+ */
+void __wake_up_rotate(wait_queue_head_t *q, unsigned int mode,
+		      int nr_exclusive, int wake_flags, void *key)
+{
+	unsigned long flags;
+	wait_queue_t *curr, *next;
+	LIST_HEAD(rotate_list);
+
+	spin_lock_irqsave(&q->lock, flags);
+	list_for_each_entry_safe(curr, next, &q->task_list, task_list) {
+		unsigned wq_flags = curr->flags;
+
+		if (curr->func(curr, mode, wake_flags, key) &&
+					(wq_flags & WQ_FLAG_EXCLUSIVE)) {
+			if (nr_exclusive > 0)
+				list_move_tail(&curr->task_list, &rotate_list);
+			if (!--nr_exclusive)
+				break;
+		}
+	}
+	list_splice_tail(&rotate_list, &q->task_list);
+	spin_unlock_irqrestore(&q->lock, flags);
+}
+
+/*
  * Note: we use "set_current_state()" _after_ the wait-queue add,
  * because we need a memory barrier there on SMP, so that any
  * wake-function that tests for the wait-queue being active
-- 
1.8.2.rc2


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v3 2/3] epoll: restrict wakeups to the overflow list
  2015-02-24 21:25 [PATCH v3 0/3] epoll: introduce round robin wakeup mode Jason Baron
  2015-02-24 21:25 ` [PATCH v3 1/3] sched/wait: add __wake_up_rotate() Jason Baron
@ 2015-02-24 21:25 ` Jason Baron
  2015-02-24 21:25 ` [PATCH v3 3/3] epoll: Add EPOLL_ROTATE mode Jason Baron
  2015-02-25  7:38 ` [PATCH v3 0/3] epoll: introduce round robin wakeup mode Ingo Molnar
  3 siblings, 0 replies; 16+ messages in thread
From: Jason Baron @ 2015-02-24 21:25 UTC (permalink / raw)
  To: peterz, mingo, viro
  Cc: akpm, normalperson, davidel, mtk.manpages, luto, linux-kernel,
	linux-fsdevel, linux-api

During ep_scan_ready_list(), when the ep->mtx is dropped, we queue new
events to the ep->ovflist. However, instead of just issuing wakeup for these
newly encountered events, we instead proceed to issue wakeups even if
nothing new is being propagated.

Normally, this simply results in unnecessary calls to wakeup. However,
now that we want to add wakeup queues that have 'state', this results in
unnecessary state transitions. That is, with the current default behavior
of always waking up all threads, the extra calls to wakeup do not affect
things adversely (besides the extra call overheads). However, we wish to
add policies that are stateful (for example rotating wakeups among epoll
sets), and these unnecessary wakeups cause unwanted transitions.

Signed-off-by: Jason Baron <jbaron@akamai.com>
---
 fs/eventpoll.c | 11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index d77f944..da84712 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -594,7 +594,7 @@ static int ep_scan_ready_list(struct eventpoll *ep,
 					   struct list_head *, void *),
 			      void *priv, int depth, bool ep_locked)
 {
-	int error, pwake = 0;
+	int error, pwake = 0, newly_ready = 0;
 	unsigned long flags;
 	struct epitem *epi, *nepi;
 	LIST_HEAD(txlist);
@@ -634,6 +634,13 @@ static int ep_scan_ready_list(struct eventpoll *ep,
 	for (nepi = ep->ovflist; (epi = nepi) != NULL;
 	     nepi = epi->next, epi->next = EP_UNACTIVE_PTR) {
 		/*
+		 * We only need to perform wakeups if new events have arrived
+		 * while the ep->lock was dropped. We should have already
+		 * issued the wakeups for an existing events.
+		 */
+		if (!newly_ready)
+			newly_ready = 1;
+		/*
 		 * We need to check if the item is already in the list.
 		 * During the "sproc" callback execution time, items are
 		 * queued into ->ovflist but the "txlist" might already
@@ -657,7 +664,7 @@ static int ep_scan_ready_list(struct eventpoll *ep,
 	list_splice(&txlist, &ep->rdllist);
 	__pm_relax(ep->ws);
 
-	if (!list_empty(&ep->rdllist)) {
+	if (newly_ready) {
 		/*
 		 * Wake up (if active) both the eventpoll wait list and
 		 * the ->poll() wait list (delayed after we release the lock).
-- 
1.8.2.rc2


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v3 3/3] epoll: Add EPOLL_ROTATE mode
  2015-02-24 21:25 [PATCH v3 0/3] epoll: introduce round robin wakeup mode Jason Baron
  2015-02-24 21:25 ` [PATCH v3 1/3] sched/wait: add __wake_up_rotate() Jason Baron
  2015-02-24 21:25 ` [PATCH v3 2/3] epoll: restrict wakeups to the overflow list Jason Baron
@ 2015-02-24 21:25 ` Jason Baron
  2015-02-25  7:38 ` [PATCH v3 0/3] epoll: introduce round robin wakeup mode Ingo Molnar
  3 siblings, 0 replies; 16+ messages in thread
From: Jason Baron @ 2015-02-24 21:25 UTC (permalink / raw)
  To: peterz, mingo, viro
  Cc: akpm, normalperson, davidel, mtk.manpages, luto, linux-kernel,
	linux-fsdevel, linux-api

Epoll file descriptors that are added to a shared wakeup source are always
added in a non-exclusive manner. That means that when we have multiple epoll
fds attached to a shared wakeup source they are all woken up. This can
lead to excessive cpu usage and uneven load distribution.

This patch introduces a new 'EPOLL_ROTATE' flag that can be passed as
an argument to epoll_create1(). The 'EPOLL_ROTATE' flag creates a 'special'
epfd that, such that it will round robin its wakeups among any epfds attached
to it. This special epfd can then be attached to at most 1 wakeup sources.

For example:

1. ep1 = epoll_create1(EPOLL_ROTATE);
2. epoll_ctl(ep1, EPOLL_CTL_ADD, fd_source, NULL);
3. epoll_ctl(ep2, EPOLL_CTL_ADD, ep1, event);
   epoll_ctl(ep3, EPOLL_CTL_ADD, ep1, event);
   epoll_ctl(ep4, EPOLL_CTL_ADD, ep1, event);
     .
     .
     .

The usage of the indirect 'special' epfd is designed such that we don't have to
attach the epfds in step #3, directly to the wakeup source. This address two
issues. First, it ensures that competing process do not miss wakeups when they
are attached directly to the wakeup source. Because of the round robin nature
of the wakeups, we want to ensure that competing threads can't starve out or
insert themselves into our wakeup queue. Since, this 'special' fd is local to
our process we can control its exposure. And second, we didn't want the
implementation to affect the core kernel wake up paths, even when unused. By
creating a separate 'special' epfd we have isolated the wakeup paths.

An implementation note is that in the epoll wakeup routine,
'ep_poll_callback()', if EPOLLROUNDROBIN is set, we return 1, for a successful
wakeup, only when there are current waiters. The idea is to use this additional
heuristic in order minimize wakeup latencies.

Signed-off-by: Jason Baron <jbaron@akamai.com>
---
 fs/eventpoll.c                 | 41 +++++++++++++++++++++++++++++++++++------
 include/uapi/linux/eventpoll.h |  4 ++++
 2 files changed, 39 insertions(+), 6 deletions(-)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index da84712..a8b06a1 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -219,6 +219,9 @@ struct eventpoll {
 	/* used to optimize loop detection check */
 	int visited;
 	struct list_head visited_list_link;
+
+	/* wakeup policy type */
+	int wakeup_policy;
 };
 
 /* Wait structure used by the poll hooks */
@@ -1009,6 +1012,7 @@ static int ep_poll_callback(wait_queue_t *wait, unsigned mode, int sync, void *k
 	unsigned long flags;
 	struct epitem *epi = ep_item_from_wait(wait);
 	struct eventpoll *ep = epi->ep;
+	int ewake = 0;
 
 	if ((unsigned long)key & POLLFREE) {
 		ep_pwq_from_wait(wait)->whead = NULL;
@@ -1073,8 +1077,10 @@ static int ep_poll_callback(wait_queue_t *wait, unsigned mode, int sync, void *k
 	 * Wake up ( if active ) both the eventpoll wait list and the ->poll()
 	 * wait list.
 	 */
-	if (waitqueue_active(&ep->wq))
+	if (waitqueue_active(&ep->wq)) {
 		wake_up_locked(&ep->wq);
+		ewake = 1;
+	}
 	if (waitqueue_active(&ep->poll_wait))
 		pwake++;
 
@@ -1082,9 +1088,15 @@ out_unlock:
 	spin_unlock_irqrestore(&ep->lock, flags);
 
 	/* We have to call this outside the lock */
-	if (pwake)
-		ep_poll_safewake(&ep->poll_wait);
-
+	if (pwake) {
+		if (ep->wakeup_policy == EPOLL_ROTATE)
+			__wake_up_rotate(&ep->poll_wait, TASK_NORMAL, 1, 0,
+					 (void *)POLLIN);
+		else
+			ep_poll_safewake(&ep->poll_wait);
+	}
+	if (epi->event.events & EPOLLROTATE)
+		return ewake;
 	return 1;
 }
 
@@ -1097,12 +1109,19 @@ static void ep_ptable_queue_proc(struct file *file, wait_queue_head_t *whead,
 {
 	struct epitem *epi = ep_item_from_epqueue(pt);
 	struct eppoll_entry *pwq;
+	struct eventpoll *ep;
 
 	if (epi->nwait >= 0 && (pwq = kmem_cache_alloc(pwq_cache, GFP_KERNEL))) {
 		init_waitqueue_func_entry(&pwq->wait, ep_poll_callback);
 		pwq->whead = whead;
 		pwq->base = epi;
-		add_wait_queue(whead, &pwq->wait);
+		if (is_file_epoll(epi->ffd.file))
+			ep = epi->ffd.file->private_data;
+		if (ep && (ep->wakeup_policy == EPOLL_ROTATE)) {
+			__add_wait_queue_exclusive(whead, &pwq->wait);
+			epi->event.events |= EPOLLROTATE;
+		} else
+			add_wait_queue(whead, &pwq->wait);
 		list_add_tail(&pwq->llink, &epi->pwqlist);
 		epi->nwait++;
 	} else {
@@ -1777,7 +1796,7 @@ SYSCALL_DEFINE1(epoll_create1, int, flags)
 	/* Check the EPOLL_* constant for consistency.  */
 	BUILD_BUG_ON(EPOLL_CLOEXEC != O_CLOEXEC);
 
-	if (flags & ~EPOLL_CLOEXEC)
+	if (flags & ~(EPOLL_CLOEXEC | EPOLL_ROTATE))
 		return -EINVAL;
 	/*
 	 * Create the internal data structure ("struct eventpoll").
@@ -1802,6 +1821,8 @@ SYSCALL_DEFINE1(epoll_create1, int, flags)
 	}
 	ep->file = file;
 	fd_install(fd, file);
+	if (flags & EPOLL_ROTATE)
+		ep->wakeup_policy = EPOLL_ROTATE;
 	return fd;
 
 out_free_fd:
@@ -1874,6 +1895,9 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
 	 */
 	ep = f.file->private_data;
 
+	if ((ep->wakeup_policy == EPOLL_ROTATE) && is_file_epoll(tf.file))
+		goto error_tgt_fput;
+
 	/*
 	 * When we insert an epoll file descriptor, inside another epoll file
 	 * descriptor, there is the change of creating closed loops, which are
@@ -1911,6 +1935,10 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
 				mutex_lock_nested(&tep->mtx, 1);
 			}
 		}
+		error = -EINVAL;
+		if ((ep->wakeup_policy == EPOLL_ROTATE) &&
+					(!RB_EMPTY_ROOT(&ep->rbr)))
+			goto error_unlock;
 	}
 
 	/*
@@ -1945,6 +1973,7 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
 			error = -ENOENT;
 		break;
 	}
+error_unlock:
 	if (tep != NULL)
 		mutex_unlock(&tep->mtx);
 	mutex_unlock(&ep->mtx);
diff --git a/include/uapi/linux/eventpoll.h b/include/uapi/linux/eventpoll.h
index bc81fb2..b0a3b41 100644
--- a/include/uapi/linux/eventpoll.h
+++ b/include/uapi/linux/eventpoll.h
@@ -20,12 +20,16 @@
 
 /* Flags for epoll_create1.  */
 #define EPOLL_CLOEXEC O_CLOEXEC
+#define EPOLL_ROTATE 0x1
 
 /* Valid opcodes to issue to sys_epoll_ctl() */
 #define EPOLL_CTL_ADD 1
 #define EPOLL_CTL_DEL 2
 #define EPOLL_CTL_MOD 3
 
+/* Marks struct epitem that are attached to wakeup policy type EPOLL_ROTATE */
+#define EPOLLROTATE (1 << 28)
+
 /*
  * Request the handling of system wakeup events so as to prevent system suspends
  * from happening while those events are being processed.
-- 
1.8.2.rc2


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 0/3] epoll: introduce round robin wakeup mode
  2015-02-24 21:25 [PATCH v3 0/3] epoll: introduce round robin wakeup mode Jason Baron
                   ` (2 preceding siblings ...)
  2015-02-24 21:25 ` [PATCH v3 3/3] epoll: Add EPOLL_ROTATE mode Jason Baron
@ 2015-02-25  7:38 ` Ingo Molnar
  2015-02-25 16:27   ` Jason Baron
  3 siblings, 1 reply; 16+ messages in thread
From: Ingo Molnar @ 2015-02-25  7:38 UTC (permalink / raw)
  To: Jason Baron
  Cc: peterz, mingo, viro, akpm, normalperson, davidel, mtk.manpages,
	luto, linux-kernel, linux-fsdevel, linux-api, Linus Torvalds,
	Alexander Viro


* Jason Baron <jbaron@akamai.com> wrote:

> Hi,
> 
> When we are sharing a wakeup source among multiple epoll 
> fds, we end up with thundering herd wakeups, since there 
> is currently no way to add to the wakeup source 
> exclusively. This series introduces a new EPOLL_ROTATE 
> flag to allow for round robin exclusive wakeups.
> 
> I believe this patch series addresses the two main 
> concerns that were raised in prior postings. Namely, that 
> it affected code (and potentially performance) of the 
> core kernel wakeup functions, even in cases where it was 
> not strictly needed, and that it could lead to wakeup 
> starvation (since we were are no longer waking up all 
> waiters). It does so by adding an extra layer of 
> indirection, whereby waiters are attached to a 'psuedo' 
> epoll fd, which in turn is attached directly to the 
> wakeup source.

>   sched/wait: add __wake_up_rotate()

>  include/linux/wait.h           |  1 +
>  kernel/sched/wait.c            | 27 ++++++++++++++++++++++

So the scheduler bits are looking good to me in principle, 
because they just add a new round-robin-rotating wakeup 
variant and don't disturb the others.

Is there consensus on the epoll ABI changes? With Davide 
Libenzi inactive eventpoll appears to be without a 
dedicated maintainer since 2011 or so. Is there anyone who 
knows the code and its usages in detail and does final ABI 
decisions on eventpoll - Andrew, Al or Linus?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 0/3] epoll: introduce round robin wakeup mode
  2015-02-25  7:38 ` [PATCH v3 0/3] epoll: introduce round robin wakeup mode Ingo Molnar
@ 2015-02-25 16:27   ` Jason Baron
  2015-02-27 21:10     ` Andrew Morton
  0 siblings, 1 reply; 16+ messages in thread
From: Jason Baron @ 2015-02-25 16:27 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: peterz, mingo, viro, akpm, normalperson, davidel, mtk.manpages,
	luto, linux-kernel, linux-fsdevel, linux-api, Linus Torvalds,
	Alexander Viro

On 02/25/2015 02:38 AM, Ingo Molnar wrote:
> * Jason Baron <jbaron@akamai.com> wrote:
>
>> Hi,
>>
>> When we are sharing a wakeup source among multiple epoll 
>> fds, we end up with thundering herd wakeups, since there 
>> is currently no way to add to the wakeup source 
>> exclusively. This series introduces a new EPOLL_ROTATE 
>> flag to allow for round robin exclusive wakeups.
>>
>> I believe this patch series addresses the two main 
>> concerns that were raised in prior postings. Namely, that 
>> it affected code (and potentially performance) of the 
>> core kernel wakeup functions, even in cases where it was 
>> not strictly needed, and that it could lead to wakeup 
>> starvation (since we were are no longer waking up all 
>> waiters). It does so by adding an extra layer of 
>> indirection, whereby waiters are attached to a 'psuedo' 
>> epoll fd, which in turn is attached directly to the 
>> wakeup source.
>>   sched/wait: add __wake_up_rotate()
>>  include/linux/wait.h           |  1 +
>>  kernel/sched/wait.c            | 27 ++++++++++++++++++++++
> So the scheduler bits are looking good to me in principle, 
> because they just add a new round-robin-rotating wakeup 
> variant and don't disturb the others.
>
> Is there consensus on the epoll ABI changes? With Davide 

I'm not sure there is a clear consensus on this change,
but I'm hoping that I've addressed the outstanding
concerns in this latest version.

I also think the addition of a way to do a 'wakeup policy'
here will open up other 'policies', such as taking into
account cpu affinity as you suggested. So, I think its
potentially an interesting direction for this code.

> Libenzi inactive eventpoll appears to be without a 
> dedicated maintainer since 2011 or so. Is there anyone who 
> knows the code and its usages in detail and does final ABI 
> decisions on eventpoll - Andrew, Al or Linus?
>
Generally, Andrew and Al do more 'final' reviews here,
and a lot of others on lkml are always very helpful in
looking at this code. However, its not always clear, at
least to me, who I should pester.

Thanks,

-Jason

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 0/3] epoll: introduce round robin wakeup mode
  2015-02-25 16:27   ` Jason Baron
@ 2015-02-27 21:10     ` Andrew Morton
  2015-02-27 21:31       ` Jonathan Corbet
  2015-02-27 22:01       ` Jason Baron
  0 siblings, 2 replies; 16+ messages in thread
From: Andrew Morton @ 2015-02-27 21:10 UTC (permalink / raw)
  To: Jason Baron
  Cc: Ingo Molnar, peterz, mingo, viro, normalperson, davidel,
	mtk.manpages, luto, linux-kernel, linux-fsdevel, linux-api,
	Linus Torvalds, Alexander Viro

On Wed, 25 Feb 2015 11:27:04 -0500 Jason Baron <jbaron@akamai.com> wrote:

> > Libenzi inactive eventpoll appears to be without a 
> > dedicated maintainer since 2011 or so. Is there anyone who 
> > knows the code and its usages in detail and does final ABI 
> > decisions on eventpoll - Andrew, Al or Linus?
> >
> Generally, Andrew and Al do more 'final' reviews here,
> and a lot of others on lkml are always very helpful in
> looking at this code. However, its not always clear, at
> least to me, who I should pester.

Yes, it's a difficult situation.

The 3/3 changelog refers to "EPOLLROUNDROBIN" which I assume is
a leftover from some earlier revision?

I don't really understand the need for rotation/round-robin.  We can
solve the thundering herd via exclusive wakeups, but what is the point
in choosing to wake the task which has been sleeping for the longest
time?  Why is that better than waking the task which has been sleeping
for the *least* time?  That's probably faster as that task's data is
more likely to still be in cache.

The changelogs talks about "starvation" but they don't really say what
this term means in this context, nor why it is a bad thing.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 0/3] epoll: introduce round robin wakeup mode
  2015-02-27 21:10     ` Andrew Morton
@ 2015-02-27 21:31       ` Jonathan Corbet
  2015-03-02  5:04         ` Jason Baron
  2015-02-27 22:01       ` Jason Baron
  1 sibling, 1 reply; 16+ messages in thread
From: Jonathan Corbet @ 2015-02-27 21:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jason Baron, Ingo Molnar, peterz, mingo, viro, normalperson,
	davidel, mtk.manpages, luto, linux-kernel, linux-fsdevel,
	linux-api, Linus Torvalds, Alexander Viro

On Fri, 27 Feb 2015 13:10:34 -0800
Andrew Morton <akpm@linux-foundation.org> wrote:

> I don't really understand the need for rotation/round-robin.  We can
> solve the thundering herd via exclusive wakeups, but what is the point
> in choosing to wake the task which has been sleeping for the longest
> time?  Why is that better than waking the task which has been sleeping
> for the *least* time?  That's probably faster as that task's data is
> more likely to still be in cache.

So here's my chance to show the world what a fool I am (again)...  If I
understand this at all, a task woken from epoll_wait() remains on the wait
queues while it is off doing other stuff.  If you're doing exclusive
wakeups, the task at the head of the queue will get all of them, since it
never gets removed from the queue.  So you don't spread your load around,
and, indeed, you may "wake" a process that is busy doing something else and
can't deal with the event now anyway.  You need some way to shuffle up the
wait queue, and round-robin is probably as good as any.

(The alternative would be to take the process off the queue until it calls
epoll_wait() again, but that runs counter to what epoll is all about).

At least, that was my impression when I took a look at this stuff.

jon

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 0/3] epoll: introduce round robin wakeup mode
  2015-02-27 21:10     ` Andrew Morton
  2015-02-27 21:31       ` Jonathan Corbet
@ 2015-02-27 22:01       ` Jason Baron
  2015-02-27 22:31         ` Andrew Morton
  1 sibling, 1 reply; 16+ messages in thread
From: Jason Baron @ 2015-02-27 22:01 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Ingo Molnar, peterz, mingo, viro, normalperson, davidel,
	mtk.manpages, luto, linux-kernel, linux-fsdevel, linux-api,
	Linus Torvalds, Alexander Viro


On 02/27/2015 04:10 PM, Andrew Morton wrote:
> On Wed, 25 Feb 2015 11:27:04 -0500 Jason Baron <jbaron@akamai.com> wrote:
>
>>> Libenzi inactive eventpoll appears to be without a 
>>> dedicated maintainer since 2011 or so. Is there anyone who 
>>> knows the code and its usages in detail and does final ABI 
>>> decisions on eventpoll - Andrew, Al or Linus?
>>>
>> Generally, Andrew and Al do more 'final' reviews here,
>> and a lot of others on lkml are always very helpful in
>> looking at this code. However, its not always clear, at
>> least to me, who I should pester.
> Yes, it's a difficult situation.
>
> The 3/3 changelog refers to "EPOLLROUNDROBIN" which I assume is
> a leftover from some earlier revision?

Yes, that's a typo there. It should read 'EPOLL_ROTATE'.

>
> I don't really understand the need for rotation/round-robin.  We can
> solve the thundering herd via exclusive wakeups, but what is the point
> in choosing to wake the task which has been sleeping for the longest
> time?  Why is that better than waking the task which has been sleeping
> for the *least* time?  That's probably faster as that task's data is
> more likely to still be in cache.
>
> The changelogs talks about "starvation" but they don't really say what
> this term means in this context, nor why it is a bad thing.
>

So the idea with the 'rotation' is to try and distribute the
workload more evenly across the worker threads. We currently
tend to wake up the 'head' of the queue over and over and
thus the workload for us is not evenly distributed. In fact, we
have a workload where we have to remove all the epoll sets
and then re-add them in a different order to improve the situation.
We are trying to avoid this workaround and in addition avoid
thundering wakeups when possible (using exclusive as you
mention).

I agree that waking up the task that may have been sleeping longer
may not be the best for all workloads. So what I am proposing
here is an optional flag to meet a certain workload. It might not be
right for all workloads, but we have found it quite useful.

The 'starvation' mention was in regards to the fact that with this
new behavior of not waking up all threads (and rotating them),
an adversarial thread might insert itself into our wakeup queue
and 'starve' us out. This concern was raised by Andy Lutomirkski,
and this current series is not subject to this issue, b/c it works
by creating a new epoll fd and then adding that epoll fd to the
wakeup queue.  Thus, this 'new' epoll fd is local to the thread
and the wakeup queue continues to wake all threads. Only the
'new' epoll fd which we then attach ourselves to, implements the
exclusive/rotate behavior.

Thanks,

-Jason



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 0/3] epoll: introduce round robin wakeup mode
  2015-02-27 22:01       ` Jason Baron
@ 2015-02-27 22:31         ` Andrew Morton
  2015-03-05  0:02           ` Ingo Molnar
  0 siblings, 1 reply; 16+ messages in thread
From: Andrew Morton @ 2015-02-27 22:31 UTC (permalink / raw)
  To: Jason Baron
  Cc: Ingo Molnar, peterz, mingo, viro, normalperson, davidel,
	mtk.manpages, luto, linux-kernel, linux-fsdevel, linux-api,
	Linus Torvalds, Alexander Viro

On Fri, 27 Feb 2015 17:01:32 -0500 Jason Baron <jbaron@akamai.com> wrote:

> 
> >
> > I don't really understand the need for rotation/round-robin.  We can
> > solve the thundering herd via exclusive wakeups, but what is the point
> > in choosing to wake the task which has been sleeping for the longest
> > time?  Why is that better than waking the task which has been sleeping
> > for the *least* time?  That's probably faster as that task's data is
> > more likely to still be in cache.
> >
> > The changelogs talks about "starvation" but they don't really say what
> > this term means in this context, nor why it is a bad thing.
> >

I'm still not getting it.

> So the idea with the 'rotation' is to try and distribute the
> workload more evenly across the worker threads.

Why?

> We currently
> tend to wake up the 'head' of the queue over and over and
> thus the workload for us is not evenly distributed.

What's wrong with that?

> In fact, we
> have a workload where we have to remove all the epoll sets
> and then re-add them in a different order to improve the situation.

Why?

> We are trying to avoid this workaround and in addition avoid
> thundering wakeups when possible (using exclusive as you
> mention).


A better way to describe a patchset like this is to start out
describing what the user sees: what is wrong with the current kernel
behaviour and how will it be improved?  Once that is understood then
start getting into kernel internal details.


Jon's explanation was much more useful.  Was it accurate?



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 0/3] epoll: introduce round robin wakeup mode
  2015-02-27 21:31       ` Jonathan Corbet
@ 2015-03-02  5:04         ` Jason Baron
  0 siblings, 0 replies; 16+ messages in thread
From: Jason Baron @ 2015-03-02  5:04 UTC (permalink / raw)
  To: Jonathan Corbet, Andrew Morton
  Cc: Ingo Molnar, peterz, mingo, viro, normalperson, davidel,
	mtk.manpages, luto, linux-kernel, linux-fsdevel, linux-api,
	Linus Torvalds, Alexander Viro

On 02/27/2015 04:31 PM, Jonathan Corbet wrote:
> On Fri, 27 Feb 2015 13:10:34 -0800
> Andrew Morton <akpm@linux-foundation.org> wrote:
>
>> I don't really understand the need for rotation/round-robin.  We can
>> solve the thundering herd via exclusive wakeups, but what is the point
>> in choosing to wake the task which has been sleeping for the longest
>> time?  Why is that better than waking the task which has been sleeping
>> for the *least* time?  That's probably faster as that task's data is
>> more likely to still be in cache.
> So here's my chance to show the world what a fool I am (again)...  If I
> understand this at all, a task woken from epoll_wait() remains on the wait
> queues while it is off doing other stuff.  If you're doing exclusive
> wakeups, the task at the head of the queue will get all of them, since it
> never gets removed from the queue.  So you don't spread your load around,
> and, indeed, you may "wake" a process that is busy doing something else and
> can't deal with the event now anyway.  You need some way to shuffle up the
> wait queue, and round-robin is probably as good as any.
>
> (The alternative would be to take the process off the queue until it calls
> epoll_wait() again, but that runs counter to what epoll is all about).
>
> At least, that was my impression when I took a look at this stuff.
>
> jon

So tasks do not remain on wait queues when they are not in
epoll_wait(). That is, tasks are added to the associated epoll
fd wait queue at the beginning of epoll_wait(), and removed
from the associated epoll fd wait queue when epoll_wait()
returns.

One can think about the problem, perhaps, as assigning fd
events - POLLIN, POLLLOUT, etc., to a set of tasks. And this
discussion is about how to do the assignment in certain
cases. Namely, one could start by partitioning  the set of fds
into unique sets and then assigning them
(via EPOLL_CTL_ADD) to different epoll fds. Then, if there
is say a single task blocking on each epoll fd (via epoll_wait())
then each task can work nicely on its own set of events
without needing to necessarily co-ordinate with the other
tasks.

Now, this all works fine until, we have an 'fd' or event source
that we wish to share among all the tasks. We might
want to share it because it generates events or work, that
would potentially overwhelm a single task.

So in this shared event source case, where we have added
the fd to all of the epoll fds, we currently do a wake all.
This series attempts to change that behavior (with an
optional flag to epoll_create1()), into a round robin wakeup
(both to avoid excessive wakeups and to more evenly distribute
the wakeups). Note also that it will continue to wake up tasks,
as long as it doesn't find any in epoll_wait(). Thus, it still can
potentially wake up all if nobody is in epoll_wait().

Now, we could try and distribute the fd events among tasks
all waiting on a single epoll fd (meaning we have a single
event queue). But, we have already partitioned most of the
events, why combine them back into a single queue?

Thanks,

-Jason

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 0/3] epoll: introduce round robin wakeup mode
  2015-02-27 22:31         ` Andrew Morton
@ 2015-03-05  0:02           ` Ingo Molnar
  2015-03-05  3:53             ` Jason Baron
  0 siblings, 1 reply; 16+ messages in thread
From: Ingo Molnar @ 2015-03-05  0:02 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jason Baron, peterz, mingo, viro, normalperson, davidel,
	mtk.manpages, luto, linux-kernel, linux-fsdevel, linux-api,
	Linus Torvalds, Alexander Viro


* Andrew Morton <akpm@linux-foundation.org> wrote:

> On Fri, 27 Feb 2015 17:01:32 -0500 Jason Baron <jbaron@akamai.com> wrote:
> 
> > 
> > >
> > > I don't really understand the need for rotation/round-robin.  We can
> > > solve the thundering herd via exclusive wakeups, but what is the point
> > > in choosing to wake the task which has been sleeping for the longest
> > > time?  Why is that better than waking the task which has been sleeping
> > > for the *least* time?  That's probably faster as that task's data is
> > > more likely to still be in cache.
> > >
> > > The changelogs talks about "starvation" but they don't really say what
> > > this term means in this context, nor why it is a bad thing.
> > >
> 
> I'm still not getting it.
> 
> > So the idea with the 'rotation' is to try and distribute the
> > workload more evenly across the worker threads.
> 
> Why?
> 
> > We currently
> > tend to wake up the 'head' of the queue over and over and
> > thus the workload for us is not evenly distributed.
> 
> What's wrong with that?
> 
> > In fact, we
> > have a workload where we have to remove all the epoll sets
> > and then re-add them in a different order to improve the situation.
> 
> Why?

So my guess would be (but Jason would know this more precisely) that 
spreading the workload to more tasks in a FIFO manner, the individual 
tasks can move between CPUs better, and fill in available CPU 
bandwidth better, increasing concurrency.

With the current LIFO distribution of wakeups, the 'busiest' threads 
will get many wakeups (potentially from different CPUs), making them 
cache-hot, which may interfere with them easily migrating across CPUs.

So while technically both approaches have similar concurrency, the 
more 'spread out' task hierarchy schedules in a more consistent 
manner.

But ... this is just a wild guess and even if my description is 
accurate then it should still be backed by robust measurements and 
observations, before we extend the ABI.

This hypothesis could be tested by the patch below: with the patch 
applied if the performance difference between FIFO and LIFO epoll 
wakeups disappears, then the root cause is the cache-hotness code in 
the scheduler.

Thanks,

	Ingo

---

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ee595ef30470..89af04e946d2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5354,7 +5354,7 @@ static int task_hot(struct task_struct *p, struct lb_env *env)
 
 	lockdep_assert_held(&env->src_rq->lock);
 
-	if (p->sched_class != &fair_sched_class)
+	if (1 || p->sched_class != &fair_sched_class)
 		return 0;
 
 	if (unlikely(p->policy == SCHED_IDLE))

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 0/3] epoll: introduce round robin wakeup mode
  2015-03-05  0:02           ` Ingo Molnar
@ 2015-03-05  3:53             ` Jason Baron
  2015-03-05  9:15               ` Ingo Molnar
  0 siblings, 1 reply; 16+ messages in thread
From: Jason Baron @ 2015-03-05  3:53 UTC (permalink / raw)
  To: Ingo Molnar, Andrew Morton
  Cc: peterz, mingo, viro, normalperson, davidel, mtk.manpages, luto,
	linux-kernel, linux-fsdevel, linux-api, Linus Torvalds,
	Alexander Viro

On 03/04/2015 07:02 PM, Ingo Molnar wrote:
> * Andrew Morton <akpm@linux-foundation.org> wrote:
>
>> On Fri, 27 Feb 2015 17:01:32 -0500 Jason Baron <jbaron@akamai.com> wrote:
>>
>>>> I don't really understand the need for rotation/round-robin.  We can
>>>> solve the thundering herd via exclusive wakeups, but what is the point
>>>> in choosing to wake the task which has been sleeping for the longest
>>>> time?  Why is that better than waking the task which has been sleeping
>>>> for the *least* time?  That's probably faster as that task's data is
>>>> more likely to still be in cache.
>>>>
>>>> The changelogs talks about "starvation" but they don't really say what
>>>> this term means in this context, nor why it is a bad thing.
>>>>
>> I'm still not getting it.
>>
>>> So the idea with the 'rotation' is to try and distribute the
>>> workload more evenly across the worker threads.
>> Why?
>>
>>> We currently
>>> tend to wake up the 'head' of the queue over and over and
>>> thus the workload for us is not evenly distributed.
>> What's wrong with that?
>>
>>> In fact, we
>>> have a workload where we have to remove all the epoll sets
>>> and then re-add them in a different order to improve the situation.
>> Why?
> So my guess would be (but Jason would know this more precisely) that 
> spreading the workload to more tasks in a FIFO manner, the individual 
> tasks can move between CPUs better, and fill in available CPU 
> bandwidth better, increasing concurrency.
>
> With the current LIFO distribution of wakeups, the 'busiest' threads 
> will get many wakeups (potentially from different CPUs), making them 
> cache-hot, which may interfere with them easily migrating across CPUs.
>
> So while technically both approaches have similar concurrency, the 
> more 'spread out' task hierarchy schedules in a more consistent 
> manner.
>
> But ... this is just a wild guess and even if my description is 
> accurate then it should still be backed by robust measurements and 
> observations, before we extend the ABI.
>
> This hypothesis could be tested by the patch below: with the patch 
> applied if the performance difference between FIFO and LIFO epoll 
> wakeups disappears, then the root cause is the cache-hotness code in 
> the scheduler.
>
>

So what I think you are describing here fits the model
where you have single epoll fd (returned by epoll_create()),
which is then attached to wakeup fds. So that can be thought
of as having a single 'event' queue (the single epoll fd), where
multiple threads are competing to grab events via epoll_wait()
and things are currently LIFO there as you describe.

However, the use-case I was trying to get at is where you have
multiple epoll fds (or event queues), and really just one thread
doing epoll_wait() against a single epoll fd. So instead of having
all threads competing for all events, we have divided up the
events into separate queues.

Now, the 'problematic' case is where there may be an event
source that is shared among all these epoll fds - such as a listen
socket or a pipe. Now there are two distinct issues in this case
that this series is trying to address.

1) All epoll fds will receive a wakeup (and hence the threads
that are potentially blocking there, although they may not
return to user-space if the event has already been consumed).
I think the test case I posted shows this pretty clearly -
http://lwn.net/Articles/632590/. The number of context switches
without adding the to the wait queue is 50x the case where
they are added exclusively. That's a lot of extra cpu usage.

2) We are using the wakeup in this case to 'assign' work more
permanently to the thread. That is, in the case of a listen socket
we then add the connected socket to the woken up threads
local set of epoll events. So the load persists past the wake up.
And in this case, doing the round robin wakeups, simply allows
us to access more cpu bandwidth. (I'm also looking into potentially
using cpu affinity to do the wakeups as well as you suggested.)

Thanks,

-Jason

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 0/3] epoll: introduce round robin wakeup mode
  2015-03-05  3:53             ` Jason Baron
@ 2015-03-05  9:15               ` Ingo Molnar
  2015-03-05 20:24                 ` Jason Baron
  2015-03-07 12:35                 ` Jason Baron
  0 siblings, 2 replies; 16+ messages in thread
From: Ingo Molnar @ 2015-03-05  9:15 UTC (permalink / raw)
  To: Jason Baron
  Cc: Andrew Morton, peterz, mingo, viro, normalperson, davidel,
	mtk.manpages, luto, linux-kernel, linux-fsdevel, linux-api,
	Linus Torvalds, Alexander Viro


* Jason Baron <jbaron@akamai.com> wrote:

> 2) We are using the wakeup in this case to 'assign' work more 
> permanently to the thread. That is, in the case of a listen socket 
> we then add the connected socket to the woken up threads local set 
> of epoll events. So the load persists past the wake up. And in this 
> case, doing the round robin wakeups, simply allows us to access more 
> cpu bandwidth. (I'm also looking into potentially using cpu affinity 
> to do the wakeups as well as you suggested.)

So this is the part that I still don't understand.

What difference does LIFO versus FIFO wakeups make to CPU utilization: 
a thread waiting for work is idle, no matter whether it ran most 
recently or least recently.

Once an idle worker thread is woken it will compute its own work, for 
whatever time it needs to, and won't be bothered by epoll again until 
it finished its work and starts waiting again.

So regardless the wakeup order it's the same principal bandwidth 
utilization, modulo caching artifacts [*] and modulo scheduling 
artifacts [**]:

[*]  Caching artifacts: in that sense Andrew's point stands: given 
     multiple equivalent choices it's more beneficial to pick a thread 
     that was most recently used (and is thus most cache-hot - i.e. 
     the current wakeup behavior), versus a thread that was least 
     recently used (and is thus the most cache-cold - i.e. the 
     round-robin wakeup you introduce).

[**] The hack patch I posted in my previous reply.

Thanks,

	Ingo


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 0/3] epoll: introduce round robin wakeup mode
  2015-03-05  9:15               ` Ingo Molnar
@ 2015-03-05 20:24                 ` Jason Baron
  2015-03-07 12:35                 ` Jason Baron
  1 sibling, 0 replies; 16+ messages in thread
From: Jason Baron @ 2015-03-05 20:24 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andrew Morton, peterz, mingo, viro, normalperson, davidel,
	mtk.manpages, luto, linux-kernel, linux-fsdevel, linux-api,
	Linus Torvalds, Alexander Viro


On 03/05/2015 04:15 AM, Ingo Molnar wrote:
> * Jason Baron <jbaron@akamai.com> wrote:
>
>> 2) We are using the wakeup in this case to 'assign' work more 
>> permanently to the thread. That is, in the case of a listen socket 
>> we then add the connected socket to the woken up threads local set 
>> of epoll events. So the load persists past the wake up. And in this 
>> case, doing the round robin wakeups, simply allows us to access more 
>> cpu bandwidth. (I'm also looking into potentially using cpu affinity 
>> to do the wakeups as well as you suggested.)
> So this is the part that I still don't understand.
>
> What difference does LIFO versus FIFO wakeups make to CPU utilization: 
> a thread waiting for work is idle, no matter whether it ran most 
> recently or least recently.
>
> Once an idle worker thread is woken it will compute its own work, for 
> whatever time it needs to, and won't be bothered by epoll again until 
> it finished its work and starts waiting again.
>
> So regardless the wakeup order it's the same principal bandwidth 
> utilization, modulo caching artifacts [*] and modulo scheduling 
> artifacts [**]:

So just adding the wakeup source as 'exclusive', I think would
give much of the desired behavior as you point out. In the first
patch posting I separated 'exclusive' from 'rotate' (where rotate
depended on exclusive), since the idle threads will tend to get
assigned the new work vs. the busy threads as you point out
and the workload naturally spreads out (modulo the artifacts
you mentioned).

However, I added the 'rotate' b/c I'm assigning work via the
wakeup that persists past the wakeup point. So without the rotate
I might end up assigning a lot of work to always say the first
thread if its always idle. And then I might get a large burst of
work queued to it at some later point. The rotate is intended
to address this case.

To use some pseudo-code in hopes of clarifying things, each
thread is roughly doing:

epoll_ctl(epfd, EPOLL_CTL_ADD, listen_fd...);

while(1) {

    epoll_wait(epfd...);
    fd = accept(listen_fd...);
    epoll_ctl(epfd, EPOLL_CTL_ADD, fd...);
    ...do any additional desired fd processing...
}

So since the work persists past the wakeup point (after
the 'fd' has been assigned to the epfd set of the local
thread), I am trying to balance out future load.

This is an issue that current userspace has to address in
various ways. In our case, we periodically remove all
epfds from the listen socket, and then re-add in a
different order periodically. Another alternative that was
suggested by Eric was to have a dedicated thread(s), to
do the assignment. So these approaches can work to an
extent, but they seem at least to me to complicate
userspace somewhat. And at least in our case, its not
providing as good balancing as this approach.

So I am trying to use epoll in a special way to do work
assignment. I think the model is different from the
standard waker/wakee model. So to that end, in this
v3 version, I've attempted to isolate all the changes to
be contained within epoll to reflect that fact.

Thanks,

-Jason


>
> [*]  Caching artifacts: in that sense Andrew's point stands: given 
>      multiple equivalent choices it's more beneficial to pick a thread 
>      that was most recently used (and is thus most cache-hot - i.e. 
>      the current wakeup behavior), versus a thread that was least 
>      recently used (and is thus the most cache-cold - i.e. the 
>      round-robin wakeup you introduce).
>
> [**] The hack patch I posted in my previous reply.
>
> Thanks,
>
> 	Ingo
>


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 0/3] epoll: introduce round robin wakeup mode
  2015-03-05  9:15               ` Ingo Molnar
  2015-03-05 20:24                 ` Jason Baron
@ 2015-03-07 12:35                 ` Jason Baron
  1 sibling, 0 replies; 16+ messages in thread
From: Jason Baron @ 2015-03-07 12:35 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andrew Morton, peterz, mingo, viro, normalperson, davidel,
	mtk.manpages, luto, linux-kernel, linux-fsdevel, linux-api,
	Linus Torvalds, Alexander Viro

On 03/05/2015 04:15 AM, Ingo Molnar wrote:
> * Jason Baron <jbaron@akamai.com> wrote:
>
>> 2) We are using the wakeup in this case to 'assign' work more 
>> permanently to the thread. That is, in the case of a listen socket 
>> we then add the connected socket to the woken up threads local set 
>> of epoll events. So the load persists past the wake up. And in this 
>> case, doing the round robin wakeups, simply allows us to access more 
>> cpu bandwidth. (I'm also looking into potentially using cpu affinity 
>> to do the wakeups as well as you suggested.)
> So this is the part that I still don't understand.

Here's maybe another way to frame this. Epoll sets add
a waiter on the wait queue in a fixed order when epoll sets
are added (via EPOLL_CTL_ADD). This order does not change
modulo adds/dels which are usually not common. So if
we don't want to wake all threads, when say an interrupt
occurs at some random point, we can either:

1) Walk the list, wake up the first epoll set that has idle
threads (queued via epoll_wait()) and return.

or:

2) Walk the list and wake up the first epoll set that has idle
threads, but then 'rotate' or move this epoll set to the tail
of the queue before returning.

So because the epoll sets are in a fixed order there is
an extreme bias to pick the same epoll sets over and over
regardless of the order in which threads return to wait
via (epoll_wait()). So I think the rotate makes sense for
the case where I am trying to assign work to threads that
may persist past the wake up point, and for cases where
the threads can finish all their work before returning
back to epoll_wait().

Thanks,

-Jason







^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2015-03-07 12:35 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-02-24 21:25 [PATCH v3 0/3] epoll: introduce round robin wakeup mode Jason Baron
2015-02-24 21:25 ` [PATCH v3 1/3] sched/wait: add __wake_up_rotate() Jason Baron
2015-02-24 21:25 ` [PATCH v3 2/3] epoll: restrict wakeups to the overflow list Jason Baron
2015-02-24 21:25 ` [PATCH v3 3/3] epoll: Add EPOLL_ROTATE mode Jason Baron
2015-02-25  7:38 ` [PATCH v3 0/3] epoll: introduce round robin wakeup mode Ingo Molnar
2015-02-25 16:27   ` Jason Baron
2015-02-27 21:10     ` Andrew Morton
2015-02-27 21:31       ` Jonathan Corbet
2015-03-02  5:04         ` Jason Baron
2015-02-27 22:01       ` Jason Baron
2015-02-27 22:31         ` Andrew Morton
2015-03-05  0:02           ` Ingo Molnar
2015-03-05  3:53             ` Jason Baron
2015-03-05  9:15               ` Ingo Molnar
2015-03-05 20:24                 ` Jason Baron
2015-03-07 12:35                 ` Jason Baron

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).