linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/8] simplify ep_poll
@ 2020-11-06 23:16 Soheil Hassas Yeganeh
  2020-11-06 23:16 ` [PATCH 1/8] epoll: check for events when removing a timed out thread from the wait queue Soheil Hassas Yeganeh
                   ` (9 more replies)
  0 siblings, 10 replies; 17+ messages in thread
From: Soheil Hassas Yeganeh @ 2020-11-06 23:16 UTC (permalink / raw)
  To: torvalds, viro, linux-fsdevel
  Cc: linux-kernel, akpm, dave, edumazet, willemb, khazhy, guantaol,
	Soheil Hassas Yeganeh

From: Soheil Hassas Yeganeh <soheil@google.com>

This patch series is a follow up based on the suggestions and feedback by Linus:
https://lkml.kernel.org/r/CAHk-=wizk=OxUyQPbO8MS41w2Pag1kniUV5WdD5qWL-gq1kjDA@mail.gmail.com

The first patch in the series is a fix for the epoll race in
presence of timeouts, so that it can be cleanly backported to all
affected stable kernels.

The rest of the patch series simplify the ep_poll() implementation.
Some of these simplifications result in minor performance enhancements
as well.  We have kept these changes under self tests and internal
benchmarks for a few days, and there are minor (1-2%) performance
enhancements as a result.

Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>

Soheil Hassas Yeganeh (8):
  epoll: check for events when removing a timed out thread from the wait
    queue
  epoll: simplify signal handling
  epoll: pull fatal signal checks into ep_send_events()
  epoll: move eavail next to the list_empty_careful check
  epoll: simplify and optimize busy loop logic
  epoll: pull all code between fetch_events and send_event into the loop
  epoll: replace gotos with a proper loop
  epoll: eliminate unnecessary lock for zero timeout

 fs/eventpoll.c | 159 +++++++++++++++++++++++++------------------------
 1 file changed, 80 insertions(+), 79 deletions(-)

-- 
2.29.1.341.ge80a0c044ae-goog


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 1/8] epoll: check for events when removing a timed out thread from the wait queue
  2020-11-06 23:16 [PATCH 0/8] simplify ep_poll Soheil Hassas Yeganeh
@ 2020-11-06 23:16 ` Soheil Hassas Yeganeh
  2020-11-10  6:05   ` Davidlohr Bueso
  2020-11-06 23:16 ` [PATCH 2/8] epoll: simplify signal handling Soheil Hassas Yeganeh
                   ` (8 subsequent siblings)
  9 siblings, 1 reply; 17+ messages in thread
From: Soheil Hassas Yeganeh @ 2020-11-06 23:16 UTC (permalink / raw)
  To: torvalds, viro, linux-fsdevel
  Cc: linux-kernel, akpm, dave, edumazet, willemb, khazhy, guantaol,
	Soheil Hassas Yeganeh

From: Soheil Hassas Yeganeh <soheil@google.com>

After abc610e01c66 ("fs/epoll: avoid barrier after an epoll_wait(2)
timeout"), we break out of the ep_poll loop upon timeout, without checking
whether there is any new events available.  Prior to that patch-series we
always called ep_events_available() after exiting the loop.

This can cause races and missed wakeups.  For example, consider the
following scenario reported by Guantao Liu:

Suppose we have an eventfd added using EPOLLET to an epollfd.

Thread 1: Sleeps for just below 5ms and then writes to an eventfd.
Thread 2: Calls epoll_wait with a timeout of 5 ms. If it sees an
          event of the eventfd, it will write back on that fd.
Thread 3: Calls epoll_wait with a negative timeout.

Prior to abc610e01c66, it is guaranteed that Thread 3 will wake up either
by Thread 1 or Thread 2.  After abc610e01c66, Thread 3 can be blocked
indefinitely if Thread 2 sees a timeout right before the write to the
eventfd by Thread 1.  Thread 2 will be woken up from
schedule_hrtimeout_range and, with evail 0, it will not call
ep_send_events().

To fix this issue:
1) Simplify the timed_out case as suggested by Linus.
2) while holding the lock, recheck whether the thread was woken up
   after its time out has reached.

Note that (2) is different from Linus' original suggestion: It do not
set "eavail = ep_events_available(ep)" to avoid unnecessary contention
(when there are too many timed-out threads and a small number of events),
as well as races mentioned in the discussion thread.

This is the first patch in the series so that the backport to stable
releases is straightforward.

Link: https://lkml.kernel.org/r/CAHk-=wizk=OxUyQPbO8MS41w2Pag1kniUV5WdD5qWL-gq1kjDA@mail.gmail.com
Fixes: abc610e01c66 ("fs/epoll: avoid barrier after an epoll_wait(2) timeout")
Tested-by: Guantao Liu <guantaol@google.com>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Reported-by: Guantao Liu <guantaol@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Khazhismel Kumykov <khazhy@google.com>
---
 fs/eventpoll.c | 25 ++++++++++++++++---------
 1 file changed, 16 insertions(+), 9 deletions(-)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 4df61129566d..117b1c395ae4 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -1902,23 +1902,30 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
 		}
 		write_unlock_irq(&ep->lock);
 
-		if (eavail || res)
-			break;
+		if (!eavail && !res)
+			timed_out = !schedule_hrtimeout_range(to, slack,
+							      HRTIMER_MODE_ABS);
 
-		if (!schedule_hrtimeout_range(to, slack, HRTIMER_MODE_ABS)) {
-			timed_out = 1;
-			break;
-		}
-
-		/* We were woken up, thus go and try to harvest some events */
+		/*
+		 * We were woken up, thus go and try to harvest some events.
+		 * If timed out and still on the wait queue, recheck eavail
+		 * carefully under lock, below.
+		 */
 		eavail = 1;
-
 	} while (0);
 
 	__set_current_state(TASK_RUNNING);
 
 	if (!list_empty_careful(&wait.entry)) {
 		write_lock_irq(&ep->lock);
+		/*
+		 * If the thread timed out and is not on the wait queue, it
+		 * means that the thread was woken up after its timeout expired
+		 * before it could reacquire the lock. Thus, when wait.entry is
+		 * empty, it needs to harvest events.
+		 */
+		if (timed_out)
+			eavail = list_empty(&wait.entry);
 		__remove_wait_queue(&ep->wq, &wait);
 		write_unlock_irq(&ep->lock);
 	}
-- 
2.29.1.341.ge80a0c044ae-goog


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 2/8] epoll: simplify signal handling
  2020-11-06 23:16 [PATCH 0/8] simplify ep_poll Soheil Hassas Yeganeh
  2020-11-06 23:16 ` [PATCH 1/8] epoll: check for events when removing a timed out thread from the wait queue Soheil Hassas Yeganeh
@ 2020-11-06 23:16 ` Soheil Hassas Yeganeh
  2020-11-06 23:16 ` [PATCH 3/8] epoll: pull fatal signal checks into ep_send_events() Soheil Hassas Yeganeh
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 17+ messages in thread
From: Soheil Hassas Yeganeh @ 2020-11-06 23:16 UTC (permalink / raw)
  To: torvalds, viro, linux-fsdevel
  Cc: linux-kernel, akpm, dave, edumazet, willemb, khazhy, guantaol,
	Soheil Hassas Yeganeh

From: Soheil Hassas Yeganeh <soheil@google.com>

Check signals before locking ep->lock, and immediately return
-EINTR if there is any signal pending.

This saves a few loads, stores, and branches from the hot path
and simplifies the loop structure for follow up patches.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Khazhismel Kumykov <khazhy@google.com>
---
 fs/eventpoll.c | 20 ++++++++++----------
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 117b1c395ae4..80c560dad6a3 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -1818,7 +1818,7 @@ static inline struct timespec64 ep_set_mstimeout(long ms)
 static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
 		   int maxevents, long timeout)
 {
-	int res = 0, eavail, timed_out = 0;
+	int res, eavail, timed_out = 0;
 	u64 slack = 0;
 	wait_queue_entry_t wait;
 	ktime_t expires, *to = NULL;
@@ -1865,6 +1865,9 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
 	ep_reset_busy_poll_napi_id(ep);
 
 	do {
+		if (signal_pending(current))
+			return -EINTR;
+
 		/*
 		 * Internally init_wait() uses autoremove_wake_function(),
 		 * thus wait entry is removed from the wait queue on each
@@ -1894,15 +1897,12 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
 		 * important.
 		 */
 		eavail = ep_events_available(ep);
-		if (!eavail) {
-			if (signal_pending(current))
-				res = -EINTR;
-			else
-				__add_wait_queue_exclusive(&ep->wq, &wait);
-		}
+		if (!eavail)
+			__add_wait_queue_exclusive(&ep->wq, &wait);
+
 		write_unlock_irq(&ep->lock);
 
-		if (!eavail && !res)
+		if (!eavail)
 			timed_out = !schedule_hrtimeout_range(to, slack,
 							      HRTIMER_MODE_ABS);
 
@@ -1938,14 +1938,14 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
 		 * finding more events available and fetching
 		 * repeatedly.
 		 */
-		res = -EINTR;
+		return -EINTR;
 	}
 	/*
 	 * Try to transfer events to user space. In case we get 0 events and
 	 * there's still timeout left over, we go trying again in search of
 	 * more luck.
 	 */
-	if (!res && eavail &&
+	if (eavail &&
 	    !(res = ep_send_events(ep, events, maxevents)) && !timed_out)
 		goto fetch_events;
 
-- 
2.29.1.341.ge80a0c044ae-goog


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 3/8] epoll: pull fatal signal checks into ep_send_events()
  2020-11-06 23:16 [PATCH 0/8] simplify ep_poll Soheil Hassas Yeganeh
  2020-11-06 23:16 ` [PATCH 1/8] epoll: check for events when removing a timed out thread from the wait queue Soheil Hassas Yeganeh
  2020-11-06 23:16 ` [PATCH 2/8] epoll: simplify signal handling Soheil Hassas Yeganeh
@ 2020-11-06 23:16 ` Soheil Hassas Yeganeh
  2020-11-06 23:16 ` [PATCH 4/8] epoll: move eavail next to the list_empty_careful check Soheil Hassas Yeganeh
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 17+ messages in thread
From: Soheil Hassas Yeganeh @ 2020-11-06 23:16 UTC (permalink / raw)
  To: torvalds, viro, linux-fsdevel
  Cc: linux-kernel, akpm, dave, edumazet, willemb, khazhy, guantaol,
	Soheil Hassas Yeganeh

From: Soheil Hassas Yeganeh <soheil@google.com>

To simplify the code, pull in checking the fatal signals into
ep_send_events().  ep_send_events() is called only from ep_poll().

Note that, previously, we were always checking fatal events, but
it is checked only if eavail is true.  This should be fine because
the goal of that check is to quickly return from epoll_wait() when
there is a pending fatal signal.

Suggested-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Khazhismel Kumykov <khazhy@google.com>
---
 fs/eventpoll.c | 17 ++++++++---------
 1 file changed, 8 insertions(+), 9 deletions(-)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 80c560dad6a3..ed9deeab2488 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -1780,6 +1780,14 @@ static int ep_send_events(struct eventpoll *ep,
 {
 	struct ep_send_events_data esed;
 
+	/*
+	 * Always short-circuit for fatal signals to allow threads to make a
+	 * timely exit without the chance of finding more events available and
+	 * fetching repeatedly.
+	 */
+	if (fatal_signal_pending(current))
+		return -EINTR;
+
 	esed.maxevents = maxevents;
 	esed.events = events;
 
@@ -1931,15 +1939,6 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
 	}
 
 send_events:
-	if (fatal_signal_pending(current)) {
-		/*
-		 * Always short-circuit for fatal signals to allow
-		 * threads to make a timely exit without the chance of
-		 * finding more events available and fetching
-		 * repeatedly.
-		 */
-		return -EINTR;
-	}
 	/*
 	 * Try to transfer events to user space. In case we get 0 events and
 	 * there's still timeout left over, we go trying again in search of
-- 
2.29.1.341.ge80a0c044ae-goog


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 4/8] epoll: move eavail next to the list_empty_careful check
  2020-11-06 23:16 [PATCH 0/8] simplify ep_poll Soheil Hassas Yeganeh
                   ` (2 preceding siblings ...)
  2020-11-06 23:16 ` [PATCH 3/8] epoll: pull fatal signal checks into ep_send_events() Soheil Hassas Yeganeh
@ 2020-11-06 23:16 ` Soheil Hassas Yeganeh
  2020-11-06 23:16 ` [PATCH 5/8] epoll: simplify and optimize busy loop logic Soheil Hassas Yeganeh
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 17+ messages in thread
From: Soheil Hassas Yeganeh @ 2020-11-06 23:16 UTC (permalink / raw)
  To: torvalds, viro, linux-fsdevel
  Cc: linux-kernel, akpm, dave, edumazet, willemb, khazhy, guantaol,
	Soheil Hassas Yeganeh

From: Soheil Hassas Yeganeh <soheil@google.com>

This is a no-op change and simply to make the code more coherent.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Khazhismel Kumykov <khazhy@google.com>
---
 fs/eventpoll.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index ed9deeab2488..5226b0cb1098 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -1913,6 +1913,7 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
 		if (!eavail)
 			timed_out = !schedule_hrtimeout_range(to, slack,
 							      HRTIMER_MODE_ABS);
+		__set_current_state(TASK_RUNNING);
 
 		/*
 		 * We were woken up, thus go and try to harvest some events.
@@ -1922,8 +1923,6 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
 		eavail = 1;
 	} while (0);
 
-	__set_current_state(TASK_RUNNING);
-
 	if (!list_empty_careful(&wait.entry)) {
 		write_lock_irq(&ep->lock);
 		/*
-- 
2.29.1.341.ge80a0c044ae-goog


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 5/8] epoll: simplify and optimize busy loop logic
  2020-11-06 23:16 [PATCH 0/8] simplify ep_poll Soheil Hassas Yeganeh
                   ` (3 preceding siblings ...)
  2020-11-06 23:16 ` [PATCH 4/8] epoll: move eavail next to the list_empty_careful check Soheil Hassas Yeganeh
@ 2020-11-06 23:16 ` Soheil Hassas Yeganeh
  2020-11-06 23:16 ` [PATCH 6/8] epoll: pull all code between fetch_events and send_event into the loop Soheil Hassas Yeganeh
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 17+ messages in thread
From: Soheil Hassas Yeganeh @ 2020-11-06 23:16 UTC (permalink / raw)
  To: torvalds, viro, linux-fsdevel
  Cc: linux-kernel, akpm, dave, edumazet, willemb, khazhy, guantaol,
	Soheil Hassas Yeganeh

From: Soheil Hassas Yeganeh <soheil@google.com>

ep_events_available() is called multiple times around the busy loop
logic, even though the logic is generally not used.
ep_reset_busy_poll_napi_id() is similarly always called, even when
busy loop is not used.

Eliminate ep_reset_busy_poll_napi_id() and inline it inside
ep_busy_loop(). Make ep_busy_loop() return whether there are any
events available after the busy loop.  This will eliminate unnecessary
loads and branches, and simplifies the loop.

Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Khazhismel Kumykov <khazhy@google.com>
---
 fs/eventpoll.c | 42 +++++++++++++++++++-----------------------
 1 file changed, 19 insertions(+), 23 deletions(-)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 5226b0cb1098..28c1d341d2e6 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -391,19 +391,26 @@ static bool ep_busy_loop_end(void *p, unsigned long start_time)
  * busy loop will return if need_resched or ep_events_available.
  *
  * we must do our busy polling with irqs enabled
+ *
+ * Returns whether new events are available after a successful busy loop.
  */
-static void ep_busy_loop(struct eventpoll *ep, int nonblock)
+static bool ep_busy_loop(struct eventpoll *ep, int nonblock)
 {
 	unsigned int napi_id = READ_ONCE(ep->napi_id);
 
-	if ((napi_id >= MIN_NAPI_ID) && net_busy_loop_on())
+	if ((napi_id >= MIN_NAPI_ID) && net_busy_loop_on()) {
 		napi_busy_loop(napi_id, nonblock ? NULL : ep_busy_loop_end, ep);
-}
-
-static inline void ep_reset_busy_poll_napi_id(struct eventpoll *ep)
-{
-	if (ep->napi_id)
+		if (ep_events_available(ep))
+			return true;
+		/*
+		 * Busy poll timed out.  Drop NAPI ID for now, we can add
+		 * it back in when we have moved a socket with a valid NAPI
+		 * ID onto the ready list.
+		 */
 		ep->napi_id = 0;
+		return false;
+	}
+	return false;
 }
 
 /*
@@ -444,12 +451,9 @@ static inline void ep_set_busy_poll_napi_id(struct epitem *epi)
 
 #else
 
-static inline void ep_busy_loop(struct eventpoll *ep, int nonblock)
-{
-}
-
-static inline void ep_reset_busy_poll_napi_id(struct eventpoll *ep)
+static inline bool ep_busy_loop(struct eventpoll *ep, int nonblock)
 {
+	return false;
 }
 
 static inline void ep_set_busy_poll_napi_id(struct epitem *epi)
@@ -1857,21 +1861,13 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
 	}
 
 fetch_events:
-
-	if (!ep_events_available(ep))
-		ep_busy_loop(ep, timed_out);
-
 	eavail = ep_events_available(ep);
+	if (!eavail)
+		eavail = ep_busy_loop(ep, timed_out);
+
 	if (eavail)
 		goto send_events;
 
-	/*
-	 * Busy poll timed out.  Drop NAPI ID for now, we can add
-	 * it back in when we have moved a socket with a valid NAPI
-	 * ID onto the ready list.
-	 */
-	ep_reset_busy_poll_napi_id(ep);
-
 	do {
 		if (signal_pending(current))
 			return -EINTR;
-- 
2.29.1.341.ge80a0c044ae-goog


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 6/8] epoll: pull all code between fetch_events and send_event into the loop
  2020-11-06 23:16 [PATCH 0/8] simplify ep_poll Soheil Hassas Yeganeh
                   ` (4 preceding siblings ...)
  2020-11-06 23:16 ` [PATCH 5/8] epoll: simplify and optimize busy loop logic Soheil Hassas Yeganeh
@ 2020-11-06 23:16 ` Soheil Hassas Yeganeh
  2020-11-06 23:16 ` [PATCH 7/8] epoll: replace gotos with a proper loop Soheil Hassas Yeganeh
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 17+ messages in thread
From: Soheil Hassas Yeganeh @ 2020-11-06 23:16 UTC (permalink / raw)
  To: torvalds, viro, linux-fsdevel
  Cc: linux-kernel, akpm, dave, edumazet, willemb, khazhy, guantaol,
	Soheil Hassas Yeganeh

From: Soheil Hassas Yeganeh <soheil@google.com>

This is a no-op change which simplifies the follow up patches.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Khazhismel Kumykov <khazhy@google.com>
---
 fs/eventpoll.c | 41 +++++++++++++++++++++--------------------
 1 file changed, 21 insertions(+), 20 deletions(-)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 28c1d341d2e6..b29bbebe8ca4 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -1861,14 +1861,14 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
 	}
 
 fetch_events:
-	eavail = ep_events_available(ep);
-	if (!eavail)
-		eavail = ep_busy_loop(ep, timed_out);
+	do {
+		eavail = ep_events_available(ep);
+		if (!eavail)
+			eavail = ep_busy_loop(ep, timed_out);
 
-	if (eavail)
-		goto send_events;
+		if (eavail)
+			goto send_events;
 
-	do {
 		if (signal_pending(current))
 			return -EINTR;
 
@@ -1917,21 +1917,22 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
 		 * carefully under lock, below.
 		 */
 		eavail = 1;
-	} while (0);
 
-	if (!list_empty_careful(&wait.entry)) {
-		write_lock_irq(&ep->lock);
-		/*
-		 * If the thread timed out and is not on the wait queue, it
-		 * means that the thread was woken up after its timeout expired
-		 * before it could reacquire the lock. Thus, when wait.entry is
-		 * empty, it needs to harvest events.
-		 */
-		if (timed_out)
-			eavail = list_empty(&wait.entry);
-		__remove_wait_queue(&ep->wq, &wait);
-		write_unlock_irq(&ep->lock);
-	}
+		if (!list_empty_careful(&wait.entry)) {
+			write_lock_irq(&ep->lock);
+			/*
+			 * If the thread timed out and is not on the wait queue,
+			 * it means that the thread was woken up after its
+			 * timeout expired before it could reacquire the lock.
+			 * Thus, when wait.entry is empty, it needs to harvest
+			 * events.
+			 */
+			if (timed_out)
+				eavail = list_empty(&wait.entry);
+			__remove_wait_queue(&ep->wq, &wait);
+			write_unlock_irq(&ep->lock);
+		}
+	} while (0);
 
 send_events:
 	/*
-- 
2.29.1.341.ge80a0c044ae-goog


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 7/8] epoll: replace gotos with a proper loop
  2020-11-06 23:16 [PATCH 0/8] simplify ep_poll Soheil Hassas Yeganeh
                   ` (5 preceding siblings ...)
  2020-11-06 23:16 ` [PATCH 6/8] epoll: pull all code between fetch_events and send_event into the loop Soheil Hassas Yeganeh
@ 2020-11-06 23:16 ` Soheil Hassas Yeganeh
  2020-11-06 23:16 ` [PATCH 8/8] epoll: eliminate unnecessary lock for zero timeout Soheil Hassas Yeganeh
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 17+ messages in thread
From: Soheil Hassas Yeganeh @ 2020-11-06 23:16 UTC (permalink / raw)
  To: torvalds, viro, linux-fsdevel
  Cc: linux-kernel, akpm, dave, edumazet, willemb, khazhy, guantaol,
	Soheil Hassas Yeganeh

From: Soheil Hassas Yeganeh <soheil@google.com>

The existing loop is pointless, and the labels make it really
hard to follow the structure.

Replace that control structure with a simple loop that returns
when there are new events, there is a signal, or the thread has
timed out.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Khazhismel Kumykov <khazhy@google.com>
---
 fs/eventpoll.c | 42 +++++++++++++++++++++---------------------
 1 file changed, 21 insertions(+), 21 deletions(-)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index b29bbebe8ca4..f4e1be7ada26 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -1830,7 +1830,7 @@ static inline struct timespec64 ep_set_mstimeout(long ms)
 static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
 		   int maxevents, long timeout)
 {
-	int res, eavail, timed_out = 0;
+	int res, eavail = 0, timed_out = 0;
 	u64 slack = 0;
 	wait_queue_entry_t wait;
 	ktime_t expires, *to = NULL;
@@ -1856,18 +1856,30 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
 		write_lock_irq(&ep->lock);
 		eavail = ep_events_available(ep);
 		write_unlock_irq(&ep->lock);
-
-		goto send_events;
 	}
 
-fetch_events:
-	do {
+	while (1) {
+		if (eavail) {
+			/*
+			 * Try to transfer events to user space. In case we get
+			 * 0 events and there's still timeout left over, we go
+			 * trying again in search of more luck.
+			 */
+			res = ep_send_events(ep, events, maxevents);
+			if (res)
+				return res;
+		}
+
+		if (timed_out)
+			return 0;
+
 		eavail = ep_events_available(ep);
-		if (!eavail)
-			eavail = ep_busy_loop(ep, timed_out);
+		if (eavail)
+			continue;
 
+		eavail = ep_busy_loop(ep, timed_out);
 		if (eavail)
-			goto send_events;
+			continue;
 
 		if (signal_pending(current))
 			return -EINTR;
@@ -1932,19 +1944,7 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
 			__remove_wait_queue(&ep->wq, &wait);
 			write_unlock_irq(&ep->lock);
 		}
-	} while (0);
-
-send_events:
-	/*
-	 * Try to transfer events to user space. In case we get 0 events and
-	 * there's still timeout left over, we go trying again in search of
-	 * more luck.
-	 */
-	if (eavail &&
-	    !(res = ep_send_events(ep, events, maxevents)) && !timed_out)
-		goto fetch_events;
-
-	return res;
+	}
 }
 
 /**
-- 
2.29.1.341.ge80a0c044ae-goog


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 8/8] epoll: eliminate unnecessary lock for zero timeout
  2020-11-06 23:16 [PATCH 0/8] simplify ep_poll Soheil Hassas Yeganeh
                   ` (6 preceding siblings ...)
  2020-11-06 23:16 ` [PATCH 7/8] epoll: replace gotos with a proper loop Soheil Hassas Yeganeh
@ 2020-11-06 23:16 ` Soheil Hassas Yeganeh
  2020-11-06 23:35 ` [PATCH 0/8] simplify ep_poll Linus Torvalds
  2020-11-08  1:43 ` Andrew Morton
  9 siblings, 0 replies; 17+ messages in thread
From: Soheil Hassas Yeganeh @ 2020-11-06 23:16 UTC (permalink / raw)
  To: torvalds, viro, linux-fsdevel
  Cc: linux-kernel, akpm, dave, edumazet, willemb, khazhy, guantaol,
	Soheil Hassas Yeganeh

From: Soheil Hassas Yeganeh <soheil@google.com>

We call ep_events_available() under lock when timeout is 0,
and then call it without locks in the loop for the other cases.

Instead, call ep_events_available() without lock for all cases.
For non-zero timeouts, we will recheck after adding the thread to
the wait queue. For zero timeout cases, by definition, user is
opportunistically polling and will have to call epoll_wait again
in the future.

Note that this lock was kept in c5a282e9635e9 because the whole
loop was historically under lock.

This patch results in a 1% CPU/RPC reduction in RPC benchmarks.

Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Khazhismel Kumykov <khazhy@google.com>
---
 fs/eventpoll.c | 25 ++++++++++++-------------
 1 file changed, 12 insertions(+), 13 deletions(-)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index f4e1be7ada26..1aa23b0be72b 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -1830,7 +1830,7 @@ static inline struct timespec64 ep_set_mstimeout(long ms)
 static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
 		   int maxevents, long timeout)
 {
-	int res, eavail = 0, timed_out = 0;
+	int res, eavail, timed_out = 0;
 	u64 slack = 0;
 	wait_queue_entry_t wait;
 	ktime_t expires, *to = NULL;
@@ -1846,18 +1846,21 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
 	} else if (timeout == 0) {
 		/*
 		 * Avoid the unnecessary trip to the wait queue loop, if the
-		 * caller specified a non blocking operation. We still need
-		 * lock because we could race and not see an epi being added
-		 * to the ready list while in irq callback. Thus incorrectly
-		 * returning 0 back to userspace.
+		 * caller specified a non blocking operation.
 		 */
 		timed_out = 1;
-
-		write_lock_irq(&ep->lock);
-		eavail = ep_events_available(ep);
-		write_unlock_irq(&ep->lock);
 	}
 
+	/*
+	 * This call is racy: We may or may not see events that are being added
+	 * to the ready list under the lock (e.g., in IRQ callbacks). For, cases
+	 * with a non-zero timeout, this thread will check the ready list under
+	 * lock and will added to the wait queue.  For, cases with a zero
+	 * timeout, the user by definition should not care and will have to
+	 * recheck again.
+	 */
+	eavail = ep_events_available(ep);
+
 	while (1) {
 		if (eavail) {
 			/*
@@ -1873,10 +1876,6 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
 		if (timed_out)
 			return 0;
 
-		eavail = ep_events_available(ep);
-		if (eavail)
-			continue;
-
 		eavail = ep_busy_loop(ep, timed_out);
 		if (eavail)
 			continue;
-- 
2.29.1.341.ge80a0c044ae-goog


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/8] simplify ep_poll
  2020-11-06 23:16 [PATCH 0/8] simplify ep_poll Soheil Hassas Yeganeh
                   ` (7 preceding siblings ...)
  2020-11-06 23:16 ` [PATCH 8/8] epoll: eliminate unnecessary lock for zero timeout Soheil Hassas Yeganeh
@ 2020-11-06 23:35 ` Linus Torvalds
  2020-11-08  1:43 ` Andrew Morton
  9 siblings, 0 replies; 17+ messages in thread
From: Linus Torvalds @ 2020-11-06 23:35 UTC (permalink / raw)
  To: Soheil Hassas Yeganeh
  Cc: Al Viro, linux-fsdevel, Linux Kernel Mailing List, Andrew Morton,
	Davidlohr Bueso, Eric Dumazet, Willem de Bruijn,
	Khazhismel Kumykov, Guantao Liu, Soheil Hassas Yeganeh

On Fri, Nov 6, 2020 at 3:17 PM Soheil Hassas Yeganeh
<soheil.kdev@gmail.com> wrote:
>
> The first patch in the series is a fix for the epoll race in
> presence of timeouts, so that it can be cleanly backported to all
> affected stable kernels.
>
> The rest of the patch series simplify the ep_poll() implementation.
> Some of these simplifications result in minor performance enhancements
> as well.  We have kept these changes under self tests and internal
> benchmarks for a few days, and there are minor (1-2%) performance
> enhancements as a result.

From just looking at the patches (not the end result - I didn't
actually apply them), it looks sane to me.

             Linus

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/8] simplify ep_poll
  2020-11-06 23:16 [PATCH 0/8] simplify ep_poll Soheil Hassas Yeganeh
                   ` (8 preceding siblings ...)
  2020-11-06 23:35 ` [PATCH 0/8] simplify ep_poll Linus Torvalds
@ 2020-11-08  1:43 ` Andrew Morton
  2020-11-08  4:45   ` Soheil Hassas Yeganeh
  9 siblings, 1 reply; 17+ messages in thread
From: Andrew Morton @ 2020-11-08  1:43 UTC (permalink / raw)
  To: Soheil Hassas Yeganeh
  Cc: torvalds, viro, linux-fsdevel, linux-kernel, dave, edumazet,
	willemb, khazhy, guantaol, Soheil Hassas Yeganeh

On Fri,  6 Nov 2020 18:16:27 -0500 Soheil Hassas Yeganeh <soheil.kdev@gmail.com> wrote:

> From: Soheil Hassas Yeganeh <soheil@google.com>
> 
> This patch series is a follow up based on the suggestions and feedback by Linus:
> https://lkml.kernel.org/r/CAHk-=wizk=OxUyQPbO8MS41w2Pag1kniUV5WdD5qWL-gq1kjDA@mail.gmail.com

Al Viro has been playing in here as well - see the below, from
linux-next.

I think I'll leave it to you folks to sort this out, please.


commit 57804b1cc4616780c72a2d0930d1bd0d5bd3ed4c
Author:     Al Viro <viro@zeniv.linux.org.uk>
AuthorDate: Mon Aug 31 13:41:30 2020 -0400
Commit:     Al Viro <viro@zeniv.linux.org.uk>
CommitDate: Sun Oct 25 20:02:01 2020 -0400

    lift locking/unlocking ep->mtx out of ep_{start,done}_scan()
    
    get rid of depth/ep_locked arguments there and document
    the kludge in ep_item_poll() that has lead to ep_locked existence in
    the first place
    
    Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index ac996b959e94..f9c567af1f5f 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -554,20 +554,13 @@ static inline void ep_pm_stay_awake_rcu(struct epitem *epi)
 	rcu_read_unlock();
 }
 
-static void ep_start_scan(struct eventpoll *ep,
-			  int depth, bool ep_locked,
-			  struct list_head *txlist)
-{
-	lockdep_assert_irqs_enabled();
-
-	/*
-	 * We need to lock this because we could be hit by
-	 * eventpoll_release_file() and epoll_ctl().
-	 */
-
-	if (!ep_locked)
-		mutex_lock_nested(&ep->mtx, depth);
 
+/*
+ * ep->mutex needs to be held because we could be hit by
+ * eventpoll_release_file() and epoll_ctl().
+ */
+static void ep_start_scan(struct eventpoll *ep, struct list_head *txlist)
+{
 	/*
 	 * Steal the ready list, and re-init the original one to the
 	 * empty list. Also, set ep->ovflist to NULL so that events
@@ -576,6 +569,7 @@ static void ep_start_scan(struct eventpoll *ep,
 	 * because we want the "sproc" callback to be able to do it
 	 * in a lockless way.
 	 */
+	lockdep_assert_irqs_enabled();
 	write_lock_irq(&ep->lock);
 	list_splice_init(&ep->rdllist, txlist);
 	WRITE_ONCE(ep->ovflist, NULL);
@@ -583,7 +577,6 @@ static void ep_start_scan(struct eventpoll *ep,
 }
 
 static void ep_done_scan(struct eventpoll *ep,
-			 int depth, bool ep_locked,
 			 struct list_head *txlist)
 {
 	struct epitem *epi, *nepi;
@@ -624,9 +617,6 @@ static void ep_done_scan(struct eventpoll *ep,
 	list_splice(txlist, &ep->rdllist);
 	__pm_relax(ep->ws);
 	write_unlock_irq(&ep->lock);
-
-	if (!ep_locked)
-		mutex_unlock(&ep->mtx);
 }
 
 static void epi_rcu_free(struct rcu_head *head)
@@ -763,11 +753,16 @@ static __poll_t ep_item_poll(const struct epitem *epi, poll_table *pt,
 
 	ep = epi->ffd.file->private_data;
 	poll_wait(epi->ffd.file, &ep->poll_wait, pt);
-	locked = pt && (pt->_qproc == ep_ptable_queue_proc);
 
-	ep_start_scan(ep, depth, locked, &txlist);
+	// kludge: ep_insert() calls us with ep->mtx already locked
+	locked = pt && (pt->_qproc == ep_ptable_queue_proc);
+	if (!locked)
+		mutex_lock_nested(&ep->mtx, depth);
+	ep_start_scan(ep, &txlist);
 	res = ep_read_events_proc(ep, &txlist, depth + 1);
-	ep_done_scan(ep, depth, locked, &txlist);
+	ep_done_scan(ep, &txlist);
+	if (!locked)
+		mutex_unlock(&ep->mtx);
 	return res & epi->event.events;
 }
 
@@ -809,9 +804,11 @@ static __poll_t ep_eventpoll_poll(struct file *file, poll_table *wait)
 	 * Proceed to find out if wanted events are really available inside
 	 * the ready list.
 	 */
-	ep_start_scan(ep, 0, false, &txlist);
+	mutex_lock(&ep->mtx);
+	ep_start_scan(ep, &txlist);
 	res = ep_read_events_proc(ep, &txlist, 1);
-	ep_done_scan(ep, 0, false, &txlist);
+	ep_done_scan(ep, &txlist);
+	mutex_unlock(&ep->mtx);
 	return res;
 }
 
@@ -1573,15 +1570,13 @@ static int ep_send_events(struct eventpoll *ep,
 
 	init_poll_funcptr(&pt, NULL);
 
-	ep_start_scan(ep, 0, false, &txlist);
+	mutex_lock(&ep->mtx);
+	ep_start_scan(ep, &txlist);
 
 	/*
 	 * We can loop without lock because we are passed a task private list.
-	 * Items cannot vanish during the loop because ep_scan_ready_list() is
-	 * holding "mtx" during this call.
+	 * Items cannot vanish during the loop we are holding ep->mtx.
 	 */
-	lockdep_assert_held(&ep->mtx);
-
 	list_for_each_entry_safe(epi, tmp, &txlist, rdllink) {
 		struct wakeup_source *ws;
 		__poll_t revents;
@@ -1609,9 +1604,8 @@ static int ep_send_events(struct eventpoll *ep,
 
 		/*
 		 * If the event mask intersect the caller-requested one,
-		 * deliver the event to userspace. Again, ep_scan_ready_list()
-		 * is holding ep->mtx, so no operations coming from userspace
-		 * can change the item.
+		 * deliver the event to userspace. Again, we are holding ep->mtx,
+		 * so no operations coming from userspace can change the item.
 		 */
 		revents = ep_item_poll(epi, &pt, 1);
 		if (!revents)
@@ -1645,7 +1639,8 @@ static int ep_send_events(struct eventpoll *ep,
 			ep_pm_stay_awake(epi);
 		}
 	}
-	ep_done_scan(ep, 0, false, &txlist);
+	ep_done_scan(ep, &txlist);
+	mutex_unlock(&ep->mtx);
 
 	return res;
 }


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/8] simplify ep_poll
  2020-11-08  1:43 ` Andrew Morton
@ 2020-11-08  4:45   ` Soheil Hassas Yeganeh
  2020-11-09 18:59     ` Davidlohr Bueso
  2020-11-10 22:05     ` Andrew Morton
  0 siblings, 2 replies; 17+ messages in thread
From: Soheil Hassas Yeganeh @ 2020-11-08  4:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Al Viro, linux-fsdevel, linux-kernel,
	Davidlohr Bueso, Eric Dumazet, Willem de Bruijn,
	Khazhismel Kumykov, Guantao Liu

On Sat, Nov 7, 2020 at 8:43 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Fri,  6 Nov 2020 18:16:27 -0500 Soheil Hassas Yeganeh <soheil.kdev@gmail.com> wrote:
>
> > From: Soheil Hassas Yeganeh <soheil@google.com>
> >
> > This patch series is a follow up based on the suggestions and feedback by Linus:
> > https://lkml.kernel.org/r/CAHk-=wizk=OxUyQPbO8MS41w2Pag1kniUV5WdD5qWL-gq1kjDA@mail.gmail.com
>
> Al Viro has been playing in here as well - see the below, from
> linux-next.
>
> I think I'll leave it to you folks to sort this out, please.

Thank you Andrew for pointing that out!  Sorry that I didn't notice Al
Viro's nice clean ups.

The changes are all orthogonal and apply cleanly except "epoll: pull
fatal signal checks into ep_send_events()".   The conflict is trivial
and the following patch should cleanly apply to linux-next/master (I
didn't move the initialization of `res = 0` after the new branch to
keep it simple).

FWIW, I also stress-tested the patch series applied on top of
linux-next/master for a couple of hours.

Could you please let me know whether I should send a V2 of the patch
series for linux-next? Thanks!

Subject: [PATCH linux-next] epoll: pull fatal signal checks into
ep_send_events()

To simplify the code, pull in checking the fatal signals into
ep_send_events().  ep_send_events() is called only from ep_poll().

Note that, previously, we were always checking fatal events, but
it is checked only if eavail is true.  This should be fine because
the goal of that check is to quickly return from epoll_wait() when
there is a pending fatal signal.

Suggested-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Khazhismel Kumykov <khazhy@google.com>
Change-Id: I68bbaf02db564e64815bcd8ed3c1cd272cfe832f
---
 fs/eventpoll.c | 17 ++++++++---------
 1 file changed, 8 insertions(+), 9 deletions(-)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 06fb0de4bcc7..42f6bfc5f24e 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -1625,6 +1625,14 @@ static int ep_send_events(struct eventpoll *ep,
        poll_table pt;
        int res = 0;

+       /*
+        * Always short-circuit for fatal signals to allow threads to make a
+        * timely exit without the chance of finding more events available and
+        * fetching repeatedly.
+        */
+       if (fatal_signal_pending(current))
+               return -EINTR;
+
        init_poll_funcptr(&pt, NULL);

        mutex_lock(&ep->mtx);
@@ -1846,15 +1854,6 @@ static int ep_poll(struct eventpoll *ep, struct
epoll_event __user *events,
        }

 send_events:
-       if (fatal_signal_pending(current)) {
-               /*
-                * Always short-circuit for fatal signals to allow
-                * threads to make a timely exit without the chance of
-                * finding more events available and fetching
-                * repeatedly.
-                */
-               return -EINTR;
-       }
        /*
         * Try to transfer events to user space. In case we get 0 events and
         * there's still timeout left over, we go trying again in search of

--
2.29.2.222.g5d2a92d10f8-goog




> commit 57804b1cc4616780c72a2d0930d1bd0d5bd3ed4c
> Author:     Al Viro <viro@zeniv.linux.org.uk>
> AuthorDate: Mon Aug 31 13:41:30 2020 -0400
> Commit:     Al Viro <viro@zeniv.linux.org.uk>
> CommitDate: Sun Oct 25 20:02:01 2020 -0400
>
>     lift locking/unlocking ep->mtx out of ep_{start,done}_scan()
>
>     get rid of depth/ep_locked arguments there and document
>     the kludge in ep_item_poll() that has lead to ep_locked existence in
>     the first place
>
>     Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
>
> diff --git a/fs/eventpoll.c b/fs/eventpoll.c
> index ac996b959e94..f9c567af1f5f 100644
> --- a/fs/eventpoll.c
> +++ b/fs/eventpoll.c
> @@ -554,20 +554,13 @@ static inline void ep_pm_stay_awake_rcu(struct epitem *epi)
>         rcu_read_unlock();
>  }
>
> -static void ep_start_scan(struct eventpoll *ep,
> -                         int depth, bool ep_locked,
> -                         struct list_head *txlist)
> -{
> -       lockdep_assert_irqs_enabled();
> -
> -       /*
> -        * We need to lock this because we could be hit by
> -        * eventpoll_release_file() and epoll_ctl().
> -        */
> -
> -       if (!ep_locked)
> -               mutex_lock_nested(&ep->mtx, depth);
>
> +/*
> + * ep->mutex needs to be held because we could be hit by
> + * eventpoll_release_file() and epoll_ctl().
> + */
> +static void ep_start_scan(struct eventpoll *ep, struct list_head *txlist)
> +{
>         /*
>          * Steal the ready list, and re-init the original one to the
>          * empty list. Also, set ep->ovflist to NULL so that events
> @@ -576,6 +569,7 @@ static void ep_start_scan(struct eventpoll *ep,
>          * because we want the "sproc" callback to be able to do it
>          * in a lockless way.
>          */
> +       lockdep_assert_irqs_enabled();
>         write_lock_irq(&ep->lock);
>         list_splice_init(&ep->rdllist, txlist);
>         WRITE_ONCE(ep->ovflist, NULL);
> @@ -583,7 +577,6 @@ static void ep_start_scan(struct eventpoll *ep,
>  }
>
>  static void ep_done_scan(struct eventpoll *ep,
> -                        int depth, bool ep_locked,
>                          struct list_head *txlist)
>  {
>         struct epitem *epi, *nepi;
> @@ -624,9 +617,6 @@ static void ep_done_scan(struct eventpoll *ep,
>         list_splice(txlist, &ep->rdllist);
>         __pm_relax(ep->ws);
>         write_unlock_irq(&ep->lock);
> -
> -       if (!ep_locked)
> -               mutex_unlock(&ep->mtx);
>  }
>
>  static void epi_rcu_free(struct rcu_head *head)
> @@ -763,11 +753,16 @@ static __poll_t ep_item_poll(const struct epitem *epi, poll_table *pt,
>
>         ep = epi->ffd.file->private_data;
>         poll_wait(epi->ffd.file, &ep->poll_wait, pt);
> -       locked = pt && (pt->_qproc == ep_ptable_queue_proc);
>
> -       ep_start_scan(ep, depth, locked, &txlist);
> +       // kludge: ep_insert() calls us with ep->mtx already locked
> +       locked = pt && (pt->_qproc == ep_ptable_queue_proc);
> +       if (!locked)
> +               mutex_lock_nested(&ep->mtx, depth);
> +       ep_start_scan(ep, &txlist);
>         res = ep_read_events_proc(ep, &txlist, depth + 1);
> -       ep_done_scan(ep, depth, locked, &txlist);
> +       ep_done_scan(ep, &txlist);
> +       if (!locked)
> +               mutex_unlock(&ep->mtx);
>         return res & epi->event.events;
>  }
>
> @@ -809,9 +804,11 @@ static __poll_t ep_eventpoll_poll(struct file *file, poll_table *wait)
>          * Proceed to find out if wanted events are really available inside
>          * the ready list.
>          */
> -       ep_start_scan(ep, 0, false, &txlist);
> +       mutex_lock(&ep->mtx);
> +       ep_start_scan(ep, &txlist);
>         res = ep_read_events_proc(ep, &txlist, 1);
> -       ep_done_scan(ep, 0, false, &txlist);
> +       ep_done_scan(ep, &txlist);
> +       mutex_unlock(&ep->mtx);
>         return res;
>  }
>
> @@ -1573,15 +1570,13 @@ static int ep_send_events(struct eventpoll *ep,
>
>         init_poll_funcptr(&pt, NULL);
>
> -       ep_start_scan(ep, 0, false, &txlist);
> +       mutex_lock(&ep->mtx);
> +       ep_start_scan(ep, &txlist);
>
>         /*
>          * We can loop without lock because we are passed a task private list.
> -        * Items cannot vanish during the loop because ep_scan_ready_list() is
> -        * holding "mtx" during this call.
> +        * Items cannot vanish during the loop we are holding ep->mtx.
>          */
> -       lockdep_assert_held(&ep->mtx);
> -
>         list_for_each_entry_safe(epi, tmp, &txlist, rdllink) {
>                 struct wakeup_source *ws;
>                 __poll_t revents;
> @@ -1609,9 +1604,8 @@ static int ep_send_events(struct eventpoll *ep,
>
>                 /*
>                  * If the event mask intersect the caller-requested one,
> -                * deliver the event to userspace. Again, ep_scan_ready_list()
> -                * is holding ep->mtx, so no operations coming from userspace
> -                * can change the item.
> +                * deliver the event to userspace. Again, we are holding ep->mtx,
> +                * so no operations coming from userspace can change the item.
>                  */
>                 revents = ep_item_poll(epi, &pt, 1);
>                 if (!revents)
> @@ -1645,7 +1639,8 @@ static int ep_send_events(struct eventpoll *ep,
>                         ep_pm_stay_awake(epi);
>                 }
>         }
> -       ep_done_scan(ep, 0, false, &txlist);
> +       ep_done_scan(ep, &txlist);
> +       mutex_unlock(&ep->mtx);
>
>         return res;
>  }
>

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/8] simplify ep_poll
  2020-11-08  4:45   ` Soheil Hassas Yeganeh
@ 2020-11-09 18:59     ` Davidlohr Bueso
  2020-11-09 19:28       ` Soheil Hassas Yeganeh
  2020-11-10 22:05     ` Andrew Morton
  1 sibling, 1 reply; 17+ messages in thread
From: Davidlohr Bueso @ 2020-11-09 18:59 UTC (permalink / raw)
  To: Soheil Hassas Yeganeh
  Cc: Andrew Morton, Linus Torvalds, Al Viro, linux-fsdevel,
	linux-kernel, Eric Dumazet, Willem de Bruijn, Khazhismel Kumykov,
	Guantao Liu

On Sat, 07 Nov 2020, Soheil Hassas Yeganeh wrote:
>FWIW, I also stress-tested the patch series applied on top of
>linux-next/master for a couple of hours.

Out of curiosity, what exactly did you use for testing?

Thanks,
Davidlohr

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/8] simplify ep_poll
  2020-11-09 18:59     ` Davidlohr Bueso
@ 2020-11-09 19:28       ` Soheil Hassas Yeganeh
  0 siblings, 0 replies; 17+ messages in thread
From: Soheil Hassas Yeganeh @ 2020-11-09 19:28 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Andrew Morton, Linus Torvalds, Al Viro, linux-fsdevel,
	linux-kernel, Eric Dumazet, Willem de Bruijn, Khazhismel Kumykov,
	Guantao Liu

On Mon, Nov 9, 2020 at 2:22 PM Davidlohr Bueso <dave@stgolabs.net> wrote:
>
> On Sat, 07 Nov 2020, Soheil Hassas Yeganeh wrote:
> >FWIW, I also stress-tested the patch series applied on top of
> >linux-next/master for a couple of hours.
>
> Out of curiosity, what exactly did you use for testing?

That's a good question. I ran two flavors of tests:
- The epoll_wakeup_test selftest in a tight loop on two different
CPUs. This includes the newly added epoll61 test case for the timeout
race.
- Internal production loadtests (i.e., multi-node benchmarks).

Thanks,
Soheil

> Thanks,
> Davidlohr

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/8] epoll: check for events when removing a timed out thread from the wait queue
  2020-11-06 23:16 ` [PATCH 1/8] epoll: check for events when removing a timed out thread from the wait queue Soheil Hassas Yeganeh
@ 2020-11-10  6:05   ` Davidlohr Bueso
  0 siblings, 0 replies; 17+ messages in thread
From: Davidlohr Bueso @ 2020-11-10  6:05 UTC (permalink / raw)
  To: Soheil Hassas Yeganeh
  Cc: torvalds, viro, linux-fsdevel, linux-kernel, akpm, edumazet,
	willemb, khazhy, guantaol, Soheil Hassas Yeganeh

On Fri, 06 Nov 2020, Soheil Hassas Yeganeh wrote:

>From: Soheil Hassas Yeganeh <soheil@google.com>
>
>After abc610e01c66 ("fs/epoll: avoid barrier after an epoll_wait(2)
>timeout"), we break out of the ep_poll loop upon timeout, without checking
>whether there is any new events available.  Prior to that patch-series we
>always called ep_events_available() after exiting the loop.
>
>This can cause races and missed wakeups.  For example, consider the
>following scenario reported by Guantao Liu:
>
>Suppose we have an eventfd added using EPOLLET to an epollfd.
>
>Thread 1: Sleeps for just below 5ms and then writes to an eventfd.
>Thread 2: Calls epoll_wait with a timeout of 5 ms. If it sees an
>          event of the eventfd, it will write back on that fd.
>Thread 3: Calls epoll_wait with a negative timeout.
>
>Prior to abc610e01c66, it is guaranteed that Thread 3 will wake up either
>by Thread 1 or Thread 2.  After abc610e01c66, Thread 3 can be blocked
>indefinitely if Thread 2 sees a timeout right before the write to the
>eventfd by Thread 1.  Thread 2 will be woken up from
>schedule_hrtimeout_range and, with evail 0, it will not call
>ep_send_events().
>
>To fix this issue:
>1) Simplify the timed_out case as suggested by Linus.
>2) while holding the lock, recheck whether the thread was woken up
>   after its time out has reached.
>
>Note that (2) is different from Linus' original suggestion: It do not
>set "eavail = ep_events_available(ep)" to avoid unnecessary contention
>(when there are too many timed-out threads and a small number of events),
>as well as races mentioned in the discussion thread.
>
>This is the first patch in the series so that the backport to stable
>releases is straightforward.
>
>Link: https://lkml.kernel.org/r/CAHk-=wizk=OxUyQPbO8MS41w2Pag1kniUV5WdD5qWL-gq1kjDA@mail.gmail.com
>Fixes: abc610e01c66 ("fs/epoll: avoid barrier after an epoll_wait(2) timeout")
>Tested-by: Guantao Liu <guantaol@google.com>
>Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
>Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
>Reported-by: Guantao Liu <guantaol@google.com>
>Reviewed-by: Eric Dumazet <edumazet@google.com>
>Reviewed-by: Willem de Bruijn <willemb@google.com>
>Reviewed-by: Khazhismel Kumykov <khazhy@google.com>

Thanks for providing the fix and a testcase.

Reviewed-by: Davidlohr Bueso <dbueso@suse.de>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/8] simplify ep_poll
  2020-11-08  4:45   ` Soheil Hassas Yeganeh
  2020-11-09 18:59     ` Davidlohr Bueso
@ 2020-11-10 22:05     ` Andrew Morton
  2020-11-11  3:37       ` Soheil Hassas Yeganeh
  1 sibling, 1 reply; 17+ messages in thread
From: Andrew Morton @ 2020-11-10 22:05 UTC (permalink / raw)
  To: Soheil Hassas Yeganeh
  Cc: Linus Torvalds, Al Viro, linux-fsdevel, linux-kernel,
	Davidlohr Bueso, Eric Dumazet, Willem de Bruijn,
	Khazhismel Kumykov, Guantao Liu

On Sat, 7 Nov 2020 23:45:30 -0500 Soheil Hassas Yeganeh <soheil@google.com> wrote:

> On Sat, Nov 7, 2020 at 8:43 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> >
> > On Fri,  6 Nov 2020 18:16:27 -0500 Soheil Hassas Yeganeh <soheil.kdev@gmail.com> wrote:
> >
> > > From: Soheil Hassas Yeganeh <soheil@google.com>
> > >
> > > This patch series is a follow up based on the suggestions and feedback by Linus:
> > > https://lkml.kernel.org/r/CAHk-=wizk=OxUyQPbO8MS41w2Pag1kniUV5WdD5qWL-gq1kjDA@mail.gmail.com
> >
> > Al Viro has been playing in here as well - see the below, from
> > linux-next.
> >
> > I think I'll leave it to you folks to sort this out, please.
> 
> Thank you Andrew for pointing that out!  Sorry that I didn't notice Al
> Viro's nice clean ups.
> 
> The changes are all orthogonal and apply cleanly except "epoll: pull
> fatal signal checks into ep_send_events()".   The conflict is trivial
> and the following patch should cleanly apply to linux-next/master (I
> didn't move the initialization of `res = 0` after the new branch to
> keep it simple).
> 
> FWIW, I also stress-tested the patch series applied on top of
> linux-next/master for a couple of hours.
> 
> Could you please let me know whether I should send a V2 of the patch
> series for linux-next? Thanks!

That worked, thanks.  I'll include all this in the next drop for
linux-next.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/8] simplify ep_poll
  2020-11-10 22:05     ` Andrew Morton
@ 2020-11-11  3:37       ` Soheil Hassas Yeganeh
  0 siblings, 0 replies; 17+ messages in thread
From: Soheil Hassas Yeganeh @ 2020-11-11  3:37 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Al Viro, linux-fsdevel, linux-kernel,
	Davidlohr Bueso, Eric Dumazet, Willem de Bruijn,
	Khazhismel Kumykov, Guantao Liu

On Tue, Nov 10, 2020 at 5:05 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Sat, 7 Nov 2020 23:45:30 -0500 Soheil Hassas Yeganeh <soheil@google.com> wrote:
>
> > On Sat, Nov 7, 2020 at 8:43 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> > >
> > > On Fri,  6 Nov 2020 18:16:27 -0500 Soheil Hassas Yeganeh <soheil.kdev@gmail.com> wrote:
> > >
> > > > From: Soheil Hassas Yeganeh <soheil@google.com>
> > > >
> > > > This patch series is a follow up based on the suggestions and feedback by Linus:
> > > > https://lkml.kernel.org/r/CAHk-=wizk=OxUyQPbO8MS41w2Pag1kniUV5WdD5qWL-gq1kjDA@mail.gmail.com
> > >
> > > Al Viro has been playing in here as well - see the below, from
> > > linux-next.
> > >
> > > I think I'll leave it to you folks to sort this out, please.
> >
> > Thank you Andrew for pointing that out!  Sorry that I didn't notice Al
> > Viro's nice clean ups.
> >
> > The changes are all orthogonal and apply cleanly except "epoll: pull
> > fatal signal checks into ep_send_events()".   The conflict is trivial
> > and the following patch should cleanly apply to linux-next/master (I
> > didn't move the initialization of `res = 0` after the new branch to
> > keep it simple).
> >
> > FWIW, I also stress-tested the patch series applied on top of
> > linux-next/master for a couple of hours.
> >
> > Could you please let me know whether I should send a V2 of the patch
> > series for linux-next? Thanks!
>
> That worked, thanks.  I'll include all this in the next drop for
> linux-next.

Awesome! Thanks very much, Andrew!

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2020-11-11  3:37 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-11-06 23:16 [PATCH 0/8] simplify ep_poll Soheil Hassas Yeganeh
2020-11-06 23:16 ` [PATCH 1/8] epoll: check for events when removing a timed out thread from the wait queue Soheil Hassas Yeganeh
2020-11-10  6:05   ` Davidlohr Bueso
2020-11-06 23:16 ` [PATCH 2/8] epoll: simplify signal handling Soheil Hassas Yeganeh
2020-11-06 23:16 ` [PATCH 3/8] epoll: pull fatal signal checks into ep_send_events() Soheil Hassas Yeganeh
2020-11-06 23:16 ` [PATCH 4/8] epoll: move eavail next to the list_empty_careful check Soheil Hassas Yeganeh
2020-11-06 23:16 ` [PATCH 5/8] epoll: simplify and optimize busy loop logic Soheil Hassas Yeganeh
2020-11-06 23:16 ` [PATCH 6/8] epoll: pull all code between fetch_events and send_event into the loop Soheil Hassas Yeganeh
2020-11-06 23:16 ` [PATCH 7/8] epoll: replace gotos with a proper loop Soheil Hassas Yeganeh
2020-11-06 23:16 ` [PATCH 8/8] epoll: eliminate unnecessary lock for zero timeout Soheil Hassas Yeganeh
2020-11-06 23:35 ` [PATCH 0/8] simplify ep_poll Linus Torvalds
2020-11-08  1:43 ` Andrew Morton
2020-11-08  4:45   ` Soheil Hassas Yeganeh
2020-11-09 18:59     ` Davidlohr Bueso
2020-11-09 19:28       ` Soheil Hassas Yeganeh
2020-11-10 22:05     ` Andrew Morton
2020-11-11  3:37       ` Soheil Hassas Yeganeh

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).