[RESEND RFC PATCH] epoll: autoremove wakers even more aggressively

All of lore.kernel.org
 help / color / mirror / Atom feed

* [RESEND RFC PATCH] epoll: autoremove wakers even more aggressively
@ 2022-06-15 21:24 Benjamin Segall
  2022-06-29 23:55 ` Andrew Morton
  0 siblings, 1 reply; 7+ messages in thread
From: Benjamin Segall @ 2022-06-15 21:24 UTC (permalink / raw)
  To: Alexander Viro, linux-fsdevel, linux-kernel
  Cc: Linus Torvalds, Andrew Morton, Shakeel Butt, Eric Dumazet

If a process is killed or otherwise exits while having active network
connections and many threads waiting on epoll_wait, the threads will all
be woken immediately, but not removed from ep->wq. Then when network
traffic scans ep->wq in wake_up, every wakeup attempt will fail, and
will not remove the entries from the list.

This means that the cost of the wakeup attempt is far higher than usual,
does not decrease, and this also competes with the dying threads trying
to actually make progress and remove themselves from the wq.

Handle this by removing visited epoll wq entries unconditionally, rather
than only when the wakeup succeeds - the structure of ep_poll means that
the only potential loss is the timed_out->eavail heuristic, which now
can race and result in a redundant ep_send_events attempt. (But only
when incoming data and a timeout actually race, not on every timeout)

Signed-off-by: Ben Segall <bsegall@google.com>
---
 fs/eventpoll.c | 22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index e2daa940ebce..8b56b94e2f56 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -1745,10 +1745,25 @@ static struct timespec64 *ep_timeout_to_timespec(struct timespec64 *to, long ms)
 	ktime_get_ts64(&now);
 	*to = timespec64_add_safe(now, *to);
 	return to;
 }
 
+/*
+ * autoremove_wake_function, but remove even on failure to wake up, because we
+ * know that default_wake_function/ttwu will only fail if the thread is already
+ * woken, and in that case the ep_poll loop will remove the entry anyways, not
+ * try to reuse it.
+ */
+static int ep_autoremove_wake_function(struct wait_queue_entry *wq_entry,
+				       unsigned int mode, int sync, void *key)
+{
+	int ret = default_wake_function(wq_entry, mode, sync, key);
+
+	list_del_init(&wq_entry->entry);
+	return ret;
+}
+
 /**
  * ep_poll - Retrieves ready events, and delivers them to the caller-supplied
  *           event buffer.
  *
  * @ep: Pointer to the eventpoll context.
@@ -1826,12 +1841,19 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
 		 * chance to harvest new event. Otherwise wakeup can be
 		 * lost. This is also good performance-wise, because on
 		 * normal wakeup path no need to call __remove_wait_queue()
 		 * explicitly, thus ep->lock is not taken, which halts the
 		 * event delivery.
+		 *
+		 * In fact, we now use an even more aggressive function that
+		 * unconditionally removes, because we don't reuse the wait
+		 * entry between loop iterations. This lets us also avoid the
+		 * performance issue if a process is killed, causing all of its
+		 * threads to wake up without being removed normally.
 		 */
 		init_wait(&wait);
+		wait.func = ep_autoremove_wake_function;
 
 		write_lock_irq(&ep->lock);
 		/*
 		 * Barrierless variant, waitqueue_active() is called under
 		 * the same lock on wakeup ep_poll_callback() side, so it
-- 
2.36.1.476.g0c4daa206d-goog


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [RESEND RFC PATCH] epoll: autoremove wakers even more aggressively
  2022-06-15 21:24 [RESEND RFC PATCH] epoll: autoremove wakers even more aggressively Benjamin Segall
@ 2022-06-29 23:55 ` Andrew Morton
  2022-06-30  1:12   ` Shakeel Butt
  0 siblings, 1 reply; 7+ messages in thread
From: Andrew Morton @ 2022-06-29 23:55 UTC (permalink / raw)
  To: Benjamin Segall
  Cc: Alexander Viro, linux-fsdevel, linux-kernel, Linus Torvalds,
	Shakeel Butt, Eric Dumazet, Roman Penyaev, Jason Baron,
	Khazhismel Kumykov, Heiher

On Wed, 15 Jun 2022 14:24:23 -0700 Benjamin Segall <bsegall@google.com> wrote:

> If a process is killed or otherwise exits while having active network
> connections and many threads waiting on epoll_wait, the threads will all
> be woken immediately, but not removed from ep->wq. Then when network
> traffic scans ep->wq in wake_up, every wakeup attempt will fail, and
> will not remove the entries from the list.
> 
> This means that the cost of the wakeup attempt is far higher than usual,
> does not decrease, and this also competes with the dying threads trying
> to actually make progress and remove themselves from the wq.
> 
> Handle this by removing visited epoll wq entries unconditionally, rather
> than only when the wakeup succeeds - the structure of ep_poll means that
> the only potential loss is the timed_out->eavail heuristic, which now
> can race and result in a redundant ep_send_events attempt. (But only
> when incoming data and a timeout actually race, not on every timeout)
>

Thanks.  I added people from 412895f03cbf96 ("epoll: atomically remove
wait entry on wake up") to cc.  Hopefully someone there can help review
and maybe test this.


> 
> diff --git a/fs/eventpoll.c b/fs/eventpoll.c
> index e2daa940ebce..8b56b94e2f56 100644
> --- a/fs/eventpoll.c
> +++ b/fs/eventpoll.c
> @@ -1745,10 +1745,25 @@ static struct timespec64 *ep_timeout_to_timespec(struct timespec64 *to, long ms)
>  	ktime_get_ts64(&now);
>  	*to = timespec64_add_safe(now, *to);
>  	return to;
>  }
>  
> +/*
> + * autoremove_wake_function, but remove even on failure to wake up, because we
> + * know that default_wake_function/ttwu will only fail if the thread is already
> + * woken, and in that case the ep_poll loop will remove the entry anyways, not
> + * try to reuse it.
> + */
> +static int ep_autoremove_wake_function(struct wait_queue_entry *wq_entry,
> +				       unsigned int mode, int sync, void *key)
> +{
> +	int ret = default_wake_function(wq_entry, mode, sync, key);
> +
> +	list_del_init(&wq_entry->entry);
> +	return ret;
> +}
> +
>  /**
>   * ep_poll - Retrieves ready events, and delivers them to the caller-supplied
>   *           event buffer.
>   *
>   * @ep: Pointer to the eventpoll context.
> @@ -1826,12 +1841,19 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
>  		 * chance to harvest new event. Otherwise wakeup can be
>  		 * lost. This is also good performance-wise, because on
>  		 * normal wakeup path no need to call __remove_wait_queue()
>  		 * explicitly, thus ep->lock is not taken, which halts the
>  		 * event delivery.
> +		 *
> +		 * In fact, we now use an even more aggressive function that
> +		 * unconditionally removes, because we don't reuse the wait
> +		 * entry between loop iterations. This lets us also avoid the
> +		 * performance issue if a process is killed, causing all of its
> +		 * threads to wake up without being removed normally.
>  		 */
>  		init_wait(&wait);
> +		wait.func = ep_autoremove_wake_function;
>  
>  		write_lock_irq(&ep->lock);
>  		/*
>  		 * Barrierless variant, waitqueue_active() is called under
>  		 * the same lock on wakeup ep_poll_callback() side, so it
> -- 
> 2.36.1.476.g0c4daa206d-goog

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RESEND RFC PATCH] epoll: autoremove wakers even more aggressively
  2022-06-29 23:55 ` Andrew Morton
@ 2022-06-30  1:12   ` Shakeel Butt
  2022-06-30  2:24     ` Andrew Morton
  0 siblings, 1 reply; 7+ messages in thread
From: Shakeel Butt @ 2022-06-30  1:12 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Benjamin Segall, Alexander Viro, linux-fsdevel, LKML,
	Linus Torvalds, Eric Dumazet, Roman Penyaev, Jason Baron,
	Khazhismel Kumykov, Heiher

On Wed, Jun 29, 2022 at 4:55 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Wed, 15 Jun 2022 14:24:23 -0700 Benjamin Segall <bsegall@google.com> wrote:
>
> > If a process is killed or otherwise exits while having active network
> > connections and many threads waiting on epoll_wait, the threads will all
> > be woken immediately, but not removed from ep->wq. Then when network
> > traffic scans ep->wq in wake_up, every wakeup attempt will fail, and
> > will not remove the entries from the list.
> >
> > This means that the cost of the wakeup attempt is far higher than usual,
> > does not decrease, and this also competes with the dying threads trying
> > to actually make progress and remove themselves from the wq.
> >
> > Handle this by removing visited epoll wq entries unconditionally, rather
> > than only when the wakeup succeeds - the structure of ep_poll means that
> > the only potential loss is the timed_out->eavail heuristic, which now
> > can race and result in a redundant ep_send_events attempt. (But only
> > when incoming data and a timeout actually race, not on every timeout)
> >
>
> Thanks.  I added people from 412895f03cbf96 ("epoll: atomically remove
> wait entry on wake up") to cc.  Hopefully someone there can help review
> and maybe test this.
>
>

Thanks Andrew. Just wanted to add that we are seeing this issue in
production with real workloads and it has caused hard lockups.
Particularly network heavy workloads with a lot of threads in
epoll_wait() can easily trigger this issue if they get killed
(oom-killed in our case).

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RESEND RFC PATCH] epoll: autoremove wakers even more aggressively
  2022-06-30  1:12   ` Shakeel Butt
@ 2022-06-30  2:24     ` Andrew Morton
  2022-06-30 14:59       ` Shakeel Butt
  0 siblings, 1 reply; 7+ messages in thread
From: Andrew Morton @ 2022-06-30  2:24 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Benjamin Segall, Alexander Viro, linux-fsdevel, LKML,
	Linus Torvalds, Eric Dumazet, Roman Penyaev, Jason Baron,
	Khazhismel Kumykov, Heiher

On Wed, 29 Jun 2022 18:12:46 -0700 Shakeel Butt <shakeelb@google.com> wrote:

> On Wed, Jun 29, 2022 at 4:55 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> >
> > On Wed, 15 Jun 2022 14:24:23 -0700 Benjamin Segall <bsegall@google.com> wrote:
> >
> > > If a process is killed or otherwise exits while having active network
> > > connections and many threads waiting on epoll_wait, the threads will all
> > > be woken immediately, but not removed from ep->wq. Then when network
> > > traffic scans ep->wq in wake_up, every wakeup attempt will fail, and
> > > will not remove the entries from the list.
> > >
> > > This means that the cost of the wakeup attempt is far higher than usual,
> > > does not decrease, and this also competes with the dying threads trying
> > > to actually make progress and remove themselves from the wq.
> > >
> > > Handle this by removing visited epoll wq entries unconditionally, rather
> > > than only when the wakeup succeeds - the structure of ep_poll means that
> > > the only potential loss is the timed_out->eavail heuristic, which now
> > > can race and result in a redundant ep_send_events attempt. (But only
> > > when incoming data and a timeout actually race, not on every timeout)
> > >
> >
> > Thanks.  I added people from 412895f03cbf96 ("epoll: atomically remove
> > wait entry on wake up") to cc.  Hopefully someone there can help review
> > and maybe test this.
> >
> >
> 
> Thanks Andrew. Just wanted to add that we are seeing this issue in
> production with real workloads and it has caused hard lockups.
> Particularly network heavy workloads with a lot of threads in
> epoll_wait() can easily trigger this issue if they get killed
> (oom-killed in our case).

Hard lockups are undesirable.  Is a cc:stable justified here?

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RESEND RFC PATCH] epoll: autoremove wakers even more aggressively
  2022-06-30  2:24     ` Andrew Morton
@ 2022-06-30 14:59       ` Shakeel Butt
  2022-07-16  1:27         ` Shakeel Butt
  0 siblings, 1 reply; 7+ messages in thread
From: Shakeel Butt @ 2022-06-30 14:59 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Benjamin Segall, Alexander Viro, linux-fsdevel, LKML,
	Linus Torvalds, Eric Dumazet, Roman Penyaev, Jason Baron,
	Khazhismel Kumykov, Heiher

On Wed, Jun 29, 2022 at 7:24 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Wed, 29 Jun 2022 18:12:46 -0700 Shakeel Butt <shakeelb@google.com> wrote:
>
> > On Wed, Jun 29, 2022 at 4:55 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> > >
> > > On Wed, 15 Jun 2022 14:24:23 -0700 Benjamin Segall <bsegall@google.com> wrote:
> > >
> > > > If a process is killed or otherwise exits while having active network
> > > > connections and many threads waiting on epoll_wait, the threads will all
> > > > be woken immediately, but not removed from ep->wq. Then when network
> > > > traffic scans ep->wq in wake_up, every wakeup attempt will fail, and
> > > > will not remove the entries from the list.
> > > >
> > > > This means that the cost of the wakeup attempt is far higher than usual,
> > > > does not decrease, and this also competes with the dying threads trying
> > > > to actually make progress and remove themselves from the wq.
> > > >
> > > > Handle this by removing visited epoll wq entries unconditionally, rather
> > > > than only when the wakeup succeeds - the structure of ep_poll means that
> > > > the only potential loss is the timed_out->eavail heuristic, which now
> > > > can race and result in a redundant ep_send_events attempt. (But only
> > > > when incoming data and a timeout actually race, not on every timeout)
> > > >
> > >
> > > Thanks.  I added people from 412895f03cbf96 ("epoll: atomically remove
> > > wait entry on wake up") to cc.  Hopefully someone there can help review
> > > and maybe test this.
> > >
> > >
> >
> > Thanks Andrew. Just wanted to add that we are seeing this issue in
> > production with real workloads and it has caused hard lockups.
> > Particularly network heavy workloads with a lot of threads in
> > epoll_wait() can easily trigger this issue if they get killed
> > (oom-killed in our case).
>
> Hard lockups are undesirable.  Is a cc:stable justified here?

Not for now as I don't know if we can blame a patch which might be the
source of this behavior.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RESEND RFC PATCH] epoll: autoremove wakers even more aggressively
  2022-06-30 14:59       ` Shakeel Butt
@ 2022-07-16  1:27         ` Shakeel Butt
  2022-07-16  4:55           ` Andrew Morton
  0 siblings, 1 reply; 7+ messages in thread
From: Shakeel Butt @ 2022-07-16  1:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Benjamin Segall, Alexander Viro, linux-fsdevel, LKML,
	Linus Torvalds, Eric Dumazet, Roman Penyaev, Jason Baron,
	Khazhismel Kumykov, Heiher

On Thu, Jun 30, 2022 at 07:59:05AM -0700, Shakeel Butt wrote:
> On Wed, Jun 29, 2022 at 7:24 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> >
> > On Wed, 29 Jun 2022 18:12:46 -0700 Shakeel Butt <shakeelb@google.com> wrote:
> >
> > > On Wed, Jun 29, 2022 at 4:55 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> > > >
> > > > On Wed, 15 Jun 2022 14:24:23 -0700 Benjamin Segall <bsegall@google.com> wrote:
> > > >
> > > > > If a process is killed or otherwise exits while having active network
> > > > > connections and many threads waiting on epoll_wait, the threads will all
> > > > > be woken immediately, but not removed from ep->wq. Then when network
> > > > > traffic scans ep->wq in wake_up, every wakeup attempt will fail, and
> > > > > will not remove the entries from the list.
> > > > >
> > > > > This means that the cost of the wakeup attempt is far higher than usual,
> > > > > does not decrease, and this also competes with the dying threads trying
> > > > > to actually make progress and remove themselves from the wq.
> > > > >
> > > > > Handle this by removing visited epoll wq entries unconditionally, rather
> > > > > than only when the wakeup succeeds - the structure of ep_poll means that
> > > > > the only potential loss is the timed_out->eavail heuristic, which now
> > > > > can race and result in a redundant ep_send_events attempt. (But only
> > > > > when incoming data and a timeout actually race, not on every timeout)
> > > > >
> > > >
> > > > Thanks.  I added people from 412895f03cbf96 ("epoll: atomically remove
> > > > wait entry on wake up") to cc.  Hopefully someone there can help review
> > > > and maybe test this.
> > > >
> > > >
> > >
> > > Thanks Andrew. Just wanted to add that we are seeing this issue in
> > > production with real workloads and it has caused hard lockups.
> > > Particularly network heavy workloads with a lot of threads in
> > > epoll_wait() can easily trigger this issue if they get killed
> > > (oom-killed in our case).
> >
> > Hard lockups are undesirable.  Is a cc:stable justified here?
> 
> Not for now as I don't know if we can blame a patch which might be the
> source of this behavior.

I am able to repro the epoll hard lockup on next-20220715 with Ben's
patch reverted. The repro is a simple TCP server and tens of clients
communicating over loopback. Though to cause the hard lockup I have to
create a couple thousand threads in epoll_wait() in server and also
reduce the kernel.watchdog_thresh. With Ben's patch the repro does not
cause the hard lockup even with kernel.watchdog.thresh=1.

Please add:

Tested-by: Shakeel Butt <shakeelb@google.com>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RESEND RFC PATCH] epoll: autoremove wakers even more aggressively
  2022-07-16  1:27         ` Shakeel Butt
@ 2022-07-16  4:55           ` Andrew Morton
  0 siblings, 0 replies; 7+ messages in thread
From: Andrew Morton @ 2022-07-16  4:55 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Benjamin Segall, Alexander Viro, linux-fsdevel, LKML,
	Linus Torvalds, Eric Dumazet, Roman Penyaev, Jason Baron,
	Khazhismel Kumykov, Heiher

On Sat, 16 Jul 2022 01:27:31 +0000 Shakeel Butt <shakeelb@google.com> wrote:

>
> ...
>
> > > > production with real workloads and it has caused hard lockups.
> > > > Particularly network heavy workloads with a lot of threads in
> > > > epoll_wait() can easily trigger this issue if they get killed
> > > > (oom-killed in our case).
> > >
> > > Hard lockups are undesirable.  Is a cc:stable justified here?
> > 
> > Not for now as I don't know if we can blame a patch which might be the
> > source of this behavior.
> 
> I am able to repro the epoll hard lockup on next-20220715 with Ben's
> patch reverted. The repro is a simple TCP server and tens of clients
> communicating over loopback. Though to cause the hard lockup I have to
> create a couple thousand threads in epoll_wait() in server and also
> reduce the kernel.watchdog_thresh. With Ben's patch the repro does not
> cause the hard lockup even with kernel.watchdog.thresh=1.
> 
> Please add:
> 
> Tested-by: Shakeel Butt <shakeelb@google.com>

OK, thanks.  I added the cc:stable.  No Fixes:, as it has presumably
been there for a long time, perhaps for all time.


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2022-07-16  4:55 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-06-15 21:24 [RESEND RFC PATCH] epoll: autoremove wakers even more aggressively Benjamin Segall
2022-06-29 23:55 ` Andrew Morton
2022-06-30  1:12   ` Shakeel Butt
2022-06-30  2:24     ` Andrew Morton
2022-06-30 14:59       ` Shakeel Butt
2022-07-16  1:27         ` Shakeel Butt
2022-07-16  4:55           ` Andrew Morton

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.