All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v4 RESEND] epoll: use refcount to reduce ep_mutex contention
@ 2023-03-07 18:46 Paolo Abeni
  2023-03-07 21:17 ` Keller, Jacob E
  2023-03-07 21:30 ` Andrew Morton
  0 siblings, 2 replies; 7+ messages in thread
From: Paolo Abeni @ 2023-03-07 18:46 UTC (permalink / raw)
  To: netdev
  Cc: Soheil Hassas Yeganeh, Al Viro, Carlos Maiolino, Eric Biggers,
	Jacob Keller, Andrew Morton, Jens Axboe, Christian Brauner,
	linux-fsdevel

We are observing huge contention on the epmutex during an http
connection/rate test:

 83.17% 0.25%  nginx            [kernel.kallsyms]         [k] entry_SYSCALL_64_after_hwframe
[...]
           |--66.96%--__fput
                      |--60.04%--eventpoll_release_file
                                 |--58.41%--__mutex_lock.isra.6
                                           |--56.56%--osq_lock

The application is multi-threaded, creates a new epoll entry for
each incoming connection, and does not delete it before the
connection shutdown - that is, before the connection's fd close().

Many different threads compete frequently for the epmutex lock,
affecting the overall performance.

To reduce the contention this patch introduces explicit reference counting
for the eventpoll struct. Each registered event acquires a reference,
and references are released at ep_remove() time.

Additionally, this introduces a new 'dying' flag to prevent races between
the EP file close() and the monitored file close().
ep_eventpoll_release() marks, under f_lock spinlock, each epitem as before
removing it, while EP file close() does not touch dying epitems.

The eventpoll struct is released by whoever - among EP file close() and
and the monitored file close() drops its last reference.

With all the above in place, we can drop the epmutex usage at disposal time.

Overall this produces a significant performance improvement in the
mentioned connection/rate scenario: the mutex operations disappear from
the topmost offenders in the perf report, and the measured connections/rate
grows by ~60%.

To make the change more readable this additionally renames ep_free() to
ep_clear_and_put(), and moves the actual memory cleanup in a separate
ep_free() helper.

Tested-by: Xiumei Mu <xmu@redhiat.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
This is a repost of v4, with no changes. Kindly asking if FS maintainers
could have a look.

    v4 at:
    https://lore.kernel.org/linux-fsdevel/9d8ad7995e51ad3aecdfe6f7f9e72231b8c9d3b5.1671569682.git.pabeni@redhat.com/

    v3 at:
    https://lore.kernel.org/linux-fsdevel/1aedd7e87097bc4352ba658ac948c585a655785a.1669657846.git.pabeni@redhat.com/

    v2 at:
    https://lore.kernel.org/linux-fsdevel/f35e58ed5af8131f0f402c3dc6c3033fa96d1843.1669312208.git.pabeni@redhat.com/

    v1 at:
    https://lore.kernel.org/linux-fsdevel/f35e58ed5af8131f0f402c3dc6c3033fa96d1843.1669312208.git.pabeni@redhat.com/

    Previous related effort at:
    https://lore.kernel.org/linux-fsdevel/20190727113542.162213-1-cj.chengjian@huawei.com/
    https://lkml.org/lkml/2017/10/28/81
---
 fs/eventpoll.c | 185 +++++++++++++++++++++++++++++++------------------
 1 file changed, 116 insertions(+), 69 deletions(-)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 64659b110973..a43ccb02133c 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -57,13 +57,7 @@
  * we need a lock that will allow us to sleep. This lock is a
  * mutex (ep->mtx). It is acquired during the event transfer loop,
  * during epoll_ctl(EPOLL_CTL_DEL) and during eventpoll_release_file().
- * Then we also need a global mutex to serialize eventpoll_release_file()
- * and ep_free().
- * This mutex is acquired by ep_free() during the epoll file
- * cleanup path and it is also acquired by eventpoll_release_file()
- * if a file has been pushed inside an epoll set and it is then
- * close()d without a previous call to epoll_ctl(EPOLL_CTL_DEL).
- * It is also acquired when inserting an epoll fd onto another epoll
+ * The epmutex is acquired when inserting an epoll fd onto another epoll
  * fd. We do this so that we walk the epoll tree and ensure that this
  * insertion does not create a cycle of epoll file descriptors, which
  * could lead to deadlock. We need a global mutex to prevent two
@@ -153,6 +147,13 @@ struct epitem {
 	/* The file descriptor information this item refers to */
 	struct epoll_filefd ffd;
 
+	/*
+	 * Protected by file->f_lock, true for to-be-released epitem already
+	 * removed from the "struct file" items list; together with
+	 * eventpoll->refcount orchestrates "struct eventpoll" disposal
+	 */
+	bool dying;
+
 	/* List containing poll wait queues */
 	struct eppoll_entry *pwqlist;
 
@@ -217,6 +218,12 @@ struct eventpoll {
 	u64 gen;
 	struct hlist_head refs;
 
+	/*
+	 * usage count, used together with epitem->dying to
+	 * orchestrate the disposal of this struct
+	 */
+	refcount_t refcount;
+
 #ifdef CONFIG_NET_RX_BUSY_POLL
 	/* used to track busy poll napi_id */
 	unsigned int napi_id;
@@ -240,9 +247,7 @@ struct ep_pqueue {
 /* Maximum number of epoll watched descriptors, per user */
 static long max_user_watches __read_mostly;
 
-/*
- * This mutex is used to serialize ep_free() and eventpoll_release_file().
- */
+/* Used for cycles detection */
 static DEFINE_MUTEX(epmutex);
 
 static u64 loop_check_gen = 0;
@@ -557,8 +562,7 @@ static void ep_remove_wait_queue(struct eppoll_entry *pwq)
 
 /*
  * This function unregisters poll callbacks from the associated file
- * descriptor.  Must be called with "mtx" held (or "epmutex" if called from
- * ep_free).
+ * descriptor.  Must be called with "mtx" held.
  */
 static void ep_unregister_pollwait(struct eventpoll *ep, struct epitem *epi)
 {
@@ -681,11 +685,38 @@ static void epi_rcu_free(struct rcu_head *head)
 	kmem_cache_free(epi_cache, epi);
 }
 
+static void ep_get(struct eventpoll *ep)
+{
+	refcount_inc(&ep->refcount);
+}
+
+/*
+ * Returns true if the event poll can be disposed
+ */
+static bool ep_refcount_dec_and_test(struct eventpoll *ep)
+{
+	if (!refcount_dec_and_test(&ep->refcount))
+		return false;
+
+	WARN_ON_ONCE(!RB_EMPTY_ROOT(&ep->rbr.rb_root));
+	return true;
+}
+
+static void ep_free(struct eventpoll *ep)
+{
+	mutex_destroy(&ep->mtx);
+	free_uid(ep->user);
+	wakeup_source_unregister(ep->ws);
+	kfree(ep);
+}
+
 /*
  * Removes a "struct epitem" from the eventpoll RB tree and deallocates
  * all the associated resources. Must be called with "mtx" held.
+ * If the dying flag is set, do the removal only if force is true.
+ * Returns true if the eventpoll can be disposed.
  */
-static int ep_remove(struct eventpoll *ep, struct epitem *epi)
+static bool __ep_remove(struct eventpoll *ep, struct epitem *epi, bool force)
 {
 	struct file *file = epi->ffd.file;
 	struct epitems_head *to_free;
@@ -700,6 +731,11 @@ static int ep_remove(struct eventpoll *ep, struct epitem *epi)
 
 	/* Remove the current item from the list of epoll hooks */
 	spin_lock(&file->f_lock);
+	if (epi->dying && !force) {
+		spin_unlock(&file->f_lock);
+		return false;
+	}
+
 	to_free = NULL;
 	head = file->f_ep;
 	if (head->first == &epi->fllink && !epi->fllink.next) {
@@ -733,28 +769,28 @@ static int ep_remove(struct eventpoll *ep, struct epitem *epi)
 	call_rcu(&epi->rcu, epi_rcu_free);
 
 	percpu_counter_dec(&ep->user->epoll_watches);
+	return ep_refcount_dec_and_test(ep);
+}
 
-	return 0;
+/*
+ * ep_remove variant for callers owing an additional reference to the ep
+ */
+static void ep_remove_safe(struct eventpoll *ep, struct epitem *epi)
+{
+	WARN_ON_ONCE(__ep_remove(ep, epi, false));
 }
 
-static void ep_free(struct eventpoll *ep)
+static void ep_clear_and_put(struct eventpoll *ep)
 {
 	struct rb_node *rbp;
 	struct epitem *epi;
+	bool dispose;
 
 	/* We need to release all tasks waiting for these file */
 	if (waitqueue_active(&ep->poll_wait))
 		ep_poll_safewake(ep, NULL, 0);
 
-	/*
-	 * We need to lock this because we could be hit by
-	 * eventpoll_release_file() while we're freeing the "struct eventpoll".
-	 * We do not need to hold "ep->mtx" here because the epoll file
-	 * is on the way to be removed and no one has references to it
-	 * anymore. The only hit might come from eventpoll_release_file() but
-	 * holding "epmutex" is sufficient here.
-	 */
-	mutex_lock(&epmutex);
+	mutex_lock(&ep->mtx);
 
 	/*
 	 * Walks through the whole tree by unregistering poll callbacks.
@@ -768,25 +804,21 @@ static void ep_free(struct eventpoll *ep)
 
 	/*
 	 * Walks through the whole tree by freeing each "struct epitem". At this
-	 * point we are sure no poll callbacks will be lingering around, and also by
-	 * holding "epmutex" we can be sure that no file cleanup code will hit
-	 * us during this operation. So we can avoid the lock on "ep->lock".
-	 * We do not need to lock ep->mtx, either, we only do it to prevent
-	 * a lockdep warning.
+	 * point we are sure no poll callbacks will be lingering around.
+	 * Since we still own a reference to the eventpoll struct, the loop can't
+	 * dispose it.
 	 */
-	mutex_lock(&ep->mtx);
 	while ((rbp = rb_first_cached(&ep->rbr)) != NULL) {
 		epi = rb_entry(rbp, struct epitem, rbn);
-		ep_remove(ep, epi);
+		ep_remove_safe(ep, epi);
 		cond_resched();
 	}
+
+	dispose = ep_refcount_dec_and_test(ep);
 	mutex_unlock(&ep->mtx);
 
-	mutex_unlock(&epmutex);
-	mutex_destroy(&ep->mtx);
-	free_uid(ep->user);
-	wakeup_source_unregister(ep->ws);
-	kfree(ep);
+	if (dispose)
+		ep_free(ep);
 }
 
 static int ep_eventpoll_release(struct inode *inode, struct file *file)
@@ -794,7 +826,7 @@ static int ep_eventpoll_release(struct inode *inode, struct file *file)
 	struct eventpoll *ep = file->private_data;
 
 	if (ep)
-		ep_free(ep);
+		ep_clear_and_put(ep);
 
 	return 0;
 }
@@ -906,33 +938,35 @@ void eventpoll_release_file(struct file *file)
 {
 	struct eventpoll *ep;
 	struct epitem *epi;
-	struct hlist_node *next;
+	bool dispose;
 
 	/*
-	 * We don't want to get "file->f_lock" because it is not
-	 * necessary. It is not necessary because we're in the "struct file"
-	 * cleanup path, and this means that no one is using this file anymore.
-	 * So, for example, epoll_ctl() cannot hit here since if we reach this
-	 * point, the file counter already went to zero and fget() would fail.
-	 * The only hit might come from ep_free() but by holding the mutex
-	 * will correctly serialize the operation. We do need to acquire
-	 * "ep->mtx" after "epmutex" because ep_remove() requires it when called
-	 * from anywhere but ep_free().
-	 *
-	 * Besides, ep_remove() acquires the lock, so we can't hold it here.
+	 * Use the 'dying' flag to prevent a concurrent ep_cleat_and_put() from
+	 * touching the epitems list before eventpoll_release_file() can access
+	 * the ep->mtx.
 	 */
-	mutex_lock(&epmutex);
-	if (unlikely(!file->f_ep)) {
-		mutex_unlock(&epmutex);
-		return;
-	}
-	hlist_for_each_entry_safe(epi, next, file->f_ep, fllink) {
+again:
+	spin_lock(&file->f_lock);
+	if (file->f_ep && file->f_ep->first) {
+		/* detach from ep tree */
+		epi = hlist_entry(file->f_ep->first, struct epitem, fllink);
+		epi->dying = true;
+		spin_unlock(&file->f_lock);
+
+		/*
+		 * ep access is safe as we still own a reference to the ep
+		 * struct
+		 */
 		ep = epi->ep;
-		mutex_lock_nested(&ep->mtx, 0);
-		ep_remove(ep, epi);
+		mutex_lock(&ep->mtx);
+		dispose = __ep_remove(ep, epi, true);
 		mutex_unlock(&ep->mtx);
+
+		if (dispose)
+			ep_free(ep);
+		goto again;
 	}
-	mutex_unlock(&epmutex);
+	spin_unlock(&file->f_lock);
 }
 
 static int ep_alloc(struct eventpoll **pep)
@@ -955,6 +989,7 @@ static int ep_alloc(struct eventpoll **pep)
 	ep->rbr = RB_ROOT_CACHED;
 	ep->ovflist = EP_UNACTIVE_PTR;
 	ep->user = user;
+	refcount_set(&ep->refcount, 1);
 
 	*pep = ep;
 
@@ -1223,10 +1258,10 @@ static int ep_poll_callback(wait_queue_entry_t *wait, unsigned mode, int sync, v
 		 */
 		list_del_init(&wait->entry);
 		/*
-		 * ->whead != NULL protects us from the race with ep_free()
-		 * or ep_remove(), ep_remove_wait_queue() takes whead->lock
-		 * held by the caller. Once we nullify it, nothing protects
-		 * ep/epi or even wait.
+		 * ->whead != NULL protects us from the race with
+		 * ep_clear_and_put() or ep_remove(), ep_remove_wait_queue()
+		 * takes whead->lock held by the caller. Once we nullify it,
+		 * nothing protects ep/epi or even wait.
 		 */
 		smp_store_release(&ep_pwq_from_wait(wait)->whead, NULL);
 	}
@@ -1496,16 +1531,22 @@ static int ep_insert(struct eventpoll *ep, const struct epoll_event *event,
 	if (tep)
 		mutex_unlock(&tep->mtx);
 
+	/*
+	 * ep_remove_safe() calls in the later error paths can't lead to
+	 * ep_free() as the ep file itself still holds an ep reference.
+	 */
+	ep_get(ep);
+
 	/* now check if we've created too many backpaths */
 	if (unlikely(full_check && reverse_path_check())) {
-		ep_remove(ep, epi);
+		ep_remove_safe(ep, epi);
 		return -EINVAL;
 	}
 
 	if (epi->event.events & EPOLLWAKEUP) {
 		error = ep_create_wakeup_source(epi);
 		if (error) {
-			ep_remove(ep, epi);
+			ep_remove_safe(ep, epi);
 			return error;
 		}
 	}
@@ -1529,7 +1570,7 @@ static int ep_insert(struct eventpoll *ep, const struct epoll_event *event,
 	 * high memory pressure.
 	 */
 	if (unlikely(!epq.epi)) {
-		ep_remove(ep, epi);
+		ep_remove_safe(ep, epi);
 		return -ENOMEM;
 	}
 
@@ -2025,7 +2066,7 @@ static int do_epoll_create(int flags)
 out_free_fd:
 	put_unused_fd(fd);
 out_free_ep:
-	ep_free(ep);
+	ep_clear_and_put(ep);
 	return error;
 }
 
@@ -2167,10 +2208,16 @@ int do_epoll_ctl(int epfd, int op, int fd, struct epoll_event *epds,
 			error = -EEXIST;
 		break;
 	case EPOLL_CTL_DEL:
-		if (epi)
-			error = ep_remove(ep, epi);
-		else
+		if (epi) {
+			/*
+			 * The eventpoll itself is still alive: the refcount
+			 * can't go to zero here.
+			 */
+			ep_remove_safe(ep, epi);
+			error = 0;
+		} else {
 			error = -ENOENT;
+		}
 		break;
 	case EPOLL_CTL_MOD:
 		if (epi) {
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* RE: [PATCH v4 RESEND] epoll: use refcount to reduce ep_mutex contention
  2023-03-07 18:46 [PATCH v4 RESEND] epoll: use refcount to reduce ep_mutex contention Paolo Abeni
@ 2023-03-07 21:17 ` Keller, Jacob E
  2023-03-07 21:21   ` Soheil Hassas Yeganeh
  2023-03-07 21:30 ` Andrew Morton
  1 sibling, 1 reply; 7+ messages in thread
From: Keller, Jacob E @ 2023-03-07 21:17 UTC (permalink / raw)
  To: Paolo Abeni, netdev
  Cc: Soheil Hassas Yeganeh, Al Viro, Carlos Maiolino, Eric Biggers,
	Andrew Morton, Jens Axboe, Christian Brauner, linux-fsdevel



> -----Original Message-----
> From: Paolo Abeni <pabeni@redhat.com>
> Sent: Tuesday, March 7, 2023 10:47 AM
> To: netdev@vger.kernel.org
> Cc: Soheil Hassas Yeganeh <soheil@google.com>; Al Viro
> <viro@zeniv.linux.org.uk>; Carlos Maiolino <cmaiolino@redhat.com>; Eric
> Biggers <ebiggers@kernel.org>; Keller, Jacob E <jacob.e.keller@intel.com>;
> Andrew Morton <akpm@linux-foundation.org>; Jens Axboe <axboe@kernel.dk>;
> Christian Brauner <brauner@kernel.org>; linux-fsdevel@vger.kernel.org
> Subject: [PATCH v4 RESEND] epoll: use refcount to reduce ep_mutex contention
> 
> We are observing huge contention on the epmutex during an http
> connection/rate test:
> 
>  83.17% 0.25%  nginx            [kernel.kallsyms]         [k]
> entry_SYSCALL_64_after_hwframe
> [...]
>            |--66.96%--__fput
>                       |--60.04%--eventpoll_release_file
>                                  |--58.41%--__mutex_lock.isra.6
>                                            |--56.56%--osq_lock
> 
> The application is multi-threaded, creates a new epoll entry for
> each incoming connection, and does not delete it before the
> connection shutdown - that is, before the connection's fd close().
> 
> Many different threads compete frequently for the epmutex lock,
> affecting the overall performance.
> 
> To reduce the contention this patch introduces explicit reference counting
> for the eventpoll struct. Each registered event acquires a reference,
> and references are released at ep_remove() time.
> 
> Additionally, this introduces a new 'dying' flag to prevent races between
> the EP file close() and the monitored file close().
> ep_eventpoll_release() marks, under f_lock spinlock, each epitem as before
> removing it, while EP file close() does not touch dying epitems.
> 
> The eventpoll struct is released by whoever - among EP file close() and
> and the monitored file close() drops its last reference.
> 
> With all the above in place, we can drop the epmutex usage at disposal time.
> 
> Overall this produces a significant performance improvement in the
> mentioned connection/rate scenario: the mutex operations disappear from
> the topmost offenders in the perf report, and the measured connections/rate
> grows by ~60%.
> 
> To make the change more readable this additionally renames ep_free() to
> ep_clear_and_put(), and moves the actual memory cleanup in a separate
> ep_free() helper.
> 
> Tested-by: Xiumei Mu <xmu@redhiat.com>
> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
> Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
> ---
> This is a repost of v4, with no changes. Kindly asking if FS maintainers
> could have a look.

This (still) looks good to me.

Thanks,
Jake

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v4 RESEND] epoll: use refcount to reduce ep_mutex contention
  2023-03-07 21:17 ` Keller, Jacob E
@ 2023-03-07 21:21   ` Soheil Hassas Yeganeh
  0 siblings, 0 replies; 7+ messages in thread
From: Soheil Hassas Yeganeh @ 2023-03-07 21:21 UTC (permalink / raw)
  To: Keller, Jacob E
  Cc: Paolo Abeni, netdev, Al Viro, Carlos Maiolino, Eric Biggers,
	Andrew Morton, Jens Axboe, Christian Brauner, linux-fsdevel

On Tue, Mar 7, 2023 at 4:17 PM Keller, Jacob E <jacob.e.keller@intel.com> wrote:
>
>
>
> > -----Original Message-----
> > From: Paolo Abeni <pabeni@redhat.com>
> > Sent: Tuesday, March 7, 2023 10:47 AM
> > To: netdev@vger.kernel.org
> > Cc: Soheil Hassas Yeganeh <soheil@google.com>; Al Viro
> > <viro@zeniv.linux.org.uk>; Carlos Maiolino <cmaiolino@redhat.com>; Eric
> > Biggers <ebiggers@kernel.org>; Keller, Jacob E <jacob.e.keller@intel.com>;
> > Andrew Morton <akpm@linux-foundation.org>; Jens Axboe <axboe@kernel.dk>;
> > Christian Brauner <brauner@kernel.org>; linux-fsdevel@vger.kernel.org
> > Subject: [PATCH v4 RESEND] epoll: use refcount to reduce ep_mutex contention
> >
> > We are observing huge contention on the epmutex during an http
> > connection/rate test:
> >
> >  83.17% 0.25%  nginx            [kernel.kallsyms]         [k]
> > entry_SYSCALL_64_after_hwframe
> > [...]
> >            |--66.96%--__fput
> >                       |--60.04%--eventpoll_release_file
> >                                  |--58.41%--__mutex_lock.isra.6
> >                                            |--56.56%--osq_lock
> >
> > The application is multi-threaded, creates a new epoll entry for
> > each incoming connection, and does not delete it before the
> > connection shutdown - that is, before the connection's fd close().
> >
> > Many different threads compete frequently for the epmutex lock,
> > affecting the overall performance.
> >
> > To reduce the contention this patch introduces explicit reference counting
> > for the eventpoll struct. Each registered event acquires a reference,
> > and references are released at ep_remove() time.
> >
> > Additionally, this introduces a new 'dying' flag to prevent races between
> > the EP file close() and the monitored file close().
> > ep_eventpoll_release() marks, under f_lock spinlock, each epitem as before
> > removing it, while EP file close() does not touch dying epitems.
> >
> > The eventpoll struct is released by whoever - among EP file close() and
> > and the monitored file close() drops its last reference.
> >
> > With all the above in place, we can drop the epmutex usage at disposal time.
> >
> > Overall this produces a significant performance improvement in the
> > mentioned connection/rate scenario: the mutex operations disappear from
> > the topmost offenders in the perf report, and the measured connections/rate
> > grows by ~60%.
> >
> > To make the change more readable this additionally renames ep_free() to
> > ep_clear_and_put(), and moves the actual memory cleanup in a separate
> > ep_free() helper.
> >
> > Tested-by: Xiumei Mu <xmu@redhiat.com>
> > Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
> > Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
> > Signed-off-by: Paolo Abeni <pabeni@redhat.com>
> > ---
> > This is a repost of v4, with no changes. Kindly asking if FS maintainers
> > could have a look.
>
> This (still) looks good to me.
>
> Thanks,
> Jake

Thank you! Still looks great to me as well.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v4 RESEND] epoll: use refcount to reduce ep_mutex contention
  2023-03-07 18:46 [PATCH v4 RESEND] epoll: use refcount to reduce ep_mutex contention Paolo Abeni
  2023-03-07 21:17 ` Keller, Jacob E
@ 2023-03-07 21:30 ` Andrew Morton
  2023-03-08  8:55   ` Paolo Abeni
  1 sibling, 1 reply; 7+ messages in thread
From: Andrew Morton @ 2023-03-07 21:30 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: netdev, Soheil Hassas Yeganeh, Al Viro, Carlos Maiolino,
	Eric Biggers, Jacob Keller, Jens Axboe, Christian Brauner,
	linux-fsdevel

On Tue,  7 Mar 2023 19:46:37 +0100 Paolo Abeni <pabeni@redhat.com> wrote:

> We are observing huge contention on the epmutex during an http
> connection/rate test:
> 
>  83.17% 0.25%  nginx            [kernel.kallsyms]         [k] entry_SYSCALL_64_after_hwframe
> [...]
>            |--66.96%--__fput
>                       |--60.04%--eventpoll_release_file
>                                  |--58.41%--__mutex_lock.isra.6
>                                            |--56.56%--osq_lock
> 
> The application is multi-threaded, creates a new epoll entry for
> each incoming connection, and does not delete it before the
> connection shutdown - that is, before the connection's fd close().
> 
> Many different threads compete frequently for the epmutex lock,
> affecting the overall performance.
> 
> To reduce the contention this patch introduces explicit reference counting
> for the eventpoll struct. Each registered event acquires a reference,
> and references are released at ep_remove() time.
> 
> Additionally, this introduces a new 'dying' flag to prevent races between
> the EP file close() and the monitored file close().
> ep_eventpoll_release() marks, under f_lock spinlock, each epitem as before

"as dying"?

> removing it, while EP file close() does not touch dying epitems.

The need for this dying flag is somewhat unclear to me.  I mean, if we
have refcounting done correctly, why the need for this flag?  Some
additional description of the dynamics would be helpful.

Methinks this flag is here to cope with the delayed freeing via
hlist_del_rcu(), but that's a guess?

> The eventpoll struct is released by whoever - among EP file close() and
> and the monitored file close() drops its last reference.
> 
> With all the above in place, we can drop the epmutex usage at disposal time.
> 
> Overall this produces a significant performance improvement in the
> mentioned connection/rate scenario: the mutex operations disappear from
> the topmost offenders in the perf report, and the measured connections/rate
> grows by ~60%.
> 
> To make the change more readable this additionally renames ep_free() to
> ep_clear_and_put(), and moves the actual memory cleanup in a separate
> ep_free() helper.
> 
> ...
>
> --- a/fs/eventpoll.c
> +++ b/fs/eventpoll.c
>
> ...
>
> +	free_uid(ep->user);
> +	wakeup_source_unregister(ep->ws);
> +	kfree(ep);
> +}
> +
>  /*
>   * Removes a "struct epitem" from the eventpoll RB tree and deallocates
>   * all the associated resources. Must be called with "mtx" held.
> + * If the dying flag is set, do the removal only if force is true.

This comment describes "what" the code does, which is obvious from the
code anwyay.  It's better if comments describe "why" the code does what
it does.

> + * Returns true if the eventpoll can be disposed.
>   */
> -static int ep_remove(struct eventpoll *ep, struct epitem *epi)
> +static bool __ep_remove(struct eventpoll *ep, struct epitem *epi, bool force)
>  {
>  	struct file *file = epi->ffd.file;
>  	struct epitems_head *to_free;
>
> ...
>
>  	/*
> -	 * We don't want to get "file->f_lock" because it is not
> -	 * necessary. It is not necessary because we're in the "struct file"
> -	 * cleanup path, and this means that no one is using this file anymore.
> -	 * So, for example, epoll_ctl() cannot hit here since if we reach this
> -	 * point, the file counter already went to zero and fget() would fail.
> -	 * The only hit might come from ep_free() but by holding the mutex
> -	 * will correctly serialize the operation. We do need to acquire
> -	 * "ep->mtx" after "epmutex" because ep_remove() requires it when called
> -	 * from anywhere but ep_free().
> -	 *
> -	 * Besides, ep_remove() acquires the lock, so we can't hold it here.
> +	 * Use the 'dying' flag to prevent a concurrent ep_cleat_and_put() from

s/cleat/clear/

> +	 * touching the epitems list before eventpoll_release_file() can access
> +	 * the ep->mtx.
>  	 */
> -	mutex_lock(&epmutex);
> -	if (unlikely(!file->f_ep)) {
> -		mutex_unlock(&epmutex);
> -		return;
> -	}
> -	hlist_for_each_entry_safe(epi, next, file->f_ep, fllink) {
> +again:
> +	spin_lock(&file->f_lock);
> +	if (file->f_ep && file->f_ep->first) {
> +		/* detach from ep tree */

Comment appears to be misplaced - the following code doesn't detach
anything?

> +		epi = hlist_entry(file->f_ep->first, struct epitem, fllink);
> +		epi->dying = true;
> +		spin_unlock(&file->f_lock);
> +
> +		/*
> +		 * ep access is safe as we still own a reference to the ep
> +		 * struct
> +		 */
>  		ep = epi->ep;
> -		mutex_lock_nested(&ep->mtx, 0);
> -		ep_remove(ep, epi);
> +		mutex_lock(&ep->mtx);
> +		dispose = __ep_remove(ep, epi, true);
>  		mutex_unlock(&ep->mtx);
> +
> +		if (dispose)
> +			ep_free(ep);
> +		goto again;
>  	}
> ...
>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v4 RESEND] epoll: use refcount to reduce ep_mutex contention
  2023-03-07 21:30 ` Andrew Morton
@ 2023-03-08  8:55   ` Paolo Abeni
  2023-03-08 18:40     ` Andrew Morton
  0 siblings, 1 reply; 7+ messages in thread
From: Paolo Abeni @ 2023-03-08  8:55 UTC (permalink / raw)
  To: Andrew Morton
  Cc: netdev, Soheil Hassas Yeganeh, Al Viro, Carlos Maiolino,
	Eric Biggers, Jacob Keller, Jens Axboe, Christian Brauner,
	linux-fsdevel

On Tue, 2023-03-07 at 13:30 -0800, Andrew Morton wrote:
> On Tue,  7 Mar 2023 19:46:37 +0100 Paolo Abeni <pabeni@redhat.com> wrote:
> 
> > We are observing huge contention on the epmutex during an http
> > connection/rate test:
> > 
> >  83.17% 0.25%  nginx            [kernel.kallsyms]         [k] entry_SYSCALL_64_after_hwframe
> > [...]
> >            |--66.96%--__fput
> >                       |--60.04%--eventpoll_release_file
> >                                  |--58.41%--__mutex_lock.isra.6
> >                                            |--56.56%--osq_lock
> > 
> > The application is multi-threaded, creates a new epoll entry for
> > each incoming connection, and does not delete it before the
> > connection shutdown - that is, before the connection's fd close().
> > 
> > Many different threads compete frequently for the epmutex lock,
> > affecting the overall performance.
> > 
> > To reduce the contention this patch introduces explicit reference counting
> > for the eventpoll struct. Each registered event acquires a reference,
> > and references are released at ep_remove() time.
> > 
> > Additionally, this introduces a new 'dying' flag to prevent races between
> > the EP file close() and the monitored file close().
> > ep_eventpoll_release() marks, under f_lock spinlock, each epitem as before
> 
> "as dying"?
> 
> > removing it, while EP file close() does not touch dying epitems.
> 
> The need for this dying flag is somewhat unclear to me.  I mean, if we
> have refcounting done correctly, why the need for this flag?  Some
> additional description of the dynamics would be helpful.
> 
> Methinks this flag is here to cope with the delayed freeing via
> hlist_del_rcu(), but that's a guess?

First thing first, thanks for the feedback!

Both ep_clear_and_put() and eventpoll_release_file() can release the
eventpoll struct. The second must acquire the file->f_lock spinlock to
reach/access such struct pointer. Callers of __ep_remove need to
acquire first the ep->mtx, so eventpoll_release_file() must release the
spinlock after fetching the pointer and before acquiring the mutex.

Meanwhile, without the 'dying' flag, ep_clear_and_put() could kick-in,
eventually on a different CPU, drop all the ep references and free the
struct. 
An alternative to the 'dying' flag would be removing the following loop
from ep_clear_and_put():

	while ((rbp = rb_first_cached(&ep->rbr)) != NULL) {
                epi = rb_entry(rbp, struct epitem, rbn);
                ep_remove_safe(ep, epi);
                cond_resched();
        }

So that ep_clear_and_put() would not release all the ep references
anymore. That option has the downside of keeping the ep struct alive
for an unlimited time after ep_clear_and_put(). A previous revision of
this patch implemented a similar behavior, but Eric Biggers noted it
could hurt some users:

https://lore.kernel.org/linux-fsdevel/Y3%2F4FW4mqY3fWRfU@sol.localdomain/

Please let me know if the above is clear enough.

> > The eventpoll struct is released by whoever - among EP file close() and
> > and the monitored file close() drops its last reference.
> > 
> > With all the above in place, we can drop the epmutex usage at disposal time.
> > 
> > Overall this produces a significant performance improvement in the
> > mentioned connection/rate scenario: the mutex operations disappear from
> > the topmost offenders in the perf report, and the measured connections/rate
> > grows by ~60%.
> > 
> > To make the change more readable this additionally renames ep_free() to
> > ep_clear_and_put(), and moves the actual memory cleanup in a separate
> > ep_free() helper.
> > 
> > ...
> > 
> > --- a/fs/eventpoll.c
> > +++ b/fs/eventpoll.c
> > 
> > ...
> > 
> > +	free_uid(ep->user);
> > +	wakeup_source_unregister(ep->ws);
> > +	kfree(ep);
> > +}
> > +
> >  /*
> >   * Removes a "struct epitem" from the eventpoll RB tree and deallocates
> >   * all the associated resources. Must be called with "mtx" held.
> > + * If the dying flag is set, do the removal only if force is true.
> 
> This comment describes "what" the code does, which is obvious from the
> code anwyay.  It's better if comments describe "why" the code does what
> it does.

What about appending the following?

"""
This prevents ep_clear_and_put() from dropping all the ep references
while running concurrently with eventpoll_release_file().
"""

(I'll keep the 'what' part to hopefully make the 'why' more clear)

> > + * Returns true if the eventpoll can be disposed.
> >   */
> > -static int ep_remove(struct eventpoll *ep, struct epitem *epi)
> > +static bool __ep_remove(struct eventpoll *ep, struct epitem *epi, bool force)
> >  {
> >  	struct file *file = epi->ffd.file;
> >  	struct epitems_head *to_free;
> > 
> > ...
> > 
> >  	/*
> > -	 * We don't want to get "file->f_lock" because it is not
> > -	 * necessary. It is not necessary because we're in the "struct file"
> > -	 * cleanup path, and this means that no one is using this file anymore.
> > -	 * So, for example, epoll_ctl() cannot hit here since if we reach this
> > -	 * point, the file counter already went to zero and fget() would fail.
> > -	 * The only hit might come from ep_free() but by holding the mutex
> > -	 * will correctly serialize the operation. We do need to acquire
> > -	 * "ep->mtx" after "epmutex" because ep_remove() requires it when called
> > -	 * from anywhere but ep_free().
> > -	 *
> > -	 * Besides, ep_remove() acquires the lock, so we can't hold it here.
> > +	 * Use the 'dying' flag to prevent a concurrent ep_cleat_and_put() from
> 
> s/cleat/clear/
> 
> > +	 * touching the epitems list before eventpoll_release_file() can access
> > +	 * the ep->mtx.
> >  	 */
> > -	mutex_lock(&epmutex);
> > -	if (unlikely(!file->f_ep)) {
> > -		mutex_unlock(&epmutex);
> > -		return;
> > -	}
> > -	hlist_for_each_entry_safe(epi, next, file->f_ep, fllink) {
> > +again:
> > +	spin_lock(&file->f_lock);
> > +	if (file->f_ep && file->f_ep->first) {
> > +		/* detach from ep tree */
> 
> Comment appears to be misplaced - the following code doesn't detach
> anything?

Indeed. This is a left-over from a previous revision. Can be dropped.


I have a process question: I understand this is queued for the mm-
nonmm-unstable branch. Should I send a v5 with the above comments
changes or an incremental patch or something completely different?

Thanks!

Paolo


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v4 RESEND] epoll: use refcount to reduce ep_mutex contention
  2023-03-08  8:55   ` Paolo Abeni
@ 2023-03-08 18:40     ` Andrew Morton
  2023-03-08 20:34       ` Paolo Abeni
  0 siblings, 1 reply; 7+ messages in thread
From: Andrew Morton @ 2023-03-08 18:40 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: netdev, Soheil Hassas Yeganeh, Al Viro, Carlos Maiolino,
	Eric Biggers, Jacob Keller, Jens Axboe, Christian Brauner,
	linux-fsdevel

On Wed, 08 Mar 2023 09:55:31 +0100 Paolo Abeni <pabeni@redhat.com> wrote:

> I have a process question: I understand this is queued for the mm-
> nonmm-unstable branch. Should I send a v5 with the above comments
> changes or an incremental patch or something completely different?

Either is OK.  If it's a v5 I'll usually queue a delta so people who
have a;ready reviewed can see what changed.  That delta is later
squashed and I'll use v5's changelog for the whole.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v4 RESEND] epoll: use refcount to reduce ep_mutex contention
  2023-03-08 18:40     ` Andrew Morton
@ 2023-03-08 20:34       ` Paolo Abeni
  0 siblings, 0 replies; 7+ messages in thread
From: Paolo Abeni @ 2023-03-08 20:34 UTC (permalink / raw)
  To: Andrew Morton
  Cc: netdev, Soheil Hassas Yeganeh, Al Viro, Carlos Maiolino,
	Eric Biggers, Jacob Keller, Jens Axboe, Christian Brauner,
	linux-fsdevel

On Wed, 2023-03-08 at 10:40 -0800, Andrew Morton wrote:
> On Wed, 08 Mar 2023 09:55:31 +0100 Paolo Abeni <pabeni@redhat.com> wrote:
> 
> > I have a process question: I understand this is queued for the mm-
> > nonmm-unstable branch. Should I send a v5 with the above comments
> > changes or an incremental patch or something completely different?
> 
> Either is OK.  If it's a v5 I'll usually queue a delta so people who
> have a;ready reviewed can see what changed.  That delta is later
> squashed and I'll use v5's changelog for the whole.

Since even the changelog needs some fixup, I'll send a v5.

Thanks,

Paolo


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2023-03-08 20:36 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-03-07 18:46 [PATCH v4 RESEND] epoll: use refcount to reduce ep_mutex contention Paolo Abeni
2023-03-07 21:17 ` Keller, Jacob E
2023-03-07 21:21   ` Soheil Hassas Yeganeh
2023-03-07 21:30 ` Andrew Morton
2023-03-08  8:55   ` Paolo Abeni
2023-03-08 18:40     ` Andrew Morton
2023-03-08 20:34       ` Paolo Abeni

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.