Re: [PATCH v5] epoll: use refcount to reduce ep_mutex contention

From: Paolo Abeni <pabeni@redhat.com>
To: Christian Brauner <brauner@kernel.org>
Cc: netdev@vger.kernel.org, Soheil Hassas Yeganeh <soheil@google.com>,
	Al Viro <viro@zeniv.linux.org.uk>,
	Carlos Maiolino <cmaiolino@redhat.com>,
	Eric Biggers <ebiggers@kernel.org>,
	Jacob Keller <jacob.e.keller@intel.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Jens Axboe <axboe@kernel.dk>,
	linux-fsdevel@vger.kernel.org
Subject: Re: [PATCH v5] epoll: use refcount to reduce ep_mutex contention
Date: Thu, 09 Mar 2023 13:47:23 +0100	[thread overview]
Message-ID: <5de48abb24e033a5eb457007d0a5b6b391831fd4.camel@redhat.com> (raw)
In-Reply-To: <20230309111803.2z242amw4f5nwfwu@wittgenstein>

On Thu, 2023-03-09 at 12:18 +0100, Christian Brauner wrote:
> On Wed, Mar 08, 2023 at 10:51:31PM +0100, Paolo Abeni wrote:
> > We are observing huge contention on the epmutex during an http
> > connection/rate test:
> > 
> >  83.17% 0.25%  nginx            [kernel.kallsyms]         [k] entry_SYSCALL_64_after_hwframe
> > [...]
> >            |--66.96%--__fput
> >                       |--60.04%--eventpoll_release_file
> >                                  |--58.41%--__mutex_lock.isra.6
> >                                            |--56.56%--osq_lock
> > 
> > The application is multi-threaded, creates a new epoll entry for
> > each incoming connection, and does not delete it before the
> > connection shutdown - that is, before the connection's fd close().
> > 
> > Many different threads compete frequently for the epmutex lock,
> > affecting the overall performance.
> > 
> > To reduce the contention this patch introduces explicit reference counting
> > for the eventpoll struct. Each registered event acquires a reference,
> > and references are released at ep_remove() time.
> > 
> > The eventpoll struct is released by whoever - among EP file close() and
> > and the monitored file close() drops its last reference.
> > 
> > Additionally, this introduces a new 'dying' flag to prevent races between
> > the EP file close() and the monitored file close().
> > ep_eventpoll_release() marks, under f_lock spinlock, each epitem as dying
> > before removing it, while EP file close() does not touch dying epitems.
> > 
> > The above is needed as both close operations could run concurrently and
> > drop the EP reference acquired via the epitem entry. Without the above
> > flag, the monitored file close() could reach the EP struct via the epitem
> > list while the epitem is still listed and then try to put it after its
> > disposal.
> > 
> > An alternative could be avoiding touching the references acquired via
> > the epitems at EP file close() time, but that could leave the EP struct
> > alive for potentially unlimited time after EP file close(), with nasty
> > side effects.
> > 
> > With all the above in place, we can drop the epmutex usage at disposal time.
> > 
> > Overall this produces a significant performance improvement in the
> > mentioned connection/rate scenario: the mutex operations disappear from
> > the topmost offenders in the perf report, and the measured connections/rate
> > grows by ~60%.
> > 
> > To make the change more readable this additionally renames ep_free() to
> > ep_clear_and_put(), and moves the actual memory cleanup in a separate
> > ep_free() helper.
> > 
> > Tested-by: Xiumei Mu <xmu@redhiat.com>
> 
> Is that a typo "redhiat" in the mail?

Indeed yes! Thanks for noticing. Should I share a new revision to
address that?

[...]

> > @@ -700,6 +733,11 @@ static int ep_remove(struct eventpoll *ep, struct epitem *epi)
> >  
> >  	/* Remove the current item from the list of epoll hooks */
> >  	spin_lock(&file->f_lock);
> > +	if (epi->dying && !force) {
> > +		spin_unlock(&file->f_lock);
> > +		return false;
> > +	}
> 
> It's a bit unfortunate that we have to acquire the spinlock just to immediately
> having to drop it. Slighly ugly but workable could be but that depends on how
> likely we find it that we end up with !force and a dying fd...

The concurrent close() of both the EP file and the monitored file
should be a quite rare event, I *think* over-optimizing it should not
be worthy. Additionally, even with the spinlock, the proposed patch
should be considerably faster then the previous code even in that
specific scenario, as before we the epmutex conflicting with a much
larger scope. 

The only doubt I have about the suggested code, should we add a
READ_ONCE() even on the later 'dying' check? My understanding is that
the compiler should be allowed to emit a single read instruction,
before the spinlock itself.

Cheers,

Paolo