Re: [RFC PATCH 1/1] epoll: use rwlock in order to reduce ep_poll_callback() contention

From: Roman Penyaev <rpenyaev@suse.de>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>,
	Paul McKenney <paulmck@linux.vnet.ibm.com>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	Linux List Kernel Mailing <linux-kernel@vger.kernel.org>
Subject: Re: [RFC PATCH 1/1] epoll: use rwlock in order to reduce ep_poll_callback() contention
Date: Tue, 04 Dec 2018 12:50:58 +0100	[thread overview]
Message-ID: <83edf06ce9db540495b53527eca3248c@suse.de> (raw)
In-Reply-To: <CAHk-=wiQX4HnBpyrxCwJmhBRff0GG65tOhsRnA=2KdYL=PBdyg@mail.gmail.com>

On 2018-12-03 18:34, Linus Torvalds wrote:
> On Mon, Dec 3, 2018 at 3:03 AM Roman Penyaev <rpenyaev@suse.de> wrote:
>> 
>> Also I'm not quite sure where to put very special lockless variant
>> of adding element to the list (list_add_tail_lockless() in this
>> patch).  Seems keeping it locally is safer.
> 
> That function is scary, and can be mis-used so easily that I
> definitely don't want to see it anywhere else.
> 
> Afaik, it's *really* important that only "add_tail" operations can be
> done in parallel.

True, adding element either to head or to tail can work in parallel,
any mix will corrupt the list.  I tried to reflect this in the comment
of list_add_tail_lockless().  Although not sure has it become clearer
to a reader or not.

> This also ends up making the memory ordering of "xchg()" very very
> important. Yes, we've documented it as being an ordering op, but I'm
> not sure we've relied on it this directly before.

Seems exit_mm() does exactly the same, the following chunk:

		up_read(&mm->mmap_sem);

		self.task = current;
		self.next = xchg(&core_state->dumper.next, &self);

At least code pattern looks similar.

> I also note that now we do more/different locking in the waitqueue
> handling, because the code now takes both that rwlock _and_ the
> waitqueue spinlock for wakeup. That also makes me worried that the
> "waitqueue_active()" games are no no longer reliable. I think they're
> fine (looks like they are only done under the write-lock, so it's
> effectively the same serialization anyway),

The only difference in waking up is that same epollitem waitqueue can be
observed as active from different CPUs, real wake up happens only once
(wake_up() takes wq.lock, so should be fine to call it multiple times),
but 1 is returned for all callers of ep_poll_callback() who has seen
the wq as active.

If epollitem is created with EPOLLEXCLUSIVE flag, then 1, which is 
returned
from ep_poll_callback(), indicates "break the loop, exclusive wake up 
has
happened" (the loop is in __wake_up_common), but even we consider this
exclusive wake up case this seems is totally fine, because wake up 
events
are not lost and epollitem will scan all ready fds and eventually will
observe all of the callers (who has returned 1 from ep_poll_callback())
as ready.  I hope I did not miss anything.

> but the upshoot of all of
> this is that I *really* want others to look at this patch too. A lot
> of small subtle things here.

Would be great if someone can look through, eventpoll.c looks a
bit abandoned.

--
Roman