linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Roman Penyaev <rpenyaev@suse.de>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>,
	Paul McKenney <paulmck@linux.vnet.ibm.com>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	Linux List Kernel Mailing <linux-kernel@vger.kernel.org>
Subject: Re: [RFC PATCH 1/1] epoll: use rwlock in order to reduce ep_poll_callback() contention
Date: Tue, 04 Dec 2018 12:50:58 +0100	[thread overview]
Message-ID: <83edf06ce9db540495b53527eca3248c@suse.de> (raw)
In-Reply-To: <CAHk-=wiQX4HnBpyrxCwJmhBRff0GG65tOhsRnA=2KdYL=PBdyg@mail.gmail.com>

On 2018-12-03 18:34, Linus Torvalds wrote:
> On Mon, Dec 3, 2018 at 3:03 AM Roman Penyaev <rpenyaev@suse.de> wrote:
>> 
>> Also I'm not quite sure where to put very special lockless variant
>> of adding element to the list (list_add_tail_lockless() in this
>> patch).  Seems keeping it locally is safer.
> 
> That function is scary, and can be mis-used so easily that I
> definitely don't want to see it anywhere else.
> 
> Afaik, it's *really* important that only "add_tail" operations can be
> done in parallel.

True, adding element either to head or to tail can work in parallel,
any mix will corrupt the list.  I tried to reflect this in the comment
of list_add_tail_lockless().  Although not sure has it become clearer
to a reader or not.


> This also ends up making the memory ordering of "xchg()" very very
> important. Yes, we've documented it as being an ordering op, but I'm
> not sure we've relied on it this directly before.

Seems exit_mm() does exactly the same, the following chunk:

		up_read(&mm->mmap_sem);

		self.task = current;
		self.next = xchg(&core_state->dumper.next, &self);


At least code pattern looks similar.


> I also note that now we do more/different locking in the waitqueue
> handling, because the code now takes both that rwlock _and_ the
> waitqueue spinlock for wakeup. That also makes me worried that the
> "waitqueue_active()" games are no no longer reliable. I think they're
> fine (looks like they are only done under the write-lock, so it's
> effectively the same serialization anyway),


The only difference in waking up is that same epollitem waitqueue can be
observed as active from different CPUs, real wake up happens only once
(wake_up() takes wq.lock, so should be fine to call it multiple times),
but 1 is returned for all callers of ep_poll_callback() who has seen
the wq as active.

If epollitem is created with EPOLLEXCLUSIVE flag, then 1, which is 
returned
from ep_poll_callback(), indicates "break the loop, exclusive wake up 
has
happened" (the loop is in __wake_up_common), but even we consider this
exclusive wake up case this seems is totally fine, because wake up 
events
are not lost and epollitem will scan all ready fds and eventually will
observe all of the callers (who has returned 1 from ep_poll_callback())
as ready.  I hope I did not miss anything.


> but the upshoot of all of
> this is that I *really* want others to look at this patch too. A lot
> of small subtle things here.

Would be great if someone can look through, eventpoll.c looks a
bit abandoned.

--
Roman

  reply	other threads:[~2018-12-04 11:51 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-12-03 11:02 [RFC PATCH 1/1] epoll: use rwlock in order to reduce ep_poll_callback() contention Roman Penyaev
2018-12-03 17:34 ` Linus Torvalds
2018-12-04 11:50   ` Roman Penyaev [this message]
2018-12-04 23:59     ` Andrea Parri
2018-12-05 11:25       ` Roman Penyaev
2018-12-04 17:23 ` Jason Baron
2018-12-04 19:02   ` Paul E. McKenney
2018-12-05 11:22     ` Roman Penyaev
2018-12-05 11:16   ` Roman Penyaev
2018-12-05 16:38     ` Jason Baron
2018-12-05 20:11       ` Roman Penyaev
2018-12-06  1:54   ` Davidlohr Bueso
2018-12-06  3:08   ` Davidlohr Bueso
2018-12-06 10:27     ` Roman Penyaev
2018-12-06  4:04   ` Davidlohr Bueso
2018-12-06 10:25     ` Roman Penyaev
2018-12-05 23:46 ` Eric Wong
2018-12-06 10:52   ` Roman Penyaev
2018-12-06 20:35     ` Eric Wong

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=83edf06ce9db540495b53527eca3248c@suse.de \
    --to=rpenyaev@suse.de \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=paulmck@linux.vnet.ibm.com \
    --cc=torvalds@linux-foundation.org \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).