linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jason Baron <jbaron@akamai.com>
To: Roman Penyaev <rpenyaev@suse.de>,
	Andrew Morton <akpm@linux-foundation.org>
Cc: hev <r@hev.cc>,
	linux-fsdevel@vger.kernel.org, Al Viro <viro@zeniv.linux.org.uk>,
	Davide Libenzi <davidel@xmailserver.org>,
	Davidlohr Bueso <dave@stgolabs.net>,
	Dominik Brodowski <linux@dominikbrodowski.net>,
	Eric Wong <e@80x24.org>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Sridhar Samudrala <sridhar.samudrala@intel.com>,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH RESEND v4] fs/epoll: Remove unnecessary wakeups of nested epoll that in ET mode
Date: Thu, 3 Oct 2019 12:13:57 -0400	[thread overview]
Message-ID: <9ca02c9b-85b7-dced-9c82-1fc453c49b8a@akamai.com> (raw)
In-Reply-To: <c0a96dd89d0a361d8061b8c356b57ed2@suse.de>



On 9/30/19 7:55 AM, Roman Penyaev wrote:
> On 2019-09-28 04:29, Andrew Morton wrote:
>> On Wed, 25 Sep 2019 09:56:03 +0800 hev <r@hev.cc> wrote:
>>
>>> From: Heiher <r@hev.cc>
>>>
>>> Take the case where we have:
>>>
>>>         t0
>>>          | (ew)
>>>         e0
>>>          | (et)
>>>         e1
>>>          | (lt)
>>>         s0
>>>
>>> t0: thread 0
>>> e0: epoll fd 0
>>> e1: epoll fd 1
>>> s0: socket fd 0
>>> ew: epoll_wait
>>> et: edge-trigger
>>> lt: level-trigger
>>>
>>> We only need to wakeup nested epoll fds if something has been queued
>>> to the
>>> overflow list, since the ep_poll() traverses the rdllist during
>>> recursive poll
>>> and thus events on the overflow list may not be visible yet.
>>>
>>> Test code:
>>
>> Look sane to me.  Do you have any performance testing results which
>> show a benefit?
>>
>> epoll maintainership isn't exactly a hive of activity nowadays :(
>> Roman, would you please have time to review this?
> 
> So here is my observation: current patch does not fix the described
> problem (double wakeup) for the case, when new event comes exactly
> to the ->ovflist.  According to the patch this is the desired intention:
> 
>    /*
>     * We only need to wakeup nested epoll fds if something has been queued
>     * to the overflow list, since the ep_poll() traverses the rdllist
>     * during recursive poll and thus events on the overflow list may not be
>     * visible yet.
>     */
>     if (nepi != NULL)
>        pwake++;
> 
>     ....
> 
>     if (pwake == 2)
>        ep_poll_safewake(&ep->poll_wait);
> 
> 
> but this actually means that we repeat the same behavior (double wakeup)
> but only for the case, when event comes to the ->ovflist.
> 
> How to reproduce? Can be easily done (ok, not so easy but it is possible
> to try): to the given userspace test we need to add one more socket and
> immediately fire the event:
> 
>     e.events = EPOLLIN;
>     if (epoll_ctl(efd[1], EPOLL_CTL_ADD, s2fd[0], &e) < 0)
>        goto out;
> 
>     /*
>      * Signal any fd to let epoll_wait() to call ep_scan_ready_list()
>      * in order to "catch" it there and add new event to ->ovflist.
>      */
>      if (write(s2fd[1], "w", 1) != 1)
>         goto out;
> 
> That is done in order to let the following epoll_wait() call to invoke
> ep_scan_ready_list(), where we can "catch" and insert new event exactly
> to the ->ovflist. In order to insert event exactly in the correct list
> I introduce artificial delay.
> 
> Modified test and kernel patch is below.  Here is the output of the
> testing tool with some debug lines from kernel:
> 
>   # ~/devel/test/edge-bug
>   [   59.263178] ### sleep 2
>   >> write to sock
>   [   61.318243] ### done sleep
>   [   61.318991] !!!!!!!!!!! ep_poll_safewake(&ep->poll_wait);
> events_in_rdllist=1, events_in_ovflist=1
>   [   61.321204] ### sleep 2
>   [   63.398325] ### done sleep
>   error: What?! Again?!
> 
> First epoll_wait() call (ep_scan_ready_list()) observes 2 events
> (see "!!!!!!!!!!! ep_poll_safewake" output line), exactly what we
> wanted to achieve, so eventually ep_poll_safewake() is called again
> which leads to double wakeup.
> 
> In my opinion current patch as it is should be dropped, it does not
> fix the described problem but just hides it.
> 
> -- 

Yes, there are 2 wakeups in the test case you describe, but if the
second event (write to s1fd) gets queued after the first call to
epoll_wait(), we are going to get 2 wakeups anyways. So yes, there may
be a slightly bigger window with this patch for 2 wakeups, but its small
and I tried to be conservative with the patch - I'd rather get an
occasional 2nd wakeup then miss one. Trying to debug missing wakeups
isn't always fun...

That said, the reason for propagating events that end up on the overflow
list was to prevent the race of the wakee not seeing events because they
were still on the overflow list. In the testcase, imagine if there was a
thread doing epoll_wait() on efd[0], and then a write happends on s1fd.
I thought it was possible then that a 2nd thread doing epoll_wait() on
efd[1], wakes up, checks efd[0] and sees no events because they are
still potentially on the overflow list. However, I think that case is
not possible because the thread doing epoll_wait() on efd[0] is going to
have the ep->mtx, and thus when the thread wakes up on efd[1], its going
to have to be ordered because its also grabbing the ep->mtx associated
with efd[0].

So I think its safe to do the following if we want to go further than
the proposed patch, which is what you suggested earlier in the thread
(minus keeping the wakeup on ep->wq).

 diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index c4159bc..61d653d1 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -671,7 +671,6 @@ static __poll_t ep_scan_ready_list(struct eventpoll *ep,
                              void *priv, int depth, bool ep_locked)
 {
        __poll_t res;
-       int pwake = 0;
        struct epitem *epi, *nepi;
        LIST_HEAD(txlist);

@@ -741,23 +740,16 @@ static __poll_t ep_scan_ready_list(struct
eventpoll *ep,

        if (!list_empty(&ep->rdllist)) {
                /*
-                * Wake up (if active) both the eventpoll wait list and
-                * the ->poll() wait list (delayed after we release the
lock).
+                * Wake up (if active) the eventpoll wait list
                 */
                if (waitqueue_active(&ep->wq))
                        wake_up(&ep->wq);
-               if (waitqueue_active(&ep->poll_wait))
-                       pwake++;
        }
        write_unlock_irq(&ep->lock);

        if (!ep_locked)
                mutex_unlock(&ep->mtx);

-       /* We have to call this outside the lock */
-       if (pwake)
-               ep_poll_safewake(&ep->poll_wait);
-
        return res;
 }


Thanks,

-Jason



  reply	other threads:[~2019-10-03 17:27 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-09-25  1:56 [PATCH RESEND v4] fs/epoll: Remove unnecessary wakeups of nested epoll that in ET mode hev
2019-09-28  2:29 ` Andrew Morton
2019-09-28 19:07   ` Roman Penyaev
2019-09-30 11:55   ` Roman Penyaev
2019-10-03 16:13     ` Jason Baron [this message]
2019-10-07 10:54       ` Roman Penyaev
2019-10-07 16:42         ` Jason Baron
2019-10-07 18:30           ` Roman Penyaev
2019-10-07 18:43             ` Jason Baron
2019-10-07 19:10               ` Roman Penyaev
2019-10-09  6:03                 ` Heiher
2019-10-08  9:55               ` Roman Penyaev

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=9ca02c9b-85b7-dced-9c82-1fc453c49b8a@akamai.com \
    --to=jbaron@akamai.com \
    --cc=akpm@linux-foundation.org \
    --cc=dave@stgolabs.net \
    --cc=davidel@xmailserver.org \
    --cc=e@80x24.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux@dominikbrodowski.net \
    --cc=r@hev.cc \
    --cc=rpenyaev@suse.de \
    --cc=sridhar.samudrala@intel.com \
    --cc=torvalds@linux-foundation.org \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).