Re: [PATCH RESEND] fs/epoll: fix the edge-triggered mode for nested epoll

From: Jason Baron <jbaron@akamai.com>
To: Roman Penyaev <rpenyaev@suse.de>, Heiher <r@hev.cc>
Cc: linux-fsdevel@vger.kernel.org, Eric Wong <e@80x24.org>,
	Al Viro <viro@zeniv.linux.org.uk>,
	Andrew Morton <akpm@linux-foundation.org>,
	Davide Libenzi <davidel@xmailserver.org>,
	Davidlohr Bueso <dave@stgolabs.net>,
	Dominik Brodowski <linux@dominikbrodowski.net>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Sridhar Samudrala <sridhar.samudrala@intel.com>,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH RESEND] fs/epoll: fix the edge-triggered mode for nested epoll
Date: Thu, 5 Sep 2019 13:44:00 -0400	[thread overview]
Message-ID: <6fd44437-fdd8-3be3-a2ef-6c3534d4e954@akamai.com> (raw)
In-Reply-To: <d5914273597707b8780d188688fe0ac2@suse.de>



On 9/5/19 1:27 PM, Roman Penyaev wrote:
> On 2019-09-05 11:56, Heiher wrote:
>> Hi,
>>
>> On Thu, Sep 5, 2019 at 10:53 AM Heiher <r@hev.cc> wrote:
>>>
>>> Hi,
>>>
>>> I created an epoll wakeup test project, listed some possible cases,
>>> and any other corner cases needs to be added?
>>>
>>> https://github.com/heiher/epoll-wakeup/blob/master/README.md
>>>
>>> On Wed, Sep 4, 2019 at 10:02 PM Heiher <r@hev.cc> wrote:
>>> >
>>> > Hi,
>>> >
>>> > On Wed, Sep 4, 2019 at 8:02 PM Jason Baron <jbaron@akamai.com> wrote:
>>> > >
>>> > >
>>> > >
>>> > > On 9/4/19 5:57 AM, Roman Penyaev wrote:
>>> > > > On 2019-09-03 23:08, Jason Baron wrote:
>>> > > >> On 9/2/19 11:36 AM, Roman Penyaev wrote:
>>> > > >>> Hi,
>>> > > >>>
>>> > > >>> This is indeed a bug. (quick side note: could you please
>>> remove efd[1]
>>> > > >>> from your test, because it is not related to the reproduction
>>> of a
>>> > > >>> current bug).
>>> > > >>>
>>> > > >>> Your patch lacks a good description, what exactly you've
>>> fixed.  Let
>>> > > >>> me speak out loud and please correct me if I'm wrong, my
>>> understanding
>>> > > >>> of epoll internals has become a bit rusty: when epoll fds are
>>> nested
>>> > > >>> an attempt to harvest events (ep_scan_ready_list() call)
>>> produces a
>>> > > >>> second (repeated) event from an internal fd up to an external
>>> fd:
>>> > > >>>
>>> > > >>>      epoll_wait(efd[0], ...):
>>> > > >>>        ep_send_events():
>>> > > >>>           ep_scan_ready_list(depth=0):
>>> > > >>>             ep_send_events_proc():
>>> > > >>>                 ep_item_poll():
>>> > > >>>                   ep_scan_ready_list(depth=1):
>>> > > >>>                     ep_poll_safewake():
>>> > > >>>                       ep_poll_callback()
>>> > > >>>                         list_add_tail(&epi, &epi->rdllist);
>>> > > >>>                         ^^^^^^
>>> > > >>>                         repeated event
>>> > > >>>
>>> > > >>>
>>> > > >>> In your patch you forbid wakeup for the cases, where depth !=
>>> 0, i.e.
>>> > > >>> for all nested cases. That seems clear.  But what if we can
>>> go further
>>> > > >>> and remove the whole chunk, which seems excessive:
>>> > > >>>
>>> > > >>> @@ -885,26 +886,11 @@ static __poll_t ep_scan_ready_list(struct
>>> > > >>> eventpoll *ep,
>>> > > >>>
>>> > > >>> -
>>> > > >>> -       if (!list_empty(&ep->rdllist)) {
>>> > > >>> -               /*
>>> > > >>> -                * Wake up (if active) both the eventpoll
>>> wait list and
>>> > > >>> -                * the ->poll() wait list (delayed after we
>>> release the
>>> > > >>> lock).
>>> > > >>> -                */
>>> > > >>> -               if (waitqueue_active(&ep->wq))
>>> > > >>> -                       wake_up(&ep->wq);
>>> > > >>> -               if (waitqueue_active(&ep->poll_wait))
>>> > > >>> -                       pwake++;
>>> > > >>> -       }
>>> > > >>>         write_unlock_irq(&ep->lock);
>>> > > >>>
>>> > > >>>         if (!ep_locked)
>>> > > >>>                 mutex_unlock(&ep->mtx);
>>> > > >>>
>>> > > >>> -       /* We have to call this outside the lock */
>>> > > >>> -       if (pwake)
>>> > > >>> -               ep_poll_safewake(&ep->poll_wait);
>>> > > >>>
>>> > > >>>
>>> > > >>> I reason like that: by the time we've reached the point of
>>> scanning events
>>> > > >>> for readiness all wakeups from ep_poll_callback have been
>>> already fired and
>>> > > >>> new events have been already accounted in ready list
>>> (ep_poll_callback()
>>> > > >>> calls
>>> > > >>> the same ep_poll_safewake()). Here, frankly, I'm not 100%
>>> sure and probably
>>> > > >>> missing some corner cases.
>>> > > >>>
>>> > > >>> Thoughts?
>>> > > >>
>>> > > >> So the: 'wake_up(&ep->wq);' part, I think is about waking up
>>> other
>>> > > >> threads that may be in waiting in epoll_wait(). For example,
>>> there may
>>> > > >> be multiple threads doing epoll_wait() on the same epoll fd,
>>> and the
>>> > > >> logic above seems to say thread 1 may have processed say N
>>> events and
>>> > > >> now its going to to go off to work those, so let's wake up
>>> thread 2 now
>>> > > >> to handle the next chunk.
>>> > > >
>>> > > > Not quite. Thread which calls ep_scan_ready_list() processes
>>> all the
>>> > > > events, and while processing those, removes them one by one
>>> from the
>>> > > > ready list.  But if event mask is !0 and event belongs to
>>> > > > Level Triggered Mode descriptor (let's say default mode) it
>>> tails event
>>> > > > again back to the list (because we are in level mode, so event
>>> should
>>> > > > be there).  So at the end of this traversing loop ready list is
>>> likely
>>> > > > not empty, and if so, wake up again is called for nested epoll
>>> fds.
>>> > > > But, those nested epoll fds should get already all the
>>> notifications
>>> > > > from the main event callback ep_poll_callback(), regardless any
>>> thread
>>> > > > which traverses events.
>>> > > >
>>> > > > I suppose this logic exists for decades, when Davide (the
>>> author) was
>>> > > > reshuffling the code here and there.
>>> > > >
>>> > > > But I do not feel confidence to state that this extra wakeup is
>>> bogus,
>>> > > > I just have a gut feeling that it looks excessive.
>>> > >
>>> > > Note that I was talking about the wakeup done on ep->wq not
>>> ep->poll_wait.
>>> > > The path that I'm concerned about is let's say that there are N
>>> events
>>> > > queued on the ready list. A thread that was woken up in
>>> epoll_wait may
>>> > > decide to only process say N/2 of then. Then it will call wakeup
>>> on ep->wq
>>> > > and this will wakeup another thread to process the remaining N/2.
>>> Without
>>> > > the wakeup, the original thread isn't going to process the events
>>> until
>>> > > it finishes with the original N/2 and gets back to epoll_wait().
>>> So I'm not
>>> > > sure how important that path is but I wanted to at least note the
>>> change
>>> > > here would impact that behavior.
>>> > >
>>> > > Thanks,
>>> > >
>>> > > -Jason
>>> > >
>>> > >
>>> > > >
>>> > > >> So I think removing all that even for the
>>> > > >> depth 0 case is going to change some behavior here. So
>>> perhaps, it
>>> > > >> should be removed for all depths except for 0? And if so, it
>>> may be
>>> > > >> better to make 2 patches here to separate these changes.
>>> > > >>
>>> > > >> For the nested wakeups, I agree that the extra wakeups seem
>>> unnecessary
>>> > > >> and it may make sense to remove them for all depths. I don't
>>> think the
>>> > > >> nested epoll semantics are particularly well spelled out, and
>>> afaict,
>>> > > >> nested epoll() has behaved this way for quite some time. And
>>> the current
>>> > > >> behavior is not bad in the way that a missing wakeup or false
>>> negative
>>> > > >> would be.
>>> > > >
>>> > > > That's 100% true! For edge mode extra wake up is not a bug, not
>>> optimal
>>> > > > for userspace - yes, but that can't lead to any lost wakeups.
>>> > > >
>>> > > > --
>>> > > > Roman
>>> > > >
>>> >
>>> > I tried to remove the whole chunk of code that Roman said, and it
>>> > seems that there
>>> > are no obvious problems with the two test programs below:
>>
>> I recall this message, the test case 9/25/26 of epoll-wakeup (on
>> github) are failed while
>> the whole chunk are removed.
>>
>> Apply the original patch, all tests passed.
> 
> 
> These are failing on my bare 5.2.0-rc2
> 
> TEST  bin/epoll31       FAIL
> TEST  bin/epoll46       FAIL
> TEST  bin/epoll50       FAIL
> TEST  bin/epoll32       FAIL
> TEST  bin/epoll19       FAIL
> TEST  bin/epoll27       FAIL
> TEST  bin/epoll42       FAIL
> TEST  bin/epoll34       FAIL
> TEST  bin/epoll48       FAIL
> TEST  bin/epoll40       FAIL
> TEST  bin/epoll20       FAIL
> TEST  bin/epoll28       FAIL
> TEST  bin/epoll38       FAIL
> TEST  bin/epoll52       FAIL
> TEST  bin/epoll24       FAIL
> TEST  bin/epoll23       FAIL
> 
> 
> These are failing if your patch is applied:
> (my 5.2.0-rc2 is old? broken?)
> 
> TEST  bin/epoll46       FAIL
> TEST  bin/epoll42       FAIL
> TEST  bin/epoll34       FAIL
> TEST  bin/epoll48       FAIL
> TEST  bin/epoll40       FAIL
> TEST  bin/epoll44       FAIL
> TEST  bin/epoll38       FAIL
> 
> These are failing if "ep_poll_safewake(&ep->poll_wait)" is not called,
> but wakeup(&ep->wq); is still invoked:
> 
> TEST  bin/epoll46       FAIL
> TEST  bin/epoll42       FAIL
> TEST  bin/epoll34       FAIL
> TEST  bin/epoll40       FAIL
> TEST  bin/epoll44       FAIL
> TEST  bin/epoll38       FAIL
> 
> So at least 48 has been "fixed".
> 
> These are failing if the whole chunk is removed, like your
> said 9,25,26 are among which do not pass:
> 
> TEST  bin/epoll26       FAIL
> TEST  bin/epoll42       FAIL
> TEST  bin/epoll34       FAIL
> TEST  bin/epoll9        FAIL
> TEST  bin/epoll48       FAIL
> TEST  bin/epoll40       FAIL
> TEST  bin/epoll25       FAIL
> TEST  bin/epoll44       FAIL
> TEST  bin/epoll38       FAIL
> 
> This can be a good test suite, probably can be added to kselftests?
> 
> -- 
> Roman
> 


Indeed, I just tried the same test suite and I am seeing similar
failures - it looks like its a bit timing dependent. It looks like all
the failures are caused by a similar issue. For example, take epoll34:

         t0   t1
     (ew) |    | (ew)
         e0    |
      (lt) \  /
             |
            e1
             | (et)
            s0


The test is trying to assert that an epoll_wait() on e1 and and
epoll_wait() on e0 both return 1 event for EPOLLIN. However, the
epoll_wait on e1 is done in a thread and it can happen before or after
the epoll_wait() is called against e0. If the epoll_wait() on e1 happens
first then because its attached as 'et', it consumes the event. So that
there is no longer an event reported at e0. I think that is reasonable
semantics here. However if the wait on e0 happens after the wait on e1
then the test will pass as both waits will see the event. Thus, a patch
like this will 'fix' this testcase:

--- a/src/epoll34.c
+++ b/src/epoll34.c
@@ -59,15 +59,15 @@ int main(int argc, char *argv[])
        if (epoll_ctl(efd[0], EPOLL_CTL_ADD, efd[1], &e) < 0)
                goto out;

-       if (pthread_create(&tw, NULL, thread_handler, NULL) < 0)
-               goto out;
-
        if (pthread_create(&te, NULL, emit_handler, NULL) < 0)
                goto out;

        if (epoll_wait(efd[0], &e, 1, 500) == 1)
                count++;

+       if (pthread_create(&tw, NULL, thread_handler, NULL) < 0)
+               goto out;
+
        if (pthread_join(tw, NULL) < 0)
                goto out;


I found all the other failures to be of similar origin. I suspect Heiher
didn't see failures due to the thread timings here.

I also found that all the testcases pass if we leave the wakeup(&ep->wq)
call for the depth 0 case (and remove the pwake part).

Thanks,

-Jason