Re: [PATCH RESEND] fs/epoll: fix the edge-triggered mode for nested epoll

From: Jason Baron <jbaron@akamai.com>
To: Heiher <r@hev.cc>
Cc: Roman Penyaev <rpenyaev@suse.de>,
	linux-fsdevel@vger.kernel.org, Eric Wong <e@80x24.org>,
	Al Viro <viro@zeniv.linux.org.uk>,
	Andrew Morton <akpm@linux-foundation.org>,
	Davide Libenzi <davidel@xmailserver.org>,
	Davidlohr Bueso <dave@stgolabs.net>,
	Dominik Brodowski <linux@dominikbrodowski.net>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Sridhar Samudrala <sridhar.samudrala@intel.com>,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH RESEND] fs/epoll: fix the edge-triggered mode for nested epoll
Date: Thu, 12 Sep 2019 17:36:44 -0400	[thread overview]
Message-ID: <7bb91058-eee8-d0c5-c4b3-e3f6cbda8c0b@akamai.com> (raw)
In-Reply-To: <CAHirt9gtCssDsS6NjfxSociPObL6vwL7ygCWjgZZYegsUt4YOg@mail.gmail.com>

On 9/11/19 4:19 AM, Heiher wrote:
> Hi,
> 
> On Fri, Sep 6, 2019 at 1:48 AM Jason Baron <jbaron@akamai.com> wrote:
>>
>>
>>
>> On 9/5/19 1:27 PM, Roman Penyaev wrote:
>>> On 2019-09-05 11:56, Heiher wrote:
>>>> Hi,
>>>>
>>>> On Thu, Sep 5, 2019 at 10:53 AM Heiher <r@hev.cc> wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> I created an epoll wakeup test project, listed some possible cases,
>>>>> and any other corner cases needs to be added?
>>>>>
>>>>> https://github.com/heiher/epoll-wakeup/blob/master/README.md
>>>>>
>>>>> On Wed, Sep 4, 2019 at 10:02 PM Heiher <r@hev.cc> wrote:
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> On Wed, Sep 4, 2019 at 8:02 PM Jason Baron <jbaron@akamai.com> wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 9/4/19 5:57 AM, Roman Penyaev wrote:
>>>>>>>> On 2019-09-03 23:08, Jason Baron wrote:
>>>>>>>>> On 9/2/19 11:36 AM, Roman Penyaev wrote:
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> This is indeed a bug. (quick side note: could you please
>>>>> remove efd[1]
>>>>>>>>>> from your test, because it is not related to the reproduction
>>>>> of a
>>>>>>>>>> current bug).
>>>>>>>>>>
>>>>>>>>>> Your patch lacks a good description, what exactly you've
>>>>> fixed.  Let
>>>>>>>>>> me speak out loud and please correct me if I'm wrong, my
>>>>> understanding
>>>>>>>>>> of epoll internals has become a bit rusty: when epoll fds are
>>>>> nested
>>>>>>>>>> an attempt to harvest events (ep_scan_ready_list() call)
>>>>> produces a
>>>>>>>>>> second (repeated) event from an internal fd up to an external
>>>>> fd:
>>>>>>>>>>
>>>>>>>>>>      epoll_wait(efd[0], ...):
>>>>>>>>>>        ep_send_events():
>>>>>>>>>>           ep_scan_ready_list(depth=0):
>>>>>>>>>>             ep_send_events_proc():
>>>>>>>>>>                 ep_item_poll():
>>>>>>>>>>                   ep_scan_ready_list(depth=1):
>>>>>>>>>>                     ep_poll_safewake():
>>>>>>>>>>                       ep_poll_callback()
>>>>>>>>>>                         list_add_tail(&epi, &epi->rdllist);
>>>>>>>>>>                         ^^^^^^
>>>>>>>>>>                         repeated event
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> In your patch you forbid wakeup for the cases, where depth !=
>>>>> 0, i.e.
>>>>>>>>>> for all nested cases. That seems clear.  But what if we can
>>>>> go further
>>>>>>>>>> and remove the whole chunk, which seems excessive:
>>>>>>>>>>
>>>>>>>>>> @@ -885,26 +886,11 @@ static __poll_t ep_scan_ready_list(struct
>>>>>>>>>> eventpoll *ep,
>>>>>>>>>>
>>>>>>>>>> -
>>>>>>>>>> -       if (!list_empty(&ep->rdllist)) {
>>>>>>>>>> -               /*
>>>>>>>>>> -                * Wake up (if active) both the eventpoll
>>>>> wait list and
>>>>>>>>>> -                * the ->poll() wait list (delayed after we
>>>>> release the
>>>>>>>>>> lock).
>>>>>>>>>> -                */
>>>>>>>>>> -               if (waitqueue_active(&ep->wq))
>>>>>>>>>> -                       wake_up(&ep->wq);
>>>>>>>>>> -               if (waitqueue_active(&ep->poll_wait))
>>>>>>>>>> -                       pwake++;
>>>>>>>>>> -       }
>>>>>>>>>>         write_unlock_irq(&ep->lock);
>>>>>>>>>>
>>>>>>>>>>         if (!ep_locked)
>>>>>>>>>>                 mutex_unlock(&ep->mtx);
>>>>>>>>>>
>>>>>>>>>> -       /* We have to call this outside the lock */
>>>>>>>>>> -       if (pwake)
>>>>>>>>>> -               ep_poll_safewake(&ep->poll_wait);
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I reason like that: by the time we've reached the point of
>>>>> scanning events
>>>>>>>>>> for readiness all wakeups from ep_poll_callback have been
>>>>> already fired and
>>>>>>>>>> new events have been already accounted in ready list
>>>>> (ep_poll_callback()
>>>>>>>>>> calls
>>>>>>>>>> the same ep_poll_safewake()). Here, frankly, I'm not 100%
>>>>> sure and probably
>>>>>>>>>> missing some corner cases.
>>>>>>>>>>
>>>>>>>>>> Thoughts?
>>>>>>>>>
>>>>>>>>> So the: 'wake_up(&ep->wq);' part, I think is about waking up
>>>>> other
>>>>>>>>> threads that may be in waiting in epoll_wait(). For example,
>>>>> there may
>>>>>>>>> be multiple threads doing epoll_wait() on the same epoll fd,
>>>>> and the
>>>>>>>>> logic above seems to say thread 1 may have processed say N
>>>>> events and
>>>>>>>>> now its going to to go off to work those, so let's wake up
>>>>> thread 2 now
>>>>>>>>> to handle the next chunk.
>>>>>>>>
>>>>>>>> Not quite. Thread which calls ep_scan_ready_list() processes
>>>>> all the
>>>>>>>> events, and while processing those, removes them one by one
>>>>> from the
>>>>>>>> ready list.  But if event mask is !0 and event belongs to
>>>>>>>> Level Triggered Mode descriptor (let's say default mode) it
>>>>> tails event
>>>>>>>> again back to the list (because we are in level mode, so event
>>>>> should
>>>>>>>> be there).  So at the end of this traversing loop ready list is
>>>>> likely
>>>>>>>> not empty, and if so, wake up again is called for nested epoll
>>>>> fds.
>>>>>>>> But, those nested epoll fds should get already all the
>>>>> notifications
>>>>>>>> from the main event callback ep_poll_callback(), regardless any
>>>>> thread
>>>>>>>> which traverses events.
>>>>>>>>
>>>>>>>> I suppose this logic exists for decades, when Davide (the
>>>>> author) was
>>>>>>>> reshuffling the code here and there.
>>>>>>>>
>>>>>>>> But I do not feel confidence to state that this extra wakeup is
>>>>> bogus,
>>>>>>>> I just have a gut feeling that it looks excessive.
>>>>>>>
>>>>>>> Note that I was talking about the wakeup done on ep->wq not
>>>>> ep->poll_wait.
>>>>>>> The path that I'm concerned about is let's say that there are N
>>>>> events
>>>>>>> queued on the ready list. A thread that was woken up in
>>>>> epoll_wait may
>>>>>>> decide to only process say N/2 of then. Then it will call wakeup
>>>>> on ep->wq
>>>>>>> and this will wakeup another thread to process the remaining N/2.
>>>>> Without
>>>>>>> the wakeup, the original thread isn't going to process the events
>>>>> until
>>>>>>> it finishes with the original N/2 and gets back to epoll_wait().
>>>>> So I'm not
>>>>>>> sure how important that path is but I wanted to at least note the
>>>>> change
>>>>>>> here would impact that behavior.
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> -Jason
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>> So I think removing all that even for the
>>>>>>>>> depth 0 case is going to change some behavior here. So
>>>>> perhaps, it
>>>>>>>>> should be removed for all depths except for 0? And if so, it
>>>>> may be
>>>>>>>>> better to make 2 patches here to separate these changes.
>>>>>>>>>
>>>>>>>>> For the nested wakeups, I agree that the extra wakeups seem
>>>>> unnecessary
>>>>>>>>> and it may make sense to remove them for all depths. I don't
>>>>> think the
>>>>>>>>> nested epoll semantics are particularly well spelled out, and
>>>>> afaict,
>>>>>>>>> nested epoll() has behaved this way for quite some time. And
>>>>> the current
>>>>>>>>> behavior is not bad in the way that a missing wakeup or false
>>>>> negative
>>>>>>>>> would be.
>>>>>>>>
>>>>>>>> That's 100% true! For edge mode extra wake up is not a bug, not
>>>>> optimal
>>>>>>>> for userspace - yes, but that can't lead to any lost wakeups.
>>>>>>>>
>>>>>>>> --
>>>>>>>> Roman
>>>>>>>>
>>>>>>
>>>>>> I tried to remove the whole chunk of code that Roman said, and it
>>>>>> seems that there
>>>>>> are no obvious problems with the two test programs below:
>>>>
>>>> I recall this message, the test case 9/25/26 of epoll-wakeup (on
>>>> github) are failed while
>>>> the whole chunk are removed.
>>>>
>>>> Apply the original patch, all tests passed.
>>>
>>>
>>> These are failing on my bare 5.2.0-rc2
>>>
>>> TEST  bin/epoll31       FAIL
>>> TEST  bin/epoll46       FAIL
>>> TEST  bin/epoll50       FAIL
>>> TEST  bin/epoll32       FAIL
>>> TEST  bin/epoll19       FAIL
>>> TEST  bin/epoll27       FAIL
>>> TEST  bin/epoll42       FAIL
>>> TEST  bin/epoll34       FAIL
>>> TEST  bin/epoll48       FAIL
>>> TEST  bin/epoll40       FAIL
>>> TEST  bin/epoll20       FAIL
>>> TEST  bin/epoll28       FAIL
>>> TEST  bin/epoll38       FAIL
>>> TEST  bin/epoll52       FAIL
>>> TEST  bin/epoll24       FAIL
>>> TEST  bin/epoll23       FAIL
>>>
>>>
>>> These are failing if your patch is applied:
>>> (my 5.2.0-rc2 is old? broken?)
>>>
>>> TEST  bin/epoll46       FAIL
>>> TEST  bin/epoll42       FAIL
>>> TEST  bin/epoll34       FAIL
>>> TEST  bin/epoll48       FAIL
>>> TEST  bin/epoll40       FAIL
>>> TEST  bin/epoll44       FAIL
>>> TEST  bin/epoll38       FAIL
>>>
>>> These are failing if "ep_poll_safewake(&ep->poll_wait)" is not called,
>>> but wakeup(&ep->wq); is still invoked:
>>>
>>> TEST  bin/epoll46       FAIL
>>> TEST  bin/epoll42       FAIL
>>> TEST  bin/epoll34       FAIL
>>> TEST  bin/epoll40       FAIL
>>> TEST  bin/epoll44       FAIL
>>> TEST  bin/epoll38       FAIL
>>>
>>> So at least 48 has been "fixed".
>>>
>>> These are failing if the whole chunk is removed, like your
>>> said 9,25,26 are among which do not pass:
>>>
>>> TEST  bin/epoll26       FAIL
>>> TEST  bin/epoll42       FAIL
>>> TEST  bin/epoll34       FAIL
>>> TEST  bin/epoll9        FAIL
>>> TEST  bin/epoll48       FAIL
>>> TEST  bin/epoll40       FAIL
>>> TEST  bin/epoll25       FAIL
>>> TEST  bin/epoll44       FAIL
>>> TEST  bin/epoll38       FAIL
>>>
>>> This can be a good test suite, probably can be added to kselftests?
> 
> Thank you, I have updated epoll-tests to fix these issues. I think this is good
> news if we can added to kselftests. ;)
> 
>>>
>>> --
>>> Roman
>>>
>>
>>
>> Indeed, I just tried the same test suite and I am seeing similar
>> failures - it looks like its a bit timing dependent. It looks like all
>> the failures are caused by a similar issue. For example, take epoll34:
>>
>>          t0   t1
>>      (ew) |    | (ew)
>>          e0    |
>>       (lt) \  /
>>              |
>>             e1
>>              | (et)
>>             s0
>>
>>
>> The test is trying to assert that an epoll_wait() on e1 and and
>> epoll_wait() on e0 both return 1 event for EPOLLIN. However, the
>> epoll_wait on e1 is done in a thread and it can happen before or after
>> the epoll_wait() is called against e0. If the epoll_wait() on e1 happens
>> first then because its attached as 'et', it consumes the event. So that
>> there is no longer an event reported at e0. I think that is reasonable
>> semantics here. However if the wait on e0 happens after the wait on e1
>> then the test will pass as both waits will see the event. Thus, a patch
>> like this will 'fix' this testcase:
>>
>> --- a/src/epoll34.c
>> +++ b/src/epoll34.c
>> @@ -59,15 +59,15 @@ int main(int argc, char *argv[])
>>         if (epoll_ctl(efd[0], EPOLL_CTL_ADD, efd[1], &e) < 0)
>>                 goto out;
>>
>> -       if (pthread_create(&tw, NULL, thread_handler, NULL) < 0)
>> -               goto out;
>> -
>>         if (pthread_create(&te, NULL, emit_handler, NULL) < 0)
>>                 goto out;
>>
>>         if (epoll_wait(efd[0], &e, 1, 500) == 1)
>>                 count++;
>>
>> +       if (pthread_create(&tw, NULL, thread_handler, NULL) < 0)
>> +               goto out;
>> +
>>         if (pthread_join(tw, NULL) < 0)
>>                 goto out;
>>
>>
>> I found all the other failures to be of similar origin. I suspect Heiher
>> didn't see failures due to the thread timings here.
> 
> Thank you. I also found a multi-threaded concurrent accumulation problem,
> and that has been changed to atomic operations. I think we should allow two
> different behaviors to be passed because they are all correctly.
> 
> thread 2:
> if (epoll_wait(efd[1], &e, 1, 500) == 1)
>     __sync_fetch_and_or(&count, 1);
> 
> thread1:
> if (epoll_wait(efd[0], &e, 1, 500) == 1)
>     __sync_fetch_and_or(&count, 2);
> 
> check:
> if ((count != 1) && (count != 3))
>     goto out;
> 
>>
>> I also found that all the testcases pass if we leave the wakeup(&ep->wq)
>> call for the depth 0 case (and remove the pwake part).
> 
> So, We need to keep the wakeup(&ep-wq) for all depth, and only
> wakeup(&ep->poll_wait)
> for depth 0 and/or ep->rdlist from empty to be not empty?
> 

Yes, so my thinking is yes, leave the wakeup(&ep->wq) for all depths. It
may not strictly be necessary for depths > 0 but I do have some concerns
about it and doesn't affect this specific case. So I would leave it for
all depths and if you want to remove it for depths > 0, its a separate
patch.

I'm now not sure its really safe to remove the wakeup(&ep->poll_wait)
either. Take the case where we have:

       (et)  (lt)
t0---e0---e1----s0

Let's say there is thread t1 also doing epoll_wait() on e1. Now, let's
say s0 fires an event and wakes up t1. While t1 is reaping events from
t1 its calling ep_scan_ready_list() and while its processing the events
it drops the ep->lock and more events get queued to the overflow list.
Now, let's say t0 wakes up from epoll_wait() it will not see any of the
events queued to e1 yet (since they are not on the readylist yet). Thus,
t0 goes back to sleep and I think can miss an event here. So I think
that's what the wakeup(&ep->poll_wait) is about.

What I think we can do to fix the original case, is that perhaps in that
case where events get queued to the overflow list we do in fact do the
wakeup(&ep->poll_wait). So perhaps, we can just have variable that
checks if there is anything in the overflow list, and if so do the
wakeup(&ep->poll_wait) conditional on there being things on the overflow
list. Although I can't say I tried to see if this would work.

That said, I think ep_modify() might also subtlety be dependent on the
current behavior, in that ep_item_poll() uses ep_scan_ready_list() and
when an event is modified it may need to wakeup nested epoll fds to
inform them of events. Thus, perhaps, we would also need an explicit
call to wakeup(&ep->poll_wait) when ep_item_poll() returns events.

Thanks,

-Jason