* [PATCH RESEND] fs/epoll: fix the edge-triggered mode for nested epoll @ 2019-09-02 5:20 hev 2019-09-02 15:36 ` Roman Penyaev 0 siblings, 1 reply; 12+ messages in thread From: hev @ 2019-09-02 5:20 UTC (permalink / raw) To: linux-fsdevel Cc: e, Heiher, Al Viro, Andrew Morton, Davide Libenzi, Davidlohr Bueso, Dominik Brodowski, Jason Baron, Linus Torvalds, Roman Penyaev, Sridhar Samudrala, linux-kernel From: Heiher <r@hev.cc> The structure of event pools: efd[1]: { efd[2] (EPOLLIN) } efd[0]: { efd[2] (EPOLLIN | EPOLLET) } | | +-----------------+-----------------+ | v efd[2]: { sfd[0] (EPOLLIN) } When sfd[0] to be readable: * the epoll_wait(efd[0], ..., 0) should return efd[2]'s events on first call, and returns 0 on next calls, because efd[2] is added in edge-triggered mode. * the epoll_wait(efd[1], ..., 0) should returns efd[2]'s events on every calls until efd[2] is not readable (epoll_wait(efd[2], ...) => 0), because efd[1] is added in level-triggered mode. * the epoll_wait(efd[2], ..., 0) should returns sfd[0]'s events on every calls until sfd[0] is not readable (read(sfd[0], ...) => EAGAIN), because sfd[0] is added in level-triggered mode. Test code: #include <stdio.h> #include <unistd.h> #include <sys/epoll.h> #include <sys/socket.h> int main(int argc, char *argv[]) { int sfd[2]; int efd[3]; int nfds; struct epoll_event e; if (socketpair(AF_UNIX, SOCK_STREAM, 0, sfd) < 0) goto out; efd[0] = epoll_create(1); if (efd[0] < 0) goto out; efd[1] = epoll_create(1); if (efd[1] < 0) goto out; efd[2] = epoll_create(1); if (efd[2] < 0) goto out; e.events = EPOLLIN; if (epoll_ctl(efd[2], EPOLL_CTL_ADD, sfd[0], &e) < 0) goto out; e.events = EPOLLIN; if (epoll_ctl(efd[1], EPOLL_CTL_ADD, efd[2], &e) < 0) goto out; e.events = EPOLLIN | EPOLLET; if (epoll_ctl(efd[0], EPOLL_CTL_ADD, efd[2], &e) < 0) goto out; if (write(sfd[1], "w", 1) != 1) goto out; nfds = epoll_wait(efd[0], &e, 1, 0); if (nfds != 1) goto out; nfds = epoll_wait(efd[0], &e, 1, 0); if (nfds != 0) goto out; nfds = epoll_wait(efd[1], &e, 1, 0); if (nfds != 1) goto out; nfds = epoll_wait(efd[1], &e, 1, 0); if (nfds != 1) goto out; nfds = epoll_wait(efd[2], &e, 1, 0); if (nfds != 1) goto out; nfds = epoll_wait(efd[2], &e, 1, 0); if (nfds != 1) goto out; close(efd[2]); close(efd[1]); close(efd[0]); close(sfd[0]); close(sfd[1]); printf("PASS\n"); return 0; out: printf("FAIL\n"); return -1; } Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Davide Libenzi <davidel@xmailserver.org> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Dominik Brodowski <linux@dominikbrodowski.net> Cc: Eric Wong <e@80x24.org> Cc: Jason Baron <jbaron@akamai.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Roman Penyaev <rpenyaev@suse.de> Cc: Sridhar Samudrala <sridhar.samudrala@intel.com> Cc: linux-kernel@vger.kernel.org Cc: linux-fsdevel@vger.kernel.org Signed-off-by: hev <r@hev.cc> --- fs/eventpoll.c | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/fs/eventpoll.c b/fs/eventpoll.c index d7f1f5011fac..a44cb27c636c 100644 --- a/fs/eventpoll.c +++ b/fs/eventpoll.c @@ -672,6 +672,7 @@ static __poll_t ep_scan_ready_list(struct eventpoll *ep, { __poll_t res; int pwake = 0; + int nwake = 0; struct epitem *epi, *nepi; LIST_HEAD(txlist); @@ -685,6 +686,9 @@ static __poll_t ep_scan_ready_list(struct eventpoll *ep, if (!ep_locked) mutex_lock_nested(&ep->mtx, depth); + if (!depth || list_empty(&ep->rdllist)) + nwake = 1; + /* * Steal the ready list, and re-init the original one to the * empty list. Also, set ep->ovflist to NULL so that events @@ -739,7 +743,7 @@ static __poll_t ep_scan_ready_list(struct eventpoll *ep, list_splice(&txlist, &ep->rdllist); __pm_relax(ep->ws); - if (!list_empty(&ep->rdllist)) { + if (nwake && !list_empty(&ep->rdllist)) { /* * Wake up (if active) both the eventpoll wait list and * the ->poll() wait list (delayed after we release the lock). -- 2.23.0 ^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [PATCH RESEND] fs/epoll: fix the edge-triggered mode for nested epoll 2019-09-02 5:20 [PATCH RESEND] fs/epoll: fix the edge-triggered mode for nested epoll hev @ 2019-09-02 15:36 ` Roman Penyaev 2019-09-03 21:08 ` Jason Baron 0 siblings, 1 reply; 12+ messages in thread From: Roman Penyaev @ 2019-09-02 15:36 UTC (permalink / raw) To: hev Cc: linux-fsdevel, e, Al Viro, Andrew Morton, Davide Libenzi, Davidlohr Bueso, Dominik Brodowski, Jason Baron, Linus Torvalds, Sridhar Samudrala, linux-kernel Hi, This is indeed a bug. (quick side note: could you please remove efd[1] from your test, because it is not related to the reproduction of a current bug). Your patch lacks a good description, what exactly you've fixed. Let me speak out loud and please correct me if I'm wrong, my understanding of epoll internals has become a bit rusty: when epoll fds are nested an attempt to harvest events (ep_scan_ready_list() call) produces a second (repeated) event from an internal fd up to an external fd: epoll_wait(efd[0], ...): ep_send_events(): ep_scan_ready_list(depth=0): ep_send_events_proc(): ep_item_poll(): ep_scan_ready_list(depth=1): ep_poll_safewake(): ep_poll_callback() list_add_tail(&epi, &epi->rdllist); ^^^^^^ repeated event In your patch you forbid wakeup for the cases, where depth != 0, i.e. for all nested cases. That seems clear. But what if we can go further and remove the whole chunk, which seems excessive: @@ -885,26 +886,11 @@ static __poll_t ep_scan_ready_list(struct eventpoll *ep, - - if (!list_empty(&ep->rdllist)) { - /* - * Wake up (if active) both the eventpoll wait list and - * the ->poll() wait list (delayed after we release the lock). - */ - if (waitqueue_active(&ep->wq)) - wake_up(&ep->wq); - if (waitqueue_active(&ep->poll_wait)) - pwake++; - } write_unlock_irq(&ep->lock); if (!ep_locked) mutex_unlock(&ep->mtx); - /* We have to call this outside the lock */ - if (pwake) - ep_poll_safewake(&ep->poll_wait); I reason like that: by the time we've reached the point of scanning events for readiness all wakeups from ep_poll_callback have been already fired and new events have been already accounted in ready list (ep_poll_callback() calls the same ep_poll_safewake()). Here, frankly, I'm not 100% sure and probably missing some corner cases. Thoughts? PS. You call list_empty(&ep->rdllist) without ep->lock taken, that is fine, but you should be _careful_, so list_empty_careful(&ep->rdllist) call instead. -- Roman On 2019-09-02 07:20, hev wrote: > From: Heiher <r@hev.cc> > > The structure of event pools: > efd[1]: { efd[2] (EPOLLIN) } efd[0]: { efd[2] (EPOLLIN | > EPOLLET) } > | | > +-----------------+-----------------+ > | > v > efd[2]: { sfd[0] (EPOLLIN) } > > When sfd[0] to be readable: > * the epoll_wait(efd[0], ..., 0) should return efd[2]'s events on > first call, > and returns 0 on next calls, because efd[2] is added in > edge-triggered mode. > * the epoll_wait(efd[1], ..., 0) should returns efd[2]'s events on > every calls > until efd[2] is not readable (epoll_wait(efd[2], ...) => 0), because > efd[1] > is added in level-triggered mode. > * the epoll_wait(efd[2], ..., 0) should returns sfd[0]'s events on > every calls > until sfd[0] is not readable (read(sfd[0], ...) => EAGAIN), because > sfd[0] > is added in level-triggered mode. > > Test code: > #include <stdio.h> > #include <unistd.h> > #include <sys/epoll.h> > #include <sys/socket.h> > > int main(int argc, char *argv[]) > { > int sfd[2]; > int efd[3]; > int nfds; > struct epoll_event e; > > if (socketpair(AF_UNIX, SOCK_STREAM, 0, sfd) < 0) > goto out; > > efd[0] = epoll_create(1); > if (efd[0] < 0) > goto out; > > efd[1] = epoll_create(1); > if (efd[1] < 0) > goto out; > > efd[2] = epoll_create(1); > if (efd[2] < 0) > goto out; > > e.events = EPOLLIN; > if (epoll_ctl(efd[2], EPOLL_CTL_ADD, sfd[0], &e) < 0) > goto out; > > e.events = EPOLLIN; > if (epoll_ctl(efd[1], EPOLL_CTL_ADD, efd[2], &e) < 0) > goto out; > > e.events = EPOLLIN | EPOLLET; > if (epoll_ctl(efd[0], EPOLL_CTL_ADD, efd[2], &e) < 0) > goto out; > > if (write(sfd[1], "w", 1) != 1) > goto out; > > nfds = epoll_wait(efd[0], &e, 1, 0); > if (nfds != 1) > goto out; > > nfds = epoll_wait(efd[0], &e, 1, 0); > if (nfds != 0) > goto out; > > nfds = epoll_wait(efd[1], &e, 1, 0); > if (nfds != 1) > goto out; > > nfds = epoll_wait(efd[1], &e, 1, 0); > if (nfds != 1) > goto out; > > nfds = epoll_wait(efd[2], &e, 1, 0); > if (nfds != 1) > goto out; > > nfds = epoll_wait(efd[2], &e, 1, 0); > if (nfds != 1) > goto out; > > close(efd[2]); > close(efd[1]); > close(efd[0]); > close(sfd[0]); > close(sfd[1]); > > printf("PASS\n"); > return 0; > > out: > printf("FAIL\n"); > return -1; > } > > Cc: Al Viro <viro@ZenIV.linux.org.uk> > Cc: Andrew Morton <akpm@linux-foundation.org> > Cc: Davide Libenzi <davidel@xmailserver.org> > Cc: Davidlohr Bueso <dave@stgolabs.net> > Cc: Dominik Brodowski <linux@dominikbrodowski.net> > Cc: Eric Wong <e@80x24.org> > Cc: Jason Baron <jbaron@akamai.com> > Cc: Linus Torvalds <torvalds@linux-foundation.org> > Cc: Roman Penyaev <rpenyaev@suse.de> > Cc: Sridhar Samudrala <sridhar.samudrala@intel.com> > Cc: linux-kernel@vger.kernel.org > Cc: linux-fsdevel@vger.kernel.org > Signed-off-by: hev <r@hev.cc> > --- > fs/eventpoll.c | 6 +++++- > 1 file changed, 5 insertions(+), 1 deletion(-) > > diff --git a/fs/eventpoll.c b/fs/eventpoll.c > index d7f1f5011fac..a44cb27c636c 100644 > --- a/fs/eventpoll.c > +++ b/fs/eventpoll.c > @@ -672,6 +672,7 @@ static __poll_t ep_scan_ready_list(struct eventpoll > *ep, > { > __poll_t res; > int pwake = 0; > + int nwake = 0; > struct epitem *epi, *nepi; > LIST_HEAD(txlist); > > @@ -685,6 +686,9 @@ static __poll_t ep_scan_ready_list(struct eventpoll > *ep, > if (!ep_locked) > mutex_lock_nested(&ep->mtx, depth); > > + if (!depth || list_empty(&ep->rdllist)) > + nwake = 1; > + > /* > * Steal the ready list, and re-init the original one to the > * empty list. Also, set ep->ovflist to NULL so that events > @@ -739,7 +743,7 @@ static __poll_t ep_scan_ready_list(struct eventpoll > *ep, > list_splice(&txlist, &ep->rdllist); > __pm_relax(ep->ws); > > - if (!list_empty(&ep->rdllist)) { > + if (nwake && !list_empty(&ep->rdllist)) { > /* > * Wake up (if active) both the eventpoll wait list and > * the ->poll() wait list (delayed after we release the lock). ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH RESEND] fs/epoll: fix the edge-triggered mode for nested epoll 2019-09-02 15:36 ` Roman Penyaev @ 2019-09-03 21:08 ` Jason Baron 2019-09-04 9:57 ` Roman Penyaev 0 siblings, 1 reply; 12+ messages in thread From: Jason Baron @ 2019-09-03 21:08 UTC (permalink / raw) To: Roman Penyaev, hev Cc: linux-fsdevel, e, Al Viro, Andrew Morton, Davide Libenzi, Davidlohr Bueso, Dominik Brodowski, Linus Torvalds, Sridhar Samudrala, linux-kernel On 9/2/19 11:36 AM, Roman Penyaev wrote: > Hi, > > This is indeed a bug. (quick side note: could you please remove efd[1] > from your test, because it is not related to the reproduction of a > current bug). > > Your patch lacks a good description, what exactly you've fixed. Let > me speak out loud and please correct me if I'm wrong, my understanding > of epoll internals has become a bit rusty: when epoll fds are nested > an attempt to harvest events (ep_scan_ready_list() call) produces a > second (repeated) event from an internal fd up to an external fd: > > epoll_wait(efd[0], ...): > ep_send_events(): > ep_scan_ready_list(depth=0): > ep_send_events_proc(): > ep_item_poll(): > ep_scan_ready_list(depth=1): > ep_poll_safewake(): > ep_poll_callback() > list_add_tail(&epi, &epi->rdllist); > ^^^^^^ > repeated event > > > In your patch you forbid wakeup for the cases, where depth != 0, i.e. > for all nested cases. That seems clear. But what if we can go further > and remove the whole chunk, which seems excessive: > > @@ -885,26 +886,11 @@ static __poll_t ep_scan_ready_list(struct > eventpoll *ep, > > - > - if (!list_empty(&ep->rdllist)) { > - /* > - * Wake up (if active) both the eventpoll wait list and > - * the ->poll() wait list (delayed after we release the > lock). > - */ > - if (waitqueue_active(&ep->wq)) > - wake_up(&ep->wq); > - if (waitqueue_active(&ep->poll_wait)) > - pwake++; > - } > write_unlock_irq(&ep->lock); > > if (!ep_locked) > mutex_unlock(&ep->mtx); > > - /* We have to call this outside the lock */ > - if (pwake) > - ep_poll_safewake(&ep->poll_wait); > > > I reason like that: by the time we've reached the point of scanning events > for readiness all wakeups from ep_poll_callback have been already fired and > new events have been already accounted in ready list (ep_poll_callback() > calls > the same ep_poll_safewake()). Here, frankly, I'm not 100% sure and probably > missing some corner cases. > > Thoughts? So the: 'wake_up(&ep->wq);' part, I think is about waking up other threads that may be in waiting in epoll_wait(). For example, there may be multiple threads doing epoll_wait() on the same epoll fd, and the logic above seems to say thread 1 may have processed say N events and now its going to to go off to work those, so let's wake up thread 2 now to handle the next chunk. So I think removing all that even for the depth 0 case is going to change some behavior here. So perhaps, it should be removed for all depths except for 0? And if so, it may be better to make 2 patches here to separate these changes. For the nested wakeups, I agree that the extra wakeups seem unnecessary and it may make sense to remove them for all depths. I don't think the nested epoll semantics are particularly well spelled out, and afaict, nested epoll() has behaved this way for quite some time. And the current behavior is not bad in the way that a missing wakeup or false negative would be. It woulbe be good to better understand the use-case more here and to try and spell out the nested semantics more clearly? Thanks, -Jason > > PS. You call list_empty(&ep->rdllist) without ep->lock taken, that is > fine, > but you should be _careful_, so list_empty_careful(&ep->rdllist) call > instead. > > -- > Roman > > > > On 2019-09-02 07:20, hev wrote: >> From: Heiher <r@hev.cc> >> >> The structure of event pools: >> efd[1]: { efd[2] (EPOLLIN) } efd[0]: { efd[2] (EPOLLIN | >> EPOLLET) } >> | | >> +-----------------+-----------------+ >> | >> v >> efd[2]: { sfd[0] (EPOLLIN) } >> >> When sfd[0] to be readable: >> * the epoll_wait(efd[0], ..., 0) should return efd[2]'s events on >> first call, >> and returns 0 on next calls, because efd[2] is added in >> edge-triggered mode. >> * the epoll_wait(efd[1], ..., 0) should returns efd[2]'s events on >> every calls >> until efd[2] is not readable (epoll_wait(efd[2], ...) => 0), >> because efd[1] >> is added in level-triggered mode. >> * the epoll_wait(efd[2], ..., 0) should returns sfd[0]'s events on >> every calls >> until sfd[0] is not readable (read(sfd[0], ...) => EAGAIN), because >> sfd[0] >> is added in level-triggered mode. >> >> Test code: >> #include <stdio.h> >> #include <unistd.h> >> #include <sys/epoll.h> >> #include <sys/socket.h> >> >> int main(int argc, char *argv[]) >> { >> int sfd[2]; >> int efd[3]; >> int nfds; >> struct epoll_event e; >> >> if (socketpair(AF_UNIX, SOCK_STREAM, 0, sfd) < 0) >> goto out; >> >> efd[0] = epoll_create(1); >> if (efd[0] < 0) >> goto out; >> >> efd[1] = epoll_create(1); >> if (efd[1] < 0) >> goto out; >> >> efd[2] = epoll_create(1); >> if (efd[2] < 0) >> goto out; >> >> e.events = EPOLLIN; >> if (epoll_ctl(efd[2], EPOLL_CTL_ADD, sfd[0], &e) < 0) >> goto out; >> >> e.events = EPOLLIN; >> if (epoll_ctl(efd[1], EPOLL_CTL_ADD, efd[2], &e) < 0) >> goto out; >> >> e.events = EPOLLIN | EPOLLET; >> if (epoll_ctl(efd[0], EPOLL_CTL_ADD, efd[2], &e) < 0) >> goto out; >> >> if (write(sfd[1], "w", 1) != 1) >> goto out; >> >> nfds = epoll_wait(efd[0], &e, 1, 0); >> if (nfds != 1) >> goto out; >> >> nfds = epoll_wait(efd[0], &e, 1, 0); >> if (nfds != 0) >> goto out; >> >> nfds = epoll_wait(efd[1], &e, 1, 0); >> if (nfds != 1) >> goto out; >> >> nfds = epoll_wait(efd[1], &e, 1, 0); >> if (nfds != 1) >> goto out; >> >> nfds = epoll_wait(efd[2], &e, 1, 0); >> if (nfds != 1) >> goto out; >> >> nfds = epoll_wait(efd[2], &e, 1, 0); >> if (nfds != 1) >> goto out; >> >> close(efd[2]); >> close(efd[1]); >> close(efd[0]); >> close(sfd[0]); >> close(sfd[1]); >> >> printf("PASS\n"); >> return 0; >> >> out: >> printf("FAIL\n"); >> return -1; >> } >> >> Cc: Al Viro <viro@ZenIV.linux.org.uk> >> Cc: Andrew Morton <akpm@linux-foundation.org> >> Cc: Davide Libenzi <davidel@xmailserver.org> >> Cc: Davidlohr Bueso <dave@stgolabs.net> >> Cc: Dominik Brodowski <linux@dominikbrodowski.net> >> Cc: Eric Wong <e@80x24.org> >> Cc: Jason Baron <jbaron@akamai.com> >> Cc: Linus Torvalds <torvalds@linux-foundation.org> >> Cc: Roman Penyaev <rpenyaev@suse.de> >> Cc: Sridhar Samudrala <sridhar.samudrala@intel.com> >> Cc: linux-kernel@vger.kernel.org >> Cc: linux-fsdevel@vger.kernel.org >> Signed-off-by: hev <r@hev.cc> >> --- >> fs/eventpoll.c | 6 +++++- >> 1 file changed, 5 insertions(+), 1 deletion(-) >> >> diff --git a/fs/eventpoll.c b/fs/eventpoll.c >> index d7f1f5011fac..a44cb27c636c 100644 >> --- a/fs/eventpoll.c >> +++ b/fs/eventpoll.c >> @@ -672,6 +672,7 @@ static __poll_t ep_scan_ready_list(struct >> eventpoll *ep, >> { >> __poll_t res; >> int pwake = 0; >> + int nwake = 0; >> struct epitem *epi, *nepi; >> LIST_HEAD(txlist); >> >> @@ -685,6 +686,9 @@ static __poll_t ep_scan_ready_list(struct >> eventpoll *ep, >> if (!ep_locked) >> mutex_lock_nested(&ep->mtx, depth); >> >> + if (!depth || list_empty(&ep->rdllist)) >> + nwake = 1; >> + >> /* >> * Steal the ready list, and re-init the original one to the >> * empty list. Also, set ep->ovflist to NULL so that events >> @@ -739,7 +743,7 @@ static __poll_t ep_scan_ready_list(struct >> eventpoll *ep, >> list_splice(&txlist, &ep->rdllist); >> __pm_relax(ep->ws); >> >> - if (!list_empty(&ep->rdllist)) { >> + if (nwake && !list_empty(&ep->rdllist)) { >> /* >> * Wake up (if active) both the eventpoll wait list and >> * the ->poll() wait list (delayed after we release the lock). > ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH RESEND] fs/epoll: fix the edge-triggered mode for nested epoll 2019-09-03 21:08 ` Jason Baron @ 2019-09-04 9:57 ` Roman Penyaev 2019-09-04 12:02 ` Jason Baron 0 siblings, 1 reply; 12+ messages in thread From: Roman Penyaev @ 2019-09-04 9:57 UTC (permalink / raw) To: Jason Baron Cc: hev, linux-fsdevel, e, Al Viro, Andrew Morton, Davide Libenzi, Davidlohr Bueso, Dominik Brodowski, Linus Torvalds, Sridhar Samudrala, linux-kernel On 2019-09-03 23:08, Jason Baron wrote: > On 9/2/19 11:36 AM, Roman Penyaev wrote: >> Hi, >> >> This is indeed a bug. (quick side note: could you please remove efd[1] >> from your test, because it is not related to the reproduction of a >> current bug). >> >> Your patch lacks a good description, what exactly you've fixed. Let >> me speak out loud and please correct me if I'm wrong, my understanding >> of epoll internals has become a bit rusty: when epoll fds are nested >> an attempt to harvest events (ep_scan_ready_list() call) produces a >> second (repeated) event from an internal fd up to an external fd: >> >> epoll_wait(efd[0], ...): >> ep_send_events(): >> ep_scan_ready_list(depth=0): >> ep_send_events_proc(): >> ep_item_poll(): >> ep_scan_ready_list(depth=1): >> ep_poll_safewake(): >> ep_poll_callback() >> list_add_tail(&epi, &epi->rdllist); >> ^^^^^^ >> repeated event >> >> >> In your patch you forbid wakeup for the cases, where depth != 0, i.e. >> for all nested cases. That seems clear. But what if we can go further >> and remove the whole chunk, which seems excessive: >> >> @@ -885,26 +886,11 @@ static __poll_t ep_scan_ready_list(struct >> eventpoll *ep, >> >> - >> - if (!list_empty(&ep->rdllist)) { >> - /* >> - * Wake up (if active) both the eventpoll wait list >> and >> - * the ->poll() wait list (delayed after we release >> the >> lock). >> - */ >> - if (waitqueue_active(&ep->wq)) >> - wake_up(&ep->wq); >> - if (waitqueue_active(&ep->poll_wait)) >> - pwake++; >> - } >> write_unlock_irq(&ep->lock); >> >> if (!ep_locked) >> mutex_unlock(&ep->mtx); >> >> - /* We have to call this outside the lock */ >> - if (pwake) >> - ep_poll_safewake(&ep->poll_wait); >> >> >> I reason like that: by the time we've reached the point of scanning >> events >> for readiness all wakeups from ep_poll_callback have been already >> fired and >> new events have been already accounted in ready list >> (ep_poll_callback() >> calls >> the same ep_poll_safewake()). Here, frankly, I'm not 100% sure and >> probably >> missing some corner cases. >> >> Thoughts? > > So the: 'wake_up(&ep->wq);' part, I think is about waking up other > threads that may be in waiting in epoll_wait(). For example, there may > be multiple threads doing epoll_wait() on the same epoll fd, and the > logic above seems to say thread 1 may have processed say N events and > now its going to to go off to work those, so let's wake up thread 2 now > to handle the next chunk. Not quite. Thread which calls ep_scan_ready_list() processes all the events, and while processing those, removes them one by one from the ready list. But if event mask is !0 and event belongs to Level Triggered Mode descriptor (let's say default mode) it tails event again back to the list (because we are in level mode, so event should be there). So at the end of this traversing loop ready list is likely not empty, and if so, wake up again is called for nested epoll fds. But, those nested epoll fds should get already all the notifications from the main event callback ep_poll_callback(), regardless any thread which traverses events. I suppose this logic exists for decades, when Davide (the author) was reshuffling the code here and there. But I do not feel confidence to state that this extra wakeup is bogus, I just have a gut feeling that it looks excessive. > So I think removing all that even for the > depth 0 case is going to change some behavior here. So perhaps, it > should be removed for all depths except for 0? And if so, it may be > better to make 2 patches here to separate these changes. > > For the nested wakeups, I agree that the extra wakeups seem unnecessary > and it may make sense to remove them for all depths. I don't think the > nested epoll semantics are particularly well spelled out, and afaict, > nested epoll() has behaved this way for quite some time. And the > current > behavior is not bad in the way that a missing wakeup or false negative > would be. That's 100% true! For edge mode extra wake up is not a bug, not optimal for userspace - yes, but that can't lead to any lost wakeups. -- Roman ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH RESEND] fs/epoll: fix the edge-triggered mode for nested epoll 2019-09-04 9:57 ` Roman Penyaev @ 2019-09-04 12:02 ` Jason Baron 2019-09-04 14:02 ` Heiher 0 siblings, 1 reply; 12+ messages in thread From: Jason Baron @ 2019-09-04 12:02 UTC (permalink / raw) To: Roman Penyaev Cc: hev, linux-fsdevel, e, Al Viro, Andrew Morton, Davide Libenzi, Davidlohr Bueso, Dominik Brodowski, Linus Torvalds, Sridhar Samudrala, linux-kernel On 9/4/19 5:57 AM, Roman Penyaev wrote: > On 2019-09-03 23:08, Jason Baron wrote: >> On 9/2/19 11:36 AM, Roman Penyaev wrote: >>> Hi, >>> >>> This is indeed a bug. (quick side note: could you please remove efd[1] >>> from your test, because it is not related to the reproduction of a >>> current bug). >>> >>> Your patch lacks a good description, what exactly you've fixed. Let >>> me speak out loud and please correct me if I'm wrong, my understanding >>> of epoll internals has become a bit rusty: when epoll fds are nested >>> an attempt to harvest events (ep_scan_ready_list() call) produces a >>> second (repeated) event from an internal fd up to an external fd: >>> >>> epoll_wait(efd[0], ...): >>> ep_send_events(): >>> ep_scan_ready_list(depth=0): >>> ep_send_events_proc(): >>> ep_item_poll(): >>> ep_scan_ready_list(depth=1): >>> ep_poll_safewake(): >>> ep_poll_callback() >>> list_add_tail(&epi, &epi->rdllist); >>> ^^^^^^ >>> repeated event >>> >>> >>> In your patch you forbid wakeup for the cases, where depth != 0, i.e. >>> for all nested cases. That seems clear. But what if we can go further >>> and remove the whole chunk, which seems excessive: >>> >>> @@ -885,26 +886,11 @@ static __poll_t ep_scan_ready_list(struct >>> eventpoll *ep, >>> >>> - >>> - if (!list_empty(&ep->rdllist)) { >>> - /* >>> - * Wake up (if active) both the eventpoll wait list and >>> - * the ->poll() wait list (delayed after we release the >>> lock). >>> - */ >>> - if (waitqueue_active(&ep->wq)) >>> - wake_up(&ep->wq); >>> - if (waitqueue_active(&ep->poll_wait)) >>> - pwake++; >>> - } >>> write_unlock_irq(&ep->lock); >>> >>> if (!ep_locked) >>> mutex_unlock(&ep->mtx); >>> >>> - /* We have to call this outside the lock */ >>> - if (pwake) >>> - ep_poll_safewake(&ep->poll_wait); >>> >>> >>> I reason like that: by the time we've reached the point of scanning events >>> for readiness all wakeups from ep_poll_callback have been already fired and >>> new events have been already accounted in ready list (ep_poll_callback() >>> calls >>> the same ep_poll_safewake()). Here, frankly, I'm not 100% sure and probably >>> missing some corner cases. >>> >>> Thoughts? >> >> So the: 'wake_up(&ep->wq);' part, I think is about waking up other >> threads that may be in waiting in epoll_wait(). For example, there may >> be multiple threads doing epoll_wait() on the same epoll fd, and the >> logic above seems to say thread 1 may have processed say N events and >> now its going to to go off to work those, so let's wake up thread 2 now >> to handle the next chunk. > > Not quite. Thread which calls ep_scan_ready_list() processes all the > events, and while processing those, removes them one by one from the > ready list. But if event mask is !0 and event belongs to > Level Triggered Mode descriptor (let's say default mode) it tails event > again back to the list (because we are in level mode, so event should > be there). So at the end of this traversing loop ready list is likely > not empty, and if so, wake up again is called for nested epoll fds. > But, those nested epoll fds should get already all the notifications > from the main event callback ep_poll_callback(), regardless any thread > which traverses events. > > I suppose this logic exists for decades, when Davide (the author) was > reshuffling the code here and there. > > But I do not feel confidence to state that this extra wakeup is bogus, > I just have a gut feeling that it looks excessive. Note that I was talking about the wakeup done on ep->wq not ep->poll_wait. The path that I'm concerned about is let's say that there are N events queued on the ready list. A thread that was woken up in epoll_wait may decide to only process say N/2 of then. Then it will call wakeup on ep->wq and this will wakeup another thread to process the remaining N/2. Without the wakeup, the original thread isn't going to process the events until it finishes with the original N/2 and gets back to epoll_wait(). So I'm not sure how important that path is but I wanted to at least note the change here would impact that behavior. Thanks, -Jason > >> So I think removing all that even for the >> depth 0 case is going to change some behavior here. So perhaps, it >> should be removed for all depths except for 0? And if so, it may be >> better to make 2 patches here to separate these changes. >> >> For the nested wakeups, I agree that the extra wakeups seem unnecessary >> and it may make sense to remove them for all depths. I don't think the >> nested epoll semantics are particularly well spelled out, and afaict, >> nested epoll() has behaved this way for quite some time. And the current >> behavior is not bad in the way that a missing wakeup or false negative >> would be. > > That's 100% true! For edge mode extra wake up is not a bug, not optimal > for userspace - yes, but that can't lead to any lost wakeups. > > -- > Roman > ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH RESEND] fs/epoll: fix the edge-triggered mode for nested epoll 2019-09-04 12:02 ` Jason Baron @ 2019-09-04 14:02 ` Heiher 2019-09-05 2:53 ` Heiher 0 siblings, 1 reply; 12+ messages in thread From: Heiher @ 2019-09-04 14:02 UTC (permalink / raw) To: Jason Baron, linux-fsdevel, Roman Penyaev Cc: Eric Wong, Al Viro, Andrew Morton, Davide Libenzi, Davidlohr Bueso, Dominik Brodowski, Linus Torvalds, Sridhar Samudrala, linux-kernel Hi, On Wed, Sep 4, 2019 at 8:02 PM Jason Baron <jbaron@akamai.com> wrote: > > > > On 9/4/19 5:57 AM, Roman Penyaev wrote: > > On 2019-09-03 23:08, Jason Baron wrote: > >> On 9/2/19 11:36 AM, Roman Penyaev wrote: > >>> Hi, > >>> > >>> This is indeed a bug. (quick side note: could you please remove efd[1] > >>> from your test, because it is not related to the reproduction of a > >>> current bug). > >>> > >>> Your patch lacks a good description, what exactly you've fixed. Let > >>> me speak out loud and please correct me if I'm wrong, my understanding > >>> of epoll internals has become a bit rusty: when epoll fds are nested > >>> an attempt to harvest events (ep_scan_ready_list() call) produces a > >>> second (repeated) event from an internal fd up to an external fd: > >>> > >>> epoll_wait(efd[0], ...): > >>> ep_send_events(): > >>> ep_scan_ready_list(depth=0): > >>> ep_send_events_proc(): > >>> ep_item_poll(): > >>> ep_scan_ready_list(depth=1): > >>> ep_poll_safewake(): > >>> ep_poll_callback() > >>> list_add_tail(&epi, &epi->rdllist); > >>> ^^^^^^ > >>> repeated event > >>> > >>> > >>> In your patch you forbid wakeup for the cases, where depth != 0, i.e. > >>> for all nested cases. That seems clear. But what if we can go further > >>> and remove the whole chunk, which seems excessive: > >>> > >>> @@ -885,26 +886,11 @@ static __poll_t ep_scan_ready_list(struct > >>> eventpoll *ep, > >>> > >>> - > >>> - if (!list_empty(&ep->rdllist)) { > >>> - /* > >>> - * Wake up (if active) both the eventpoll wait list and > >>> - * the ->poll() wait list (delayed after we release the > >>> lock). > >>> - */ > >>> - if (waitqueue_active(&ep->wq)) > >>> - wake_up(&ep->wq); > >>> - if (waitqueue_active(&ep->poll_wait)) > >>> - pwake++; > >>> - } > >>> write_unlock_irq(&ep->lock); > >>> > >>> if (!ep_locked) > >>> mutex_unlock(&ep->mtx); > >>> > >>> - /* We have to call this outside the lock */ > >>> - if (pwake) > >>> - ep_poll_safewake(&ep->poll_wait); > >>> > >>> > >>> I reason like that: by the time we've reached the point of scanning events > >>> for readiness all wakeups from ep_poll_callback have been already fired and > >>> new events have been already accounted in ready list (ep_poll_callback() > >>> calls > >>> the same ep_poll_safewake()). Here, frankly, I'm not 100% sure and probably > >>> missing some corner cases. > >>> > >>> Thoughts? > >> > >> So the: 'wake_up(&ep->wq);' part, I think is about waking up other > >> threads that may be in waiting in epoll_wait(). For example, there may > >> be multiple threads doing epoll_wait() on the same epoll fd, and the > >> logic above seems to say thread 1 may have processed say N events and > >> now its going to to go off to work those, so let's wake up thread 2 now > >> to handle the next chunk. > > > > Not quite. Thread which calls ep_scan_ready_list() processes all the > > events, and while processing those, removes them one by one from the > > ready list. But if event mask is !0 and event belongs to > > Level Triggered Mode descriptor (let's say default mode) it tails event > > again back to the list (because we are in level mode, so event should > > be there). So at the end of this traversing loop ready list is likely > > not empty, and if so, wake up again is called for nested epoll fds. > > But, those nested epoll fds should get already all the notifications > > from the main event callback ep_poll_callback(), regardless any thread > > which traverses events. > > > > I suppose this logic exists for decades, when Davide (the author) was > > reshuffling the code here and there. > > > > But I do not feel confidence to state that this extra wakeup is bogus, > > I just have a gut feeling that it looks excessive. > > Note that I was talking about the wakeup done on ep->wq not ep->poll_wait. > The path that I'm concerned about is let's say that there are N events > queued on the ready list. A thread that was woken up in epoll_wait may > decide to only process say N/2 of then. Then it will call wakeup on ep->wq > and this will wakeup another thread to process the remaining N/2. Without > the wakeup, the original thread isn't going to process the events until > it finishes with the original N/2 and gets back to epoll_wait(). So I'm not > sure how important that path is but I wanted to at least note the change > here would impact that behavior. > > Thanks, > > -Jason > > > > > >> So I think removing all that even for the > >> depth 0 case is going to change some behavior here. So perhaps, it > >> should be removed for all depths except for 0? And if so, it may be > >> better to make 2 patches here to separate these changes. > >> > >> For the nested wakeups, I agree that the extra wakeups seem unnecessary > >> and it may make sense to remove them for all depths. I don't think the > >> nested epoll semantics are particularly well spelled out, and afaict, > >> nested epoll() has behaved this way for quite some time. And the current > >> behavior is not bad in the way that a missing wakeup or false negative > >> would be. > > > > That's 100% true! For edge mode extra wake up is not a bug, not optimal > > for userspace - yes, but that can't lead to any lost wakeups. > > > > -- > > Roman > > I tried to remove the whole chunk of code that Roman said, and it seems that there are no obvious problems with the two test programs below: Test case 1: t0 | e0 | e1 (et) | s0 (lt) When s0 is readable, the thread 0 can only read once event from e0. #include <stdio.h> #include <unistd.h> #include <sys/epoll.h> #include <sys/socket.h> int main(int argc, char *argv[]) { int sfd[2]; int efd[2]; int nfds; struct epoll_event e; if (socketpair(AF_UNIX, SOCK_STREAM, 0, sfd) < 0) goto out; efd[0] = epoll_create(1); if (efd[0] < 0) goto out; efd[1] = epoll_create(1); if (efd[1] < 0) goto out; e.events = EPOLLIN; if (epoll_ctl(efd[1], EPOLL_CTL_ADD, sfd[0], &e) < 0) goto out; e.events = EPOLLIN | EPOLLET; if (epoll_ctl(efd[0], EPOLL_CTL_ADD, efd[1], &e) < 0) goto out; if (write(sfd[1], "w", 1) != 1) goto out; nfds = epoll_wait(efd[0], &e, 1, 0); if (nfds != 1) goto out; nfds = epoll_wait(efd[0], &e, 1, 0); if (nfds != 0) goto out; nfds = epoll_wait(efd[1], &e, 1, 0); if (nfds != 1) goto out; nfds = epoll_wait(efd[1], &e, 1, 0); if (nfds != 1) goto out; close(efd[1]); close(efd[0]); close(sfd[0]); close(sfd[1]); printf("PASS\n"); return 0; out: printf("FAIL\n"); return -1; } Test case 2: t0 t1 \ / e0 / \ (et) e1 e2 (et) | | (lt) s0 s2 (lt) When s0 and s2 are readable, both thread 0 and thread 1 can read an event from e0. #include <stdio.h> #include <unistd.h> #include <pthread.h> #include <sys/epoll.h> #include <sys/socket.h> static int efd[3]; static int sfd[4]; static int count; static void * thread_handler(void *data) { struct epoll_event e; if (epoll_wait(efd[0], &e, 1, -1) == 1) count++; return NULL; } static void * emit_handler(void *data) { usleep (100000); write(sfd[1], "w", 1); write(sfd[3], "w", 1); return NULL; } int main(int argc, char *argv[]) { struct epoll_event e; pthread_t tw, te; if (socketpair(AF_UNIX, SOCK_STREAM, 0, &sfd[0]) < 0) goto out; if (socketpair(AF_UNIX, SOCK_STREAM, 0, &sfd[2]) < 0) goto out; efd[0] = epoll_create(1); if (efd[0] < 0) goto out; efd[1] = epoll_create(1); if (efd[1] < 0) goto out; efd[2] = epoll_create(1); if (efd[2] < 0) goto out; e.events = EPOLLIN; if (epoll_ctl(efd[1], EPOLL_CTL_ADD, sfd[0], &e) < 0) goto out; e.events = EPOLLIN; if (epoll_ctl(efd[2], EPOLL_CTL_ADD, sfd[2], &e) < 0) goto out; e.events = EPOLLIN | EPOLLET; if (epoll_ctl(efd[0], EPOLL_CTL_ADD, efd[1], &e) < 0) goto out; e.events = EPOLLIN | EPOLLET; if (epoll_ctl(efd[0], EPOLL_CTL_ADD, efd[2], &e) < 0) goto out; if (pthread_create(&tw, NULL, thread_handler, NULL) < 0) goto out; if (pthread_create(&te, NULL, emit_handler, NULL) < 0) goto out; if (epoll_wait(efd[0], &e, 1, -1) == 1) count++; if (pthread_join(tw, NULL) < 0) goto out; if (count != 2) goto out; close(efd[0]); close(efd[1]); close(efd[2]); close(sfd[0]); close(sfd[1]); close(sfd[2]); close(sfd[3]); printf ("PASS\n"); return 0; out: printf("FAIL\n"); return -1; } t0: thread0 t1: thread1 e0: epoll0 (efd[0]) e1: epoll1 (efd[1]) e2: epoll2 (efd[2]) s0: socket0 (sfd[0]) s2: socket2 (sfd[2]) Is it possible to prove that this modification is correct, or any other corner cases are missing? -- Best regards! Hev https://hev.cc ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH RESEND] fs/epoll: fix the edge-triggered mode for nested epoll 2019-09-04 14:02 ` Heiher @ 2019-09-05 2:53 ` Heiher 2019-09-05 9:56 ` Heiher 0 siblings, 1 reply; 12+ messages in thread From: Heiher @ 2019-09-05 2:53 UTC (permalink / raw) To: Jason Baron, linux-fsdevel, Roman Penyaev Cc: Eric Wong, Al Viro, Andrew Morton, Davide Libenzi, Davidlohr Bueso, Dominik Brodowski, Linus Torvalds, Sridhar Samudrala, linux-kernel Hi, I created an epoll wakeup test project, listed some possible cases, and any other corner cases needs to be added? https://github.com/heiher/epoll-wakeup/blob/master/README.md On Wed, Sep 4, 2019 at 10:02 PM Heiher <r@hev.cc> wrote: > > Hi, > > On Wed, Sep 4, 2019 at 8:02 PM Jason Baron <jbaron@akamai.com> wrote: > > > > > > > > On 9/4/19 5:57 AM, Roman Penyaev wrote: > > > On 2019-09-03 23:08, Jason Baron wrote: > > >> On 9/2/19 11:36 AM, Roman Penyaev wrote: > > >>> Hi, > > >>> > > >>> This is indeed a bug. (quick side note: could you please remove efd[1] > > >>> from your test, because it is not related to the reproduction of a > > >>> current bug). > > >>> > > >>> Your patch lacks a good description, what exactly you've fixed. Let > > >>> me speak out loud and please correct me if I'm wrong, my understanding > > >>> of epoll internals has become a bit rusty: when epoll fds are nested > > >>> an attempt to harvest events (ep_scan_ready_list() call) produces a > > >>> second (repeated) event from an internal fd up to an external fd: > > >>> > > >>> epoll_wait(efd[0], ...): > > >>> ep_send_events(): > > >>> ep_scan_ready_list(depth=0): > > >>> ep_send_events_proc(): > > >>> ep_item_poll(): > > >>> ep_scan_ready_list(depth=1): > > >>> ep_poll_safewake(): > > >>> ep_poll_callback() > > >>> list_add_tail(&epi, &epi->rdllist); > > >>> ^^^^^^ > > >>> repeated event > > >>> > > >>> > > >>> In your patch you forbid wakeup for the cases, where depth != 0, i.e. > > >>> for all nested cases. That seems clear. But what if we can go further > > >>> and remove the whole chunk, which seems excessive: > > >>> > > >>> @@ -885,26 +886,11 @@ static __poll_t ep_scan_ready_list(struct > > >>> eventpoll *ep, > > >>> > > >>> - > > >>> - if (!list_empty(&ep->rdllist)) { > > >>> - /* > > >>> - * Wake up (if active) both the eventpoll wait list and > > >>> - * the ->poll() wait list (delayed after we release the > > >>> lock). > > >>> - */ > > >>> - if (waitqueue_active(&ep->wq)) > > >>> - wake_up(&ep->wq); > > >>> - if (waitqueue_active(&ep->poll_wait)) > > >>> - pwake++; > > >>> - } > > >>> write_unlock_irq(&ep->lock); > > >>> > > >>> if (!ep_locked) > > >>> mutex_unlock(&ep->mtx); > > >>> > > >>> - /* We have to call this outside the lock */ > > >>> - if (pwake) > > >>> - ep_poll_safewake(&ep->poll_wait); > > >>> > > >>> > > >>> I reason like that: by the time we've reached the point of scanning events > > >>> for readiness all wakeups from ep_poll_callback have been already fired and > > >>> new events have been already accounted in ready list (ep_poll_callback() > > >>> calls > > >>> the same ep_poll_safewake()). Here, frankly, I'm not 100% sure and probably > > >>> missing some corner cases. > > >>> > > >>> Thoughts? > > >> > > >> So the: 'wake_up(&ep->wq);' part, I think is about waking up other > > >> threads that may be in waiting in epoll_wait(). For example, there may > > >> be multiple threads doing epoll_wait() on the same epoll fd, and the > > >> logic above seems to say thread 1 may have processed say N events and > > >> now its going to to go off to work those, so let's wake up thread 2 now > > >> to handle the next chunk. > > > > > > Not quite. Thread which calls ep_scan_ready_list() processes all the > > > events, and while processing those, removes them one by one from the > > > ready list. But if event mask is !0 and event belongs to > > > Level Triggered Mode descriptor (let's say default mode) it tails event > > > again back to the list (because we are in level mode, so event should > > > be there). So at the end of this traversing loop ready list is likely > > > not empty, and if so, wake up again is called for nested epoll fds. > > > But, those nested epoll fds should get already all the notifications > > > from the main event callback ep_poll_callback(), regardless any thread > > > which traverses events. > > > > > > I suppose this logic exists for decades, when Davide (the author) was > > > reshuffling the code here and there. > > > > > > But I do not feel confidence to state that this extra wakeup is bogus, > > > I just have a gut feeling that it looks excessive. > > > > Note that I was talking about the wakeup done on ep->wq not ep->poll_wait. > > The path that I'm concerned about is let's say that there are N events > > queued on the ready list. A thread that was woken up in epoll_wait may > > decide to only process say N/2 of then. Then it will call wakeup on ep->wq > > and this will wakeup another thread to process the remaining N/2. Without > > the wakeup, the original thread isn't going to process the events until > > it finishes with the original N/2 and gets back to epoll_wait(). So I'm not > > sure how important that path is but I wanted to at least note the change > > here would impact that behavior. > > > > Thanks, > > > > -Jason > > > > > > > > > >> So I think removing all that even for the > > >> depth 0 case is going to change some behavior here. So perhaps, it > > >> should be removed for all depths except for 0? And if so, it may be > > >> better to make 2 patches here to separate these changes. > > >> > > >> For the nested wakeups, I agree that the extra wakeups seem unnecessary > > >> and it may make sense to remove them for all depths. I don't think the > > >> nested epoll semantics are particularly well spelled out, and afaict, > > >> nested epoll() has behaved this way for quite some time. And the current > > >> behavior is not bad in the way that a missing wakeup or false negative > > >> would be. > > > > > > That's 100% true! For edge mode extra wake up is not a bug, not optimal > > > for userspace - yes, but that can't lead to any lost wakeups. > > > > > > -- > > > Roman > > > > > I tried to remove the whole chunk of code that Roman said, and it > seems that there > are no obvious problems with the two test programs below: > > Test case 1: > t0 > | > e0 > | > e1 (et) > | > s0 (lt) > > When s0 is readable, the thread 0 can only read once event from e0. > > #include <stdio.h> > #include <unistd.h> > #include <sys/epoll.h> > #include <sys/socket.h> > > int main(int argc, char *argv[]) > { > int sfd[2]; > int efd[2]; > int nfds; > struct epoll_event e; > > if (socketpair(AF_UNIX, SOCK_STREAM, 0, sfd) < 0) > goto out; > > efd[0] = epoll_create(1); > if (efd[0] < 0) > goto out; > > efd[1] = epoll_create(1); > if (efd[1] < 0) > goto out; > > e.events = EPOLLIN; > if (epoll_ctl(efd[1], EPOLL_CTL_ADD, sfd[0], &e) < 0) > goto out; > > e.events = EPOLLIN | EPOLLET; > if (epoll_ctl(efd[0], EPOLL_CTL_ADD, efd[1], &e) < 0) > goto out; > > if (write(sfd[1], "w", 1) != 1) > goto out; > > nfds = epoll_wait(efd[0], &e, 1, 0); > if (nfds != 1) > goto out; > > nfds = epoll_wait(efd[0], &e, 1, 0); > if (nfds != 0) > goto out; > > nfds = epoll_wait(efd[1], &e, 1, 0); > if (nfds != 1) > goto out; > > nfds = epoll_wait(efd[1], &e, 1, 0); > if (nfds != 1) > goto out; > > close(efd[1]); > close(efd[0]); > close(sfd[0]); > close(sfd[1]); > > printf("PASS\n"); > return 0; > > out: > printf("FAIL\n"); > return -1; > } > > Test case 2: > t0 t1 > \ / > e0 > / \ > (et) e1 e2 (et) > | | > (lt) s0 s2 (lt) > > When s0 and s2 are readable, both thread 0 and thread 1 can read an > event from e0. > > #include <stdio.h> > #include <unistd.h> > #include <pthread.h> > #include <sys/epoll.h> > #include <sys/socket.h> > > static int efd[3]; > static int sfd[4]; > static int count; > > static void * > thread_handler(void *data) > { > struct epoll_event e; > > if (epoll_wait(efd[0], &e, 1, -1) == 1) > count++; > > return NULL; > } > > static void * > emit_handler(void *data) > { > usleep (100000); > > write(sfd[1], "w", 1); > write(sfd[3], "w", 1); > > return NULL; > } > > int main(int argc, char *argv[]) > { > struct epoll_event e; > pthread_t tw, te; > > if (socketpair(AF_UNIX, SOCK_STREAM, 0, &sfd[0]) < 0) > goto out; > > if (socketpair(AF_UNIX, SOCK_STREAM, 0, &sfd[2]) < 0) > goto out; > > efd[0] = epoll_create(1); > if (efd[0] < 0) > goto out; > > efd[1] = epoll_create(1); > if (efd[1] < 0) > goto out; > > efd[2] = epoll_create(1); > if (efd[2] < 0) > goto out; > > e.events = EPOLLIN; > if (epoll_ctl(efd[1], EPOLL_CTL_ADD, sfd[0], &e) < 0) > goto out; > > e.events = EPOLLIN; > if (epoll_ctl(efd[2], EPOLL_CTL_ADD, sfd[2], &e) < 0) > goto out; > > e.events = EPOLLIN | EPOLLET; > if (epoll_ctl(efd[0], EPOLL_CTL_ADD, efd[1], &e) < 0) > goto out; > > e.events = EPOLLIN | EPOLLET; > if (epoll_ctl(efd[0], EPOLL_CTL_ADD, efd[2], &e) < 0) > goto out; > > if (pthread_create(&tw, NULL, thread_handler, NULL) < 0) > goto out; > > if (pthread_create(&te, NULL, emit_handler, NULL) < 0) > goto out; > > if (epoll_wait(efd[0], &e, 1, -1) == 1) > count++; > > if (pthread_join(tw, NULL) < 0) > goto out; > > if (count != 2) > goto out; > > close(efd[0]); > close(efd[1]); > close(efd[2]); > close(sfd[0]); > close(sfd[1]); > close(sfd[2]); > close(sfd[3]); > > printf ("PASS\n"); > return 0; > > out: > printf("FAIL\n"); > return -1; > } > > t0: thread0 > t1: thread1 > e0: epoll0 (efd[0]) > e1: epoll1 (efd[1]) > e2: epoll2 (efd[2]) > s0: socket0 (sfd[0]) > s2: socket2 (sfd[2]) > > Is it possible to prove that this modification is correct, or any > other corner cases are missing? > > -- > Best regards! > Hev > https://hev.cc -- Best regards! Hev https://hev.cc ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH RESEND] fs/epoll: fix the edge-triggered mode for nested epoll 2019-09-05 2:53 ` Heiher @ 2019-09-05 9:56 ` Heiher 2019-09-05 17:27 ` Roman Penyaev 0 siblings, 1 reply; 12+ messages in thread From: Heiher @ 2019-09-05 9:56 UTC (permalink / raw) To: Jason Baron, linux-fsdevel, Roman Penyaev Cc: Eric Wong, Al Viro, Andrew Morton, Davide Libenzi, Davidlohr Bueso, Dominik Brodowski, Linus Torvalds, Sridhar Samudrala, linux-kernel Hi, On Thu, Sep 5, 2019 at 10:53 AM Heiher <r@hev.cc> wrote: > > Hi, > > I created an epoll wakeup test project, listed some possible cases, > and any other corner cases needs to be added? > > https://github.com/heiher/epoll-wakeup/blob/master/README.md > > On Wed, Sep 4, 2019 at 10:02 PM Heiher <r@hev.cc> wrote: > > > > Hi, > > > > On Wed, Sep 4, 2019 at 8:02 PM Jason Baron <jbaron@akamai.com> wrote: > > > > > > > > > > > > On 9/4/19 5:57 AM, Roman Penyaev wrote: > > > > On 2019-09-03 23:08, Jason Baron wrote: > > > >> On 9/2/19 11:36 AM, Roman Penyaev wrote: > > > >>> Hi, > > > >>> > > > >>> This is indeed a bug. (quick side note: could you please remove efd[1] > > > >>> from your test, because it is not related to the reproduction of a > > > >>> current bug). > > > >>> > > > >>> Your patch lacks a good description, what exactly you've fixed. Let > > > >>> me speak out loud and please correct me if I'm wrong, my understanding > > > >>> of epoll internals has become a bit rusty: when epoll fds are nested > > > >>> an attempt to harvest events (ep_scan_ready_list() call) produces a > > > >>> second (repeated) event from an internal fd up to an external fd: > > > >>> > > > >>> epoll_wait(efd[0], ...): > > > >>> ep_send_events(): > > > >>> ep_scan_ready_list(depth=0): > > > >>> ep_send_events_proc(): > > > >>> ep_item_poll(): > > > >>> ep_scan_ready_list(depth=1): > > > >>> ep_poll_safewake(): > > > >>> ep_poll_callback() > > > >>> list_add_tail(&epi, &epi->rdllist); > > > >>> ^^^^^^ > > > >>> repeated event > > > >>> > > > >>> > > > >>> In your patch you forbid wakeup for the cases, where depth != 0, i.e. > > > >>> for all nested cases. That seems clear. But what if we can go further > > > >>> and remove the whole chunk, which seems excessive: > > > >>> > > > >>> @@ -885,26 +886,11 @@ static __poll_t ep_scan_ready_list(struct > > > >>> eventpoll *ep, > > > >>> > > > >>> - > > > >>> - if (!list_empty(&ep->rdllist)) { > > > >>> - /* > > > >>> - * Wake up (if active) both the eventpoll wait list and > > > >>> - * the ->poll() wait list (delayed after we release the > > > >>> lock). > > > >>> - */ > > > >>> - if (waitqueue_active(&ep->wq)) > > > >>> - wake_up(&ep->wq); > > > >>> - if (waitqueue_active(&ep->poll_wait)) > > > >>> - pwake++; > > > >>> - } > > > >>> write_unlock_irq(&ep->lock); > > > >>> > > > >>> if (!ep_locked) > > > >>> mutex_unlock(&ep->mtx); > > > >>> > > > >>> - /* We have to call this outside the lock */ > > > >>> - if (pwake) > > > >>> - ep_poll_safewake(&ep->poll_wait); > > > >>> > > > >>> > > > >>> I reason like that: by the time we've reached the point of scanning events > > > >>> for readiness all wakeups from ep_poll_callback have been already fired and > > > >>> new events have been already accounted in ready list (ep_poll_callback() > > > >>> calls > > > >>> the same ep_poll_safewake()). Here, frankly, I'm not 100% sure and probably > > > >>> missing some corner cases. > > > >>> > > > >>> Thoughts? > > > >> > > > >> So the: 'wake_up(&ep->wq);' part, I think is about waking up other > > > >> threads that may be in waiting in epoll_wait(). For example, there may > > > >> be multiple threads doing epoll_wait() on the same epoll fd, and the > > > >> logic above seems to say thread 1 may have processed say N events and > > > >> now its going to to go off to work those, so let's wake up thread 2 now > > > >> to handle the next chunk. > > > > > > > > Not quite. Thread which calls ep_scan_ready_list() processes all the > > > > events, and while processing those, removes them one by one from the > > > > ready list. But if event mask is !0 and event belongs to > > > > Level Triggered Mode descriptor (let's say default mode) it tails event > > > > again back to the list (because we are in level mode, so event should > > > > be there). So at the end of this traversing loop ready list is likely > > > > not empty, and if so, wake up again is called for nested epoll fds. > > > > But, those nested epoll fds should get already all the notifications > > > > from the main event callback ep_poll_callback(), regardless any thread > > > > which traverses events. > > > > > > > > I suppose this logic exists for decades, when Davide (the author) was > > > > reshuffling the code here and there. > > > > > > > > But I do not feel confidence to state that this extra wakeup is bogus, > > > > I just have a gut feeling that it looks excessive. > > > > > > Note that I was talking about the wakeup done on ep->wq not ep->poll_wait. > > > The path that I'm concerned about is let's say that there are N events > > > queued on the ready list. A thread that was woken up in epoll_wait may > > > decide to only process say N/2 of then. Then it will call wakeup on ep->wq > > > and this will wakeup another thread to process the remaining N/2. Without > > > the wakeup, the original thread isn't going to process the events until > > > it finishes with the original N/2 and gets back to epoll_wait(). So I'm not > > > sure how important that path is but I wanted to at least note the change > > > here would impact that behavior. > > > > > > Thanks, > > > > > > -Jason > > > > > > > > > > > > > >> So I think removing all that even for the > > > >> depth 0 case is going to change some behavior here. So perhaps, it > > > >> should be removed for all depths except for 0? And if so, it may be > > > >> better to make 2 patches here to separate these changes. > > > >> > > > >> For the nested wakeups, I agree that the extra wakeups seem unnecessary > > > >> and it may make sense to remove them for all depths. I don't think the > > > >> nested epoll semantics are particularly well spelled out, and afaict, > > > >> nested epoll() has behaved this way for quite some time. And the current > > > >> behavior is not bad in the way that a missing wakeup or false negative > > > >> would be. > > > > > > > > That's 100% true! For edge mode extra wake up is not a bug, not optimal > > > > for userspace - yes, but that can't lead to any lost wakeups. > > > > > > > > -- > > > > Roman > > > > > > > > I tried to remove the whole chunk of code that Roman said, and it > > seems that there > > are no obvious problems with the two test programs below: I recall this message, the test case 9/25/26 of epoll-wakeup (on github) are failed while the whole chunk are removed. Apply the original patch, all tests passed. > > > > Test case 1: > > t0 > > | > > e0 > > | > > e1 (et) > > | > > s0 (lt) > > > > When s0 is readable, the thread 0 can only read once event from e0. > > > > #include <stdio.h> > > #include <unistd.h> > > #include <sys/epoll.h> > > #include <sys/socket.h> > > > > int main(int argc, char *argv[]) > > { > > int sfd[2]; > > int efd[2]; > > int nfds; > > struct epoll_event e; > > > > if (socketpair(AF_UNIX, SOCK_STREAM, 0, sfd) < 0) > > goto out; > > > > efd[0] = epoll_create(1); > > if (efd[0] < 0) > > goto out; > > > > efd[1] = epoll_create(1); > > if (efd[1] < 0) > > goto out; > > > > e.events = EPOLLIN; > > if (epoll_ctl(efd[1], EPOLL_CTL_ADD, sfd[0], &e) < 0) > > goto out; > > > > e.events = EPOLLIN | EPOLLET; > > if (epoll_ctl(efd[0], EPOLL_CTL_ADD, efd[1], &e) < 0) > > goto out; > > > > if (write(sfd[1], "w", 1) != 1) > > goto out; > > > > nfds = epoll_wait(efd[0], &e, 1, 0); > > if (nfds != 1) > > goto out; > > > > nfds = epoll_wait(efd[0], &e, 1, 0); > > if (nfds != 0) > > goto out; > > > > nfds = epoll_wait(efd[1], &e, 1, 0); > > if (nfds != 1) > > goto out; > > > > nfds = epoll_wait(efd[1], &e, 1, 0); > > if (nfds != 1) > > goto out; > > > > close(efd[1]); > > close(efd[0]); > > close(sfd[0]); > > close(sfd[1]); > > > > printf("PASS\n"); > > return 0; > > > > out: > > printf("FAIL\n"); > > return -1; > > } > > > > Test case 2: > > t0 t1 > > \ / > > e0 > > / \ > > (et) e1 e2 (et) > > | | > > (lt) s0 s2 (lt) > > > > When s0 and s2 are readable, both thread 0 and thread 1 can read an > > event from e0. > > > > #include <stdio.h> > > #include <unistd.h> > > #include <pthread.h> > > #include <sys/epoll.h> > > #include <sys/socket.h> > > > > static int efd[3]; > > static int sfd[4]; > > static int count; > > > > static void * > > thread_handler(void *data) > > { > > struct epoll_event e; > > > > if (epoll_wait(efd[0], &e, 1, -1) == 1) > > count++; > > > > return NULL; > > } > > > > static void * > > emit_handler(void *data) > > { > > usleep (100000); > > > > write(sfd[1], "w", 1); > > write(sfd[3], "w", 1); > > > > return NULL; > > } > > > > int main(int argc, char *argv[]) > > { > > struct epoll_event e; > > pthread_t tw, te; > > > > if (socketpair(AF_UNIX, SOCK_STREAM, 0, &sfd[0]) < 0) > > goto out; > > > > if (socketpair(AF_UNIX, SOCK_STREAM, 0, &sfd[2]) < 0) > > goto out; > > > > efd[0] = epoll_create(1); > > if (efd[0] < 0) > > goto out; > > > > efd[1] = epoll_create(1); > > if (efd[1] < 0) > > goto out; > > > > efd[2] = epoll_create(1); > > if (efd[2] < 0) > > goto out; > > > > e.events = EPOLLIN; > > if (epoll_ctl(efd[1], EPOLL_CTL_ADD, sfd[0], &e) < 0) > > goto out; > > > > e.events = EPOLLIN; > > if (epoll_ctl(efd[2], EPOLL_CTL_ADD, sfd[2], &e) < 0) > > goto out; > > > > e.events = EPOLLIN | EPOLLET; > > if (epoll_ctl(efd[0], EPOLL_CTL_ADD, efd[1], &e) < 0) > > goto out; > > > > e.events = EPOLLIN | EPOLLET; > > if (epoll_ctl(efd[0], EPOLL_CTL_ADD, efd[2], &e) < 0) > > goto out; > > > > if (pthread_create(&tw, NULL, thread_handler, NULL) < 0) > > goto out; > > > > if (pthread_create(&te, NULL, emit_handler, NULL) < 0) > > goto out; > > > > if (epoll_wait(efd[0], &e, 1, -1) == 1) > > count++; > > > > if (pthread_join(tw, NULL) < 0) > > goto out; > > > > if (count != 2) > > goto out; > > > > close(efd[0]); > > close(efd[1]); > > close(efd[2]); > > close(sfd[0]); > > close(sfd[1]); > > close(sfd[2]); > > close(sfd[3]); > > > > printf ("PASS\n"); > > return 0; > > > > out: > > printf("FAIL\n"); > > return -1; > > } > > > > t0: thread0 > > t1: thread1 > > e0: epoll0 (efd[0]) > > e1: epoll1 (efd[1]) > > e2: epoll2 (efd[2]) > > s0: socket0 (sfd[0]) > > s2: socket2 (sfd[2]) > > > > Is it possible to prove that this modification is correct, or any > > other corner cases are missing? > > > > -- > > Best regards! > > Hev > > https://hev.cc > > > > -- > Best regards! > Hev > https://hev.cc -- Best regards! Hev https://hev.cc ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH RESEND] fs/epoll: fix the edge-triggered mode for nested epoll 2019-09-05 9:56 ` Heiher @ 2019-09-05 17:27 ` Roman Penyaev 2019-09-05 17:44 ` Jason Baron 0 siblings, 1 reply; 12+ messages in thread From: Roman Penyaev @ 2019-09-05 17:27 UTC (permalink / raw) To: Heiher Cc: Jason Baron, linux-fsdevel, Eric Wong, Al Viro, Andrew Morton, Davide Libenzi, Davidlohr Bueso, Dominik Brodowski, Linus Torvalds, Sridhar Samudrala, linux-kernel On 2019-09-05 11:56, Heiher wrote: > Hi, > > On Thu, Sep 5, 2019 at 10:53 AM Heiher <r@hev.cc> wrote: >> >> Hi, >> >> I created an epoll wakeup test project, listed some possible cases, >> and any other corner cases needs to be added? >> >> https://github.com/heiher/epoll-wakeup/blob/master/README.md >> >> On Wed, Sep 4, 2019 at 10:02 PM Heiher <r@hev.cc> wrote: >> > >> > Hi, >> > >> > On Wed, Sep 4, 2019 at 8:02 PM Jason Baron <jbaron@akamai.com> wrote: >> > > >> > > >> > > >> > > On 9/4/19 5:57 AM, Roman Penyaev wrote: >> > > > On 2019-09-03 23:08, Jason Baron wrote: >> > > >> On 9/2/19 11:36 AM, Roman Penyaev wrote: >> > > >>> Hi, >> > > >>> >> > > >>> This is indeed a bug. (quick side note: could you please remove efd[1] >> > > >>> from your test, because it is not related to the reproduction of a >> > > >>> current bug). >> > > >>> >> > > >>> Your patch lacks a good description, what exactly you've fixed. Let >> > > >>> me speak out loud and please correct me if I'm wrong, my understanding >> > > >>> of epoll internals has become a bit rusty: when epoll fds are nested >> > > >>> an attempt to harvest events (ep_scan_ready_list() call) produces a >> > > >>> second (repeated) event from an internal fd up to an external fd: >> > > >>> >> > > >>> epoll_wait(efd[0], ...): >> > > >>> ep_send_events(): >> > > >>> ep_scan_ready_list(depth=0): >> > > >>> ep_send_events_proc(): >> > > >>> ep_item_poll(): >> > > >>> ep_scan_ready_list(depth=1): >> > > >>> ep_poll_safewake(): >> > > >>> ep_poll_callback() >> > > >>> list_add_tail(&epi, &epi->rdllist); >> > > >>> ^^^^^^ >> > > >>> repeated event >> > > >>> >> > > >>> >> > > >>> In your patch you forbid wakeup for the cases, where depth != 0, i.e. >> > > >>> for all nested cases. That seems clear. But what if we can go further >> > > >>> and remove the whole chunk, which seems excessive: >> > > >>> >> > > >>> @@ -885,26 +886,11 @@ static __poll_t ep_scan_ready_list(struct >> > > >>> eventpoll *ep, >> > > >>> >> > > >>> - >> > > >>> - if (!list_empty(&ep->rdllist)) { >> > > >>> - /* >> > > >>> - * Wake up (if active) both the eventpoll wait list and >> > > >>> - * the ->poll() wait list (delayed after we release the >> > > >>> lock). >> > > >>> - */ >> > > >>> - if (waitqueue_active(&ep->wq)) >> > > >>> - wake_up(&ep->wq); >> > > >>> - if (waitqueue_active(&ep->poll_wait)) >> > > >>> - pwake++; >> > > >>> - } >> > > >>> write_unlock_irq(&ep->lock); >> > > >>> >> > > >>> if (!ep_locked) >> > > >>> mutex_unlock(&ep->mtx); >> > > >>> >> > > >>> - /* We have to call this outside the lock */ >> > > >>> - if (pwake) >> > > >>> - ep_poll_safewake(&ep->poll_wait); >> > > >>> >> > > >>> >> > > >>> I reason like that: by the time we've reached the point of scanning events >> > > >>> for readiness all wakeups from ep_poll_callback have been already fired and >> > > >>> new events have been already accounted in ready list (ep_poll_callback() >> > > >>> calls >> > > >>> the same ep_poll_safewake()). Here, frankly, I'm not 100% sure and probably >> > > >>> missing some corner cases. >> > > >>> >> > > >>> Thoughts? >> > > >> >> > > >> So the: 'wake_up(&ep->wq);' part, I think is about waking up other >> > > >> threads that may be in waiting in epoll_wait(). For example, there may >> > > >> be multiple threads doing epoll_wait() on the same epoll fd, and the >> > > >> logic above seems to say thread 1 may have processed say N events and >> > > >> now its going to to go off to work those, so let's wake up thread 2 now >> > > >> to handle the next chunk. >> > > > >> > > > Not quite. Thread which calls ep_scan_ready_list() processes all the >> > > > events, and while processing those, removes them one by one from the >> > > > ready list. But if event mask is !0 and event belongs to >> > > > Level Triggered Mode descriptor (let's say default mode) it tails event >> > > > again back to the list (because we are in level mode, so event should >> > > > be there). So at the end of this traversing loop ready list is likely >> > > > not empty, and if so, wake up again is called for nested epoll fds. >> > > > But, those nested epoll fds should get already all the notifications >> > > > from the main event callback ep_poll_callback(), regardless any thread >> > > > which traverses events. >> > > > >> > > > I suppose this logic exists for decades, when Davide (the author) was >> > > > reshuffling the code here and there. >> > > > >> > > > But I do not feel confidence to state that this extra wakeup is bogus, >> > > > I just have a gut feeling that it looks excessive. >> > > >> > > Note that I was talking about the wakeup done on ep->wq not ep->poll_wait. >> > > The path that I'm concerned about is let's say that there are N events >> > > queued on the ready list. A thread that was woken up in epoll_wait may >> > > decide to only process say N/2 of then. Then it will call wakeup on ep->wq >> > > and this will wakeup another thread to process the remaining N/2. Without >> > > the wakeup, the original thread isn't going to process the events until >> > > it finishes with the original N/2 and gets back to epoll_wait(). So I'm not >> > > sure how important that path is but I wanted to at least note the change >> > > here would impact that behavior. >> > > >> > > Thanks, >> > > >> > > -Jason >> > > >> > > >> > > > >> > > >> So I think removing all that even for the >> > > >> depth 0 case is going to change some behavior here. So perhaps, it >> > > >> should be removed for all depths except for 0? And if so, it may be >> > > >> better to make 2 patches here to separate these changes. >> > > >> >> > > >> For the nested wakeups, I agree that the extra wakeups seem unnecessary >> > > >> and it may make sense to remove them for all depths. I don't think the >> > > >> nested epoll semantics are particularly well spelled out, and afaict, >> > > >> nested epoll() has behaved this way for quite some time. And the current >> > > >> behavior is not bad in the way that a missing wakeup or false negative >> > > >> would be. >> > > > >> > > > That's 100% true! For edge mode extra wake up is not a bug, not optimal >> > > > for userspace - yes, but that can't lead to any lost wakeups. >> > > > >> > > > -- >> > > > Roman >> > > > >> > >> > I tried to remove the whole chunk of code that Roman said, and it >> > seems that there >> > are no obvious problems with the two test programs below: > > I recall this message, the test case 9/25/26 of epoll-wakeup (on > github) are failed while > the whole chunk are removed. > > Apply the original patch, all tests passed. These are failing on my bare 5.2.0-rc2 TEST bin/epoll31 FAIL TEST bin/epoll46 FAIL TEST bin/epoll50 FAIL TEST bin/epoll32 FAIL TEST bin/epoll19 FAIL TEST bin/epoll27 FAIL TEST bin/epoll42 FAIL TEST bin/epoll34 FAIL TEST bin/epoll48 FAIL TEST bin/epoll40 FAIL TEST bin/epoll20 FAIL TEST bin/epoll28 FAIL TEST bin/epoll38 FAIL TEST bin/epoll52 FAIL TEST bin/epoll24 FAIL TEST bin/epoll23 FAIL These are failing if your patch is applied: (my 5.2.0-rc2 is old? broken?) TEST bin/epoll46 FAIL TEST bin/epoll42 FAIL TEST bin/epoll34 FAIL TEST bin/epoll48 FAIL TEST bin/epoll40 FAIL TEST bin/epoll44 FAIL TEST bin/epoll38 FAIL These are failing if "ep_poll_safewake(&ep->poll_wait)" is not called, but wakeup(&ep->wq); is still invoked: TEST bin/epoll46 FAIL TEST bin/epoll42 FAIL TEST bin/epoll34 FAIL TEST bin/epoll40 FAIL TEST bin/epoll44 FAIL TEST bin/epoll38 FAIL So at least 48 has been "fixed". These are failing if the whole chunk is removed, like your said 9,25,26 are among which do not pass: TEST bin/epoll26 FAIL TEST bin/epoll42 FAIL TEST bin/epoll34 FAIL TEST bin/epoll9 FAIL TEST bin/epoll48 FAIL TEST bin/epoll40 FAIL TEST bin/epoll25 FAIL TEST bin/epoll44 FAIL TEST bin/epoll38 FAIL This can be a good test suite, probably can be added to kselftests? -- Roman ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH RESEND] fs/epoll: fix the edge-triggered mode for nested epoll 2019-09-05 17:27 ` Roman Penyaev @ 2019-09-05 17:44 ` Jason Baron 2019-09-11 8:19 ` Heiher 0 siblings, 1 reply; 12+ messages in thread From: Jason Baron @ 2019-09-05 17:44 UTC (permalink / raw) To: Roman Penyaev, Heiher Cc: linux-fsdevel, Eric Wong, Al Viro, Andrew Morton, Davide Libenzi, Davidlohr Bueso, Dominik Brodowski, Linus Torvalds, Sridhar Samudrala, linux-kernel On 9/5/19 1:27 PM, Roman Penyaev wrote: > On 2019-09-05 11:56, Heiher wrote: >> Hi, >> >> On Thu, Sep 5, 2019 at 10:53 AM Heiher <r@hev.cc> wrote: >>> >>> Hi, >>> >>> I created an epoll wakeup test project, listed some possible cases, >>> and any other corner cases needs to be added? >>> >>> https://github.com/heiher/epoll-wakeup/blob/master/README.md >>> >>> On Wed, Sep 4, 2019 at 10:02 PM Heiher <r@hev.cc> wrote: >>> > >>> > Hi, >>> > >>> > On Wed, Sep 4, 2019 at 8:02 PM Jason Baron <jbaron@akamai.com> wrote: >>> > > >>> > > >>> > > >>> > > On 9/4/19 5:57 AM, Roman Penyaev wrote: >>> > > > On 2019-09-03 23:08, Jason Baron wrote: >>> > > >> On 9/2/19 11:36 AM, Roman Penyaev wrote: >>> > > >>> Hi, >>> > > >>> >>> > > >>> This is indeed a bug. (quick side note: could you please >>> remove efd[1] >>> > > >>> from your test, because it is not related to the reproduction >>> of a >>> > > >>> current bug). >>> > > >>> >>> > > >>> Your patch lacks a good description, what exactly you've >>> fixed. Let >>> > > >>> me speak out loud and please correct me if I'm wrong, my >>> understanding >>> > > >>> of epoll internals has become a bit rusty: when epoll fds are >>> nested >>> > > >>> an attempt to harvest events (ep_scan_ready_list() call) >>> produces a >>> > > >>> second (repeated) event from an internal fd up to an external >>> fd: >>> > > >>> >>> > > >>> epoll_wait(efd[0], ...): >>> > > >>> ep_send_events(): >>> > > >>> ep_scan_ready_list(depth=0): >>> > > >>> ep_send_events_proc(): >>> > > >>> ep_item_poll(): >>> > > >>> ep_scan_ready_list(depth=1): >>> > > >>> ep_poll_safewake(): >>> > > >>> ep_poll_callback() >>> > > >>> list_add_tail(&epi, &epi->rdllist); >>> > > >>> ^^^^^^ >>> > > >>> repeated event >>> > > >>> >>> > > >>> >>> > > >>> In your patch you forbid wakeup for the cases, where depth != >>> 0, i.e. >>> > > >>> for all nested cases. That seems clear. But what if we can >>> go further >>> > > >>> and remove the whole chunk, which seems excessive: >>> > > >>> >>> > > >>> @@ -885,26 +886,11 @@ static __poll_t ep_scan_ready_list(struct >>> > > >>> eventpoll *ep, >>> > > >>> >>> > > >>> - >>> > > >>> - if (!list_empty(&ep->rdllist)) { >>> > > >>> - /* >>> > > >>> - * Wake up (if active) both the eventpoll >>> wait list and >>> > > >>> - * the ->poll() wait list (delayed after we >>> release the >>> > > >>> lock). >>> > > >>> - */ >>> > > >>> - if (waitqueue_active(&ep->wq)) >>> > > >>> - wake_up(&ep->wq); >>> > > >>> - if (waitqueue_active(&ep->poll_wait)) >>> > > >>> - pwake++; >>> > > >>> - } >>> > > >>> write_unlock_irq(&ep->lock); >>> > > >>> >>> > > >>> if (!ep_locked) >>> > > >>> mutex_unlock(&ep->mtx); >>> > > >>> >>> > > >>> - /* We have to call this outside the lock */ >>> > > >>> - if (pwake) >>> > > >>> - ep_poll_safewake(&ep->poll_wait); >>> > > >>> >>> > > >>> >>> > > >>> I reason like that: by the time we've reached the point of >>> scanning events >>> > > >>> for readiness all wakeups from ep_poll_callback have been >>> already fired and >>> > > >>> new events have been already accounted in ready list >>> (ep_poll_callback() >>> > > >>> calls >>> > > >>> the same ep_poll_safewake()). Here, frankly, I'm not 100% >>> sure and probably >>> > > >>> missing some corner cases. >>> > > >>> >>> > > >>> Thoughts? >>> > > >> >>> > > >> So the: 'wake_up(&ep->wq);' part, I think is about waking up >>> other >>> > > >> threads that may be in waiting in epoll_wait(). For example, >>> there may >>> > > >> be multiple threads doing epoll_wait() on the same epoll fd, >>> and the >>> > > >> logic above seems to say thread 1 may have processed say N >>> events and >>> > > >> now its going to to go off to work those, so let's wake up >>> thread 2 now >>> > > >> to handle the next chunk. >>> > > > >>> > > > Not quite. Thread which calls ep_scan_ready_list() processes >>> all the >>> > > > events, and while processing those, removes them one by one >>> from the >>> > > > ready list. But if event mask is !0 and event belongs to >>> > > > Level Triggered Mode descriptor (let's say default mode) it >>> tails event >>> > > > again back to the list (because we are in level mode, so event >>> should >>> > > > be there). So at the end of this traversing loop ready list is >>> likely >>> > > > not empty, and if so, wake up again is called for nested epoll >>> fds. >>> > > > But, those nested epoll fds should get already all the >>> notifications >>> > > > from the main event callback ep_poll_callback(), regardless any >>> thread >>> > > > which traverses events. >>> > > > >>> > > > I suppose this logic exists for decades, when Davide (the >>> author) was >>> > > > reshuffling the code here and there. >>> > > > >>> > > > But I do not feel confidence to state that this extra wakeup is >>> bogus, >>> > > > I just have a gut feeling that it looks excessive. >>> > > >>> > > Note that I was talking about the wakeup done on ep->wq not >>> ep->poll_wait. >>> > > The path that I'm concerned about is let's say that there are N >>> events >>> > > queued on the ready list. A thread that was woken up in >>> epoll_wait may >>> > > decide to only process say N/2 of then. Then it will call wakeup >>> on ep->wq >>> > > and this will wakeup another thread to process the remaining N/2. >>> Without >>> > > the wakeup, the original thread isn't going to process the events >>> until >>> > > it finishes with the original N/2 and gets back to epoll_wait(). >>> So I'm not >>> > > sure how important that path is but I wanted to at least note the >>> change >>> > > here would impact that behavior. >>> > > >>> > > Thanks, >>> > > >>> > > -Jason >>> > > >>> > > >>> > > > >>> > > >> So I think removing all that even for the >>> > > >> depth 0 case is going to change some behavior here. So >>> perhaps, it >>> > > >> should be removed for all depths except for 0? And if so, it >>> may be >>> > > >> better to make 2 patches here to separate these changes. >>> > > >> >>> > > >> For the nested wakeups, I agree that the extra wakeups seem >>> unnecessary >>> > > >> and it may make sense to remove them for all depths. I don't >>> think the >>> > > >> nested epoll semantics are particularly well spelled out, and >>> afaict, >>> > > >> nested epoll() has behaved this way for quite some time. And >>> the current >>> > > >> behavior is not bad in the way that a missing wakeup or false >>> negative >>> > > >> would be. >>> > > > >>> > > > That's 100% true! For edge mode extra wake up is not a bug, not >>> optimal >>> > > > for userspace - yes, but that can't lead to any lost wakeups. >>> > > > >>> > > > -- >>> > > > Roman >>> > > > >>> > >>> > I tried to remove the whole chunk of code that Roman said, and it >>> > seems that there >>> > are no obvious problems with the two test programs below: >> >> I recall this message, the test case 9/25/26 of epoll-wakeup (on >> github) are failed while >> the whole chunk are removed. >> >> Apply the original patch, all tests passed. > > > These are failing on my bare 5.2.0-rc2 > > TEST bin/epoll31 FAIL > TEST bin/epoll46 FAIL > TEST bin/epoll50 FAIL > TEST bin/epoll32 FAIL > TEST bin/epoll19 FAIL > TEST bin/epoll27 FAIL > TEST bin/epoll42 FAIL > TEST bin/epoll34 FAIL > TEST bin/epoll48 FAIL > TEST bin/epoll40 FAIL > TEST bin/epoll20 FAIL > TEST bin/epoll28 FAIL > TEST bin/epoll38 FAIL > TEST bin/epoll52 FAIL > TEST bin/epoll24 FAIL > TEST bin/epoll23 FAIL > > > These are failing if your patch is applied: > (my 5.2.0-rc2 is old? broken?) > > TEST bin/epoll46 FAIL > TEST bin/epoll42 FAIL > TEST bin/epoll34 FAIL > TEST bin/epoll48 FAIL > TEST bin/epoll40 FAIL > TEST bin/epoll44 FAIL > TEST bin/epoll38 FAIL > > These are failing if "ep_poll_safewake(&ep->poll_wait)" is not called, > but wakeup(&ep->wq); is still invoked: > > TEST bin/epoll46 FAIL > TEST bin/epoll42 FAIL > TEST bin/epoll34 FAIL > TEST bin/epoll40 FAIL > TEST bin/epoll44 FAIL > TEST bin/epoll38 FAIL > > So at least 48 has been "fixed". > > These are failing if the whole chunk is removed, like your > said 9,25,26 are among which do not pass: > > TEST bin/epoll26 FAIL > TEST bin/epoll42 FAIL > TEST bin/epoll34 FAIL > TEST bin/epoll9 FAIL > TEST bin/epoll48 FAIL > TEST bin/epoll40 FAIL > TEST bin/epoll25 FAIL > TEST bin/epoll44 FAIL > TEST bin/epoll38 FAIL > > This can be a good test suite, probably can be added to kselftests? > > -- > Roman > Indeed, I just tried the same test suite and I am seeing similar failures - it looks like its a bit timing dependent. It looks like all the failures are caused by a similar issue. For example, take epoll34: t0 t1 (ew) | | (ew) e0 | (lt) \ / | e1 | (et) s0 The test is trying to assert that an epoll_wait() on e1 and and epoll_wait() on e0 both return 1 event for EPOLLIN. However, the epoll_wait on e1 is done in a thread and it can happen before or after the epoll_wait() is called against e0. If the epoll_wait() on e1 happens first then because its attached as 'et', it consumes the event. So that there is no longer an event reported at e0. I think that is reasonable semantics here. However if the wait on e0 happens after the wait on e1 then the test will pass as both waits will see the event. Thus, a patch like this will 'fix' this testcase: --- a/src/epoll34.c +++ b/src/epoll34.c @@ -59,15 +59,15 @@ int main(int argc, char *argv[]) if (epoll_ctl(efd[0], EPOLL_CTL_ADD, efd[1], &e) < 0) goto out; - if (pthread_create(&tw, NULL, thread_handler, NULL) < 0) - goto out; - if (pthread_create(&te, NULL, emit_handler, NULL) < 0) goto out; if (epoll_wait(efd[0], &e, 1, 500) == 1) count++; + if (pthread_create(&tw, NULL, thread_handler, NULL) < 0) + goto out; + if (pthread_join(tw, NULL) < 0) goto out; I found all the other failures to be of similar origin. I suspect Heiher didn't see failures due to the thread timings here. I also found that all the testcases pass if we leave the wakeup(&ep->wq) call for the depth 0 case (and remove the pwake part). Thanks, -Jason ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH RESEND] fs/epoll: fix the edge-triggered mode for nested epoll 2019-09-05 17:44 ` Jason Baron @ 2019-09-11 8:19 ` Heiher 2019-09-12 21:36 ` Jason Baron 0 siblings, 1 reply; 12+ messages in thread From: Heiher @ 2019-09-11 8:19 UTC (permalink / raw) To: Jason Baron Cc: Roman Penyaev, linux-fsdevel, Eric Wong, Al Viro, Andrew Morton, Davide Libenzi, Davidlohr Bueso, Dominik Brodowski, Linus Torvalds, Sridhar Samudrala, linux-kernel Hi, On Fri, Sep 6, 2019 at 1:48 AM Jason Baron <jbaron@akamai.com> wrote: > > > > On 9/5/19 1:27 PM, Roman Penyaev wrote: > > On 2019-09-05 11:56, Heiher wrote: > >> Hi, > >> > >> On Thu, Sep 5, 2019 at 10:53 AM Heiher <r@hev.cc> wrote: > >>> > >>> Hi, > >>> > >>> I created an epoll wakeup test project, listed some possible cases, > >>> and any other corner cases needs to be added? > >>> > >>> https://github.com/heiher/epoll-wakeup/blob/master/README.md > >>> > >>> On Wed, Sep 4, 2019 at 10:02 PM Heiher <r@hev.cc> wrote: > >>> > > >>> > Hi, > >>> > > >>> > On Wed, Sep 4, 2019 at 8:02 PM Jason Baron <jbaron@akamai.com> wrote: > >>> > > > >>> > > > >>> > > > >>> > > On 9/4/19 5:57 AM, Roman Penyaev wrote: > >>> > > > On 2019-09-03 23:08, Jason Baron wrote: > >>> > > >> On 9/2/19 11:36 AM, Roman Penyaev wrote: > >>> > > >>> Hi, > >>> > > >>> > >>> > > >>> This is indeed a bug. (quick side note: could you please > >>> remove efd[1] > >>> > > >>> from your test, because it is not related to the reproduction > >>> of a > >>> > > >>> current bug). > >>> > > >>> > >>> > > >>> Your patch lacks a good description, what exactly you've > >>> fixed. Let > >>> > > >>> me speak out loud and please correct me if I'm wrong, my > >>> understanding > >>> > > >>> of epoll internals has become a bit rusty: when epoll fds are > >>> nested > >>> > > >>> an attempt to harvest events (ep_scan_ready_list() call) > >>> produces a > >>> > > >>> second (repeated) event from an internal fd up to an external > >>> fd: > >>> > > >>> > >>> > > >>> epoll_wait(efd[0], ...): > >>> > > >>> ep_send_events(): > >>> > > >>> ep_scan_ready_list(depth=0): > >>> > > >>> ep_send_events_proc(): > >>> > > >>> ep_item_poll(): > >>> > > >>> ep_scan_ready_list(depth=1): > >>> > > >>> ep_poll_safewake(): > >>> > > >>> ep_poll_callback() > >>> > > >>> list_add_tail(&epi, &epi->rdllist); > >>> > > >>> ^^^^^^ > >>> > > >>> repeated event > >>> > > >>> > >>> > > >>> > >>> > > >>> In your patch you forbid wakeup for the cases, where depth != > >>> 0, i.e. > >>> > > >>> for all nested cases. That seems clear. But what if we can > >>> go further > >>> > > >>> and remove the whole chunk, which seems excessive: > >>> > > >>> > >>> > > >>> @@ -885,26 +886,11 @@ static __poll_t ep_scan_ready_list(struct > >>> > > >>> eventpoll *ep, > >>> > > >>> > >>> > > >>> - > >>> > > >>> - if (!list_empty(&ep->rdllist)) { > >>> > > >>> - /* > >>> > > >>> - * Wake up (if active) both the eventpoll > >>> wait list and > >>> > > >>> - * the ->poll() wait list (delayed after we > >>> release the > >>> > > >>> lock). > >>> > > >>> - */ > >>> > > >>> - if (waitqueue_active(&ep->wq)) > >>> > > >>> - wake_up(&ep->wq); > >>> > > >>> - if (waitqueue_active(&ep->poll_wait)) > >>> > > >>> - pwake++; > >>> > > >>> - } > >>> > > >>> write_unlock_irq(&ep->lock); > >>> > > >>> > >>> > > >>> if (!ep_locked) > >>> > > >>> mutex_unlock(&ep->mtx); > >>> > > >>> > >>> > > >>> - /* We have to call this outside the lock */ > >>> > > >>> - if (pwake) > >>> > > >>> - ep_poll_safewake(&ep->poll_wait); > >>> > > >>> > >>> > > >>> > >>> > > >>> I reason like that: by the time we've reached the point of > >>> scanning events > >>> > > >>> for readiness all wakeups from ep_poll_callback have been > >>> already fired and > >>> > > >>> new events have been already accounted in ready list > >>> (ep_poll_callback() > >>> > > >>> calls > >>> > > >>> the same ep_poll_safewake()). Here, frankly, I'm not 100% > >>> sure and probably > >>> > > >>> missing some corner cases. > >>> > > >>> > >>> > > >>> Thoughts? > >>> > > >> > >>> > > >> So the: 'wake_up(&ep->wq);' part, I think is about waking up > >>> other > >>> > > >> threads that may be in waiting in epoll_wait(). For example, > >>> there may > >>> > > >> be multiple threads doing epoll_wait() on the same epoll fd, > >>> and the > >>> > > >> logic above seems to say thread 1 may have processed say N > >>> events and > >>> > > >> now its going to to go off to work those, so let's wake up > >>> thread 2 now > >>> > > >> to handle the next chunk. > >>> > > > > >>> > > > Not quite. Thread which calls ep_scan_ready_list() processes > >>> all the > >>> > > > events, and while processing those, removes them one by one > >>> from the > >>> > > > ready list. But if event mask is !0 and event belongs to > >>> > > > Level Triggered Mode descriptor (let's say default mode) it > >>> tails event > >>> > > > again back to the list (because we are in level mode, so event > >>> should > >>> > > > be there). So at the end of this traversing loop ready list is > >>> likely > >>> > > > not empty, and if so, wake up again is called for nested epoll > >>> fds. > >>> > > > But, those nested epoll fds should get already all the > >>> notifications > >>> > > > from the main event callback ep_poll_callback(), regardless any > >>> thread > >>> > > > which traverses events. > >>> > > > > >>> > > > I suppose this logic exists for decades, when Davide (the > >>> author) was > >>> > > > reshuffling the code here and there. > >>> > > > > >>> > > > But I do not feel confidence to state that this extra wakeup is > >>> bogus, > >>> > > > I just have a gut feeling that it looks excessive. > >>> > > > >>> > > Note that I was talking about the wakeup done on ep->wq not > >>> ep->poll_wait. > >>> > > The path that I'm concerned about is let's say that there are N > >>> events > >>> > > queued on the ready list. A thread that was woken up in > >>> epoll_wait may > >>> > > decide to only process say N/2 of then. Then it will call wakeup > >>> on ep->wq > >>> > > and this will wakeup another thread to process the remaining N/2. > >>> Without > >>> > > the wakeup, the original thread isn't going to process the events > >>> until > >>> > > it finishes with the original N/2 and gets back to epoll_wait(). > >>> So I'm not > >>> > > sure how important that path is but I wanted to at least note the > >>> change > >>> > > here would impact that behavior. > >>> > > > >>> > > Thanks, > >>> > > > >>> > > -Jason > >>> > > > >>> > > > >>> > > > > >>> > > >> So I think removing all that even for the > >>> > > >> depth 0 case is going to change some behavior here. So > >>> perhaps, it > >>> > > >> should be removed for all depths except for 0? And if so, it > >>> may be > >>> > > >> better to make 2 patches here to separate these changes. > >>> > > >> > >>> > > >> For the nested wakeups, I agree that the extra wakeups seem > >>> unnecessary > >>> > > >> and it may make sense to remove them for all depths. I don't > >>> think the > >>> > > >> nested epoll semantics are particularly well spelled out, and > >>> afaict, > >>> > > >> nested epoll() has behaved this way for quite some time. And > >>> the current > >>> > > >> behavior is not bad in the way that a missing wakeup or false > >>> negative > >>> > > >> would be. > >>> > > > > >>> > > > That's 100% true! For edge mode extra wake up is not a bug, not > >>> optimal > >>> > > > for userspace - yes, but that can't lead to any lost wakeups. > >>> > > > > >>> > > > -- > >>> > > > Roman > >>> > > > > >>> > > >>> > I tried to remove the whole chunk of code that Roman said, and it > >>> > seems that there > >>> > are no obvious problems with the two test programs below: > >> > >> I recall this message, the test case 9/25/26 of epoll-wakeup (on > >> github) are failed while > >> the whole chunk are removed. > >> > >> Apply the original patch, all tests passed. > > > > > > These are failing on my bare 5.2.0-rc2 > > > > TEST bin/epoll31 FAIL > > TEST bin/epoll46 FAIL > > TEST bin/epoll50 FAIL > > TEST bin/epoll32 FAIL > > TEST bin/epoll19 FAIL > > TEST bin/epoll27 FAIL > > TEST bin/epoll42 FAIL > > TEST bin/epoll34 FAIL > > TEST bin/epoll48 FAIL > > TEST bin/epoll40 FAIL > > TEST bin/epoll20 FAIL > > TEST bin/epoll28 FAIL > > TEST bin/epoll38 FAIL > > TEST bin/epoll52 FAIL > > TEST bin/epoll24 FAIL > > TEST bin/epoll23 FAIL > > > > > > These are failing if your patch is applied: > > (my 5.2.0-rc2 is old? broken?) > > > > TEST bin/epoll46 FAIL > > TEST bin/epoll42 FAIL > > TEST bin/epoll34 FAIL > > TEST bin/epoll48 FAIL > > TEST bin/epoll40 FAIL > > TEST bin/epoll44 FAIL > > TEST bin/epoll38 FAIL > > > > These are failing if "ep_poll_safewake(&ep->poll_wait)" is not called, > > but wakeup(&ep->wq); is still invoked: > > > > TEST bin/epoll46 FAIL > > TEST bin/epoll42 FAIL > > TEST bin/epoll34 FAIL > > TEST bin/epoll40 FAIL > > TEST bin/epoll44 FAIL > > TEST bin/epoll38 FAIL > > > > So at least 48 has been "fixed". > > > > These are failing if the whole chunk is removed, like your > > said 9,25,26 are among which do not pass: > > > > TEST bin/epoll26 FAIL > > TEST bin/epoll42 FAIL > > TEST bin/epoll34 FAIL > > TEST bin/epoll9 FAIL > > TEST bin/epoll48 FAIL > > TEST bin/epoll40 FAIL > > TEST bin/epoll25 FAIL > > TEST bin/epoll44 FAIL > > TEST bin/epoll38 FAIL > > > > This can be a good test suite, probably can be added to kselftests? Thank you, I have updated epoll-tests to fix these issues. I think this is good news if we can added to kselftests. ;) > > > > -- > > Roman > > > > > Indeed, I just tried the same test suite and I am seeing similar > failures - it looks like its a bit timing dependent. It looks like all > the failures are caused by a similar issue. For example, take epoll34: > > t0 t1 > (ew) | | (ew) > e0 | > (lt) \ / > | > e1 > | (et) > s0 > > > The test is trying to assert that an epoll_wait() on e1 and and > epoll_wait() on e0 both return 1 event for EPOLLIN. However, the > epoll_wait on e1 is done in a thread and it can happen before or after > the epoll_wait() is called against e0. If the epoll_wait() on e1 happens > first then because its attached as 'et', it consumes the event. So that > there is no longer an event reported at e0. I think that is reasonable > semantics here. However if the wait on e0 happens after the wait on e1 > then the test will pass as both waits will see the event. Thus, a patch > like this will 'fix' this testcase: > > --- a/src/epoll34.c > +++ b/src/epoll34.c > @@ -59,15 +59,15 @@ int main(int argc, char *argv[]) > if (epoll_ctl(efd[0], EPOLL_CTL_ADD, efd[1], &e) < 0) > goto out; > > - if (pthread_create(&tw, NULL, thread_handler, NULL) < 0) > - goto out; > - > if (pthread_create(&te, NULL, emit_handler, NULL) < 0) > goto out; > > if (epoll_wait(efd[0], &e, 1, 500) == 1) > count++; > > + if (pthread_create(&tw, NULL, thread_handler, NULL) < 0) > + goto out; > + > if (pthread_join(tw, NULL) < 0) > goto out; > > > I found all the other failures to be of similar origin. I suspect Heiher > didn't see failures due to the thread timings here. Thank you. I also found a multi-threaded concurrent accumulation problem, and that has been changed to atomic operations. I think we should allow two different behaviors to be passed because they are all correctly. thread 2: if (epoll_wait(efd[1], &e, 1, 500) == 1) __sync_fetch_and_or(&count, 1); thread1: if (epoll_wait(efd[0], &e, 1, 500) == 1) __sync_fetch_and_or(&count, 2); check: if ((count != 1) && (count != 3)) goto out; > > I also found that all the testcases pass if we leave the wakeup(&ep->wq) > call for the depth 0 case (and remove the pwake part). So, We need to keep the wakeup(&ep-wq) for all depth, and only wakeup(&ep->poll_wait) for depth 0 and/or ep->rdlist from empty to be not empty? > > Thanks, > > -Jason > > > > -- Best regards! Hev https://hev.cc ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH RESEND] fs/epoll: fix the edge-triggered mode for nested epoll 2019-09-11 8:19 ` Heiher @ 2019-09-12 21:36 ` Jason Baron 0 siblings, 0 replies; 12+ messages in thread From: Jason Baron @ 2019-09-12 21:36 UTC (permalink / raw) To: Heiher Cc: Roman Penyaev, linux-fsdevel, Eric Wong, Al Viro, Andrew Morton, Davide Libenzi, Davidlohr Bueso, Dominik Brodowski, Linus Torvalds, Sridhar Samudrala, linux-kernel On 9/11/19 4:19 AM, Heiher wrote: > Hi, > > On Fri, Sep 6, 2019 at 1:48 AM Jason Baron <jbaron@akamai.com> wrote: >> >> >> >> On 9/5/19 1:27 PM, Roman Penyaev wrote: >>> On 2019-09-05 11:56, Heiher wrote: >>>> Hi, >>>> >>>> On Thu, Sep 5, 2019 at 10:53 AM Heiher <r@hev.cc> wrote: >>>>> >>>>> Hi, >>>>> >>>>> I created an epoll wakeup test project, listed some possible cases, >>>>> and any other corner cases needs to be added? >>>>> >>>>> https://github.com/heiher/epoll-wakeup/blob/master/README.md >>>>> >>>>> On Wed, Sep 4, 2019 at 10:02 PM Heiher <r@hev.cc> wrote: >>>>>> >>>>>> Hi, >>>>>> >>>>>> On Wed, Sep 4, 2019 at 8:02 PM Jason Baron <jbaron@akamai.com> wrote: >>>>>>> >>>>>>> >>>>>>> >>>>>>> On 9/4/19 5:57 AM, Roman Penyaev wrote: >>>>>>>> On 2019-09-03 23:08, Jason Baron wrote: >>>>>>>>> On 9/2/19 11:36 AM, Roman Penyaev wrote: >>>>>>>>>> Hi, >>>>>>>>>> >>>>>>>>>> This is indeed a bug. (quick side note: could you please >>>>> remove efd[1] >>>>>>>>>> from your test, because it is not related to the reproduction >>>>> of a >>>>>>>>>> current bug). >>>>>>>>>> >>>>>>>>>> Your patch lacks a good description, what exactly you've >>>>> fixed. Let >>>>>>>>>> me speak out loud and please correct me if I'm wrong, my >>>>> understanding >>>>>>>>>> of epoll internals has become a bit rusty: when epoll fds are >>>>> nested >>>>>>>>>> an attempt to harvest events (ep_scan_ready_list() call) >>>>> produces a >>>>>>>>>> second (repeated) event from an internal fd up to an external >>>>> fd: >>>>>>>>>> >>>>>>>>>> epoll_wait(efd[0], ...): >>>>>>>>>> ep_send_events(): >>>>>>>>>> ep_scan_ready_list(depth=0): >>>>>>>>>> ep_send_events_proc(): >>>>>>>>>> ep_item_poll(): >>>>>>>>>> ep_scan_ready_list(depth=1): >>>>>>>>>> ep_poll_safewake(): >>>>>>>>>> ep_poll_callback() >>>>>>>>>> list_add_tail(&epi, &epi->rdllist); >>>>>>>>>> ^^^^^^ >>>>>>>>>> repeated event >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> In your patch you forbid wakeup for the cases, where depth != >>>>> 0, i.e. >>>>>>>>>> for all nested cases. That seems clear. But what if we can >>>>> go further >>>>>>>>>> and remove the whole chunk, which seems excessive: >>>>>>>>>> >>>>>>>>>> @@ -885,26 +886,11 @@ static __poll_t ep_scan_ready_list(struct >>>>>>>>>> eventpoll *ep, >>>>>>>>>> >>>>>>>>>> - >>>>>>>>>> - if (!list_empty(&ep->rdllist)) { >>>>>>>>>> - /* >>>>>>>>>> - * Wake up (if active) both the eventpoll >>>>> wait list and >>>>>>>>>> - * the ->poll() wait list (delayed after we >>>>> release the >>>>>>>>>> lock). >>>>>>>>>> - */ >>>>>>>>>> - if (waitqueue_active(&ep->wq)) >>>>>>>>>> - wake_up(&ep->wq); >>>>>>>>>> - if (waitqueue_active(&ep->poll_wait)) >>>>>>>>>> - pwake++; >>>>>>>>>> - } >>>>>>>>>> write_unlock_irq(&ep->lock); >>>>>>>>>> >>>>>>>>>> if (!ep_locked) >>>>>>>>>> mutex_unlock(&ep->mtx); >>>>>>>>>> >>>>>>>>>> - /* We have to call this outside the lock */ >>>>>>>>>> - if (pwake) >>>>>>>>>> - ep_poll_safewake(&ep->poll_wait); >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> I reason like that: by the time we've reached the point of >>>>> scanning events >>>>>>>>>> for readiness all wakeups from ep_poll_callback have been >>>>> already fired and >>>>>>>>>> new events have been already accounted in ready list >>>>> (ep_poll_callback() >>>>>>>>>> calls >>>>>>>>>> the same ep_poll_safewake()). Here, frankly, I'm not 100% >>>>> sure and probably >>>>>>>>>> missing some corner cases. >>>>>>>>>> >>>>>>>>>> Thoughts? >>>>>>>>> >>>>>>>>> So the: 'wake_up(&ep->wq);' part, I think is about waking up >>>>> other >>>>>>>>> threads that may be in waiting in epoll_wait(). For example, >>>>> there may >>>>>>>>> be multiple threads doing epoll_wait() on the same epoll fd, >>>>> and the >>>>>>>>> logic above seems to say thread 1 may have processed say N >>>>> events and >>>>>>>>> now its going to to go off to work those, so let's wake up >>>>> thread 2 now >>>>>>>>> to handle the next chunk. >>>>>>>> >>>>>>>> Not quite. Thread which calls ep_scan_ready_list() processes >>>>> all the >>>>>>>> events, and while processing those, removes them one by one >>>>> from the >>>>>>>> ready list. But if event mask is !0 and event belongs to >>>>>>>> Level Triggered Mode descriptor (let's say default mode) it >>>>> tails event >>>>>>>> again back to the list (because we are in level mode, so event >>>>> should >>>>>>>> be there). So at the end of this traversing loop ready list is >>>>> likely >>>>>>>> not empty, and if so, wake up again is called for nested epoll >>>>> fds. >>>>>>>> But, those nested epoll fds should get already all the >>>>> notifications >>>>>>>> from the main event callback ep_poll_callback(), regardless any >>>>> thread >>>>>>>> which traverses events. >>>>>>>> >>>>>>>> I suppose this logic exists for decades, when Davide (the >>>>> author) was >>>>>>>> reshuffling the code here and there. >>>>>>>> >>>>>>>> But I do not feel confidence to state that this extra wakeup is >>>>> bogus, >>>>>>>> I just have a gut feeling that it looks excessive. >>>>>>> >>>>>>> Note that I was talking about the wakeup done on ep->wq not >>>>> ep->poll_wait. >>>>>>> The path that I'm concerned about is let's say that there are N >>>>> events >>>>>>> queued on the ready list. A thread that was woken up in >>>>> epoll_wait may >>>>>>> decide to only process say N/2 of then. Then it will call wakeup >>>>> on ep->wq >>>>>>> and this will wakeup another thread to process the remaining N/2. >>>>> Without >>>>>>> the wakeup, the original thread isn't going to process the events >>>>> until >>>>>>> it finishes with the original N/2 and gets back to epoll_wait(). >>>>> So I'm not >>>>>>> sure how important that path is but I wanted to at least note the >>>>> change >>>>>>> here would impact that behavior. >>>>>>> >>>>>>> Thanks, >>>>>>> >>>>>>> -Jason >>>>>>> >>>>>>> >>>>>>>> >>>>>>>>> So I think removing all that even for the >>>>>>>>> depth 0 case is going to change some behavior here. So >>>>> perhaps, it >>>>>>>>> should be removed for all depths except for 0? And if so, it >>>>> may be >>>>>>>>> better to make 2 patches here to separate these changes. >>>>>>>>> >>>>>>>>> For the nested wakeups, I agree that the extra wakeups seem >>>>> unnecessary >>>>>>>>> and it may make sense to remove them for all depths. I don't >>>>> think the >>>>>>>>> nested epoll semantics are particularly well spelled out, and >>>>> afaict, >>>>>>>>> nested epoll() has behaved this way for quite some time. And >>>>> the current >>>>>>>>> behavior is not bad in the way that a missing wakeup or false >>>>> negative >>>>>>>>> would be. >>>>>>>> >>>>>>>> That's 100% true! For edge mode extra wake up is not a bug, not >>>>> optimal >>>>>>>> for userspace - yes, but that can't lead to any lost wakeups. >>>>>>>> >>>>>>>> -- >>>>>>>> Roman >>>>>>>> >>>>>> >>>>>> I tried to remove the whole chunk of code that Roman said, and it >>>>>> seems that there >>>>>> are no obvious problems with the two test programs below: >>>> >>>> I recall this message, the test case 9/25/26 of epoll-wakeup (on >>>> github) are failed while >>>> the whole chunk are removed. >>>> >>>> Apply the original patch, all tests passed. >>> >>> >>> These are failing on my bare 5.2.0-rc2 >>> >>> TEST bin/epoll31 FAIL >>> TEST bin/epoll46 FAIL >>> TEST bin/epoll50 FAIL >>> TEST bin/epoll32 FAIL >>> TEST bin/epoll19 FAIL >>> TEST bin/epoll27 FAIL >>> TEST bin/epoll42 FAIL >>> TEST bin/epoll34 FAIL >>> TEST bin/epoll48 FAIL >>> TEST bin/epoll40 FAIL >>> TEST bin/epoll20 FAIL >>> TEST bin/epoll28 FAIL >>> TEST bin/epoll38 FAIL >>> TEST bin/epoll52 FAIL >>> TEST bin/epoll24 FAIL >>> TEST bin/epoll23 FAIL >>> >>> >>> These are failing if your patch is applied: >>> (my 5.2.0-rc2 is old? broken?) >>> >>> TEST bin/epoll46 FAIL >>> TEST bin/epoll42 FAIL >>> TEST bin/epoll34 FAIL >>> TEST bin/epoll48 FAIL >>> TEST bin/epoll40 FAIL >>> TEST bin/epoll44 FAIL >>> TEST bin/epoll38 FAIL >>> >>> These are failing if "ep_poll_safewake(&ep->poll_wait)" is not called, >>> but wakeup(&ep->wq); is still invoked: >>> >>> TEST bin/epoll46 FAIL >>> TEST bin/epoll42 FAIL >>> TEST bin/epoll34 FAIL >>> TEST bin/epoll40 FAIL >>> TEST bin/epoll44 FAIL >>> TEST bin/epoll38 FAIL >>> >>> So at least 48 has been "fixed". >>> >>> These are failing if the whole chunk is removed, like your >>> said 9,25,26 are among which do not pass: >>> >>> TEST bin/epoll26 FAIL >>> TEST bin/epoll42 FAIL >>> TEST bin/epoll34 FAIL >>> TEST bin/epoll9 FAIL >>> TEST bin/epoll48 FAIL >>> TEST bin/epoll40 FAIL >>> TEST bin/epoll25 FAIL >>> TEST bin/epoll44 FAIL >>> TEST bin/epoll38 FAIL >>> >>> This can be a good test suite, probably can be added to kselftests? > > Thank you, I have updated epoll-tests to fix these issues. I think this is good > news if we can added to kselftests. ;) > >>> >>> -- >>> Roman >>> >> >> >> Indeed, I just tried the same test suite and I am seeing similar >> failures - it looks like its a bit timing dependent. It looks like all >> the failures are caused by a similar issue. For example, take epoll34: >> >> t0 t1 >> (ew) | | (ew) >> e0 | >> (lt) \ / >> | >> e1 >> | (et) >> s0 >> >> >> The test is trying to assert that an epoll_wait() on e1 and and >> epoll_wait() on e0 both return 1 event for EPOLLIN. However, the >> epoll_wait on e1 is done in a thread and it can happen before or after >> the epoll_wait() is called against e0. If the epoll_wait() on e1 happens >> first then because its attached as 'et', it consumes the event. So that >> there is no longer an event reported at e0. I think that is reasonable >> semantics here. However if the wait on e0 happens after the wait on e1 >> then the test will pass as both waits will see the event. Thus, a patch >> like this will 'fix' this testcase: >> >> --- a/src/epoll34.c >> +++ b/src/epoll34.c >> @@ -59,15 +59,15 @@ int main(int argc, char *argv[]) >> if (epoll_ctl(efd[0], EPOLL_CTL_ADD, efd[1], &e) < 0) >> goto out; >> >> - if (pthread_create(&tw, NULL, thread_handler, NULL) < 0) >> - goto out; >> - >> if (pthread_create(&te, NULL, emit_handler, NULL) < 0) >> goto out; >> >> if (epoll_wait(efd[0], &e, 1, 500) == 1) >> count++; >> >> + if (pthread_create(&tw, NULL, thread_handler, NULL) < 0) >> + goto out; >> + >> if (pthread_join(tw, NULL) < 0) >> goto out; >> >> >> I found all the other failures to be of similar origin. I suspect Heiher >> didn't see failures due to the thread timings here. > > Thank you. I also found a multi-threaded concurrent accumulation problem, > and that has been changed to atomic operations. I think we should allow two > different behaviors to be passed because they are all correctly. > > thread 2: > if (epoll_wait(efd[1], &e, 1, 500) == 1) > __sync_fetch_and_or(&count, 1); > > thread1: > if (epoll_wait(efd[0], &e, 1, 500) == 1) > __sync_fetch_and_or(&count, 2); > > check: > if ((count != 1) && (count != 3)) > goto out; > >> >> I also found that all the testcases pass if we leave the wakeup(&ep->wq) >> call for the depth 0 case (and remove the pwake part). > > So, We need to keep the wakeup(&ep-wq) for all depth, and only > wakeup(&ep->poll_wait) > for depth 0 and/or ep->rdlist from empty to be not empty? > Yes, so my thinking is yes, leave the wakeup(&ep->wq) for all depths. It may not strictly be necessary for depths > 0 but I do have some concerns about it and doesn't affect this specific case. So I would leave it for all depths and if you want to remove it for depths > 0, its a separate patch. I'm now not sure its really safe to remove the wakeup(&ep->poll_wait) either. Take the case where we have: (et) (lt) t0---e0---e1----s0 Let's say there is thread t1 also doing epoll_wait() on e1. Now, let's say s0 fires an event and wakes up t1. While t1 is reaping events from t1 its calling ep_scan_ready_list() and while its processing the events it drops the ep->lock and more events get queued to the overflow list. Now, let's say t0 wakes up from epoll_wait() it will not see any of the events queued to e1 yet (since they are not on the readylist yet). Thus, t0 goes back to sleep and I think can miss an event here. So I think that's what the wakeup(&ep->poll_wait) is about. What I think we can do to fix the original case, is that perhaps in that case where events get queued to the overflow list we do in fact do the wakeup(&ep->poll_wait). So perhaps, we can just have variable that checks if there is anything in the overflow list, and if so do the wakeup(&ep->poll_wait) conditional on there being things on the overflow list. Although I can't say I tried to see if this would work. That said, I think ep_modify() might also subtlety be dependent on the current behavior, in that ep_item_poll() uses ep_scan_ready_list() and when an event is modified it may need to wakeup nested epoll fds to inform them of events. Thus, perhaps, we would also need an explicit call to wakeup(&ep->poll_wait) when ep_item_poll() returns events. Thanks, -Jason ^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2019-09-12 21:37 UTC | newest] Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2019-09-02 5:20 [PATCH RESEND] fs/epoll: fix the edge-triggered mode for nested epoll hev 2019-09-02 15:36 ` Roman Penyaev 2019-09-03 21:08 ` Jason Baron 2019-09-04 9:57 ` Roman Penyaev 2019-09-04 12:02 ` Jason Baron 2019-09-04 14:02 ` Heiher 2019-09-05 2:53 ` Heiher 2019-09-05 9:56 ` Heiher 2019-09-05 17:27 ` Roman Penyaev 2019-09-05 17:44 ` Jason Baron 2019-09-11 8:19 ` Heiher 2019-09-12 21:36 ` Jason Baron
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).