All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH RESEND] fs/epoll: fix the edge-triggered mode for nested epoll
@ 2019-09-02  5:20 hev
  2019-09-02 15:36 ` Roman Penyaev
  0 siblings, 1 reply; 12+ messages in thread
From: hev @ 2019-09-02  5:20 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: e, Heiher, Al Viro, Andrew Morton, Davide Libenzi,
	Davidlohr Bueso, Dominik Brodowski, Jason Baron, Linus Torvalds,
	Roman Penyaev, Sridhar Samudrala, linux-kernel

From: Heiher <r@hev.cc>

The structure of event pools:
 efd[1]: { efd[2] (EPOLLIN) }        efd[0]: { efd[2] (EPOLLIN | EPOLLET) }
               |                                   |
               +-----------------+-----------------+
                                 |
                                 v
                             efd[2]: { sfd[0] (EPOLLIN) }

When sfd[0] to be readable:
 * the epoll_wait(efd[0], ..., 0) should return efd[2]'s events on first call,
   and returns 0 on next calls, because efd[2] is added in edge-triggered mode.
 * the epoll_wait(efd[1], ..., 0) should returns efd[2]'s events on every calls
   until efd[2] is not readable (epoll_wait(efd[2], ...) => 0), because efd[1]
   is added in level-triggered mode.
 * the epoll_wait(efd[2], ..., 0) should returns sfd[0]'s events on every calls
   until sfd[0] is not readable (read(sfd[0], ...) => EAGAIN), because sfd[0]
   is added in level-triggered mode.

Test code:
 #include <stdio.h>
 #include <unistd.h>
 #include <sys/epoll.h>
 #include <sys/socket.h>

 int main(int argc, char *argv[])
 {
 	int sfd[2];
 	int efd[3];
 	int nfds;
 	struct epoll_event e;

 	if (socketpair(AF_UNIX, SOCK_STREAM, 0, sfd) < 0)
 		goto out;

 	efd[0] = epoll_create(1);
 	if (efd[0] < 0)
 		goto out;

 	efd[1] = epoll_create(1);
 	if (efd[1] < 0)
 		goto out;

 	efd[2] = epoll_create(1);
 	if (efd[2] < 0)
 		goto out;

 	e.events = EPOLLIN;
 	if (epoll_ctl(efd[2], EPOLL_CTL_ADD, sfd[0], &e) < 0)
 		goto out;

 	e.events = EPOLLIN;
 	if (epoll_ctl(efd[1], EPOLL_CTL_ADD, efd[2], &e) < 0)
 		goto out;

 	e.events = EPOLLIN | EPOLLET;
 	if (epoll_ctl(efd[0], EPOLL_CTL_ADD, efd[2], &e) < 0)
 		goto out;

 	if (write(sfd[1], "w", 1) != 1)
 		goto out;

 	nfds = epoll_wait(efd[0], &e, 1, 0);
 	if (nfds != 1)
 		goto out;

 	nfds = epoll_wait(efd[0], &e, 1, 0);
 	if (nfds != 0)
 		goto out;

 	nfds = epoll_wait(efd[1], &e, 1, 0);
 	if (nfds != 1)
 		goto out;

 	nfds = epoll_wait(efd[1], &e, 1, 0);
 	if (nfds != 1)
 		goto out;

 	nfds = epoll_wait(efd[2], &e, 1, 0);
 	if (nfds != 1)
 		goto out;

 	nfds = epoll_wait(efd[2], &e, 1, 0);
 	if (nfds != 1)
 		goto out;

 	close(efd[2]);
 	close(efd[1]);
 	close(efd[0]);
 	close(sfd[0]);
 	close(sfd[1]);

 	printf("PASS\n");
 	return 0;

 out:
 	printf("FAIL\n");
 	return -1;
 }

Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Davide Libenzi <davidel@xmailserver.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Dominik Brodowski <linux@dominikbrodowski.net>
Cc: Eric Wong <e@80x24.org>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Roman Penyaev <rpenyaev@suse.de>
Cc: Sridhar Samudrala <sridhar.samudrala@intel.com>
Cc: linux-kernel@vger.kernel.org
Cc: linux-fsdevel@vger.kernel.org
Signed-off-by: hev <r@hev.cc>
---
 fs/eventpoll.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index d7f1f5011fac..a44cb27c636c 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -672,6 +672,7 @@ static __poll_t ep_scan_ready_list(struct eventpoll *ep,
 {
 	__poll_t res;
 	int pwake = 0;
+	int nwake = 0;
 	struct epitem *epi, *nepi;
 	LIST_HEAD(txlist);
 
@@ -685,6 +686,9 @@ static __poll_t ep_scan_ready_list(struct eventpoll *ep,
 	if (!ep_locked)
 		mutex_lock_nested(&ep->mtx, depth);
 
+	if (!depth || list_empty(&ep->rdllist))
+		nwake = 1;
+
 	/*
 	 * Steal the ready list, and re-init the original one to the
 	 * empty list. Also, set ep->ovflist to NULL so that events
@@ -739,7 +743,7 @@ static __poll_t ep_scan_ready_list(struct eventpoll *ep,
 	list_splice(&txlist, &ep->rdllist);
 	__pm_relax(ep->ws);
 
-	if (!list_empty(&ep->rdllist)) {
+	if (nwake && !list_empty(&ep->rdllist)) {
 		/*
 		 * Wake up (if active) both the eventpoll wait list and
 		 * the ->poll() wait list (delayed after we release the lock).
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH RESEND] fs/epoll: fix the edge-triggered mode for nested epoll
  2019-09-02  5:20 [PATCH RESEND] fs/epoll: fix the edge-triggered mode for nested epoll hev
@ 2019-09-02 15:36 ` Roman Penyaev
  2019-09-03 21:08   ` Jason Baron
  0 siblings, 1 reply; 12+ messages in thread
From: Roman Penyaev @ 2019-09-02 15:36 UTC (permalink / raw)
  To: hev
  Cc: linux-fsdevel, e, Al Viro, Andrew Morton, Davide Libenzi,
	Davidlohr Bueso, Dominik Brodowski, Jason Baron, Linus Torvalds,
	Sridhar Samudrala, linux-kernel

Hi,

This is indeed a bug. (quick side note: could you please remove efd[1]
from your test, because it is not related to the reproduction of a
current bug).

Your patch lacks a good description, what exactly you've fixed.  Let
me speak out loud and please correct me if I'm wrong, my understanding
of epoll internals has become a bit rusty: when epoll fds are nested
an attempt to harvest events (ep_scan_ready_list() call) produces a
second (repeated) event from an internal fd up to an external fd:

      epoll_wait(efd[0], ...):
        ep_send_events():
           ep_scan_ready_list(depth=0):
             ep_send_events_proc():
                 ep_item_poll():
                   ep_scan_ready_list(depth=1):
                     ep_poll_safewake():
                       ep_poll_callback()
                         list_add_tail(&epi, &epi->rdllist);
                         ^^^^^^
                         repeated event


In your patch you forbid wakeup for the cases, where depth != 0, i.e.
for all nested cases. That seems clear.  But what if we can go further
and remove the whole chunk, which seems excessive:

@@ -885,26 +886,11 @@ static __poll_t ep_scan_ready_list(struct 
eventpoll *ep,

-
-       if (!list_empty(&ep->rdllist)) {
-               /*
-                * Wake up (if active) both the eventpoll wait list and
-                * the ->poll() wait list (delayed after we release the 
lock).
-                */
-               if (waitqueue_active(&ep->wq))
-                       wake_up(&ep->wq);
-               if (waitqueue_active(&ep->poll_wait))
-                       pwake++;
-       }
         write_unlock_irq(&ep->lock);

         if (!ep_locked)
                 mutex_unlock(&ep->mtx);

-       /* We have to call this outside the lock */
-       if (pwake)
-               ep_poll_safewake(&ep->poll_wait);


I reason like that: by the time we've reached the point of scanning 
events
for readiness all wakeups from ep_poll_callback have been already fired 
and
new events have been already accounted in ready list (ep_poll_callback() 
calls
the same ep_poll_safewake()). Here, frankly, I'm not 100% sure and 
probably
missing some corner cases.

Thoughts?

PS.  You call list_empty(&ep->rdllist) without ep->lock taken, that is 
fine,
      but you should be _careful_, so list_empty_careful(&ep->rdllist) 
call
      instead.

--
Roman



On 2019-09-02 07:20, hev wrote:
> From: Heiher <r@hev.cc>
> 
> The structure of event pools:
>  efd[1]: { efd[2] (EPOLLIN) }        efd[0]: { efd[2] (EPOLLIN | 
> EPOLLET) }
>                |                                   |
>                +-----------------+-----------------+
>                                  |
>                                  v
>                              efd[2]: { sfd[0] (EPOLLIN) }
> 
> When sfd[0] to be readable:
>  * the epoll_wait(efd[0], ..., 0) should return efd[2]'s events on 
> first call,
>    and returns 0 on next calls, because efd[2] is added in 
> edge-triggered mode.
>  * the epoll_wait(efd[1], ..., 0) should returns efd[2]'s events on 
> every calls
>    until efd[2] is not readable (epoll_wait(efd[2], ...) => 0), because 
> efd[1]
>    is added in level-triggered mode.
>  * the epoll_wait(efd[2], ..., 0) should returns sfd[0]'s events on 
> every calls
>    until sfd[0] is not readable (read(sfd[0], ...) => EAGAIN), because 
> sfd[0]
>    is added in level-triggered mode.
> 
> Test code:
>  #include <stdio.h>
>  #include <unistd.h>
>  #include <sys/epoll.h>
>  #include <sys/socket.h>
> 
>  int main(int argc, char *argv[])
>  {
>  	int sfd[2];
>  	int efd[3];
>  	int nfds;
>  	struct epoll_event e;
> 
>  	if (socketpair(AF_UNIX, SOCK_STREAM, 0, sfd) < 0)
>  		goto out;
> 
>  	efd[0] = epoll_create(1);
>  	if (efd[0] < 0)
>  		goto out;
> 
>  	efd[1] = epoll_create(1);
>  	if (efd[1] < 0)
>  		goto out;
> 
>  	efd[2] = epoll_create(1);
>  	if (efd[2] < 0)
>  		goto out;
> 
>  	e.events = EPOLLIN;
>  	if (epoll_ctl(efd[2], EPOLL_CTL_ADD, sfd[0], &e) < 0)
>  		goto out;
> 
>  	e.events = EPOLLIN;
>  	if (epoll_ctl(efd[1], EPOLL_CTL_ADD, efd[2], &e) < 0)
>  		goto out;
> 
>  	e.events = EPOLLIN | EPOLLET;
>  	if (epoll_ctl(efd[0], EPOLL_CTL_ADD, efd[2], &e) < 0)
>  		goto out;
> 
>  	if (write(sfd[1], "w", 1) != 1)
>  		goto out;
> 
>  	nfds = epoll_wait(efd[0], &e, 1, 0);
>  	if (nfds != 1)
>  		goto out;
> 
>  	nfds = epoll_wait(efd[0], &e, 1, 0);
>  	if (nfds != 0)
>  		goto out;
> 
>  	nfds = epoll_wait(efd[1], &e, 1, 0);
>  	if (nfds != 1)
>  		goto out;
> 
>  	nfds = epoll_wait(efd[1], &e, 1, 0);
>  	if (nfds != 1)
>  		goto out;
> 
>  	nfds = epoll_wait(efd[2], &e, 1, 0);
>  	if (nfds != 1)
>  		goto out;
> 
>  	nfds = epoll_wait(efd[2], &e, 1, 0);
>  	if (nfds != 1)
>  		goto out;
> 
>  	close(efd[2]);
>  	close(efd[1]);
>  	close(efd[0]);
>  	close(sfd[0]);
>  	close(sfd[1]);
> 
>  	printf("PASS\n");
>  	return 0;
> 
>  out:
>  	printf("FAIL\n");
>  	return -1;
>  }
> 
> Cc: Al Viro <viro@ZenIV.linux.org.uk>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Davide Libenzi <davidel@xmailserver.org>
> Cc: Davidlohr Bueso <dave@stgolabs.net>
> Cc: Dominik Brodowski <linux@dominikbrodowski.net>
> Cc: Eric Wong <e@80x24.org>
> Cc: Jason Baron <jbaron@akamai.com>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Roman Penyaev <rpenyaev@suse.de>
> Cc: Sridhar Samudrala <sridhar.samudrala@intel.com>
> Cc: linux-kernel@vger.kernel.org
> Cc: linux-fsdevel@vger.kernel.org
> Signed-off-by: hev <r@hev.cc>
> ---
>  fs/eventpoll.c | 6 +++++-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/eventpoll.c b/fs/eventpoll.c
> index d7f1f5011fac..a44cb27c636c 100644
> --- a/fs/eventpoll.c
> +++ b/fs/eventpoll.c
> @@ -672,6 +672,7 @@ static __poll_t ep_scan_ready_list(struct eventpoll 
> *ep,
>  {
>  	__poll_t res;
>  	int pwake = 0;
> +	int nwake = 0;
>  	struct epitem *epi, *nepi;
>  	LIST_HEAD(txlist);
> 
> @@ -685,6 +686,9 @@ static __poll_t ep_scan_ready_list(struct eventpoll 
> *ep,
>  	if (!ep_locked)
>  		mutex_lock_nested(&ep->mtx, depth);
> 
> +	if (!depth || list_empty(&ep->rdllist))
> +		nwake = 1;
> +
>  	/*
>  	 * Steal the ready list, and re-init the original one to the
>  	 * empty list. Also, set ep->ovflist to NULL so that events
> @@ -739,7 +743,7 @@ static __poll_t ep_scan_ready_list(struct eventpoll 
> *ep,
>  	list_splice(&txlist, &ep->rdllist);
>  	__pm_relax(ep->ws);
> 
> -	if (!list_empty(&ep->rdllist)) {
> +	if (nwake && !list_empty(&ep->rdllist)) {
>  		/*
>  		 * Wake up (if active) both the eventpoll wait list and
>  		 * the ->poll() wait list (delayed after we release the lock).


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH RESEND] fs/epoll: fix the edge-triggered mode for nested epoll
  2019-09-02 15:36 ` Roman Penyaev
@ 2019-09-03 21:08   ` Jason Baron
  2019-09-04  9:57     ` Roman Penyaev
  0 siblings, 1 reply; 12+ messages in thread
From: Jason Baron @ 2019-09-03 21:08 UTC (permalink / raw)
  To: Roman Penyaev, hev
  Cc: linux-fsdevel, e, Al Viro, Andrew Morton, Davide Libenzi,
	Davidlohr Bueso, Dominik Brodowski, Linus Torvalds,
	Sridhar Samudrala, linux-kernel



On 9/2/19 11:36 AM, Roman Penyaev wrote:
> Hi,
> 
> This is indeed a bug. (quick side note: could you please remove efd[1]
> from your test, because it is not related to the reproduction of a
> current bug).
> 
> Your patch lacks a good description, what exactly you've fixed.  Let
> me speak out loud and please correct me if I'm wrong, my understanding
> of epoll internals has become a bit rusty: when epoll fds are nested
> an attempt to harvest events (ep_scan_ready_list() call) produces a
> second (repeated) event from an internal fd up to an external fd:
> 
>      epoll_wait(efd[0], ...):
>        ep_send_events():
>           ep_scan_ready_list(depth=0):
>             ep_send_events_proc():
>                 ep_item_poll():
>                   ep_scan_ready_list(depth=1):
>                     ep_poll_safewake():
>                       ep_poll_callback()
>                         list_add_tail(&epi, &epi->rdllist);
>                         ^^^^^^
>                         repeated event
> 
> 
> In your patch you forbid wakeup for the cases, where depth != 0, i.e.
> for all nested cases. That seems clear.  But what if we can go further
> and remove the whole chunk, which seems excessive:
> 
> @@ -885,26 +886,11 @@ static __poll_t ep_scan_ready_list(struct
> eventpoll *ep,
> 
> -
> -       if (!list_empty(&ep->rdllist)) {
> -               /*
> -                * Wake up (if active) both the eventpoll wait list and
> -                * the ->poll() wait list (delayed after we release the
> lock).
> -                */
> -               if (waitqueue_active(&ep->wq))
> -                       wake_up(&ep->wq);
> -               if (waitqueue_active(&ep->poll_wait))
> -                       pwake++;
> -       }
>         write_unlock_irq(&ep->lock);
> 
>         if (!ep_locked)
>                 mutex_unlock(&ep->mtx);
> 
> -       /* We have to call this outside the lock */
> -       if (pwake)
> -               ep_poll_safewake(&ep->poll_wait);
> 
> 
> I reason like that: by the time we've reached the point of scanning events
> for readiness all wakeups from ep_poll_callback have been already fired and
> new events have been already accounted in ready list (ep_poll_callback()
> calls
> the same ep_poll_safewake()). Here, frankly, I'm not 100% sure and probably
> missing some corner cases.
> 
> Thoughts?

So the: 'wake_up(&ep->wq);' part, I think is about waking up other
threads that may be in waiting in epoll_wait(). For example, there may
be multiple threads doing epoll_wait() on the same epoll fd, and the
logic above seems to say thread 1 may have processed say N events and
now its going to to go off to work those, so let's wake up thread 2 now
to handle the next chunk. So I think removing all that even for the
depth 0 case is going to change some behavior here. So perhaps, it
should be removed for all depths except for 0? And if so, it may be
better to make 2 patches here to separate these changes.

For the nested wakeups, I agree that the extra wakeups seem unnecessary
and it may make sense to remove them for all depths. I don't think the
nested epoll semantics are particularly well spelled out, and afaict,
nested epoll() has behaved this way for quite some time. And the current
behavior is not bad in the way that a missing wakeup or false negative
would be. It woulbe be good to better understand the use-case more here
and to try and spell out the nested semantics more clearly?

Thanks,

-Jason


> 
> PS.  You call list_empty(&ep->rdllist) without ep->lock taken, that is
> fine,
>      but you should be _careful_, so list_empty_careful(&ep->rdllist) call
>      instead.
> 
> -- 
> Roman
> 
> 
> 
> On 2019-09-02 07:20, hev wrote:
>> From: Heiher <r@hev.cc>
>>
>> The structure of event pools:
>>  efd[1]: { efd[2] (EPOLLIN) }        efd[0]: { efd[2] (EPOLLIN |
>> EPOLLET) }
>>                |                                   |
>>                +-----------------+-----------------+
>>                                  |
>>                                  v
>>                              efd[2]: { sfd[0] (EPOLLIN) }
>>
>> When sfd[0] to be readable:
>>  * the epoll_wait(efd[0], ..., 0) should return efd[2]'s events on
>> first call,
>>    and returns 0 on next calls, because efd[2] is added in
>> edge-triggered mode.
>>  * the epoll_wait(efd[1], ..., 0) should returns efd[2]'s events on
>> every calls
>>    until efd[2] is not readable (epoll_wait(efd[2], ...) => 0),
>> because efd[1]
>>    is added in level-triggered mode.
>>  * the epoll_wait(efd[2], ..., 0) should returns sfd[0]'s events on
>> every calls
>>    until sfd[0] is not readable (read(sfd[0], ...) => EAGAIN), because
>> sfd[0]
>>    is added in level-triggered mode.
>>
>> Test code:
>>  #include <stdio.h>
>>  #include <unistd.h>
>>  #include <sys/epoll.h>
>>  #include <sys/socket.h>
>>
>>  int main(int argc, char *argv[])
>>  {
>>      int sfd[2];
>>      int efd[3];
>>      int nfds;
>>      struct epoll_event e;
>>
>>      if (socketpair(AF_UNIX, SOCK_STREAM, 0, sfd) < 0)
>>          goto out;
>>
>>      efd[0] = epoll_create(1);
>>      if (efd[0] < 0)
>>          goto out;
>>
>>      efd[1] = epoll_create(1);
>>      if (efd[1] < 0)
>>          goto out;
>>
>>      efd[2] = epoll_create(1);
>>      if (efd[2] < 0)
>>          goto out;
>>
>>      e.events = EPOLLIN;
>>      if (epoll_ctl(efd[2], EPOLL_CTL_ADD, sfd[0], &e) < 0)
>>          goto out;
>>
>>      e.events = EPOLLIN;
>>      if (epoll_ctl(efd[1], EPOLL_CTL_ADD, efd[2], &e) < 0)
>>          goto out;
>>
>>      e.events = EPOLLIN | EPOLLET;
>>      if (epoll_ctl(efd[0], EPOLL_CTL_ADD, efd[2], &e) < 0)
>>          goto out;
>>
>>      if (write(sfd[1], "w", 1) != 1)
>>          goto out;
>>
>>      nfds = epoll_wait(efd[0], &e, 1, 0);
>>      if (nfds != 1)
>>          goto out;
>>
>>      nfds = epoll_wait(efd[0], &e, 1, 0);
>>      if (nfds != 0)
>>          goto out;
>>
>>      nfds = epoll_wait(efd[1], &e, 1, 0);
>>      if (nfds != 1)
>>          goto out;
>>
>>      nfds = epoll_wait(efd[1], &e, 1, 0);
>>      if (nfds != 1)
>>          goto out;
>>
>>      nfds = epoll_wait(efd[2], &e, 1, 0);
>>      if (nfds != 1)
>>          goto out;
>>
>>      nfds = epoll_wait(efd[2], &e, 1, 0);
>>      if (nfds != 1)
>>          goto out;
>>
>>      close(efd[2]);
>>      close(efd[1]);
>>      close(efd[0]);
>>      close(sfd[0]);
>>      close(sfd[1]);
>>
>>      printf("PASS\n");
>>      return 0;
>>
>>  out:
>>      printf("FAIL\n");
>>      return -1;
>>  }
>>
>> Cc: Al Viro <viro@ZenIV.linux.org.uk>
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> Cc: Davide Libenzi <davidel@xmailserver.org>
>> Cc: Davidlohr Bueso <dave@stgolabs.net>
>> Cc: Dominik Brodowski <linux@dominikbrodowski.net>
>> Cc: Eric Wong <e@80x24.org>
>> Cc: Jason Baron <jbaron@akamai.com>
>> Cc: Linus Torvalds <torvalds@linux-foundation.org>
>> Cc: Roman Penyaev <rpenyaev@suse.de>
>> Cc: Sridhar Samudrala <sridhar.samudrala@intel.com>
>> Cc: linux-kernel@vger.kernel.org
>> Cc: linux-fsdevel@vger.kernel.org
>> Signed-off-by: hev <r@hev.cc>
>> ---
>>  fs/eventpoll.c | 6 +++++-
>>  1 file changed, 5 insertions(+), 1 deletion(-)
>>
>> diff --git a/fs/eventpoll.c b/fs/eventpoll.c
>> index d7f1f5011fac..a44cb27c636c 100644
>> --- a/fs/eventpoll.c
>> +++ b/fs/eventpoll.c
>> @@ -672,6 +672,7 @@ static __poll_t ep_scan_ready_list(struct
>> eventpoll *ep,
>>  {
>>      __poll_t res;
>>      int pwake = 0;
>> +    int nwake = 0;
>>      struct epitem *epi, *nepi;
>>      LIST_HEAD(txlist);
>>
>> @@ -685,6 +686,9 @@ static __poll_t ep_scan_ready_list(struct
>> eventpoll *ep,
>>      if (!ep_locked)
>>          mutex_lock_nested(&ep->mtx, depth);
>>
>> +    if (!depth || list_empty(&ep->rdllist))
>> +        nwake = 1;
>> +
>>      /*
>>       * Steal the ready list, and re-init the original one to the
>>       * empty list. Also, set ep->ovflist to NULL so that events
>> @@ -739,7 +743,7 @@ static __poll_t ep_scan_ready_list(struct
>> eventpoll *ep,
>>      list_splice(&txlist, &ep->rdllist);
>>      __pm_relax(ep->ws);
>>
>> -    if (!list_empty(&ep->rdllist)) {
>> +    if (nwake && !list_empty(&ep->rdllist)) {
>>          /*
>>           * Wake up (if active) both the eventpoll wait list and
>>           * the ->poll() wait list (delayed after we release the lock).
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH RESEND] fs/epoll: fix the edge-triggered mode for nested epoll
  2019-09-03 21:08   ` Jason Baron
@ 2019-09-04  9:57     ` Roman Penyaev
  2019-09-04 12:02       ` Jason Baron
  0 siblings, 1 reply; 12+ messages in thread
From: Roman Penyaev @ 2019-09-04  9:57 UTC (permalink / raw)
  To: Jason Baron
  Cc: hev, linux-fsdevel, e, Al Viro, Andrew Morton, Davide Libenzi,
	Davidlohr Bueso, Dominik Brodowski, Linus Torvalds,
	Sridhar Samudrala, linux-kernel

On 2019-09-03 23:08, Jason Baron wrote:
> On 9/2/19 11:36 AM, Roman Penyaev wrote:
>> Hi,
>> 
>> This is indeed a bug. (quick side note: could you please remove efd[1]
>> from your test, because it is not related to the reproduction of a
>> current bug).
>> 
>> Your patch lacks a good description, what exactly you've fixed.  Let
>> me speak out loud and please correct me if I'm wrong, my understanding
>> of epoll internals has become a bit rusty: when epoll fds are nested
>> an attempt to harvest events (ep_scan_ready_list() call) produces a
>> second (repeated) event from an internal fd up to an external fd:
>> 
>>      epoll_wait(efd[0], ...):
>>        ep_send_events():
>>           ep_scan_ready_list(depth=0):
>>             ep_send_events_proc():
>>                 ep_item_poll():
>>                   ep_scan_ready_list(depth=1):
>>                     ep_poll_safewake():
>>                       ep_poll_callback()
>>                         list_add_tail(&epi, &epi->rdllist);
>>                         ^^^^^^
>>                         repeated event
>> 
>> 
>> In your patch you forbid wakeup for the cases, where depth != 0, i.e.
>> for all nested cases. That seems clear.  But what if we can go further
>> and remove the whole chunk, which seems excessive:
>> 
>> @@ -885,26 +886,11 @@ static __poll_t ep_scan_ready_list(struct
>> eventpoll *ep,
>> 
>> -
>> -       if (!list_empty(&ep->rdllist)) {
>> -               /*
>> -                * Wake up (if active) both the eventpoll wait list 
>> and
>> -                * the ->poll() wait list (delayed after we release 
>> the
>> lock).
>> -                */
>> -               if (waitqueue_active(&ep->wq))
>> -                       wake_up(&ep->wq);
>> -               if (waitqueue_active(&ep->poll_wait))
>> -                       pwake++;
>> -       }
>>         write_unlock_irq(&ep->lock);
>> 
>>         if (!ep_locked)
>>                 mutex_unlock(&ep->mtx);
>> 
>> -       /* We have to call this outside the lock */
>> -       if (pwake)
>> -               ep_poll_safewake(&ep->poll_wait);
>> 
>> 
>> I reason like that: by the time we've reached the point of scanning 
>> events
>> for readiness all wakeups from ep_poll_callback have been already 
>> fired and
>> new events have been already accounted in ready list 
>> (ep_poll_callback()
>> calls
>> the same ep_poll_safewake()). Here, frankly, I'm not 100% sure and 
>> probably
>> missing some corner cases.
>> 
>> Thoughts?
> 
> So the: 'wake_up(&ep->wq);' part, I think is about waking up other
> threads that may be in waiting in epoll_wait(). For example, there may
> be multiple threads doing epoll_wait() on the same epoll fd, and the
> logic above seems to say thread 1 may have processed say N events and
> now its going to to go off to work those, so let's wake up thread 2 now
> to handle the next chunk.

Not quite. Thread which calls ep_scan_ready_list() processes all the
events, and while processing those, removes them one by one from the
ready list.  But if event mask is !0 and event belongs to
Level Triggered Mode descriptor (let's say default mode) it tails event
again back to the list (because we are in level mode, so event should
be there).  So at the end of this traversing loop ready list is likely
not empty, and if so, wake up again is called for nested epoll fds.
But, those nested epoll fds should get already all the notifications
from the main event callback ep_poll_callback(), regardless any thread
which traverses events.

I suppose this logic exists for decades, when Davide (the author) was
reshuffling the code here and there.

But I do not feel confidence to state that this extra wakeup is bogus,
I just have a gut feeling that it looks excessive.

> So I think removing all that even for the
> depth 0 case is going to change some behavior here. So perhaps, it
> should be removed for all depths except for 0? And if so, it may be
> better to make 2 patches here to separate these changes.
> 
> For the nested wakeups, I agree that the extra wakeups seem unnecessary
> and it may make sense to remove them for all depths. I don't think the
> nested epoll semantics are particularly well spelled out, and afaict,
> nested epoll() has behaved this way for quite some time. And the 
> current
> behavior is not bad in the way that a missing wakeup or false negative
> would be.

That's 100% true! For edge mode extra wake up is not a bug, not optimal
for userspace - yes, but that can't lead to any lost wakeups.

--
Roman


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH RESEND] fs/epoll: fix the edge-triggered mode for nested epoll
  2019-09-04  9:57     ` Roman Penyaev
@ 2019-09-04 12:02       ` Jason Baron
  2019-09-04 14:02         ` Heiher
  0 siblings, 1 reply; 12+ messages in thread
From: Jason Baron @ 2019-09-04 12:02 UTC (permalink / raw)
  To: Roman Penyaev
  Cc: hev, linux-fsdevel, e, Al Viro, Andrew Morton, Davide Libenzi,
	Davidlohr Bueso, Dominik Brodowski, Linus Torvalds,
	Sridhar Samudrala, linux-kernel



On 9/4/19 5:57 AM, Roman Penyaev wrote:
> On 2019-09-03 23:08, Jason Baron wrote:
>> On 9/2/19 11:36 AM, Roman Penyaev wrote:
>>> Hi,
>>>
>>> This is indeed a bug. (quick side note: could you please remove efd[1]
>>> from your test, because it is not related to the reproduction of a
>>> current bug).
>>>
>>> Your patch lacks a good description, what exactly you've fixed.  Let
>>> me speak out loud and please correct me if I'm wrong, my understanding
>>> of epoll internals has become a bit rusty: when epoll fds are nested
>>> an attempt to harvest events (ep_scan_ready_list() call) produces a
>>> second (repeated) event from an internal fd up to an external fd:
>>>
>>>      epoll_wait(efd[0], ...):
>>>        ep_send_events():
>>>           ep_scan_ready_list(depth=0):
>>>             ep_send_events_proc():
>>>                 ep_item_poll():
>>>                   ep_scan_ready_list(depth=1):
>>>                     ep_poll_safewake():
>>>                       ep_poll_callback()
>>>                         list_add_tail(&epi, &epi->rdllist);
>>>                         ^^^^^^
>>>                         repeated event
>>>
>>>
>>> In your patch you forbid wakeup for the cases, where depth != 0, i.e.
>>> for all nested cases. That seems clear.  But what if we can go further
>>> and remove the whole chunk, which seems excessive:
>>>
>>> @@ -885,26 +886,11 @@ static __poll_t ep_scan_ready_list(struct
>>> eventpoll *ep,
>>>
>>> -
>>> -       if (!list_empty(&ep->rdllist)) {
>>> -               /*
>>> -                * Wake up (if active) both the eventpoll wait list and
>>> -                * the ->poll() wait list (delayed after we release the
>>> lock).
>>> -                */
>>> -               if (waitqueue_active(&ep->wq))
>>> -                       wake_up(&ep->wq);
>>> -               if (waitqueue_active(&ep->poll_wait))
>>> -                       pwake++;
>>> -       }
>>>         write_unlock_irq(&ep->lock);
>>>
>>>         if (!ep_locked)
>>>                 mutex_unlock(&ep->mtx);
>>>
>>> -       /* We have to call this outside the lock */
>>> -       if (pwake)
>>> -               ep_poll_safewake(&ep->poll_wait);
>>>
>>>
>>> I reason like that: by the time we've reached the point of scanning events
>>> for readiness all wakeups from ep_poll_callback have been already fired and
>>> new events have been already accounted in ready list (ep_poll_callback()
>>> calls
>>> the same ep_poll_safewake()). Here, frankly, I'm not 100% sure and probably
>>> missing some corner cases.
>>>
>>> Thoughts?
>>
>> So the: 'wake_up(&ep->wq);' part, I think is about waking up other
>> threads that may be in waiting in epoll_wait(). For example, there may
>> be multiple threads doing epoll_wait() on the same epoll fd, and the
>> logic above seems to say thread 1 may have processed say N events and
>> now its going to to go off to work those, so let's wake up thread 2 now
>> to handle the next chunk.
> 
> Not quite. Thread which calls ep_scan_ready_list() processes all the
> events, and while processing those, removes them one by one from the
> ready list.  But if event mask is !0 and event belongs to
> Level Triggered Mode descriptor (let's say default mode) it tails event
> again back to the list (because we are in level mode, so event should
> be there).  So at the end of this traversing loop ready list is likely
> not empty, and if so, wake up again is called for nested epoll fds.
> But, those nested epoll fds should get already all the notifications
> from the main event callback ep_poll_callback(), regardless any thread
> which traverses events.
> 
> I suppose this logic exists for decades, when Davide (the author) was
> reshuffling the code here and there.
> 
> But I do not feel confidence to state that this extra wakeup is bogus,
> I just have a gut feeling that it looks excessive.

Note that I was talking about the wakeup done on ep->wq not ep->poll_wait.
The path that I'm concerned about is let's say that there are N events
queued on the ready list. A thread that was woken up in epoll_wait may
decide to only process say N/2 of then. Then it will call wakeup on ep->wq
and this will wakeup another thread to process the remaining N/2. Without
the wakeup, the original thread isn't going to process the events until
it finishes with the original N/2 and gets back to epoll_wait(). So I'm not
sure how important that path is but I wanted to at least note the change
here would impact that behavior.

Thanks,

-Jason


> 
>> So I think removing all that even for the
>> depth 0 case is going to change some behavior here. So perhaps, it
>> should be removed for all depths except for 0? And if so, it may be
>> better to make 2 patches here to separate these changes.
>>
>> For the nested wakeups, I agree that the extra wakeups seem unnecessary
>> and it may make sense to remove them for all depths. I don't think the
>> nested epoll semantics are particularly well spelled out, and afaict,
>> nested epoll() has behaved this way for quite some time. And the current
>> behavior is not bad in the way that a missing wakeup or false negative
>> would be.
> 
> That's 100% true! For edge mode extra wake up is not a bug, not optimal
> for userspace - yes, but that can't lead to any lost wakeups.
> 
> -- 
> Roman
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH RESEND] fs/epoll: fix the edge-triggered mode for nested epoll
  2019-09-04 12:02       ` Jason Baron
@ 2019-09-04 14:02         ` Heiher
  2019-09-05  2:53           ` Heiher
  0 siblings, 1 reply; 12+ messages in thread
From: Heiher @ 2019-09-04 14:02 UTC (permalink / raw)
  To: Jason Baron, linux-fsdevel, Roman Penyaev
  Cc: Eric Wong, Al Viro, Andrew Morton, Davide Libenzi,
	Davidlohr Bueso, Dominik Brodowski, Linus Torvalds,
	Sridhar Samudrala, linux-kernel

Hi,

On Wed, Sep 4, 2019 at 8:02 PM Jason Baron <jbaron@akamai.com> wrote:
>
>
>
> On 9/4/19 5:57 AM, Roman Penyaev wrote:
> > On 2019-09-03 23:08, Jason Baron wrote:
> >> On 9/2/19 11:36 AM, Roman Penyaev wrote:
> >>> Hi,
> >>>
> >>> This is indeed a bug. (quick side note: could you please remove efd[1]
> >>> from your test, because it is not related to the reproduction of a
> >>> current bug).
> >>>
> >>> Your patch lacks a good description, what exactly you've fixed.  Let
> >>> me speak out loud and please correct me if I'm wrong, my understanding
> >>> of epoll internals has become a bit rusty: when epoll fds are nested
> >>> an attempt to harvest events (ep_scan_ready_list() call) produces a
> >>> second (repeated) event from an internal fd up to an external fd:
> >>>
> >>>      epoll_wait(efd[0], ...):
> >>>        ep_send_events():
> >>>           ep_scan_ready_list(depth=0):
> >>>             ep_send_events_proc():
> >>>                 ep_item_poll():
> >>>                   ep_scan_ready_list(depth=1):
> >>>                     ep_poll_safewake():
> >>>                       ep_poll_callback()
> >>>                         list_add_tail(&epi, &epi->rdllist);
> >>>                         ^^^^^^
> >>>                         repeated event
> >>>
> >>>
> >>> In your patch you forbid wakeup for the cases, where depth != 0, i.e.
> >>> for all nested cases. That seems clear.  But what if we can go further
> >>> and remove the whole chunk, which seems excessive:
> >>>
> >>> @@ -885,26 +886,11 @@ static __poll_t ep_scan_ready_list(struct
> >>> eventpoll *ep,
> >>>
> >>> -
> >>> -       if (!list_empty(&ep->rdllist)) {
> >>> -               /*
> >>> -                * Wake up (if active) both the eventpoll wait list and
> >>> -                * the ->poll() wait list (delayed after we release the
> >>> lock).
> >>> -                */
> >>> -               if (waitqueue_active(&ep->wq))
> >>> -                       wake_up(&ep->wq);
> >>> -               if (waitqueue_active(&ep->poll_wait))
> >>> -                       pwake++;
> >>> -       }
> >>>         write_unlock_irq(&ep->lock);
> >>>
> >>>         if (!ep_locked)
> >>>                 mutex_unlock(&ep->mtx);
> >>>
> >>> -       /* We have to call this outside the lock */
> >>> -       if (pwake)
> >>> -               ep_poll_safewake(&ep->poll_wait);
> >>>
> >>>
> >>> I reason like that: by the time we've reached the point of scanning events
> >>> for readiness all wakeups from ep_poll_callback have been already fired and
> >>> new events have been already accounted in ready list (ep_poll_callback()
> >>> calls
> >>> the same ep_poll_safewake()). Here, frankly, I'm not 100% sure and probably
> >>> missing some corner cases.
> >>>
> >>> Thoughts?
> >>
> >> So the: 'wake_up(&ep->wq);' part, I think is about waking up other
> >> threads that may be in waiting in epoll_wait(). For example, there may
> >> be multiple threads doing epoll_wait() on the same epoll fd, and the
> >> logic above seems to say thread 1 may have processed say N events and
> >> now its going to to go off to work those, so let's wake up thread 2 now
> >> to handle the next chunk.
> >
> > Not quite. Thread which calls ep_scan_ready_list() processes all the
> > events, and while processing those, removes them one by one from the
> > ready list.  But if event mask is !0 and event belongs to
> > Level Triggered Mode descriptor (let's say default mode) it tails event
> > again back to the list (because we are in level mode, so event should
> > be there).  So at the end of this traversing loop ready list is likely
> > not empty, and if so, wake up again is called for nested epoll fds.
> > But, those nested epoll fds should get already all the notifications
> > from the main event callback ep_poll_callback(), regardless any thread
> > which traverses events.
> >
> > I suppose this logic exists for decades, when Davide (the author) was
> > reshuffling the code here and there.
> >
> > But I do not feel confidence to state that this extra wakeup is bogus,
> > I just have a gut feeling that it looks excessive.
>
> Note that I was talking about the wakeup done on ep->wq not ep->poll_wait.
> The path that I'm concerned about is let's say that there are N events
> queued on the ready list. A thread that was woken up in epoll_wait may
> decide to only process say N/2 of then. Then it will call wakeup on ep->wq
> and this will wakeup another thread to process the remaining N/2. Without
> the wakeup, the original thread isn't going to process the events until
> it finishes with the original N/2 and gets back to epoll_wait(). So I'm not
> sure how important that path is but I wanted to at least note the change
> here would impact that behavior.
>
> Thanks,
>
> -Jason
>
>
> >
> >> So I think removing all that even for the
> >> depth 0 case is going to change some behavior here. So perhaps, it
> >> should be removed for all depths except for 0? And if so, it may be
> >> better to make 2 patches here to separate these changes.
> >>
> >> For the nested wakeups, I agree that the extra wakeups seem unnecessary
> >> and it may make sense to remove them for all depths. I don't think the
> >> nested epoll semantics are particularly well spelled out, and afaict,
> >> nested epoll() has behaved this way for quite some time. And the current
> >> behavior is not bad in the way that a missing wakeup or false negative
> >> would be.
> >
> > That's 100% true! For edge mode extra wake up is not a bug, not optimal
> > for userspace - yes, but that can't lead to any lost wakeups.
> >
> > --
> > Roman
> >

I tried to remove the whole chunk of code that Roman said, and it
seems that there
are no obvious problems with the two test programs below:

Test case 1:
           t0
            |
           e0
            |
           e1 (et)
            |
           s0 (lt)

When s0 is readable, the thread 0 can only read once event from e0.

#include <stdio.h>
#include <unistd.h>
#include <sys/epoll.h>
#include <sys/socket.h>

int main(int argc, char *argv[])
{
    int sfd[2];
    int efd[2];
    int nfds;
    struct epoll_event e;

    if (socketpair(AF_UNIX, SOCK_STREAM, 0, sfd) < 0)
        goto out;

    efd[0] = epoll_create(1);
    if (efd[0] < 0)
        goto out;

    efd[1] = epoll_create(1);
    if (efd[1] < 0)
        goto out;

    e.events = EPOLLIN;
    if (epoll_ctl(efd[1], EPOLL_CTL_ADD, sfd[0], &e) < 0)
        goto out;

    e.events = EPOLLIN | EPOLLET;
    if (epoll_ctl(efd[0], EPOLL_CTL_ADD, efd[1], &e) < 0)
        goto out;

    if (write(sfd[1], "w", 1) != 1)
        goto out;

    nfds = epoll_wait(efd[0], &e, 1, 0);
    if (nfds != 1)
        goto out;

    nfds = epoll_wait(efd[0], &e, 1, 0);
    if (nfds != 0)
        goto out;

    nfds = epoll_wait(efd[1], &e, 1, 0);
    if (nfds != 1)
        goto out;

    nfds = epoll_wait(efd[1], &e, 1, 0);
    if (nfds != 1)
        goto out;

    close(efd[1]);
    close(efd[0]);
    close(sfd[0]);
    close(sfd[1]);

    printf("PASS\n");
    return 0;

out:
    printf("FAIL\n");
    return -1;
}

Test case 2:
           t0  t1
            \   /
             e0
            /   \
    (et) e1   e2 (et)
           |      |
     (lt) s0    s2 (lt)

When s0 and s2 are readable, both thread 0 and thread 1 can read an
event from e0.

#include <stdio.h>
#include <unistd.h>
#include <pthread.h>
#include <sys/epoll.h>
#include <sys/socket.h>

static int efd[3];
static int sfd[4];
static int count;

static void *
thread_handler(void *data)
{
    struct epoll_event e;

    if (epoll_wait(efd[0], &e, 1, -1) == 1)
        count++;

    return NULL;
}

static void *
emit_handler(void *data)
{
    usleep (100000);

    write(sfd[1], "w", 1);
    write(sfd[3], "w", 1);

    return NULL;
}

int main(int argc, char *argv[])
{
    struct epoll_event e;
    pthread_t tw, te;

    if (socketpair(AF_UNIX, SOCK_STREAM, 0, &sfd[0]) < 0)
        goto out;

    if (socketpair(AF_UNIX, SOCK_STREAM, 0, &sfd[2]) < 0)
        goto out;

    efd[0] = epoll_create(1);
    if (efd[0] < 0)
        goto out;

    efd[1] = epoll_create(1);
    if (efd[1] < 0)
        goto out;

    efd[2] = epoll_create(1);
    if (efd[2] < 0)
        goto out;

    e.events = EPOLLIN;
    if (epoll_ctl(efd[1], EPOLL_CTL_ADD, sfd[0], &e) < 0)
        goto out;

    e.events = EPOLLIN;
    if (epoll_ctl(efd[2], EPOLL_CTL_ADD, sfd[2], &e) < 0)
        goto out;

    e.events = EPOLLIN | EPOLLET;
    if (epoll_ctl(efd[0], EPOLL_CTL_ADD, efd[1], &e) < 0)
        goto out;

    e.events = EPOLLIN | EPOLLET;
    if (epoll_ctl(efd[0], EPOLL_CTL_ADD, efd[2], &e) < 0)
        goto out;

    if (pthread_create(&tw, NULL, thread_handler, NULL) < 0)
        goto out;

    if (pthread_create(&te, NULL, emit_handler, NULL) < 0)
        goto out;

    if (epoll_wait(efd[0], &e, 1, -1) == 1)
        count++;

    if (pthread_join(tw, NULL) < 0)
        goto out;

    if (count != 2)
        goto out;

    close(efd[0]);
    close(efd[1]);
    close(efd[2]);
    close(sfd[0]);
    close(sfd[1]);
    close(sfd[2]);
    close(sfd[3]);

    printf ("PASS\n");
    return 0;

out:
    printf("FAIL\n");
    return -1;
}

t0: thread0
t1: thread1
e0: epoll0 (efd[0])
e1: epoll1 (efd[1])
e2: epoll2 (efd[2])
s0: socket0 (sfd[0])
s2: socket2 (sfd[2])

Is it possible to prove that this modification is correct, or any
other corner cases are missing?

-- 
Best regards!
Hev
https://hev.cc

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH RESEND] fs/epoll: fix the edge-triggered mode for nested epoll
  2019-09-04 14:02         ` Heiher
@ 2019-09-05  2:53           ` Heiher
  2019-09-05  9:56             ` Heiher
  0 siblings, 1 reply; 12+ messages in thread
From: Heiher @ 2019-09-05  2:53 UTC (permalink / raw)
  To: Jason Baron, linux-fsdevel, Roman Penyaev
  Cc: Eric Wong, Al Viro, Andrew Morton, Davide Libenzi,
	Davidlohr Bueso, Dominik Brodowski, Linus Torvalds,
	Sridhar Samudrala, linux-kernel

Hi,

I created an epoll wakeup test project, listed some possible cases,
and any other corner cases needs to be added?

https://github.com/heiher/epoll-wakeup/blob/master/README.md

On Wed, Sep 4, 2019 at 10:02 PM Heiher <r@hev.cc> wrote:
>
> Hi,
>
> On Wed, Sep 4, 2019 at 8:02 PM Jason Baron <jbaron@akamai.com> wrote:
> >
> >
> >
> > On 9/4/19 5:57 AM, Roman Penyaev wrote:
> > > On 2019-09-03 23:08, Jason Baron wrote:
> > >> On 9/2/19 11:36 AM, Roman Penyaev wrote:
> > >>> Hi,
> > >>>
> > >>> This is indeed a bug. (quick side note: could you please remove efd[1]
> > >>> from your test, because it is not related to the reproduction of a
> > >>> current bug).
> > >>>
> > >>> Your patch lacks a good description, what exactly you've fixed.  Let
> > >>> me speak out loud and please correct me if I'm wrong, my understanding
> > >>> of epoll internals has become a bit rusty: when epoll fds are nested
> > >>> an attempt to harvest events (ep_scan_ready_list() call) produces a
> > >>> second (repeated) event from an internal fd up to an external fd:
> > >>>
> > >>>      epoll_wait(efd[0], ...):
> > >>>        ep_send_events():
> > >>>           ep_scan_ready_list(depth=0):
> > >>>             ep_send_events_proc():
> > >>>                 ep_item_poll():
> > >>>                   ep_scan_ready_list(depth=1):
> > >>>                     ep_poll_safewake():
> > >>>                       ep_poll_callback()
> > >>>                         list_add_tail(&epi, &epi->rdllist);
> > >>>                         ^^^^^^
> > >>>                         repeated event
> > >>>
> > >>>
> > >>> In your patch you forbid wakeup for the cases, where depth != 0, i.e.
> > >>> for all nested cases. That seems clear.  But what if we can go further
> > >>> and remove the whole chunk, which seems excessive:
> > >>>
> > >>> @@ -885,26 +886,11 @@ static __poll_t ep_scan_ready_list(struct
> > >>> eventpoll *ep,
> > >>>
> > >>> -
> > >>> -       if (!list_empty(&ep->rdllist)) {
> > >>> -               /*
> > >>> -                * Wake up (if active) both the eventpoll wait list and
> > >>> -                * the ->poll() wait list (delayed after we release the
> > >>> lock).
> > >>> -                */
> > >>> -               if (waitqueue_active(&ep->wq))
> > >>> -                       wake_up(&ep->wq);
> > >>> -               if (waitqueue_active(&ep->poll_wait))
> > >>> -                       pwake++;
> > >>> -       }
> > >>>         write_unlock_irq(&ep->lock);
> > >>>
> > >>>         if (!ep_locked)
> > >>>                 mutex_unlock(&ep->mtx);
> > >>>
> > >>> -       /* We have to call this outside the lock */
> > >>> -       if (pwake)
> > >>> -               ep_poll_safewake(&ep->poll_wait);
> > >>>
> > >>>
> > >>> I reason like that: by the time we've reached the point of scanning events
> > >>> for readiness all wakeups from ep_poll_callback have been already fired and
> > >>> new events have been already accounted in ready list (ep_poll_callback()
> > >>> calls
> > >>> the same ep_poll_safewake()). Here, frankly, I'm not 100% sure and probably
> > >>> missing some corner cases.
> > >>>
> > >>> Thoughts?
> > >>
> > >> So the: 'wake_up(&ep->wq);' part, I think is about waking up other
> > >> threads that may be in waiting in epoll_wait(). For example, there may
> > >> be multiple threads doing epoll_wait() on the same epoll fd, and the
> > >> logic above seems to say thread 1 may have processed say N events and
> > >> now its going to to go off to work those, so let's wake up thread 2 now
> > >> to handle the next chunk.
> > >
> > > Not quite. Thread which calls ep_scan_ready_list() processes all the
> > > events, and while processing those, removes them one by one from the
> > > ready list.  But if event mask is !0 and event belongs to
> > > Level Triggered Mode descriptor (let's say default mode) it tails event
> > > again back to the list (because we are in level mode, so event should
> > > be there).  So at the end of this traversing loop ready list is likely
> > > not empty, and if so, wake up again is called for nested epoll fds.
> > > But, those nested epoll fds should get already all the notifications
> > > from the main event callback ep_poll_callback(), regardless any thread
> > > which traverses events.
> > >
> > > I suppose this logic exists for decades, when Davide (the author) was
> > > reshuffling the code here and there.
> > >
> > > But I do not feel confidence to state that this extra wakeup is bogus,
> > > I just have a gut feeling that it looks excessive.
> >
> > Note that I was talking about the wakeup done on ep->wq not ep->poll_wait.
> > The path that I'm concerned about is let's say that there are N events
> > queued on the ready list. A thread that was woken up in epoll_wait may
> > decide to only process say N/2 of then. Then it will call wakeup on ep->wq
> > and this will wakeup another thread to process the remaining N/2. Without
> > the wakeup, the original thread isn't going to process the events until
> > it finishes with the original N/2 and gets back to epoll_wait(). So I'm not
> > sure how important that path is but I wanted to at least note the change
> > here would impact that behavior.
> >
> > Thanks,
> >
> > -Jason
> >
> >
> > >
> > >> So I think removing all that even for the
> > >> depth 0 case is going to change some behavior here. So perhaps, it
> > >> should be removed for all depths except for 0? And if so, it may be
> > >> better to make 2 patches here to separate these changes.
> > >>
> > >> For the nested wakeups, I agree that the extra wakeups seem unnecessary
> > >> and it may make sense to remove them for all depths. I don't think the
> > >> nested epoll semantics are particularly well spelled out, and afaict,
> > >> nested epoll() has behaved this way for quite some time. And the current
> > >> behavior is not bad in the way that a missing wakeup or false negative
> > >> would be.
> > >
> > > That's 100% true! For edge mode extra wake up is not a bug, not optimal
> > > for userspace - yes, but that can't lead to any lost wakeups.
> > >
> > > --
> > > Roman
> > >
>
> I tried to remove the whole chunk of code that Roman said, and it
> seems that there
> are no obvious problems with the two test programs below:
>
> Test case 1:
>            t0
>             |
>            e0
>             |
>            e1 (et)
>             |
>            s0 (lt)
>
> When s0 is readable, the thread 0 can only read once event from e0.
>
> #include <stdio.h>
> #include <unistd.h>
> #include <sys/epoll.h>
> #include <sys/socket.h>
>
> int main(int argc, char *argv[])
> {
>     int sfd[2];
>     int efd[2];
>     int nfds;
>     struct epoll_event e;
>
>     if (socketpair(AF_UNIX, SOCK_STREAM, 0, sfd) < 0)
>         goto out;
>
>     efd[0] = epoll_create(1);
>     if (efd[0] < 0)
>         goto out;
>
>     efd[1] = epoll_create(1);
>     if (efd[1] < 0)
>         goto out;
>
>     e.events = EPOLLIN;
>     if (epoll_ctl(efd[1], EPOLL_CTL_ADD, sfd[0], &e) < 0)
>         goto out;
>
>     e.events = EPOLLIN | EPOLLET;
>     if (epoll_ctl(efd[0], EPOLL_CTL_ADD, efd[1], &e) < 0)
>         goto out;
>
>     if (write(sfd[1], "w", 1) != 1)
>         goto out;
>
>     nfds = epoll_wait(efd[0], &e, 1, 0);
>     if (nfds != 1)
>         goto out;
>
>     nfds = epoll_wait(efd[0], &e, 1, 0);
>     if (nfds != 0)
>         goto out;
>
>     nfds = epoll_wait(efd[1], &e, 1, 0);
>     if (nfds != 1)
>         goto out;
>
>     nfds = epoll_wait(efd[1], &e, 1, 0);
>     if (nfds != 1)
>         goto out;
>
>     close(efd[1]);
>     close(efd[0]);
>     close(sfd[0]);
>     close(sfd[1]);
>
>     printf("PASS\n");
>     return 0;
>
> out:
>     printf("FAIL\n");
>     return -1;
> }
>
> Test case 2:
>            t0  t1
>             \   /
>              e0
>             /   \
>     (et) e1   e2 (et)
>            |      |
>      (lt) s0    s2 (lt)
>
> When s0 and s2 are readable, both thread 0 and thread 1 can read an
> event from e0.
>
> #include <stdio.h>
> #include <unistd.h>
> #include <pthread.h>
> #include <sys/epoll.h>
> #include <sys/socket.h>
>
> static int efd[3];
> static int sfd[4];
> static int count;
>
> static void *
> thread_handler(void *data)
> {
>     struct epoll_event e;
>
>     if (epoll_wait(efd[0], &e, 1, -1) == 1)
>         count++;
>
>     return NULL;
> }
>
> static void *
> emit_handler(void *data)
> {
>     usleep (100000);
>
>     write(sfd[1], "w", 1);
>     write(sfd[3], "w", 1);
>
>     return NULL;
> }
>
> int main(int argc, char *argv[])
> {
>     struct epoll_event e;
>     pthread_t tw, te;
>
>     if (socketpair(AF_UNIX, SOCK_STREAM, 0, &sfd[0]) < 0)
>         goto out;
>
>     if (socketpair(AF_UNIX, SOCK_STREAM, 0, &sfd[2]) < 0)
>         goto out;
>
>     efd[0] = epoll_create(1);
>     if (efd[0] < 0)
>         goto out;
>
>     efd[1] = epoll_create(1);
>     if (efd[1] < 0)
>         goto out;
>
>     efd[2] = epoll_create(1);
>     if (efd[2] < 0)
>         goto out;
>
>     e.events = EPOLLIN;
>     if (epoll_ctl(efd[1], EPOLL_CTL_ADD, sfd[0], &e) < 0)
>         goto out;
>
>     e.events = EPOLLIN;
>     if (epoll_ctl(efd[2], EPOLL_CTL_ADD, sfd[2], &e) < 0)
>         goto out;
>
>     e.events = EPOLLIN | EPOLLET;
>     if (epoll_ctl(efd[0], EPOLL_CTL_ADD, efd[1], &e) < 0)
>         goto out;
>
>     e.events = EPOLLIN | EPOLLET;
>     if (epoll_ctl(efd[0], EPOLL_CTL_ADD, efd[2], &e) < 0)
>         goto out;
>
>     if (pthread_create(&tw, NULL, thread_handler, NULL) < 0)
>         goto out;
>
>     if (pthread_create(&te, NULL, emit_handler, NULL) < 0)
>         goto out;
>
>     if (epoll_wait(efd[0], &e, 1, -1) == 1)
>         count++;
>
>     if (pthread_join(tw, NULL) < 0)
>         goto out;
>
>     if (count != 2)
>         goto out;
>
>     close(efd[0]);
>     close(efd[1]);
>     close(efd[2]);
>     close(sfd[0]);
>     close(sfd[1]);
>     close(sfd[2]);
>     close(sfd[3]);
>
>     printf ("PASS\n");
>     return 0;
>
> out:
>     printf("FAIL\n");
>     return -1;
> }
>
> t0: thread0
> t1: thread1
> e0: epoll0 (efd[0])
> e1: epoll1 (efd[1])
> e2: epoll2 (efd[2])
> s0: socket0 (sfd[0])
> s2: socket2 (sfd[2])
>
> Is it possible to prove that this modification is correct, or any
> other corner cases are missing?
>
> --
> Best regards!
> Hev
> https://hev.cc



-- 
Best regards!
Hev
https://hev.cc

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH RESEND] fs/epoll: fix the edge-triggered mode for nested epoll
  2019-09-05  2:53           ` Heiher
@ 2019-09-05  9:56             ` Heiher
  2019-09-05 17:27               ` Roman Penyaev
  0 siblings, 1 reply; 12+ messages in thread
From: Heiher @ 2019-09-05  9:56 UTC (permalink / raw)
  To: Jason Baron, linux-fsdevel, Roman Penyaev
  Cc: Eric Wong, Al Viro, Andrew Morton, Davide Libenzi,
	Davidlohr Bueso, Dominik Brodowski, Linus Torvalds,
	Sridhar Samudrala, linux-kernel

Hi,

On Thu, Sep 5, 2019 at 10:53 AM Heiher <r@hev.cc> wrote:
>
> Hi,
>
> I created an epoll wakeup test project, listed some possible cases,
> and any other corner cases needs to be added?
>
> https://github.com/heiher/epoll-wakeup/blob/master/README.md
>
> On Wed, Sep 4, 2019 at 10:02 PM Heiher <r@hev.cc> wrote:
> >
> > Hi,
> >
> > On Wed, Sep 4, 2019 at 8:02 PM Jason Baron <jbaron@akamai.com> wrote:
> > >
> > >
> > >
> > > On 9/4/19 5:57 AM, Roman Penyaev wrote:
> > > > On 2019-09-03 23:08, Jason Baron wrote:
> > > >> On 9/2/19 11:36 AM, Roman Penyaev wrote:
> > > >>> Hi,
> > > >>>
> > > >>> This is indeed a bug. (quick side note: could you please remove efd[1]
> > > >>> from your test, because it is not related to the reproduction of a
> > > >>> current bug).
> > > >>>
> > > >>> Your patch lacks a good description, what exactly you've fixed.  Let
> > > >>> me speak out loud and please correct me if I'm wrong, my understanding
> > > >>> of epoll internals has become a bit rusty: when epoll fds are nested
> > > >>> an attempt to harvest events (ep_scan_ready_list() call) produces a
> > > >>> second (repeated) event from an internal fd up to an external fd:
> > > >>>
> > > >>>      epoll_wait(efd[0], ...):
> > > >>>        ep_send_events():
> > > >>>           ep_scan_ready_list(depth=0):
> > > >>>             ep_send_events_proc():
> > > >>>                 ep_item_poll():
> > > >>>                   ep_scan_ready_list(depth=1):
> > > >>>                     ep_poll_safewake():
> > > >>>                       ep_poll_callback()
> > > >>>                         list_add_tail(&epi, &epi->rdllist);
> > > >>>                         ^^^^^^
> > > >>>                         repeated event
> > > >>>
> > > >>>
> > > >>> In your patch you forbid wakeup for the cases, where depth != 0, i.e.
> > > >>> for all nested cases. That seems clear.  But what if we can go further
> > > >>> and remove the whole chunk, which seems excessive:
> > > >>>
> > > >>> @@ -885,26 +886,11 @@ static __poll_t ep_scan_ready_list(struct
> > > >>> eventpoll *ep,
> > > >>>
> > > >>> -
> > > >>> -       if (!list_empty(&ep->rdllist)) {
> > > >>> -               /*
> > > >>> -                * Wake up (if active) both the eventpoll wait list and
> > > >>> -                * the ->poll() wait list (delayed after we release the
> > > >>> lock).
> > > >>> -                */
> > > >>> -               if (waitqueue_active(&ep->wq))
> > > >>> -                       wake_up(&ep->wq);
> > > >>> -               if (waitqueue_active(&ep->poll_wait))
> > > >>> -                       pwake++;
> > > >>> -       }
> > > >>>         write_unlock_irq(&ep->lock);
> > > >>>
> > > >>>         if (!ep_locked)
> > > >>>                 mutex_unlock(&ep->mtx);
> > > >>>
> > > >>> -       /* We have to call this outside the lock */
> > > >>> -       if (pwake)
> > > >>> -               ep_poll_safewake(&ep->poll_wait);
> > > >>>
> > > >>>
> > > >>> I reason like that: by the time we've reached the point of scanning events
> > > >>> for readiness all wakeups from ep_poll_callback have been already fired and
> > > >>> new events have been already accounted in ready list (ep_poll_callback()
> > > >>> calls
> > > >>> the same ep_poll_safewake()). Here, frankly, I'm not 100% sure and probably
> > > >>> missing some corner cases.
> > > >>>
> > > >>> Thoughts?
> > > >>
> > > >> So the: 'wake_up(&ep->wq);' part, I think is about waking up other
> > > >> threads that may be in waiting in epoll_wait(). For example, there may
> > > >> be multiple threads doing epoll_wait() on the same epoll fd, and the
> > > >> logic above seems to say thread 1 may have processed say N events and
> > > >> now its going to to go off to work those, so let's wake up thread 2 now
> > > >> to handle the next chunk.
> > > >
> > > > Not quite. Thread which calls ep_scan_ready_list() processes all the
> > > > events, and while processing those, removes them one by one from the
> > > > ready list.  But if event mask is !0 and event belongs to
> > > > Level Triggered Mode descriptor (let's say default mode) it tails event
> > > > again back to the list (because we are in level mode, so event should
> > > > be there).  So at the end of this traversing loop ready list is likely
> > > > not empty, and if so, wake up again is called for nested epoll fds.
> > > > But, those nested epoll fds should get already all the notifications
> > > > from the main event callback ep_poll_callback(), regardless any thread
> > > > which traverses events.
> > > >
> > > > I suppose this logic exists for decades, when Davide (the author) was
> > > > reshuffling the code here and there.
> > > >
> > > > But I do not feel confidence to state that this extra wakeup is bogus,
> > > > I just have a gut feeling that it looks excessive.
> > >
> > > Note that I was talking about the wakeup done on ep->wq not ep->poll_wait.
> > > The path that I'm concerned about is let's say that there are N events
> > > queued on the ready list. A thread that was woken up in epoll_wait may
> > > decide to only process say N/2 of then. Then it will call wakeup on ep->wq
> > > and this will wakeup another thread to process the remaining N/2. Without
> > > the wakeup, the original thread isn't going to process the events until
> > > it finishes with the original N/2 and gets back to epoll_wait(). So I'm not
> > > sure how important that path is but I wanted to at least note the change
> > > here would impact that behavior.
> > >
> > > Thanks,
> > >
> > > -Jason
> > >
> > >
> > > >
> > > >> So I think removing all that even for the
> > > >> depth 0 case is going to change some behavior here. So perhaps, it
> > > >> should be removed for all depths except for 0? And if so, it may be
> > > >> better to make 2 patches here to separate these changes.
> > > >>
> > > >> For the nested wakeups, I agree that the extra wakeups seem unnecessary
> > > >> and it may make sense to remove them for all depths. I don't think the
> > > >> nested epoll semantics are particularly well spelled out, and afaict,
> > > >> nested epoll() has behaved this way for quite some time. And the current
> > > >> behavior is not bad in the way that a missing wakeup or false negative
> > > >> would be.
> > > >
> > > > That's 100% true! For edge mode extra wake up is not a bug, not optimal
> > > > for userspace - yes, but that can't lead to any lost wakeups.
> > > >
> > > > --
> > > > Roman
> > > >
> >
> > I tried to remove the whole chunk of code that Roman said, and it
> > seems that there
> > are no obvious problems with the two test programs below:

I recall this message, the test case 9/25/26 of epoll-wakeup (on
github) are failed while
the whole chunk are removed.

Apply the original patch, all tests passed.

> >
> > Test case 1:
> >            t0
> >             |
> >            e0
> >             |
> >            e1 (et)
> >             |
> >            s0 (lt)
> >
> > When s0 is readable, the thread 0 can only read once event from e0.
> >
> > #include <stdio.h>
> > #include <unistd.h>
> > #include <sys/epoll.h>
> > #include <sys/socket.h>
> >
> > int main(int argc, char *argv[])
> > {
> >     int sfd[2];
> >     int efd[2];
> >     int nfds;
> >     struct epoll_event e;
> >
> >     if (socketpair(AF_UNIX, SOCK_STREAM, 0, sfd) < 0)
> >         goto out;
> >
> >     efd[0] = epoll_create(1);
> >     if (efd[0] < 0)
> >         goto out;
> >
> >     efd[1] = epoll_create(1);
> >     if (efd[1] < 0)
> >         goto out;
> >
> >     e.events = EPOLLIN;
> >     if (epoll_ctl(efd[1], EPOLL_CTL_ADD, sfd[0], &e) < 0)
> >         goto out;
> >
> >     e.events = EPOLLIN | EPOLLET;
> >     if (epoll_ctl(efd[0], EPOLL_CTL_ADD, efd[1], &e) < 0)
> >         goto out;
> >
> >     if (write(sfd[1], "w", 1) != 1)
> >         goto out;
> >
> >     nfds = epoll_wait(efd[0], &e, 1, 0);
> >     if (nfds != 1)
> >         goto out;
> >
> >     nfds = epoll_wait(efd[0], &e, 1, 0);
> >     if (nfds != 0)
> >         goto out;
> >
> >     nfds = epoll_wait(efd[1], &e, 1, 0);
> >     if (nfds != 1)
> >         goto out;
> >
> >     nfds = epoll_wait(efd[1], &e, 1, 0);
> >     if (nfds != 1)
> >         goto out;
> >
> >     close(efd[1]);
> >     close(efd[0]);
> >     close(sfd[0]);
> >     close(sfd[1]);
> >
> >     printf("PASS\n");
> >     return 0;
> >
> > out:
> >     printf("FAIL\n");
> >     return -1;
> > }
> >
> > Test case 2:
> >            t0  t1
> >             \   /
> >              e0
> >             /   \
> >     (et) e1   e2 (et)
> >            |      |
> >      (lt) s0    s2 (lt)
> >
> > When s0 and s2 are readable, both thread 0 and thread 1 can read an
> > event from e0.
> >
> > #include <stdio.h>
> > #include <unistd.h>
> > #include <pthread.h>
> > #include <sys/epoll.h>
> > #include <sys/socket.h>
> >
> > static int efd[3];
> > static int sfd[4];
> > static int count;
> >
> > static void *
> > thread_handler(void *data)
> > {
> >     struct epoll_event e;
> >
> >     if (epoll_wait(efd[0], &e, 1, -1) == 1)
> >         count++;
> >
> >     return NULL;
> > }
> >
> > static void *
> > emit_handler(void *data)
> > {
> >     usleep (100000);
> >
> >     write(sfd[1], "w", 1);
> >     write(sfd[3], "w", 1);
> >
> >     return NULL;
> > }
> >
> > int main(int argc, char *argv[])
> > {
> >     struct epoll_event e;
> >     pthread_t tw, te;
> >
> >     if (socketpair(AF_UNIX, SOCK_STREAM, 0, &sfd[0]) < 0)
> >         goto out;
> >
> >     if (socketpair(AF_UNIX, SOCK_STREAM, 0, &sfd[2]) < 0)
> >         goto out;
> >
> >     efd[0] = epoll_create(1);
> >     if (efd[0] < 0)
> >         goto out;
> >
> >     efd[1] = epoll_create(1);
> >     if (efd[1] < 0)
> >         goto out;
> >
> >     efd[2] = epoll_create(1);
> >     if (efd[2] < 0)
> >         goto out;
> >
> >     e.events = EPOLLIN;
> >     if (epoll_ctl(efd[1], EPOLL_CTL_ADD, sfd[0], &e) < 0)
> >         goto out;
> >
> >     e.events = EPOLLIN;
> >     if (epoll_ctl(efd[2], EPOLL_CTL_ADD, sfd[2], &e) < 0)
> >         goto out;
> >
> >     e.events = EPOLLIN | EPOLLET;
> >     if (epoll_ctl(efd[0], EPOLL_CTL_ADD, efd[1], &e) < 0)
> >         goto out;
> >
> >     e.events = EPOLLIN | EPOLLET;
> >     if (epoll_ctl(efd[0], EPOLL_CTL_ADD, efd[2], &e) < 0)
> >         goto out;
> >
> >     if (pthread_create(&tw, NULL, thread_handler, NULL) < 0)
> >         goto out;
> >
> >     if (pthread_create(&te, NULL, emit_handler, NULL) < 0)
> >         goto out;
> >
> >     if (epoll_wait(efd[0], &e, 1, -1) == 1)
> >         count++;
> >
> >     if (pthread_join(tw, NULL) < 0)
> >         goto out;
> >
> >     if (count != 2)
> >         goto out;
> >
> >     close(efd[0]);
> >     close(efd[1]);
> >     close(efd[2]);
> >     close(sfd[0]);
> >     close(sfd[1]);
> >     close(sfd[2]);
> >     close(sfd[3]);
> >
> >     printf ("PASS\n");
> >     return 0;
> >
> > out:
> >     printf("FAIL\n");
> >     return -1;
> > }
> >
> > t0: thread0
> > t1: thread1
> > e0: epoll0 (efd[0])
> > e1: epoll1 (efd[1])
> > e2: epoll2 (efd[2])
> > s0: socket0 (sfd[0])
> > s2: socket2 (sfd[2])
> >
> > Is it possible to prove that this modification is correct, or any
> > other corner cases are missing?
> >
> > --
> > Best regards!
> > Hev
> > https://hev.cc
>
>
>
> --
> Best regards!
> Hev
> https://hev.cc



-- 
Best regards!
Hev
https://hev.cc

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH RESEND] fs/epoll: fix the edge-triggered mode for nested epoll
  2019-09-05  9:56             ` Heiher
@ 2019-09-05 17:27               ` Roman Penyaev
  2019-09-05 17:44                 ` Jason Baron
  0 siblings, 1 reply; 12+ messages in thread
From: Roman Penyaev @ 2019-09-05 17:27 UTC (permalink / raw)
  To: Heiher
  Cc: Jason Baron, linux-fsdevel, Eric Wong, Al Viro, Andrew Morton,
	Davide Libenzi, Davidlohr Bueso, Dominik Brodowski,
	Linus Torvalds, Sridhar Samudrala, linux-kernel

On 2019-09-05 11:56, Heiher wrote:
> Hi,
> 
> On Thu, Sep 5, 2019 at 10:53 AM Heiher <r@hev.cc> wrote:
>> 
>> Hi,
>> 
>> I created an epoll wakeup test project, listed some possible cases,
>> and any other corner cases needs to be added?
>> 
>> https://github.com/heiher/epoll-wakeup/blob/master/README.md
>> 
>> On Wed, Sep 4, 2019 at 10:02 PM Heiher <r@hev.cc> wrote:
>> >
>> > Hi,
>> >
>> > On Wed, Sep 4, 2019 at 8:02 PM Jason Baron <jbaron@akamai.com> wrote:
>> > >
>> > >
>> > >
>> > > On 9/4/19 5:57 AM, Roman Penyaev wrote:
>> > > > On 2019-09-03 23:08, Jason Baron wrote:
>> > > >> On 9/2/19 11:36 AM, Roman Penyaev wrote:
>> > > >>> Hi,
>> > > >>>
>> > > >>> This is indeed a bug. (quick side note: could you please remove efd[1]
>> > > >>> from your test, because it is not related to the reproduction of a
>> > > >>> current bug).
>> > > >>>
>> > > >>> Your patch lacks a good description, what exactly you've fixed.  Let
>> > > >>> me speak out loud and please correct me if I'm wrong, my understanding
>> > > >>> of epoll internals has become a bit rusty: when epoll fds are nested
>> > > >>> an attempt to harvest events (ep_scan_ready_list() call) produces a
>> > > >>> second (repeated) event from an internal fd up to an external fd:
>> > > >>>
>> > > >>>      epoll_wait(efd[0], ...):
>> > > >>>        ep_send_events():
>> > > >>>           ep_scan_ready_list(depth=0):
>> > > >>>             ep_send_events_proc():
>> > > >>>                 ep_item_poll():
>> > > >>>                   ep_scan_ready_list(depth=1):
>> > > >>>                     ep_poll_safewake():
>> > > >>>                       ep_poll_callback()
>> > > >>>                         list_add_tail(&epi, &epi->rdllist);
>> > > >>>                         ^^^^^^
>> > > >>>                         repeated event
>> > > >>>
>> > > >>>
>> > > >>> In your patch you forbid wakeup for the cases, where depth != 0, i.e.
>> > > >>> for all nested cases. That seems clear.  But what if we can go further
>> > > >>> and remove the whole chunk, which seems excessive:
>> > > >>>
>> > > >>> @@ -885,26 +886,11 @@ static __poll_t ep_scan_ready_list(struct
>> > > >>> eventpoll *ep,
>> > > >>>
>> > > >>> -
>> > > >>> -       if (!list_empty(&ep->rdllist)) {
>> > > >>> -               /*
>> > > >>> -                * Wake up (if active) both the eventpoll wait list and
>> > > >>> -                * the ->poll() wait list (delayed after we release the
>> > > >>> lock).
>> > > >>> -                */
>> > > >>> -               if (waitqueue_active(&ep->wq))
>> > > >>> -                       wake_up(&ep->wq);
>> > > >>> -               if (waitqueue_active(&ep->poll_wait))
>> > > >>> -                       pwake++;
>> > > >>> -       }
>> > > >>>         write_unlock_irq(&ep->lock);
>> > > >>>
>> > > >>>         if (!ep_locked)
>> > > >>>                 mutex_unlock(&ep->mtx);
>> > > >>>
>> > > >>> -       /* We have to call this outside the lock */
>> > > >>> -       if (pwake)
>> > > >>> -               ep_poll_safewake(&ep->poll_wait);
>> > > >>>
>> > > >>>
>> > > >>> I reason like that: by the time we've reached the point of scanning events
>> > > >>> for readiness all wakeups from ep_poll_callback have been already fired and
>> > > >>> new events have been already accounted in ready list (ep_poll_callback()
>> > > >>> calls
>> > > >>> the same ep_poll_safewake()). Here, frankly, I'm not 100% sure and probably
>> > > >>> missing some corner cases.
>> > > >>>
>> > > >>> Thoughts?
>> > > >>
>> > > >> So the: 'wake_up(&ep->wq);' part, I think is about waking up other
>> > > >> threads that may be in waiting in epoll_wait(). For example, there may
>> > > >> be multiple threads doing epoll_wait() on the same epoll fd, and the
>> > > >> logic above seems to say thread 1 may have processed say N events and
>> > > >> now its going to to go off to work those, so let's wake up thread 2 now
>> > > >> to handle the next chunk.
>> > > >
>> > > > Not quite. Thread which calls ep_scan_ready_list() processes all the
>> > > > events, and while processing those, removes them one by one from the
>> > > > ready list.  But if event mask is !0 and event belongs to
>> > > > Level Triggered Mode descriptor (let's say default mode) it tails event
>> > > > again back to the list (because we are in level mode, so event should
>> > > > be there).  So at the end of this traversing loop ready list is likely
>> > > > not empty, and if so, wake up again is called for nested epoll fds.
>> > > > But, those nested epoll fds should get already all the notifications
>> > > > from the main event callback ep_poll_callback(), regardless any thread
>> > > > which traverses events.
>> > > >
>> > > > I suppose this logic exists for decades, when Davide (the author) was
>> > > > reshuffling the code here and there.
>> > > >
>> > > > But I do not feel confidence to state that this extra wakeup is bogus,
>> > > > I just have a gut feeling that it looks excessive.
>> > >
>> > > Note that I was talking about the wakeup done on ep->wq not ep->poll_wait.
>> > > The path that I'm concerned about is let's say that there are N events
>> > > queued on the ready list. A thread that was woken up in epoll_wait may
>> > > decide to only process say N/2 of then. Then it will call wakeup on ep->wq
>> > > and this will wakeup another thread to process the remaining N/2. Without
>> > > the wakeup, the original thread isn't going to process the events until
>> > > it finishes with the original N/2 and gets back to epoll_wait(). So I'm not
>> > > sure how important that path is but I wanted to at least note the change
>> > > here would impact that behavior.
>> > >
>> > > Thanks,
>> > >
>> > > -Jason
>> > >
>> > >
>> > > >
>> > > >> So I think removing all that even for the
>> > > >> depth 0 case is going to change some behavior here. So perhaps, it
>> > > >> should be removed for all depths except for 0? And if so, it may be
>> > > >> better to make 2 patches here to separate these changes.
>> > > >>
>> > > >> For the nested wakeups, I agree that the extra wakeups seem unnecessary
>> > > >> and it may make sense to remove them for all depths. I don't think the
>> > > >> nested epoll semantics are particularly well spelled out, and afaict,
>> > > >> nested epoll() has behaved this way for quite some time. And the current
>> > > >> behavior is not bad in the way that a missing wakeup or false negative
>> > > >> would be.
>> > > >
>> > > > That's 100% true! For edge mode extra wake up is not a bug, not optimal
>> > > > for userspace - yes, but that can't lead to any lost wakeups.
>> > > >
>> > > > --
>> > > > Roman
>> > > >
>> >
>> > I tried to remove the whole chunk of code that Roman said, and it
>> > seems that there
>> > are no obvious problems with the two test programs below:
> 
> I recall this message, the test case 9/25/26 of epoll-wakeup (on
> github) are failed while
> the whole chunk are removed.
> 
> Apply the original patch, all tests passed.


These are failing on my bare 5.2.0-rc2

TEST  bin/epoll31       FAIL
TEST  bin/epoll46       FAIL
TEST  bin/epoll50       FAIL
TEST  bin/epoll32       FAIL
TEST  bin/epoll19       FAIL
TEST  bin/epoll27       FAIL
TEST  bin/epoll42       FAIL
TEST  bin/epoll34       FAIL
TEST  bin/epoll48       FAIL
TEST  bin/epoll40       FAIL
TEST  bin/epoll20       FAIL
TEST  bin/epoll28       FAIL
TEST  bin/epoll38       FAIL
TEST  bin/epoll52       FAIL
TEST  bin/epoll24       FAIL
TEST  bin/epoll23       FAIL


These are failing if your patch is applied:
(my 5.2.0-rc2 is old? broken?)

TEST  bin/epoll46       FAIL
TEST  bin/epoll42       FAIL
TEST  bin/epoll34       FAIL
TEST  bin/epoll48       FAIL
TEST  bin/epoll40       FAIL
TEST  bin/epoll44       FAIL
TEST  bin/epoll38       FAIL

These are failing if "ep_poll_safewake(&ep->poll_wait)" is not called,
but wakeup(&ep->wq); is still invoked:

TEST  bin/epoll46       FAIL
TEST  bin/epoll42       FAIL
TEST  bin/epoll34       FAIL
TEST  bin/epoll40       FAIL
TEST  bin/epoll44       FAIL
TEST  bin/epoll38       FAIL

So at least 48 has been "fixed".

These are failing if the whole chunk is removed, like your
said 9,25,26 are among which do not pass:

TEST  bin/epoll26       FAIL
TEST  bin/epoll42       FAIL
TEST  bin/epoll34       FAIL
TEST  bin/epoll9        FAIL
TEST  bin/epoll48       FAIL
TEST  bin/epoll40       FAIL
TEST  bin/epoll25       FAIL
TEST  bin/epoll44       FAIL
TEST  bin/epoll38       FAIL

This can be a good test suite, probably can be added to kselftests?

--
Roman


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH RESEND] fs/epoll: fix the edge-triggered mode for nested epoll
  2019-09-05 17:27               ` Roman Penyaev
@ 2019-09-05 17:44                 ` Jason Baron
  2019-09-11  8:19                   ` Heiher
  0 siblings, 1 reply; 12+ messages in thread
From: Jason Baron @ 2019-09-05 17:44 UTC (permalink / raw)
  To: Roman Penyaev, Heiher
  Cc: linux-fsdevel, Eric Wong, Al Viro, Andrew Morton, Davide Libenzi,
	Davidlohr Bueso, Dominik Brodowski, Linus Torvalds,
	Sridhar Samudrala, linux-kernel



On 9/5/19 1:27 PM, Roman Penyaev wrote:
> On 2019-09-05 11:56, Heiher wrote:
>> Hi,
>>
>> On Thu, Sep 5, 2019 at 10:53 AM Heiher <r@hev.cc> wrote:
>>>
>>> Hi,
>>>
>>> I created an epoll wakeup test project, listed some possible cases,
>>> and any other corner cases needs to be added?
>>>
>>> https://github.com/heiher/epoll-wakeup/blob/master/README.md
>>>
>>> On Wed, Sep 4, 2019 at 10:02 PM Heiher <r@hev.cc> wrote:
>>> >
>>> > Hi,
>>> >
>>> > On Wed, Sep 4, 2019 at 8:02 PM Jason Baron <jbaron@akamai.com> wrote:
>>> > >
>>> > >
>>> > >
>>> > > On 9/4/19 5:57 AM, Roman Penyaev wrote:
>>> > > > On 2019-09-03 23:08, Jason Baron wrote:
>>> > > >> On 9/2/19 11:36 AM, Roman Penyaev wrote:
>>> > > >>> Hi,
>>> > > >>>
>>> > > >>> This is indeed a bug. (quick side note: could you please
>>> remove efd[1]
>>> > > >>> from your test, because it is not related to the reproduction
>>> of a
>>> > > >>> current bug).
>>> > > >>>
>>> > > >>> Your patch lacks a good description, what exactly you've
>>> fixed.  Let
>>> > > >>> me speak out loud and please correct me if I'm wrong, my
>>> understanding
>>> > > >>> of epoll internals has become a bit rusty: when epoll fds are
>>> nested
>>> > > >>> an attempt to harvest events (ep_scan_ready_list() call)
>>> produces a
>>> > > >>> second (repeated) event from an internal fd up to an external
>>> fd:
>>> > > >>>
>>> > > >>>      epoll_wait(efd[0], ...):
>>> > > >>>        ep_send_events():
>>> > > >>>           ep_scan_ready_list(depth=0):
>>> > > >>>             ep_send_events_proc():
>>> > > >>>                 ep_item_poll():
>>> > > >>>                   ep_scan_ready_list(depth=1):
>>> > > >>>                     ep_poll_safewake():
>>> > > >>>                       ep_poll_callback()
>>> > > >>>                         list_add_tail(&epi, &epi->rdllist);
>>> > > >>>                         ^^^^^^
>>> > > >>>                         repeated event
>>> > > >>>
>>> > > >>>
>>> > > >>> In your patch you forbid wakeup for the cases, where depth !=
>>> 0, i.e.
>>> > > >>> for all nested cases. That seems clear.  But what if we can
>>> go further
>>> > > >>> and remove the whole chunk, which seems excessive:
>>> > > >>>
>>> > > >>> @@ -885,26 +886,11 @@ static __poll_t ep_scan_ready_list(struct
>>> > > >>> eventpoll *ep,
>>> > > >>>
>>> > > >>> -
>>> > > >>> -       if (!list_empty(&ep->rdllist)) {
>>> > > >>> -               /*
>>> > > >>> -                * Wake up (if active) both the eventpoll
>>> wait list and
>>> > > >>> -                * the ->poll() wait list (delayed after we
>>> release the
>>> > > >>> lock).
>>> > > >>> -                */
>>> > > >>> -               if (waitqueue_active(&ep->wq))
>>> > > >>> -                       wake_up(&ep->wq);
>>> > > >>> -               if (waitqueue_active(&ep->poll_wait))
>>> > > >>> -                       pwake++;
>>> > > >>> -       }
>>> > > >>>         write_unlock_irq(&ep->lock);
>>> > > >>>
>>> > > >>>         if (!ep_locked)
>>> > > >>>                 mutex_unlock(&ep->mtx);
>>> > > >>>
>>> > > >>> -       /* We have to call this outside the lock */
>>> > > >>> -       if (pwake)
>>> > > >>> -               ep_poll_safewake(&ep->poll_wait);
>>> > > >>>
>>> > > >>>
>>> > > >>> I reason like that: by the time we've reached the point of
>>> scanning events
>>> > > >>> for readiness all wakeups from ep_poll_callback have been
>>> already fired and
>>> > > >>> new events have been already accounted in ready list
>>> (ep_poll_callback()
>>> > > >>> calls
>>> > > >>> the same ep_poll_safewake()). Here, frankly, I'm not 100%
>>> sure and probably
>>> > > >>> missing some corner cases.
>>> > > >>>
>>> > > >>> Thoughts?
>>> > > >>
>>> > > >> So the: 'wake_up(&ep->wq);' part, I think is about waking up
>>> other
>>> > > >> threads that may be in waiting in epoll_wait(). For example,
>>> there may
>>> > > >> be multiple threads doing epoll_wait() on the same epoll fd,
>>> and the
>>> > > >> logic above seems to say thread 1 may have processed say N
>>> events and
>>> > > >> now its going to to go off to work those, so let's wake up
>>> thread 2 now
>>> > > >> to handle the next chunk.
>>> > > >
>>> > > > Not quite. Thread which calls ep_scan_ready_list() processes
>>> all the
>>> > > > events, and while processing those, removes them one by one
>>> from the
>>> > > > ready list.  But if event mask is !0 and event belongs to
>>> > > > Level Triggered Mode descriptor (let's say default mode) it
>>> tails event
>>> > > > again back to the list (because we are in level mode, so event
>>> should
>>> > > > be there).  So at the end of this traversing loop ready list is
>>> likely
>>> > > > not empty, and if so, wake up again is called for nested epoll
>>> fds.
>>> > > > But, those nested epoll fds should get already all the
>>> notifications
>>> > > > from the main event callback ep_poll_callback(), regardless any
>>> thread
>>> > > > which traverses events.
>>> > > >
>>> > > > I suppose this logic exists for decades, when Davide (the
>>> author) was
>>> > > > reshuffling the code here and there.
>>> > > >
>>> > > > But I do not feel confidence to state that this extra wakeup is
>>> bogus,
>>> > > > I just have a gut feeling that it looks excessive.
>>> > >
>>> > > Note that I was talking about the wakeup done on ep->wq not
>>> ep->poll_wait.
>>> > > The path that I'm concerned about is let's say that there are N
>>> events
>>> > > queued on the ready list. A thread that was woken up in
>>> epoll_wait may
>>> > > decide to only process say N/2 of then. Then it will call wakeup
>>> on ep->wq
>>> > > and this will wakeup another thread to process the remaining N/2.
>>> Without
>>> > > the wakeup, the original thread isn't going to process the events
>>> until
>>> > > it finishes with the original N/2 and gets back to epoll_wait().
>>> So I'm not
>>> > > sure how important that path is but I wanted to at least note the
>>> change
>>> > > here would impact that behavior.
>>> > >
>>> > > Thanks,
>>> > >
>>> > > -Jason
>>> > >
>>> > >
>>> > > >
>>> > > >> So I think removing all that even for the
>>> > > >> depth 0 case is going to change some behavior here. So
>>> perhaps, it
>>> > > >> should be removed for all depths except for 0? And if so, it
>>> may be
>>> > > >> better to make 2 patches here to separate these changes.
>>> > > >>
>>> > > >> For the nested wakeups, I agree that the extra wakeups seem
>>> unnecessary
>>> > > >> and it may make sense to remove them for all depths. I don't
>>> think the
>>> > > >> nested epoll semantics are particularly well spelled out, and
>>> afaict,
>>> > > >> nested epoll() has behaved this way for quite some time. And
>>> the current
>>> > > >> behavior is not bad in the way that a missing wakeup or false
>>> negative
>>> > > >> would be.
>>> > > >
>>> > > > That's 100% true! For edge mode extra wake up is not a bug, not
>>> optimal
>>> > > > for userspace - yes, but that can't lead to any lost wakeups.
>>> > > >
>>> > > > --
>>> > > > Roman
>>> > > >
>>> >
>>> > I tried to remove the whole chunk of code that Roman said, and it
>>> > seems that there
>>> > are no obvious problems with the two test programs below:
>>
>> I recall this message, the test case 9/25/26 of epoll-wakeup (on
>> github) are failed while
>> the whole chunk are removed.
>>
>> Apply the original patch, all tests passed.
> 
> 
> These are failing on my bare 5.2.0-rc2
> 
> TEST  bin/epoll31       FAIL
> TEST  bin/epoll46       FAIL
> TEST  bin/epoll50       FAIL
> TEST  bin/epoll32       FAIL
> TEST  bin/epoll19       FAIL
> TEST  bin/epoll27       FAIL
> TEST  bin/epoll42       FAIL
> TEST  bin/epoll34       FAIL
> TEST  bin/epoll48       FAIL
> TEST  bin/epoll40       FAIL
> TEST  bin/epoll20       FAIL
> TEST  bin/epoll28       FAIL
> TEST  bin/epoll38       FAIL
> TEST  bin/epoll52       FAIL
> TEST  bin/epoll24       FAIL
> TEST  bin/epoll23       FAIL
> 
> 
> These are failing if your patch is applied:
> (my 5.2.0-rc2 is old? broken?)
> 
> TEST  bin/epoll46       FAIL
> TEST  bin/epoll42       FAIL
> TEST  bin/epoll34       FAIL
> TEST  bin/epoll48       FAIL
> TEST  bin/epoll40       FAIL
> TEST  bin/epoll44       FAIL
> TEST  bin/epoll38       FAIL
> 
> These are failing if "ep_poll_safewake(&ep->poll_wait)" is not called,
> but wakeup(&ep->wq); is still invoked:
> 
> TEST  bin/epoll46       FAIL
> TEST  bin/epoll42       FAIL
> TEST  bin/epoll34       FAIL
> TEST  bin/epoll40       FAIL
> TEST  bin/epoll44       FAIL
> TEST  bin/epoll38       FAIL
> 
> So at least 48 has been "fixed".
> 
> These are failing if the whole chunk is removed, like your
> said 9,25,26 are among which do not pass:
> 
> TEST  bin/epoll26       FAIL
> TEST  bin/epoll42       FAIL
> TEST  bin/epoll34       FAIL
> TEST  bin/epoll9        FAIL
> TEST  bin/epoll48       FAIL
> TEST  bin/epoll40       FAIL
> TEST  bin/epoll25       FAIL
> TEST  bin/epoll44       FAIL
> TEST  bin/epoll38       FAIL
> 
> This can be a good test suite, probably can be added to kselftests?
> 
> -- 
> Roman
> 


Indeed, I just tried the same test suite and I am seeing similar
failures - it looks like its a bit timing dependent. It looks like all
the failures are caused by a similar issue. For example, take epoll34:

         t0   t1
     (ew) |    | (ew)
         e0    |
      (lt) \  /
             |
            e1
             | (et)
            s0


The test is trying to assert that an epoll_wait() on e1 and and
epoll_wait() on e0 both return 1 event for EPOLLIN. However, the
epoll_wait on e1 is done in a thread and it can happen before or after
the epoll_wait() is called against e0. If the epoll_wait() on e1 happens
first then because its attached as 'et', it consumes the event. So that
there is no longer an event reported at e0. I think that is reasonable
semantics here. However if the wait on e0 happens after the wait on e1
then the test will pass as both waits will see the event. Thus, a patch
like this will 'fix' this testcase:

--- a/src/epoll34.c
+++ b/src/epoll34.c
@@ -59,15 +59,15 @@ int main(int argc, char *argv[])
        if (epoll_ctl(efd[0], EPOLL_CTL_ADD, efd[1], &e) < 0)
                goto out;

-       if (pthread_create(&tw, NULL, thread_handler, NULL) < 0)
-               goto out;
-
        if (pthread_create(&te, NULL, emit_handler, NULL) < 0)
                goto out;

        if (epoll_wait(efd[0], &e, 1, 500) == 1)
                count++;

+       if (pthread_create(&tw, NULL, thread_handler, NULL) < 0)
+               goto out;
+
        if (pthread_join(tw, NULL) < 0)
                goto out;


I found all the other failures to be of similar origin. I suspect Heiher
didn't see failures due to the thread timings here.

I also found that all the testcases pass if we leave the wakeup(&ep->wq)
call for the depth 0 case (and remove the pwake part).

Thanks,

-Jason





^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH RESEND] fs/epoll: fix the edge-triggered mode for nested epoll
  2019-09-05 17:44                 ` Jason Baron
@ 2019-09-11  8:19                   ` Heiher
  2019-09-12 21:36                     ` Jason Baron
  0 siblings, 1 reply; 12+ messages in thread
From: Heiher @ 2019-09-11  8:19 UTC (permalink / raw)
  To: Jason Baron
  Cc: Roman Penyaev, linux-fsdevel, Eric Wong, Al Viro, Andrew Morton,
	Davide Libenzi, Davidlohr Bueso, Dominik Brodowski,
	Linus Torvalds, Sridhar Samudrala, linux-kernel

Hi,

On Fri, Sep 6, 2019 at 1:48 AM Jason Baron <jbaron@akamai.com> wrote:
>
>
>
> On 9/5/19 1:27 PM, Roman Penyaev wrote:
> > On 2019-09-05 11:56, Heiher wrote:
> >> Hi,
> >>
> >> On Thu, Sep 5, 2019 at 10:53 AM Heiher <r@hev.cc> wrote:
> >>>
> >>> Hi,
> >>>
> >>> I created an epoll wakeup test project, listed some possible cases,
> >>> and any other corner cases needs to be added?
> >>>
> >>> https://github.com/heiher/epoll-wakeup/blob/master/README.md
> >>>
> >>> On Wed, Sep 4, 2019 at 10:02 PM Heiher <r@hev.cc> wrote:
> >>> >
> >>> > Hi,
> >>> >
> >>> > On Wed, Sep 4, 2019 at 8:02 PM Jason Baron <jbaron@akamai.com> wrote:
> >>> > >
> >>> > >
> >>> > >
> >>> > > On 9/4/19 5:57 AM, Roman Penyaev wrote:
> >>> > > > On 2019-09-03 23:08, Jason Baron wrote:
> >>> > > >> On 9/2/19 11:36 AM, Roman Penyaev wrote:
> >>> > > >>> Hi,
> >>> > > >>>
> >>> > > >>> This is indeed a bug. (quick side note: could you please
> >>> remove efd[1]
> >>> > > >>> from your test, because it is not related to the reproduction
> >>> of a
> >>> > > >>> current bug).
> >>> > > >>>
> >>> > > >>> Your patch lacks a good description, what exactly you've
> >>> fixed.  Let
> >>> > > >>> me speak out loud and please correct me if I'm wrong, my
> >>> understanding
> >>> > > >>> of epoll internals has become a bit rusty: when epoll fds are
> >>> nested
> >>> > > >>> an attempt to harvest events (ep_scan_ready_list() call)
> >>> produces a
> >>> > > >>> second (repeated) event from an internal fd up to an external
> >>> fd:
> >>> > > >>>
> >>> > > >>>      epoll_wait(efd[0], ...):
> >>> > > >>>        ep_send_events():
> >>> > > >>>           ep_scan_ready_list(depth=0):
> >>> > > >>>             ep_send_events_proc():
> >>> > > >>>                 ep_item_poll():
> >>> > > >>>                   ep_scan_ready_list(depth=1):
> >>> > > >>>                     ep_poll_safewake():
> >>> > > >>>                       ep_poll_callback()
> >>> > > >>>                         list_add_tail(&epi, &epi->rdllist);
> >>> > > >>>                         ^^^^^^
> >>> > > >>>                         repeated event
> >>> > > >>>
> >>> > > >>>
> >>> > > >>> In your patch you forbid wakeup for the cases, where depth !=
> >>> 0, i.e.
> >>> > > >>> for all nested cases. That seems clear.  But what if we can
> >>> go further
> >>> > > >>> and remove the whole chunk, which seems excessive:
> >>> > > >>>
> >>> > > >>> @@ -885,26 +886,11 @@ static __poll_t ep_scan_ready_list(struct
> >>> > > >>> eventpoll *ep,
> >>> > > >>>
> >>> > > >>> -
> >>> > > >>> -       if (!list_empty(&ep->rdllist)) {
> >>> > > >>> -               /*
> >>> > > >>> -                * Wake up (if active) both the eventpoll
> >>> wait list and
> >>> > > >>> -                * the ->poll() wait list (delayed after we
> >>> release the
> >>> > > >>> lock).
> >>> > > >>> -                */
> >>> > > >>> -               if (waitqueue_active(&ep->wq))
> >>> > > >>> -                       wake_up(&ep->wq);
> >>> > > >>> -               if (waitqueue_active(&ep->poll_wait))
> >>> > > >>> -                       pwake++;
> >>> > > >>> -       }
> >>> > > >>>         write_unlock_irq(&ep->lock);
> >>> > > >>>
> >>> > > >>>         if (!ep_locked)
> >>> > > >>>                 mutex_unlock(&ep->mtx);
> >>> > > >>>
> >>> > > >>> -       /* We have to call this outside the lock */
> >>> > > >>> -       if (pwake)
> >>> > > >>> -               ep_poll_safewake(&ep->poll_wait);
> >>> > > >>>
> >>> > > >>>
> >>> > > >>> I reason like that: by the time we've reached the point of
> >>> scanning events
> >>> > > >>> for readiness all wakeups from ep_poll_callback have been
> >>> already fired and
> >>> > > >>> new events have been already accounted in ready list
> >>> (ep_poll_callback()
> >>> > > >>> calls
> >>> > > >>> the same ep_poll_safewake()). Here, frankly, I'm not 100%
> >>> sure and probably
> >>> > > >>> missing some corner cases.
> >>> > > >>>
> >>> > > >>> Thoughts?
> >>> > > >>
> >>> > > >> So the: 'wake_up(&ep->wq);' part, I think is about waking up
> >>> other
> >>> > > >> threads that may be in waiting in epoll_wait(). For example,
> >>> there may
> >>> > > >> be multiple threads doing epoll_wait() on the same epoll fd,
> >>> and the
> >>> > > >> logic above seems to say thread 1 may have processed say N
> >>> events and
> >>> > > >> now its going to to go off to work those, so let's wake up
> >>> thread 2 now
> >>> > > >> to handle the next chunk.
> >>> > > >
> >>> > > > Not quite. Thread which calls ep_scan_ready_list() processes
> >>> all the
> >>> > > > events, and while processing those, removes them one by one
> >>> from the
> >>> > > > ready list.  But if event mask is !0 and event belongs to
> >>> > > > Level Triggered Mode descriptor (let's say default mode) it
> >>> tails event
> >>> > > > again back to the list (because we are in level mode, so event
> >>> should
> >>> > > > be there).  So at the end of this traversing loop ready list is
> >>> likely
> >>> > > > not empty, and if so, wake up again is called for nested epoll
> >>> fds.
> >>> > > > But, those nested epoll fds should get already all the
> >>> notifications
> >>> > > > from the main event callback ep_poll_callback(), regardless any
> >>> thread
> >>> > > > which traverses events.
> >>> > > >
> >>> > > > I suppose this logic exists for decades, when Davide (the
> >>> author) was
> >>> > > > reshuffling the code here and there.
> >>> > > >
> >>> > > > But I do not feel confidence to state that this extra wakeup is
> >>> bogus,
> >>> > > > I just have a gut feeling that it looks excessive.
> >>> > >
> >>> > > Note that I was talking about the wakeup done on ep->wq not
> >>> ep->poll_wait.
> >>> > > The path that I'm concerned about is let's say that there are N
> >>> events
> >>> > > queued on the ready list. A thread that was woken up in
> >>> epoll_wait may
> >>> > > decide to only process say N/2 of then. Then it will call wakeup
> >>> on ep->wq
> >>> > > and this will wakeup another thread to process the remaining N/2.
> >>> Without
> >>> > > the wakeup, the original thread isn't going to process the events
> >>> until
> >>> > > it finishes with the original N/2 and gets back to epoll_wait().
> >>> So I'm not
> >>> > > sure how important that path is but I wanted to at least note the
> >>> change
> >>> > > here would impact that behavior.
> >>> > >
> >>> > > Thanks,
> >>> > >
> >>> > > -Jason
> >>> > >
> >>> > >
> >>> > > >
> >>> > > >> So I think removing all that even for the
> >>> > > >> depth 0 case is going to change some behavior here. So
> >>> perhaps, it
> >>> > > >> should be removed for all depths except for 0? And if so, it
> >>> may be
> >>> > > >> better to make 2 patches here to separate these changes.
> >>> > > >>
> >>> > > >> For the nested wakeups, I agree that the extra wakeups seem
> >>> unnecessary
> >>> > > >> and it may make sense to remove them for all depths. I don't
> >>> think the
> >>> > > >> nested epoll semantics are particularly well spelled out, and
> >>> afaict,
> >>> > > >> nested epoll() has behaved this way for quite some time. And
> >>> the current
> >>> > > >> behavior is not bad in the way that a missing wakeup or false
> >>> negative
> >>> > > >> would be.
> >>> > > >
> >>> > > > That's 100% true! For edge mode extra wake up is not a bug, not
> >>> optimal
> >>> > > > for userspace - yes, but that can't lead to any lost wakeups.
> >>> > > >
> >>> > > > --
> >>> > > > Roman
> >>> > > >
> >>> >
> >>> > I tried to remove the whole chunk of code that Roman said, and it
> >>> > seems that there
> >>> > are no obvious problems with the two test programs below:
> >>
> >> I recall this message, the test case 9/25/26 of epoll-wakeup (on
> >> github) are failed while
> >> the whole chunk are removed.
> >>
> >> Apply the original patch, all tests passed.
> >
> >
> > These are failing on my bare 5.2.0-rc2
> >
> > TEST  bin/epoll31       FAIL
> > TEST  bin/epoll46       FAIL
> > TEST  bin/epoll50       FAIL
> > TEST  bin/epoll32       FAIL
> > TEST  bin/epoll19       FAIL
> > TEST  bin/epoll27       FAIL
> > TEST  bin/epoll42       FAIL
> > TEST  bin/epoll34       FAIL
> > TEST  bin/epoll48       FAIL
> > TEST  bin/epoll40       FAIL
> > TEST  bin/epoll20       FAIL
> > TEST  bin/epoll28       FAIL
> > TEST  bin/epoll38       FAIL
> > TEST  bin/epoll52       FAIL
> > TEST  bin/epoll24       FAIL
> > TEST  bin/epoll23       FAIL
> >
> >
> > These are failing if your patch is applied:
> > (my 5.2.0-rc2 is old? broken?)
> >
> > TEST  bin/epoll46       FAIL
> > TEST  bin/epoll42       FAIL
> > TEST  bin/epoll34       FAIL
> > TEST  bin/epoll48       FAIL
> > TEST  bin/epoll40       FAIL
> > TEST  bin/epoll44       FAIL
> > TEST  bin/epoll38       FAIL
> >
> > These are failing if "ep_poll_safewake(&ep->poll_wait)" is not called,
> > but wakeup(&ep->wq); is still invoked:
> >
> > TEST  bin/epoll46       FAIL
> > TEST  bin/epoll42       FAIL
> > TEST  bin/epoll34       FAIL
> > TEST  bin/epoll40       FAIL
> > TEST  bin/epoll44       FAIL
> > TEST  bin/epoll38       FAIL
> >
> > So at least 48 has been "fixed".
> >
> > These are failing if the whole chunk is removed, like your
> > said 9,25,26 are among which do not pass:
> >
> > TEST  bin/epoll26       FAIL
> > TEST  bin/epoll42       FAIL
> > TEST  bin/epoll34       FAIL
> > TEST  bin/epoll9        FAIL
> > TEST  bin/epoll48       FAIL
> > TEST  bin/epoll40       FAIL
> > TEST  bin/epoll25       FAIL
> > TEST  bin/epoll44       FAIL
> > TEST  bin/epoll38       FAIL
> >
> > This can be a good test suite, probably can be added to kselftests?

Thank you, I have updated epoll-tests to fix these issues. I think this is good
news if we can added to kselftests. ;)

> >
> > --
> > Roman
> >
>
>
> Indeed, I just tried the same test suite and I am seeing similar
> failures - it looks like its a bit timing dependent. It looks like all
> the failures are caused by a similar issue. For example, take epoll34:
>
>          t0   t1
>      (ew) |    | (ew)
>          e0    |
>       (lt) \  /
>              |
>             e1
>              | (et)
>             s0
>
>
> The test is trying to assert that an epoll_wait() on e1 and and
> epoll_wait() on e0 both return 1 event for EPOLLIN. However, the
> epoll_wait on e1 is done in a thread and it can happen before or after
> the epoll_wait() is called against e0. If the epoll_wait() on e1 happens
> first then because its attached as 'et', it consumes the event. So that
> there is no longer an event reported at e0. I think that is reasonable
> semantics here. However if the wait on e0 happens after the wait on e1
> then the test will pass as both waits will see the event. Thus, a patch
> like this will 'fix' this testcase:
>
> --- a/src/epoll34.c
> +++ b/src/epoll34.c
> @@ -59,15 +59,15 @@ int main(int argc, char *argv[])
>         if (epoll_ctl(efd[0], EPOLL_CTL_ADD, efd[1], &e) < 0)
>                 goto out;
>
> -       if (pthread_create(&tw, NULL, thread_handler, NULL) < 0)
> -               goto out;
> -
>         if (pthread_create(&te, NULL, emit_handler, NULL) < 0)
>                 goto out;
>
>         if (epoll_wait(efd[0], &e, 1, 500) == 1)
>                 count++;
>
> +       if (pthread_create(&tw, NULL, thread_handler, NULL) < 0)
> +               goto out;
> +
>         if (pthread_join(tw, NULL) < 0)
>                 goto out;
>
>
> I found all the other failures to be of similar origin. I suspect Heiher
> didn't see failures due to the thread timings here.

Thank you. I also found a multi-threaded concurrent accumulation problem,
and that has been changed to atomic operations. I think we should allow two
different behaviors to be passed because they are all correctly.

thread 2:
if (epoll_wait(efd[1], &e, 1, 500) == 1)
    __sync_fetch_and_or(&count, 1);

thread1:
if (epoll_wait(efd[0], &e, 1, 500) == 1)
    __sync_fetch_and_or(&count, 2);

check:
if ((count != 1) && (count != 3))
    goto out;

>
> I also found that all the testcases pass if we leave the wakeup(&ep->wq)
> call for the depth 0 case (and remove the pwake part).

So, We need to keep the wakeup(&ep-wq) for all depth, and only
wakeup(&ep->poll_wait)
for depth 0 and/or ep->rdlist from empty to be not empty?

>
> Thanks,
>
> -Jason
>
>
>
>


-- 
Best regards!
Hev
https://hev.cc

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH RESEND] fs/epoll: fix the edge-triggered mode for nested epoll
  2019-09-11  8:19                   ` Heiher
@ 2019-09-12 21:36                     ` Jason Baron
  0 siblings, 0 replies; 12+ messages in thread
From: Jason Baron @ 2019-09-12 21:36 UTC (permalink / raw)
  To: Heiher
  Cc: Roman Penyaev, linux-fsdevel, Eric Wong, Al Viro, Andrew Morton,
	Davide Libenzi, Davidlohr Bueso, Dominik Brodowski,
	Linus Torvalds, Sridhar Samudrala, linux-kernel



On 9/11/19 4:19 AM, Heiher wrote:
> Hi,
> 
> On Fri, Sep 6, 2019 at 1:48 AM Jason Baron <jbaron@akamai.com> wrote:
>>
>>
>>
>> On 9/5/19 1:27 PM, Roman Penyaev wrote:
>>> On 2019-09-05 11:56, Heiher wrote:
>>>> Hi,
>>>>
>>>> On Thu, Sep 5, 2019 at 10:53 AM Heiher <r@hev.cc> wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> I created an epoll wakeup test project, listed some possible cases,
>>>>> and any other corner cases needs to be added?
>>>>>
>>>>> https://github.com/heiher/epoll-wakeup/blob/master/README.md
>>>>>
>>>>> On Wed, Sep 4, 2019 at 10:02 PM Heiher <r@hev.cc> wrote:
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> On Wed, Sep 4, 2019 at 8:02 PM Jason Baron <jbaron@akamai.com> wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 9/4/19 5:57 AM, Roman Penyaev wrote:
>>>>>>>> On 2019-09-03 23:08, Jason Baron wrote:
>>>>>>>>> On 9/2/19 11:36 AM, Roman Penyaev wrote:
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> This is indeed a bug. (quick side note: could you please
>>>>> remove efd[1]
>>>>>>>>>> from your test, because it is not related to the reproduction
>>>>> of a
>>>>>>>>>> current bug).
>>>>>>>>>>
>>>>>>>>>> Your patch lacks a good description, what exactly you've
>>>>> fixed.  Let
>>>>>>>>>> me speak out loud and please correct me if I'm wrong, my
>>>>> understanding
>>>>>>>>>> of epoll internals has become a bit rusty: when epoll fds are
>>>>> nested
>>>>>>>>>> an attempt to harvest events (ep_scan_ready_list() call)
>>>>> produces a
>>>>>>>>>> second (repeated) event from an internal fd up to an external
>>>>> fd:
>>>>>>>>>>
>>>>>>>>>>      epoll_wait(efd[0], ...):
>>>>>>>>>>        ep_send_events():
>>>>>>>>>>           ep_scan_ready_list(depth=0):
>>>>>>>>>>             ep_send_events_proc():
>>>>>>>>>>                 ep_item_poll():
>>>>>>>>>>                   ep_scan_ready_list(depth=1):
>>>>>>>>>>                     ep_poll_safewake():
>>>>>>>>>>                       ep_poll_callback()
>>>>>>>>>>                         list_add_tail(&epi, &epi->rdllist);
>>>>>>>>>>                         ^^^^^^
>>>>>>>>>>                         repeated event
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> In your patch you forbid wakeup for the cases, where depth !=
>>>>> 0, i.e.
>>>>>>>>>> for all nested cases. That seems clear.  But what if we can
>>>>> go further
>>>>>>>>>> and remove the whole chunk, which seems excessive:
>>>>>>>>>>
>>>>>>>>>> @@ -885,26 +886,11 @@ static __poll_t ep_scan_ready_list(struct
>>>>>>>>>> eventpoll *ep,
>>>>>>>>>>
>>>>>>>>>> -
>>>>>>>>>> -       if (!list_empty(&ep->rdllist)) {
>>>>>>>>>> -               /*
>>>>>>>>>> -                * Wake up (if active) both the eventpoll
>>>>> wait list and
>>>>>>>>>> -                * the ->poll() wait list (delayed after we
>>>>> release the
>>>>>>>>>> lock).
>>>>>>>>>> -                */
>>>>>>>>>> -               if (waitqueue_active(&ep->wq))
>>>>>>>>>> -                       wake_up(&ep->wq);
>>>>>>>>>> -               if (waitqueue_active(&ep->poll_wait))
>>>>>>>>>> -                       pwake++;
>>>>>>>>>> -       }
>>>>>>>>>>         write_unlock_irq(&ep->lock);
>>>>>>>>>>
>>>>>>>>>>         if (!ep_locked)
>>>>>>>>>>                 mutex_unlock(&ep->mtx);
>>>>>>>>>>
>>>>>>>>>> -       /* We have to call this outside the lock */
>>>>>>>>>> -       if (pwake)
>>>>>>>>>> -               ep_poll_safewake(&ep->poll_wait);
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I reason like that: by the time we've reached the point of
>>>>> scanning events
>>>>>>>>>> for readiness all wakeups from ep_poll_callback have been
>>>>> already fired and
>>>>>>>>>> new events have been already accounted in ready list
>>>>> (ep_poll_callback()
>>>>>>>>>> calls
>>>>>>>>>> the same ep_poll_safewake()). Here, frankly, I'm not 100%
>>>>> sure and probably
>>>>>>>>>> missing some corner cases.
>>>>>>>>>>
>>>>>>>>>> Thoughts?
>>>>>>>>>
>>>>>>>>> So the: 'wake_up(&ep->wq);' part, I think is about waking up
>>>>> other
>>>>>>>>> threads that may be in waiting in epoll_wait(). For example,
>>>>> there may
>>>>>>>>> be multiple threads doing epoll_wait() on the same epoll fd,
>>>>> and the
>>>>>>>>> logic above seems to say thread 1 may have processed say N
>>>>> events and
>>>>>>>>> now its going to to go off to work those, so let's wake up
>>>>> thread 2 now
>>>>>>>>> to handle the next chunk.
>>>>>>>>
>>>>>>>> Not quite. Thread which calls ep_scan_ready_list() processes
>>>>> all the
>>>>>>>> events, and while processing those, removes them one by one
>>>>> from the
>>>>>>>> ready list.  But if event mask is !0 and event belongs to
>>>>>>>> Level Triggered Mode descriptor (let's say default mode) it
>>>>> tails event
>>>>>>>> again back to the list (because we are in level mode, so event
>>>>> should
>>>>>>>> be there).  So at the end of this traversing loop ready list is
>>>>> likely
>>>>>>>> not empty, and if so, wake up again is called for nested epoll
>>>>> fds.
>>>>>>>> But, those nested epoll fds should get already all the
>>>>> notifications
>>>>>>>> from the main event callback ep_poll_callback(), regardless any
>>>>> thread
>>>>>>>> which traverses events.
>>>>>>>>
>>>>>>>> I suppose this logic exists for decades, when Davide (the
>>>>> author) was
>>>>>>>> reshuffling the code here and there.
>>>>>>>>
>>>>>>>> But I do not feel confidence to state that this extra wakeup is
>>>>> bogus,
>>>>>>>> I just have a gut feeling that it looks excessive.
>>>>>>>
>>>>>>> Note that I was talking about the wakeup done on ep->wq not
>>>>> ep->poll_wait.
>>>>>>> The path that I'm concerned about is let's say that there are N
>>>>> events
>>>>>>> queued on the ready list. A thread that was woken up in
>>>>> epoll_wait may
>>>>>>> decide to only process say N/2 of then. Then it will call wakeup
>>>>> on ep->wq
>>>>>>> and this will wakeup another thread to process the remaining N/2.
>>>>> Without
>>>>>>> the wakeup, the original thread isn't going to process the events
>>>>> until
>>>>>>> it finishes with the original N/2 and gets back to epoll_wait().
>>>>> So I'm not
>>>>>>> sure how important that path is but I wanted to at least note the
>>>>> change
>>>>>>> here would impact that behavior.
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> -Jason
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>> So I think removing all that even for the
>>>>>>>>> depth 0 case is going to change some behavior here. So
>>>>> perhaps, it
>>>>>>>>> should be removed for all depths except for 0? And if so, it
>>>>> may be
>>>>>>>>> better to make 2 patches here to separate these changes.
>>>>>>>>>
>>>>>>>>> For the nested wakeups, I agree that the extra wakeups seem
>>>>> unnecessary
>>>>>>>>> and it may make sense to remove them for all depths. I don't
>>>>> think the
>>>>>>>>> nested epoll semantics are particularly well spelled out, and
>>>>> afaict,
>>>>>>>>> nested epoll() has behaved this way for quite some time. And
>>>>> the current
>>>>>>>>> behavior is not bad in the way that a missing wakeup or false
>>>>> negative
>>>>>>>>> would be.
>>>>>>>>
>>>>>>>> That's 100% true! For edge mode extra wake up is not a bug, not
>>>>> optimal
>>>>>>>> for userspace - yes, but that can't lead to any lost wakeups.
>>>>>>>>
>>>>>>>> --
>>>>>>>> Roman
>>>>>>>>
>>>>>>
>>>>>> I tried to remove the whole chunk of code that Roman said, and it
>>>>>> seems that there
>>>>>> are no obvious problems with the two test programs below:
>>>>
>>>> I recall this message, the test case 9/25/26 of epoll-wakeup (on
>>>> github) are failed while
>>>> the whole chunk are removed.
>>>>
>>>> Apply the original patch, all tests passed.
>>>
>>>
>>> These are failing on my bare 5.2.0-rc2
>>>
>>> TEST  bin/epoll31       FAIL
>>> TEST  bin/epoll46       FAIL
>>> TEST  bin/epoll50       FAIL
>>> TEST  bin/epoll32       FAIL
>>> TEST  bin/epoll19       FAIL
>>> TEST  bin/epoll27       FAIL
>>> TEST  bin/epoll42       FAIL
>>> TEST  bin/epoll34       FAIL
>>> TEST  bin/epoll48       FAIL
>>> TEST  bin/epoll40       FAIL
>>> TEST  bin/epoll20       FAIL
>>> TEST  bin/epoll28       FAIL
>>> TEST  bin/epoll38       FAIL
>>> TEST  bin/epoll52       FAIL
>>> TEST  bin/epoll24       FAIL
>>> TEST  bin/epoll23       FAIL
>>>
>>>
>>> These are failing if your patch is applied:
>>> (my 5.2.0-rc2 is old? broken?)
>>>
>>> TEST  bin/epoll46       FAIL
>>> TEST  bin/epoll42       FAIL
>>> TEST  bin/epoll34       FAIL
>>> TEST  bin/epoll48       FAIL
>>> TEST  bin/epoll40       FAIL
>>> TEST  bin/epoll44       FAIL
>>> TEST  bin/epoll38       FAIL
>>>
>>> These are failing if "ep_poll_safewake(&ep->poll_wait)" is not called,
>>> but wakeup(&ep->wq); is still invoked:
>>>
>>> TEST  bin/epoll46       FAIL
>>> TEST  bin/epoll42       FAIL
>>> TEST  bin/epoll34       FAIL
>>> TEST  bin/epoll40       FAIL
>>> TEST  bin/epoll44       FAIL
>>> TEST  bin/epoll38       FAIL
>>>
>>> So at least 48 has been "fixed".
>>>
>>> These are failing if the whole chunk is removed, like your
>>> said 9,25,26 are among which do not pass:
>>>
>>> TEST  bin/epoll26       FAIL
>>> TEST  bin/epoll42       FAIL
>>> TEST  bin/epoll34       FAIL
>>> TEST  bin/epoll9        FAIL
>>> TEST  bin/epoll48       FAIL
>>> TEST  bin/epoll40       FAIL
>>> TEST  bin/epoll25       FAIL
>>> TEST  bin/epoll44       FAIL
>>> TEST  bin/epoll38       FAIL
>>>
>>> This can be a good test suite, probably can be added to kselftests?
> 
> Thank you, I have updated epoll-tests to fix these issues. I think this is good
> news if we can added to kselftests. ;)
> 
>>>
>>> --
>>> Roman
>>>
>>
>>
>> Indeed, I just tried the same test suite and I am seeing similar
>> failures - it looks like its a bit timing dependent. It looks like all
>> the failures are caused by a similar issue. For example, take epoll34:
>>
>>          t0   t1
>>      (ew) |    | (ew)
>>          e0    |
>>       (lt) \  /
>>              |
>>             e1
>>              | (et)
>>             s0
>>
>>
>> The test is trying to assert that an epoll_wait() on e1 and and
>> epoll_wait() on e0 both return 1 event for EPOLLIN. However, the
>> epoll_wait on e1 is done in a thread and it can happen before or after
>> the epoll_wait() is called against e0. If the epoll_wait() on e1 happens
>> first then because its attached as 'et', it consumes the event. So that
>> there is no longer an event reported at e0. I think that is reasonable
>> semantics here. However if the wait on e0 happens after the wait on e1
>> then the test will pass as both waits will see the event. Thus, a patch
>> like this will 'fix' this testcase:
>>
>> --- a/src/epoll34.c
>> +++ b/src/epoll34.c
>> @@ -59,15 +59,15 @@ int main(int argc, char *argv[])
>>         if (epoll_ctl(efd[0], EPOLL_CTL_ADD, efd[1], &e) < 0)
>>                 goto out;
>>
>> -       if (pthread_create(&tw, NULL, thread_handler, NULL) < 0)
>> -               goto out;
>> -
>>         if (pthread_create(&te, NULL, emit_handler, NULL) < 0)
>>                 goto out;
>>
>>         if (epoll_wait(efd[0], &e, 1, 500) == 1)
>>                 count++;
>>
>> +       if (pthread_create(&tw, NULL, thread_handler, NULL) < 0)
>> +               goto out;
>> +
>>         if (pthread_join(tw, NULL) < 0)
>>                 goto out;
>>
>>
>> I found all the other failures to be of similar origin. I suspect Heiher
>> didn't see failures due to the thread timings here.
> 
> Thank you. I also found a multi-threaded concurrent accumulation problem,
> and that has been changed to atomic operations. I think we should allow two
> different behaviors to be passed because they are all correctly.
> 
> thread 2:
> if (epoll_wait(efd[1], &e, 1, 500) == 1)
>     __sync_fetch_and_or(&count, 1);
> 
> thread1:
> if (epoll_wait(efd[0], &e, 1, 500) == 1)
>     __sync_fetch_and_or(&count, 2);
> 
> check:
> if ((count != 1) && (count != 3))
>     goto out;
> 
>>
>> I also found that all the testcases pass if we leave the wakeup(&ep->wq)
>> call for the depth 0 case (and remove the pwake part).
> 
> So, We need to keep the wakeup(&ep-wq) for all depth, and only
> wakeup(&ep->poll_wait)
> for depth 0 and/or ep->rdlist from empty to be not empty?
> 

Yes, so my thinking is yes, leave the wakeup(&ep->wq) for all depths. It
may not strictly be necessary for depths > 0 but I do have some concerns
about it and doesn't affect this specific case. So I would leave it for
all depths and if you want to remove it for depths > 0, its a separate
patch.

I'm now not sure its really safe to remove the wakeup(&ep->poll_wait)
either. Take the case where we have:

       (et)  (lt)
t0---e0---e1----s0


Let's say there is thread t1 also doing epoll_wait() on e1. Now, let's
say s0 fires an event and wakes up t1. While t1 is reaping events from
t1 its calling ep_scan_ready_list() and while its processing the events
it drops the ep->lock and more events get queued to the overflow list.
Now, let's say t0 wakes up from epoll_wait() it will not see any of the
events queued to e1 yet (since they are not on the readylist yet). Thus,
t0 goes back to sleep and I think can miss an event here. So I think
that's what the wakeup(&ep->poll_wait) is about.

What I think we can do to fix the original case, is that perhaps in that
case where events get queued to the overflow list we do in fact do the
wakeup(&ep->poll_wait). So perhaps, we can just have variable that
checks if there is anything in the overflow list, and if so do the
wakeup(&ep->poll_wait) conditional on there being things on the overflow
list. Although I can't say I tried to see if this would work.

That said, I think ep_modify() might also subtlety be dependent on the
current behavior, in that ep_item_poll() uses ep_scan_ready_list() and
when an event is modified it may need to wakeup nested epoll fds to
inform them of events. Thus, perhaps, we would also need an explicit
call to wakeup(&ep->poll_wait) when ep_item_poll() returns events.

Thanks,

-Jason

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2019-09-12 21:37 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-09-02  5:20 [PATCH RESEND] fs/epoll: fix the edge-triggered mode for nested epoll hev
2019-09-02 15:36 ` Roman Penyaev
2019-09-03 21:08   ` Jason Baron
2019-09-04  9:57     ` Roman Penyaev
2019-09-04 12:02       ` Jason Baron
2019-09-04 14:02         ` Heiher
2019-09-05  2:53           ` Heiher
2019-09-05  9:56             ` Heiher
2019-09-05 17:27               ` Roman Penyaev
2019-09-05 17:44                 ` Jason Baron
2019-09-11  8:19                   ` Heiher
2019-09-12 21:36                     ` Jason Baron

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.