linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Regression: epoll edge-triggered (EPOLLET) for pipes/FIFOs
@ 2020-10-12 18:39 Michael Kerrisk (man-pages)
  2020-10-12 19:25 ` Linus Torvalds
  0 siblings, 1 reply; 9+ messages in thread
From: Michael Kerrisk (man-pages) @ 2020-10-12 18:39 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David Howells, Rasmus Villemoes, Greg Kroah-Hartman,
	Peter Zijlstra, Nicolas Dichtel, Ian Kent, Christian Brauner,
	keyrings, linux-fsdevel, Linux API, lkml, Michael Kerrisk

Hello Linus,

Between Linux 5.4 and 5.5 a regression was introduced in the operation
of the epoll EPOLLET flag. From some manual bisecting, the regression
appears to have been introduced in

         commit 1b6b26ae7053e4914181eedf70f2d92c12abda8a
         Author: Linus Torvalds <torvalds@linux-foundation.org>
         Date:   Sat Dec 7 12:14:28 2019 -0800

             pipe: fix and clarify pipe write wakeup logic

(I also built a kernel from the  immediate preceding commit, and did
not observe the regression.)

The aim of ET (edge-triggered) notification is that epoll_wait() will
tell us a file descriptor is ready only if there has been new activity
on the FD since we were last informed about the FD. So, in the
following scenario where the read end of a pipe is being monitored
with EPOLLET, we see:

[Write a byte to write end of pipe]
1. Call epoll_wait() ==> tells us pipe read end is ready
2. Call epoll_wait() [again] ==> does not tell us that the read end of
pipe is ready

    (By contrast, in step 2, level-triggered notification would tell
    us the read end of the pipe is read.)

If we go further:

[Write another byte to write end of pipe]
3. Call epoll_wait() ==> tells us pipe read end is ready

The above was true until the regression. Now, step 3 does not tell us
that the pipe read end is ready, even though there is NEW input
available on the pipe. (In the analogous situation for sockets and
terminals, step 3 does (still) correctly tell us that the FD is
ready.)

I've appended a test program below. The following are the results on
kernel 5.4.0:

        $ ./pipe_epollet_test
        Writing a byte to pipe()
            1: OK:   ret = 1, events = [ EPOLLIN ]
            2: OK:   ret = 0
        Writing a byte to pipe()
            3: OK:   ret = 1, events = [ EPOLLIN ]
        Closing write end of pipe()
            4: OK:   ret = 1, events = [ EPOLLIN EPOLLHUP ]

On current kernels, the results are as follows:

        $ ./pipe_epollet_test
        Writing a byte to pipe()
            1: OK:   ret = 1, events = [ EPOLLIN ]
            2: OK:   ret = 0
        Writing a byte to pipe()
            3: FAIL: ret = 0; EXPECTED: ret = 1, events = [ EPOLLIN ]
        Closing write end of pipe()
            4: OK:   ret = 1, events = [ EPOLLIN EPOLLHUP ]

Thanks,

Michael

=====

/* pipe_epollet_test.c

   Copyright (c) 2020, Michael Kerrisk <mtk.manpages@gmail.com>

   Licensed under GNU GPLv2 or later.
*/
#include <sys/epoll.h>
#include <stdio.h>
#include <stdbool.h>
#include <stdlib.h>
#include <unistd.h>

#define errExit(msg)    do { perror(msg); exit(EXIT_FAILURE); \
                        } while (0)

static void
printMask(int events)
{
    printf(" [ %s%s]",
                (events & EPOLLIN)  ? "EPOLLIN "  : "",
                (events & EPOLLHUP) ? "EPOLLHUP " : "");
}

static void
doEpollWait(int epfd, int timeout, int expectedRetval, int expectedEvents)
{
    struct epoll_event ev;
    static int callNum = 0;

    int retval = epoll_wait(epfd, &ev, 1, timeout);
    if (retval == -1) {
        perror("epoll_wait");
        return;
    }

    /* The test succeeded if (1) we got the expected return value and
       (2) when the return value was 1, we got the expected events mask */

    bool succeeded = retval == expectedRetval &&
            (expectedRetval == 0 || expectedEvents == ev.events);

    callNum++;
    printf("    %d: ", callNum);

    if (succeeded)
        printf("OK:   ");
    else
        printf("FAIL: ");

    printf("ret = %d", retval);

    if (retval == 1) {
        printf(", events =");
        printMask(ev.events);
    }

    if (!succeeded) {
        printf("; EXPECTED: ret = %d", expectedRetval);
        if (expectedRetval == 1) {
            printf(", events =");
            printMask(expectedEvents);
        }
    }
    printf("\n");
}

int
main(int argc, char *argv[])
{
    int epfd;
    int pfd[2];

    epfd = epoll_create(1);
    if (epfd == -1)
        errExit("epoll_create");

    /* Create a pipe and add read end to epoll interest list */

    if (pipe(pfd) == -1)
        errExit("pipe");

    struct epoll_event ev;
    ev.data.fd = pfd[0];
    ev.events = EPOLLIN | EPOLLET;
    if (epoll_ctl(epfd, EPOLL_CTL_ADD, pfd[0], &ev) == -1)
        errExit("epoll_ctl");

    /* Run some tests */

    printf("Writing a byte to pipe()\n");
    write(pfd[1], "a", 1);

    doEpollWait(epfd, 0, 1, EPOLLIN);
    doEpollWait(epfd, 0, 0, 0);

    printf("Writing a byte to pipe()\n");
    write(pfd[1], "a", 1);

    doEpollWait(epfd, 0, 1, EPOLLIN);

    printf("Closing write end of pipe()\n");
    close(pfd[1]);

    doEpollWait(epfd, 0, 1, EPOLLIN | EPOLLHUP);

    exit(EXIT_SUCCESS);
}


--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Regression: epoll edge-triggered (EPOLLET) for pipes/FIFOs
  2020-10-12 18:39 Regression: epoll edge-triggered (EPOLLET) for pipes/FIFOs Michael Kerrisk (man-pages)
@ 2020-10-12 19:25 ` Linus Torvalds
  2020-10-12 19:28   ` Linus Torvalds
  2020-10-12 20:30   ` Michael Kerrisk (man-pages)
  0 siblings, 2 replies; 9+ messages in thread
From: Linus Torvalds @ 2020-10-12 19:25 UTC (permalink / raw)
  To: Michael Kerrisk-manpages, Alexander Viro
  Cc: David Howells, Rasmus Villemoes, Greg Kroah-Hartman,
	Peter Zijlstra, Nicolas Dichtel, Ian Kent, Christian Brauner,
	keyrings, linux-fsdevel, Linux API, lkml

On Mon, Oct 12, 2020 at 11:40 AM Michael Kerrisk (man-pages)
<mtk.manpages@gmail.com> wrote:
>
> Between Linux 5.4 and 5.5 a regression was introduced in the operation
> of the epoll EPOLLET flag. From some manual bisecting, the regression
> appears to have been introduced in
>
>          commit 1b6b26ae7053e4914181eedf70f2d92c12abda8a
>          Author: Linus Torvalds <torvalds@linux-foundation.org>
>          Date:   Sat Dec 7 12:14:28 2019 -0800
>
>              pipe: fix and clarify pipe write wakeup logic
>
> (I also built a kernel from the  immediate preceding commit, and did
> not observe the regression.)

So the difference from that commit is that now we only wake up a
reader of a pipe when we add data to it AND IT WAS EMPTY BEFORE.

> The aim of ET (edge-triggered) notification is that epoll_wait() will
> tell us a file descriptor is ready only if there has been new activity
> on the FD since we were last informed about the FD. So, in the
> following scenario where the read end of a pipe is being monitored
> with EPOLLET, we see:
>
> [Write a byte to write end of pipe]
> 1. Call epoll_wait() ==> tells us pipe read end is ready
> 2. Call epoll_wait() [again] ==> does not tell us that the read end of
> pipe is ready

Right.

> If we go further:
>
> [Write another byte to write end of pipe]
> 3. Call epoll_wait() ==> tells us pipe read end is ready

No.

The "read end" readiness has not changed. It was ready before, it's
ready now, there's no change in readiness.

Now, the old pipe behavior was that it would wake up writers whether
they needed it or not, so epoll got woken up even if the readiness
didn't actually change.

So we do have a change in behavior.

However, clearly your test is wrong, and there is no edge difference.

Now, if this is more than just a buggy test - and it actually breaks
some actual application and real behavior - we'll need to fix it. A
regression is a regression, and we'll need to be bug-for-bug
compatible for people who depended on bugs.

But if it's only a test, and no actual workflow that got broken, then
it's just a buggy test.

[ Adding Al to the participants, not because he has anything to do
with this pipe change, but because he's been working on epoll
cleanups, and I just want him to be aware of this thread. Al - Michael
has a test program for this thing that may or may not be worth keeping
in mind ]

                  Linus

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Regression: epoll edge-triggered (EPOLLET) for pipes/FIFOs
  2020-10-12 19:25 ` Linus Torvalds
@ 2020-10-12 19:28   ` Linus Torvalds
  2020-10-12 20:30   ` Michael Kerrisk (man-pages)
  1 sibling, 0 replies; 9+ messages in thread
From: Linus Torvalds @ 2020-10-12 19:28 UTC (permalink / raw)
  To: Michael Kerrisk-manpages, Alexander Viro
  Cc: David Howells, Rasmus Villemoes, Greg Kroah-Hartman,
	Peter Zijlstra, Nicolas Dichtel, Ian Kent, Christian Brauner,
	keyrings, linux-fsdevel, Linux API, lkml

On Mon, Oct 12, 2020 at 12:25 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Now, the old pipe behavior was that it would wake up writers whether
> they needed it or not [..]

That "writers" should be "readers", of course.

Although yes, that commit changed it for both readers and writers: if
the pipe was readable from before, then a writer adding new data to it
doesn't make it "more readable". Similarly, if a pipe was writable
before, and a reader made even more room in it, the pipe didn't get
"more writable".

So that commit removes the pointless extra wakeup calls that don't
actually make any sense (and that gave incorrect edges to the some
EPOLL case that saw an edge that didn't actually exist).

              Linus

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Regression: epoll edge-triggered (EPOLLET) for pipes/FIFOs
  2020-10-12 19:25 ` Linus Torvalds
  2020-10-12 19:28   ` Linus Torvalds
@ 2020-10-12 20:30   ` Michael Kerrisk (man-pages)
  2020-10-12 20:52     ` Linus Torvalds
  2020-10-12 22:30     ` Linus Torvalds
  1 sibling, 2 replies; 9+ messages in thread
From: Michael Kerrisk (man-pages) @ 2020-10-12 20:30 UTC (permalink / raw)
  To: Linus Torvalds, Alexander Viro
  Cc: mtk.manpages, David Howells, Rasmus Villemoes,
	Greg Kroah-Hartman, Peter Zijlstra, Nicolas Dichtel, Ian Kent,
	Christian Brauner, keyrings, linux-fsdevel, Linux API, lkml,
	Davide Libenzi

[CC += Davide]

Hello Linus,

Thanks for your quick reply.

On 10/12/20 9:25 PM, Linus Torvalds wrote:
> On Mon, Oct 12, 2020 at 11:40 AM Michael Kerrisk (man-pages)
> <mtk.manpages@gmail.com> wrote:
>>
>> Between Linux 5.4 and 5.5 a regression was introduced in the operation
>> of the epoll EPOLLET flag. From some manual bisecting, the regression
>> appears to have been introduced in
>>
>>          commit 1b6b26ae7053e4914181eedf70f2d92c12abda8a
>>          Author: Linus Torvalds <torvalds@linux-foundation.org>
>>          Date:   Sat Dec 7 12:14:28 2019 -0800
>>
>>              pipe: fix and clarify pipe write wakeup logic
>>
>> (I also built a kernel from the  immediate preceding commit, and did
>> not observe the regression.)
> 
> So the difference from that commit is that now we only wake up a
> reader of a pipe when we add data to it AND IT WAS EMPTY BEFORE.
> 
>> The aim of ET (edge-triggered) notification is that epoll_wait() will
>> tell us a file descriptor is ready only if there has been new activity
>> on the FD since we were last informed about the FD. So, in the
>> following scenario where the read end of a pipe is being monitored
>> with EPOLLET, we see:
>>
>> [Write a byte to write end of pipe]
>> 1. Call epoll_wait() ==> tells us pipe read end is ready
>> 2. Call epoll_wait() [again] ==> does not tell us that the read end of
>> pipe is ready
> 
> Right.
> 
>> If we go further:
>>
>> [Write another byte to write end of pipe]
>> 3. Call epoll_wait() ==> tells us pipe read end is ready
> 
> No.
> 
> The "read end" readiness has not changed. It was ready before, it's
> ready now, there's no change in readiness.
> 
> Now, the old pipe behavior was that it would wake up writers whether
> they needed it or not, so epoll got woken up even if the readiness
> didn't actually change.
> 
> So we do have a change in behavior.
> 
> However, clearly your test is wrong, and there is no edge difference.
> 
> Now, if this is more than just a buggy test - and it actually breaks
> some actual application and real behavior - we'll need to fix it. A
> regression is a regression, and we'll need to be bug-for-bug
> compatible for people who depended on bugs.

I don't think this is correct. The epoll(7) manual page
sill carries the text written long ago by Davide Libenzi,
the creator of epoll:

    Since  even with edge-triggered epoll, multiple events can be gen‐
    erated upon receipt of multiple chunks of data, the caller has the
    option  to specify the EPOLLONESHOT flag, to tell epoll to disable
    the associated file descriptor after the receipt of an event  with
    epoll_wait(2).

My reading of that text is that in the scenario that I describe a
readiness notification should be generated at step 3 (and indeed
should be generated whenever additional data bleeds into the channel).
Indeed, the very rationale for the existence of the EPOLLONESHOT flag
is to *prevent* notifications in such circumstances. And, as I noted,
sockets and terminals do (still) behave in the way that I expect in
this scenario.

So, I don't think this is a buggy test. It (still) appears to me
that this is a breakage of intended and documented behavior.
(Whether it breaks some actual application, I do not know. But
I have also seen that sometimes reports of such breakages take
a very time to come in.)

Thanks,

Michael



-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Regression: epoll edge-triggered (EPOLLET) for pipes/FIFOs
  2020-10-12 20:30   ` Michael Kerrisk (man-pages)
@ 2020-10-12 20:52     ` Linus Torvalds
  2020-10-12 21:43       ` Randy Dunlap
  2020-10-12 21:51       ` Michael Kerrisk (man-pages)
  2020-10-12 22:30     ` Linus Torvalds
  1 sibling, 2 replies; 9+ messages in thread
From: Linus Torvalds @ 2020-10-12 20:52 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages)
  Cc: Alexander Viro, David Howells, Rasmus Villemoes,
	Greg Kroah-Hartman, Peter Zijlstra, Nicolas Dichtel, Ian Kent,
	Christian Brauner, keyrings, linux-fsdevel, Linux API, lkml,
	Davide Libenzi

[-- Attachment #1: Type: text/plain, Size: 1458 bytes --]

On Mon, Oct 12, 2020 at 1:30 PM Michael Kerrisk (man-pages)
<mtk.manpages@gmail.com> wrote:
>
> [CC += Davide]

I'm not sure how active Davide is any more..

> I don't think this is correct. The epoll(7) manual page
> sill carries the text written long ago by Davide Libenzi,
> the creator of epoll:
>
>     Since  even with edge-triggered epoll, multiple events can be gen‐
>     erated upon receipt of multiple chunks of data, the caller has the
>     option  to specify the EPOLLONESHOT flag, to tell epoll to disable
>     the associated file descriptor after the receipt of an event  with
>     epoll_wait(2).
>
> My reading of that text is that in the scenario that I describe a
> readiness notification should be generated at step 3 (and indeed
> should be generated whenever additional data bleeds into the channel).

Hmm.

That is unfortunate, because it basically exposes an internal wait
queue implementation decision, not actual real semantics.

I suspect it's easy enough to "fix" the regression with the attached
patch. It's pretty nonsensical, but I guess there's not a lot of
downside - if the pipe wasn't empty, there normally shouldn't be any
non-epoll readers anyway.

I'm busy merging, mind testing this odd patch out? It is _entirely_
untested, but from the symptoms I think it's the obvious fix.

I did the same thing for the "reader starting out from a full pipe" case too.

               Linus

[-- Attachment #2: patch --]
[-- Type: application/octet-stream, Size: 1619 bytes --]

 fs/pipe.c | 13 ++++++-------
 1 file changed, 6 insertions(+), 7 deletions(-)

diff --git a/fs/pipe.c b/fs/pipe.c
index 0ac197658a2d..e60565bb8125 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -228,14 +228,13 @@ pipe_read(struct kiocb *iocb, struct iov_iter *to)
 	__pipe_lock(pipe);
 
 	/*
-	 * We only wake up writers if the pipe was full when we started
-	 * reading in order to avoid unnecessary wakeups.
+	 * Epoll wants us to wake things up whether it was full or not.
 	 *
 	 * But when we do wake up writers, we do so using a sync wakeup
 	 * (WF_SYNC), because we want them to get going and generate more
 	 * data for us.
 	 */
-	was_full = pipe_full(pipe->head, pipe->tail, pipe->max_usage);
+	was_full = true;
 	for (;;) {
 		unsigned int head = pipe->head;
 		unsigned int tail = pipe->tail;
@@ -429,8 +428,8 @@ pipe_write(struct kiocb *iocb, struct iov_iter *from)
 #endif
 
 	/*
-	 * Only wake up if the pipe started out empty, since
-	 * otherwise there should be no readers waiting.
+	 * Epoll nonsensically wants a wakeup wheher the pipe was
+	 * already empty or not.
 	 *
 	 * If it wasn't empty we try to merge new data into
 	 * the last buffer.
@@ -440,9 +439,9 @@ pipe_write(struct kiocb *iocb, struct iov_iter *from)
 	 * spanning multiple pages.
 	 */
 	head = pipe->head;
-	was_empty = pipe_empty(head, pipe->tail);
+	was_empty = true;
 	chars = total_len & (PAGE_SIZE-1);
-	if (chars && !was_empty) {
+	if (chars && !pipe_empty(head, pipe->tail)) {
 		unsigned int mask = pipe->ring_size - 1;
 		struct pipe_buffer *buf = &pipe->bufs[(head - 1) & mask];
 		int offset = buf->offset + buf->len;

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Regression: epoll edge-triggered (EPOLLET) for pipes/FIFOs
  2020-10-12 20:52     ` Linus Torvalds
@ 2020-10-12 21:43       ` Randy Dunlap
  2020-10-12 21:51       ` Michael Kerrisk (man-pages)
  1 sibling, 0 replies; 9+ messages in thread
From: Randy Dunlap @ 2020-10-12 21:43 UTC (permalink / raw)
  To: Linus Torvalds, Michael Kerrisk (man-pages)
  Cc: Alexander Viro, David Howells, Rasmus Villemoes,
	Greg Kroah-Hartman, Peter Zijlstra, Nicolas Dichtel, Ian Kent,
	Christian Brauner, keyrings, linux-fsdevel, Linux API, lkml,
	Davide Libenzi

On 10/12/20 1:52 PM, Linus Torvalds wrote:
> On Mon, Oct 12, 2020 at 1:30 PM Michael Kerrisk (man-pages)
> 
> I'm busy merging, mind testing this odd patch out? It is _entirely_
> untested, but from the symptoms I think it's the obvious fix.
> 
> I did the same thing for the "reader starting out from a full pipe" case too.
> 
>                Linus
> 

and if that patch tests good, then fix this typo, please:

+	 * Epoll nonsensically wants a wakeup wheher the pipe was

	                                      whether

thanks.
-- 
~Randy


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Regression: epoll edge-triggered (EPOLLET) for pipes/FIFOs
  2020-10-12 20:52     ` Linus Torvalds
  2020-10-12 21:43       ` Randy Dunlap
@ 2020-10-12 21:51       ` Michael Kerrisk (man-pages)
  1 sibling, 0 replies; 9+ messages in thread
From: Michael Kerrisk (man-pages) @ 2020-10-12 21:51 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: mtk.manpages, Alexander Viro, David Howells, Rasmus Villemoes,
	Greg Kroah-Hartman, Peter Zijlstra, Nicolas Dichtel, Ian Kent,
	Christian Brauner, keyrings, linux-fsdevel, Linux API, lkml,
	Davide Libenzi

On 10/12/20 10:52 PM, Linus Torvalds wrote:
> On Mon, Oct 12, 2020 at 1:30 PM Michael Kerrisk (man-pages)
> <mtk.manpages@gmail.com> wrote:
>>
>> [CC += Davide]
> 
> I'm not sure how active Davide is any more..

Yep, I know. But just in case.

>> I don't think this is correct. The epoll(7) manual page
>> sill carries the text written long ago by Davide Libenzi,
>> the creator of epoll:
>>
>>     Since  even with edge-triggered epoll, multiple events can be gen‐
>>     erated upon receipt of multiple chunks of data, the caller has the
>>     option  to specify the EPOLLONESHOT flag, to tell epoll to disable
>>     the associated file descriptor after the receipt of an event  with
>>     epoll_wait(2).
>>
>> My reading of that text is that in the scenario that I describe a
>> readiness notification should be generated at step 3 (and indeed
>> should be generated whenever additional data bleeds into the channel).
> 
> Hmm.
> 
> That is unfortunate, because it basically exposes an internal wait
> queue implementation decision, not actual real semantics.

I don't disagree that the longstanding semantics are a little odd;
your comment explains perhaps why.

> I suspect it's easy enough to "fix" the regression with the attached
> patch. It's pretty nonsensical, but I guess there's not a lot of
> downside - if the pipe wasn't empty, there normally shouldn't be any
> non-epoll readers anyway.
> 
> I'm busy merging, mind testing this odd patch out? It is _entirely_
> untested, but from the symptoms I think it's the obvious fix.

Applied against current master (13cb73490f475). My test now
runs as I expected.

> I did the same thing for the "reader starting out from a full pipe" case too.

I haven't tested this, but thanks for thinking of it.

Cheers,

Michael


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Regression: epoll edge-triggered (EPOLLET) for pipes/FIFOs
  2020-10-12 20:30   ` Michael Kerrisk (man-pages)
  2020-10-12 20:52     ` Linus Torvalds
@ 2020-10-12 22:30     ` Linus Torvalds
  2020-10-13  9:47       ` Michael Kerrisk (man-pages)
  1 sibling, 1 reply; 9+ messages in thread
From: Linus Torvalds @ 2020-10-12 22:30 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages)
  Cc: Alexander Viro, David Howells, Rasmus Villemoes,
	Greg Kroah-Hartman, Peter Zijlstra, Nicolas Dichtel, Ian Kent,
	Christian Brauner, keyrings, linux-fsdevel, Linux API, lkml,
	Davide Libenzi

On Mon, Oct 12, 2020 at 1:30 PM Michael Kerrisk (man-pages)
<mtk.manpages@gmail.com> wrote:
>
> I don't think this is correct. The epoll(7) manual page
> sill carries the text written long ago by Davide Libenzi,
> the creator of epoll:
>
>     Since  even with edge-triggered epoll, multiple events can be gen‐
>     erated upon receipt of multiple chunks of data, the caller has the
>     option  to specify the EPOLLONESHOT flag, to tell epoll to disable
>     the associated file descriptor after the receipt of an event  with
>     epoll_wait(2).

Hmm.

The more I read that paragraph, the more I think the epoll man-page
really talks about something that _could_ happen due to internal
implementation details, but that isn't really something an epoll user
would _want_ to happen or depend on.

IOW, in that whole "even with edge-triggered epoll, multiple events
can be generated", I'd emphasize the *can* part (as in "might", not as
in "will"), and my reading is that the reason EPOLLONESHOT flag exists
is to avoid that whole "this is implementation-defined, and if you
absolutely _must_ get just a single event, you need to use
EPOLLONESHOT to make sure you remove yourself after you got the one
single event you waited for".

The corollary of that reading is that the new pipe behavior is
actually the _expected_ one, and the old pipe behavior where we would
generate multiple events is the unwanted implementation detail of
"this might still happen, and if you care, you will need to do extra
stuff".

Anyway, I don't absolutely hate that patch of mine, but it does seem
nonsensical and pointless, and I think I'll just hold off on applying
it until we hear of something actually breaking.

Which I suspect simply won't happen. Getting two epoll notifications
when the pipe state didn't really change in between is not something I
can see anybody really depending on.

You _will_ get the second notification if somebody actually emptied
the pipe in between, and you have a real new "edge".

But hey, I am continually surprised by what user space code then
occasionally does, despite my fairly low expectations.

              Linus

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Regression: epoll edge-triggered (EPOLLET) for pipes/FIFOs
  2020-10-12 22:30     ` Linus Torvalds
@ 2020-10-13  9:47       ` Michael Kerrisk (man-pages)
  0 siblings, 0 replies; 9+ messages in thread
From: Michael Kerrisk (man-pages) @ 2020-10-13  9:47 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: mtk.manpages, Alexander Viro, David Howells, Rasmus Villemoes,
	Greg Kroah-Hartman, Peter Zijlstra, Nicolas Dichtel, Ian Kent,
	Christian Brauner, keyrings, linux-fsdevel, Linux API, lkml,
	Davide Libenzi

Hello Linus,

On 10/13/20 12:30 AM, Linus Torvalds wrote:
> On Mon, Oct 12, 2020 at 1:30 PM Michael Kerrisk (man-pages)
> <mtk.manpages@gmail.com> wrote:
>>
>> I don't think this is correct. The epoll(7) manual page
>> sill carries the text written long ago by Davide Libenzi,
>> the creator of epoll:
>>
>>     Since  even with edge-triggered epoll, multiple events can be gen‐
>>     erated upon receipt of multiple chunks of data, the caller has the
>>     option  to specify the EPOLLONESHOT flag, to tell epoll to disable
>>     the associated file descriptor after the receipt of an event  with
>>     epoll_wait(2).
> 
> Hmm.
> 
> The more I read that paragraph, the more I think the epoll man-page
> really talks about something that _could_ happen due to internal
> implementation details, but that isn't really something an epoll user
> would _want_ to happen or depend on.
> 
> IOW, in that whole "even with edge-triggered epoll, multiple events
> can be generated", I'd emphasize the *can* part (as in "might", not as
> in "will"), and my reading is that the reason EPOLLONESHOT flag exists
> is to avoid that whole "this is implementation-defined, and if you
> absolutely _must_ get just a single event, you need to use
> EPOLLONESHOT to make sure you remove yourself after you got the one
> single event you waited for".

I agree that that is also a valid alternate reading of the text, 
in particular, "can" could be read as "might" rather than "will".

I also agree that the semantics before the change were odd
(but see [3]).

But...

> The corollary of that reading is that the new pipe behavior is
> actually the _expected_ one, and the old pipe behavior where we would
> generate multiple events is the unwanted implementation detail of
> "this might still happen, and if you care, you will need to do extra
> stuff".

"expected" by who? I mean, there were established semantics
for pipes/FIFOs in this scenario. Those semantics changed in
Linux 5.5.

However, those established EPOLLET semantics are still (I tested
each of these) followed by:

* Sockets (tested in Internet domain)
* Terminals
* POSIX message queues
* Hierarchical epoll instances; for example:
  - epoll FD X monitors epoll FD Y with EPOLLET
  - epoll FD Y monitors two FDs, A and B, for EPOLLIN
  - input arrives on FD A
  - epoll_wait on X returns EPOLLIN for FD Y
  - next epoll_wait on X doesn't inform us that Y is ready
  - input arrives on B
  - epoll_wait on X returns EPOLLIN for FD Y

I would say that users *expect* at least the following:

* That semantics don't change unexpectedly.
* That semantics are consistent.

In Linux 5.5, the pipe EPOLLET semantics changed unexpectedly.
And now, pipes have EPOLLET semantics that are inconsistent with
every other type of FD (that I tested).

> Anyway, I don't absolutely hate that patch of mine, but it does seem
> nonsensical and pointless, and I think I'll just hold off on applying
> it until we hear of something actually breaking.

The problem is that sometimes it takes a very long time to hear
of something breaking. For example, a Linux 3.5 regression in
the POSIX message queue API was only fixed in 3.14 [1], and only
after the breakage was reported as a man-pages bug(!) a year
after the breakage.

And sometimes, if things don't get fixed soon enough, then
any fix will break new users. Thus we now have F_SETOWN_EX
(2.6.32) to do what F_SETOWN used to do before a regression
that occurred about 4 years earlier (2.6.12) (see [2]), because
reverting the F_SETOWN semantics to what they originally 
were might have broken some new apps that had appeared in
those four years.

> Which I suspect simply won't happen. Getting two epoll notifications
> when the pipe state didn't really change in between is not something I
> can see anybody really depending on.
> 
> You _will_ get the second notification if somebody actually emptied
> the pipe in between, and you have a real new "edge".
> 
> But hey, I am continually surprised by what user space code then
> occasionally does, despite my fairly low expectations.

Yes, user space code does surprising things. But, give people
enough time and every detail of API behavior will come
to be depended upon by someone. We don't know if anyone
depends on the old pipe EPOLLET behavior. I also imagine the
chances are small, but if users do depend on it, they are
in for an unpleasant surprise (missed notifications).

We can all agree that the existing EPOLLET are perhaps strange.
However, why change these semantics just for pipes? In other
words, given my notes above about consistency, what is the
argument for not applying the patch? IOW, I think "consistency"
is a rather stronger argument than "but it seems nonsensical
and pointless"; YMMV.

Cheers,

Michael

[1] See the discussion of HARD_QUEUESMAX in
https://www.man7.org/linux/man-pages/man7/mq_overview.7.html#BUGS

[2]
https://www.man7.org/linux/man-pages/man2/fcntl.2.html

[3]
Leakage of implementation details into the API is hardly
unprecedented; thus, for example, POSIX permits spurious
wake-ups on condition variable waiters to allow for
efficient CV implementation on multiprocessor systems.

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2020-10-13  9:47 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-10-12 18:39 Regression: epoll edge-triggered (EPOLLET) for pipes/FIFOs Michael Kerrisk (man-pages)
2020-10-12 19:25 ` Linus Torvalds
2020-10-12 19:28   ` Linus Torvalds
2020-10-12 20:30   ` Michael Kerrisk (man-pages)
2020-10-12 20:52     ` Linus Torvalds
2020-10-12 21:43       ` Randy Dunlap
2020-10-12 21:51       ` Michael Kerrisk (man-pages)
2020-10-12 22:30     ` Linus Torvalds
2020-10-13  9:47       ` Michael Kerrisk (man-pages)

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).