linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Linus Torvalds <torvalds@linux-foundation.org>
To: "Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com>
Cc: Alexander Viro <aviro@redhat.com>,
	David Howells <dhowells@redhat.com>,
	Rasmus Villemoes <linux@rasmusvillemoes.dk>,
	Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	Peter Zijlstra <peterz@infradead.org>,
	Nicolas Dichtel <nicolas.dichtel@6wind.com>,
	Ian Kent <raven@themaw.net>,
	Christian Brauner <christian@brauner.io>,
	keyrings@vger.kernel.org,
	"linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>,
	Linux API <linux-api@vger.kernel.org>,
	lkml <linux-kernel@vger.kernel.org>,
	Davide Libenzi <davidel@xmailserver.org>
Subject: Re: Regression: epoll edge-triggered (EPOLLET) for pipes/FIFOs
Date: Mon, 12 Oct 2020 13:52:15 -0700	[thread overview]
Message-ID: <CAHk-=wgjR7Nd4CyDoi3SH9kPJp_Td9S-hhFJZMqvp6GS1Ww8eg@mail.gmail.com> (raw)
In-Reply-To: <81229415-fb97-51f7-332c-d5e468bcbf2a@gmail.com>

[-- Attachment #1: Type: text/plain, Size: 1458 bytes --]

On Mon, Oct 12, 2020 at 1:30 PM Michael Kerrisk (man-pages)
<mtk.manpages@gmail.com> wrote:
>
> [CC += Davide]

I'm not sure how active Davide is any more..

> I don't think this is correct. The epoll(7) manual page
> sill carries the text written long ago by Davide Libenzi,
> the creator of epoll:
>
>     Since  even with edge-triggered epoll, multiple events can be gen‐
>     erated upon receipt of multiple chunks of data, the caller has the
>     option  to specify the EPOLLONESHOT flag, to tell epoll to disable
>     the associated file descriptor after the receipt of an event  with
>     epoll_wait(2).
>
> My reading of that text is that in the scenario that I describe a
> readiness notification should be generated at step 3 (and indeed
> should be generated whenever additional data bleeds into the channel).

Hmm.

That is unfortunate, because it basically exposes an internal wait
queue implementation decision, not actual real semantics.

I suspect it's easy enough to "fix" the regression with the attached
patch. It's pretty nonsensical, but I guess there's not a lot of
downside - if the pipe wasn't empty, there normally shouldn't be any
non-epoll readers anyway.

I'm busy merging, mind testing this odd patch out? It is _entirely_
untested, but from the symptoms I think it's the obvious fix.

I did the same thing for the "reader starting out from a full pipe" case too.

               Linus

[-- Attachment #2: patch --]
[-- Type: application/octet-stream, Size: 1619 bytes --]

 fs/pipe.c | 13 ++++++-------
 1 file changed, 6 insertions(+), 7 deletions(-)

diff --git a/fs/pipe.c b/fs/pipe.c
index 0ac197658a2d..e60565bb8125 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -228,14 +228,13 @@ pipe_read(struct kiocb *iocb, struct iov_iter *to)
 	__pipe_lock(pipe);
 
 	/*
-	 * We only wake up writers if the pipe was full when we started
-	 * reading in order to avoid unnecessary wakeups.
+	 * Epoll wants us to wake things up whether it was full or not.
 	 *
 	 * But when we do wake up writers, we do so using a sync wakeup
 	 * (WF_SYNC), because we want them to get going and generate more
 	 * data for us.
 	 */
-	was_full = pipe_full(pipe->head, pipe->tail, pipe->max_usage);
+	was_full = true;
 	for (;;) {
 		unsigned int head = pipe->head;
 		unsigned int tail = pipe->tail;
@@ -429,8 +428,8 @@ pipe_write(struct kiocb *iocb, struct iov_iter *from)
 #endif
 
 	/*
-	 * Only wake up if the pipe started out empty, since
-	 * otherwise there should be no readers waiting.
+	 * Epoll nonsensically wants a wakeup wheher the pipe was
+	 * already empty or not.
 	 *
 	 * If it wasn't empty we try to merge new data into
 	 * the last buffer.
@@ -440,9 +439,9 @@ pipe_write(struct kiocb *iocb, struct iov_iter *from)
 	 * spanning multiple pages.
 	 */
 	head = pipe->head;
-	was_empty = pipe_empty(head, pipe->tail);
+	was_empty = true;
 	chars = total_len & (PAGE_SIZE-1);
-	if (chars && !was_empty) {
+	if (chars && !pipe_empty(head, pipe->tail)) {
 		unsigned int mask = pipe->ring_size - 1;
 		struct pipe_buffer *buf = &pipe->bufs[(head - 1) & mask];
 		int offset = buf->offset + buf->len;

  reply	other threads:[~2020-10-12 20:52 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-10-12 18:39 Regression: epoll edge-triggered (EPOLLET) for pipes/FIFOs Michael Kerrisk (man-pages)
2020-10-12 19:25 ` Linus Torvalds
2020-10-12 19:28   ` Linus Torvalds
2020-10-12 20:30   ` Michael Kerrisk (man-pages)
2020-10-12 20:52     ` Linus Torvalds [this message]
2020-10-12 21:43       ` Randy Dunlap
2020-10-12 21:51       ` Michael Kerrisk (man-pages)
2020-10-12 22:30     ` Linus Torvalds
2020-10-13  9:47       ` Michael Kerrisk (man-pages)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAHk-=wgjR7Nd4CyDoi3SH9kPJp_Td9S-hhFJZMqvp6GS1Ww8eg@mail.gmail.com' \
    --to=torvalds@linux-foundation.org \
    --cc=aviro@redhat.com \
    --cc=christian@brauner.io \
    --cc=davidel@xmailserver.org \
    --cc=dhowells@redhat.com \
    --cc=gregkh@linuxfoundation.org \
    --cc=keyrings@vger.kernel.org \
    --cc=linux-api@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux@rasmusvillemoes.dk \
    --cc=mtk.manpages@gmail.com \
    --cc=nicolas.dichtel@6wind.com \
    --cc=peterz@infradead.org \
    --cc=raven@themaw.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).