linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: mtk.manpages@gmail.com, Alexander Viro <aviro@redhat.com>,
	David Howells <dhowells@redhat.com>,
	Rasmus Villemoes <linux@rasmusvillemoes.dk>,
	Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	Peter Zijlstra <peterz@infradead.org>,
	Nicolas Dichtel <nicolas.dichtel@6wind.com>,
	Ian Kent <raven@themaw.net>,
	Christian Brauner <christian@brauner.io>,
	keyrings@vger.kernel.org,
	"linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>,
	Linux API <linux-api@vger.kernel.org>,
	lkml <linux-kernel@vger.kernel.org>,
	Davide Libenzi <davidel@xmailserver.org>
Subject: Re: Regression: epoll edge-triggered (EPOLLET) for pipes/FIFOs
Date: Tue, 13 Oct 2020 11:47:09 +0200	[thread overview]
Message-ID: <300cb158-5ab1-ed55-404f-8abc9cbdcae0@gmail.com> (raw)
In-Reply-To: <CAHk-=wjYN_80i=9ALMwxZ77_TS_TMjkVyZ261xtuiMUaZsM4ng@mail.gmail.com>

Hello Linus,

On 10/13/20 12:30 AM, Linus Torvalds wrote:
> On Mon, Oct 12, 2020 at 1:30 PM Michael Kerrisk (man-pages)
> <mtk.manpages@gmail.com> wrote:
>>
>> I don't think this is correct. The epoll(7) manual page
>> sill carries the text written long ago by Davide Libenzi,
>> the creator of epoll:
>>
>>     Since  even with edge-triggered epoll, multiple events can be gen‐
>>     erated upon receipt of multiple chunks of data, the caller has the
>>     option  to specify the EPOLLONESHOT flag, to tell epoll to disable
>>     the associated file descriptor after the receipt of an event  with
>>     epoll_wait(2).
> 
> Hmm.
> 
> The more I read that paragraph, the more I think the epoll man-page
> really talks about something that _could_ happen due to internal
> implementation details, but that isn't really something an epoll user
> would _want_ to happen or depend on.
> 
> IOW, in that whole "even with edge-triggered epoll, multiple events
> can be generated", I'd emphasize the *can* part (as in "might", not as
> in "will"), and my reading is that the reason EPOLLONESHOT flag exists
> is to avoid that whole "this is implementation-defined, and if you
> absolutely _must_ get just a single event, you need to use
> EPOLLONESHOT to make sure you remove yourself after you got the one
> single event you waited for".

I agree that that is also a valid alternate reading of the text, 
in particular, "can" could be read as "might" rather than "will".

I also agree that the semantics before the change were odd
(but see [3]).

But...

> The corollary of that reading is that the new pipe behavior is
> actually the _expected_ one, and the old pipe behavior where we would
> generate multiple events is the unwanted implementation detail of
> "this might still happen, and if you care, you will need to do extra
> stuff".

"expected" by who? I mean, there were established semantics
for pipes/FIFOs in this scenario. Those semantics changed in
Linux 5.5.

However, those established EPOLLET semantics are still (I tested
each of these) followed by:

* Sockets (tested in Internet domain)
* Terminals
* POSIX message queues
* Hierarchical epoll instances; for example:
  - epoll FD X monitors epoll FD Y with EPOLLET
  - epoll FD Y monitors two FDs, A and B, for EPOLLIN
  - input arrives on FD A
  - epoll_wait on X returns EPOLLIN for FD Y
  - next epoll_wait on X doesn't inform us that Y is ready
  - input arrives on B
  - epoll_wait on X returns EPOLLIN for FD Y

I would say that users *expect* at least the following:

* That semantics don't change unexpectedly.
* That semantics are consistent.

In Linux 5.5, the pipe EPOLLET semantics changed unexpectedly.
And now, pipes have EPOLLET semantics that are inconsistent with
every other type of FD (that I tested).

> Anyway, I don't absolutely hate that patch of mine, but it does seem
> nonsensical and pointless, and I think I'll just hold off on applying
> it until we hear of something actually breaking.

The problem is that sometimes it takes a very long time to hear
of something breaking. For example, a Linux 3.5 regression in
the POSIX message queue API was only fixed in 3.14 [1], and only
after the breakage was reported as a man-pages bug(!) a year
after the breakage.

And sometimes, if things don't get fixed soon enough, then
any fix will break new users. Thus we now have F_SETOWN_EX
(2.6.32) to do what F_SETOWN used to do before a regression
that occurred about 4 years earlier (2.6.12) (see [2]), because
reverting the F_SETOWN semantics to what they originally 
were might have broken some new apps that had appeared in
those four years.

> Which I suspect simply won't happen. Getting two epoll notifications
> when the pipe state didn't really change in between is not something I
> can see anybody really depending on.
> 
> You _will_ get the second notification if somebody actually emptied
> the pipe in between, and you have a real new "edge".
> 
> But hey, I am continually surprised by what user space code then
> occasionally does, despite my fairly low expectations.

Yes, user space code does surprising things. But, give people
enough time and every detail of API behavior will come
to be depended upon by someone. We don't know if anyone
depends on the old pipe EPOLLET behavior. I also imagine the
chances are small, but if users do depend on it, they are
in for an unpleasant surprise (missed notifications).

We can all agree that the existing EPOLLET are perhaps strange.
However, why change these semantics just for pipes? In other
words, given my notes above about consistency, what is the
argument for not applying the patch? IOW, I think "consistency"
is a rather stronger argument than "but it seems nonsensical
and pointless"; YMMV.

Cheers,

Michael

[1] See the discussion of HARD_QUEUESMAX in
https://www.man7.org/linux/man-pages/man7/mq_overview.7.html#BUGS

[2]
https://www.man7.org/linux/man-pages/man2/fcntl.2.html

[3]
Leakage of implementation details into the API is hardly
unprecedented; thus, for example, POSIX permits spurious
wake-ups on condition variable waiters to allow for
efficient CV implementation on multiprocessor systems.

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

      reply	other threads:[~2020-10-13  9:47 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-10-12 18:39 Regression: epoll edge-triggered (EPOLLET) for pipes/FIFOs Michael Kerrisk (man-pages)
2020-10-12 19:25 ` Linus Torvalds
2020-10-12 19:28   ` Linus Torvalds
2020-10-12 20:30   ` Michael Kerrisk (man-pages)
2020-10-12 20:52     ` Linus Torvalds
2020-10-12 21:43       ` Randy Dunlap
2020-10-12 21:51       ` Michael Kerrisk (man-pages)
2020-10-12 22:30     ` Linus Torvalds
2020-10-13  9:47       ` Michael Kerrisk (man-pages) [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=300cb158-5ab1-ed55-404f-8abc9cbdcae0@gmail.com \
    --to=mtk.manpages@gmail.com \
    --cc=aviro@redhat.com \
    --cc=christian@brauner.io \
    --cc=davidel@xmailserver.org \
    --cc=dhowells@redhat.com \
    --cc=gregkh@linuxfoundation.org \
    --cc=keyrings@vger.kernel.org \
    --cc=linux-api@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux@rasmusvillemoes.dk \
    --cc=nicolas.dichtel@6wind.com \
    --cc=peterz@infradead.org \
    --cc=raven@themaw.net \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).