linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jamie Lokier <lk@tantalophile.demon.co.uk>
To: "Matthew D. Hall" <mhall@free-market.net>
Cc: Davide Libenzi <davidel@xmailserver.org>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	linux-aio@kvack.org, lse-tech@lists.sourceforge.net,
	Linus Torvalds <torvalds@transmeta.com>,
	Andrew Morton <akpm@digeo.com>,
	Alan Cox <alan@lxorguk.ukuu.org.uk>
Subject: Re: Unifying epoll,aio,futexes etc. (What I really want from epoll)
Date: Fri, 1 Nov 2002 02:56:14 +0000	[thread overview]
Message-ID: <20021101025614.GD30865@bjl1.asuk.net> (raw)
In-Reply-To: <3DC1DEFB.6070206@free-market.net>

Matthew D. Hall wrote:
> The API should present the notion of a general edge-triggered event 
> (e.g. I/O upon sockets, pipes, files, timers, etc.), and it should do so 
> simply.

Agreed.  Btw, earlier today I had misgivings about epoll, but since
I've had such positive response from Davide I think epoll has
potential to become that ideal interface...

> *  Unless every conceivable event is to be represented as a file (rather 
> unintuitive IMHO), his proposed interface fails to accomodate non-I/O 
> events (e.g. timers,  signals, directory updates, etc.).#

...apart from this one point!

> As much as I appreciate the UNIX Way, making everything a file is a
> massive oversimplification.  We can often stretch the definition far
> enough to make this work, but I'd be impressed to see how one
> intends to call, e.g., a timer a type of file.

If it has an fd, that is, if it has an index into file_table, then
it's a "file".  No other semantics are required for event purposes.

This seems quite weird and pointless at first, but actually fds are
quite useful: you can dup them and pass them between processes, and
they have a security model (you can't use someone else's fd unless
they've passed it to you).  Think of an fd as a handle to an arbitrary
object.

OTOH look at rt-signals: a complete mess, no kernel allocation
mechanism, libraries fight it out and don't always work.  Look at aio:
it has an aio_context_t - IMHO that should be an fd, not an opaque
number that cannot be transferred to another process or polled on.

However, despite all the goodness of fds, you're right.  Event queues
really need to deliver more info than which event and
read/write/hangup.  dnotify events should include what happened and
maybe the inode number.  futex events should include the address.
(rt-signals get close to this but fail due to pseudo-compatibility
with a braindead POSIX standard).

> *  There is a seemingly significant overhead in performing exactly one 
> callback per event.  Doesn't this prevent any kind of event coalescence? 

As you go on to say, this should be a matter for userspace.  My
concern is that kernel space should provide a flexible mechanism for a
variety of possible userspaces.

> *  The interface should allow the implementation to be highly extensible 
> without horrible code contortions within the kernel.  It is important to 
> be able to add new types of events as they become necessary.  The 
> interface should be general and simple enough to accomodate these 
> extensions.  Linux (really, UNIX) has failed to exercise this foresight 
> in the past, and that's why we have the current mess of varied 
> polling/triggering methods.

Agreed, agreed, agreed, agreed.  Fwiw, I now think these can all be
satisfied with some evolution of epoll, if that is permitted.

> *  I might be getting a bit utopian here, but IMHO the kernel should 
> move toward a completely edge-triggered event system.  The old 
> level-triggered interfaces should only wrap this paradigm.

Take a close look at how wait queues and notifier chains are used.
Some of the kernel is edge triggered already.  Admittedly, it's about
as clear as clay at times - some wait queues are used in an
edge-triggered way, others level-triggered.

> *  Might as well reiterate: simplicity.  FreeBSD's kevent solves nearly 
> all of the traditional problems, but it is gross.  And complicated. 

Could you explain what you find complicated and/or gross about kevent?
We should learn from their mistakes.

> There were clearly too many computer scientists and not enough 
> engineers on that team.

I am both ;)

> *  Only one queue per process or kernel thread.  Multiple queues per 
> flow of execution is just ugly and ultimately pointless.

Disagree.  You're right that it's technically not necessary to have
multiple queues, but in practice you can't always force an entire
program to funnel all its I/O through one library - that just doesn't
happen in reality.  And there is basically no cost to having multiple
queues.  Keyed off fds :)

That was a mistake made by rt-signals: assuming all the different bits
of code that use rt-signals will cooperate.  Theoretically solvable in
user space.  In reality, they don't know about each other.  Although
my code at least searches for a free rt-signal, that's not guaranteed
to work if another library assumes a fixed value.

The same problem occurs with the LDT.  Glibc wants to use it and so do I.
Conflict.

> Since the event notification is edge-triggered, I cannot see any
> reason for a significant performance degradation resulting from this
> policy.  I am not altogether convinced that this must be a strict
> policy, however; the issue of different userspace threads having
> different event queues inside the same task is interesting.

User space threads are often but not always built on top of a simple
scheduler which converts blocking system calls to queued non-blocking
system calls.  If done well, this is a form of virtualisation which
may even be nestable.

You'd expect the event queue mechanism to be included in the set of
blocking system calls which are converted, so multiple userspace
threads would "see" individual queues even though they are multiplexed
by the userspace scheduler.

This works great, until those threads expect mmap() to provide them
with their own separate zero-copy event queues :) So another reason to
need multiple queues from the kernel.

> *  No re-arming events.  They must be manually killed.

I would provide both options, like dnotify does: one-shot and
continuous.  There are occasions when one-shot is better for resource
usage - it depends what the event is monitoring.

> *  I'm sure everyone would agree that passing an opaque "user context" 
> pointer is necessary with each event.

It is not the end of the world to use an fd number and a table, but it
may improve thread scalability to use a pointer instead.

> *  Zero-copy event delivery (of course).

I think this is not critical for performance, but desirable anyway.  I
would take this further:

    1. zero-copy delivery
    2. zero system calls as long as the queue is non-empty
       (like the packet socket mmap interface)
    3. no fixed limit on the size of the queue at creation time

Neither epoll nor aio satisfy (3).  Luckily I have a nice design which
satisfies all these and is extensible in the ways we agree on.

> Some question marks:
> -  Should the kernel attempt to prune the queue of "cancelled" events 
> (hints later deemed irrelevant, untrue, or obsolete by newer events)?

Something is needed in the case of aio cancellations, but I think
that's different to what you mean.  Backtracking hints is quite
difficult to synchronise with userspace if done through mmap and no
system calls.  It's best not to bother.  Coalescing events, which can
have the effect of superceding events in some cases, is a possibility.
It's tricky but more worthwhile than backtracking.

For some kinds of event, such as round robin futex wakeups, it's
critically important that the _number_ of events seen by userspace is
the same as the number sent from the kernel.  In these cases, they are
not just hints, they are synchronisation tokens.  That adds some
excitement to coalescing in a shared memory buffer.

-- Jamie

  parent reply	other threads:[~2002-11-01  2:50 UTC|newest]

Thread overview: 117+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2002-10-28 19:14 [PATCH] epoll more scalable than poll Hanna Linder
2002-10-28 20:10 ` Hanna Linder
2002-10-28 20:56 ` Martin Waitz
2002-10-28 22:02   ` bert hubert
2002-10-28 22:15     ` bert hubert
2002-10-28 22:17   ` Davide Libenzi
2002-10-28 22:08 ` bert hubert
2002-10-28 22:12   ` [Lse-tech] " Shailabh Nagar
2002-10-28 22:37     ` Davide Libenzi
2002-10-28 22:29   ` Davide Libenzi
2002-10-28 22:58     ` and nicer too - " bert hubert
2002-10-28 23:23       ` Davide Libenzi
2002-10-28 23:45       ` John Gardiner Myers
2002-10-29  0:08         ` Davide Libenzi
2002-10-29 12:59           ` Martin Waitz
2002-10-29 15:19             ` bert hubert
2002-10-29 22:54               ` Martin Waitz
2002-10-30  2:24                 ` Davide Libenzi
2002-10-30 19:38                   ` Martin Waitz
2002-10-31  5:04                     ` Davide Libenzi
2002-10-29  0:18         ` bert hubert
2002-10-29  0:32           ` Davide Libenzi
2002-10-29  0:40             ` bert hubert
2002-10-29  0:57               ` Davide Libenzi
2002-10-29  0:53                 ` bert hubert
2002-10-29  1:13                   ` Davide Libenzi
2002-10-29  1:08                     ` [Lse-tech] " Hanna Linder
2002-10-29  1:39                       ` Davide Libenzi
2002-10-29  2:05                   ` Jamie Lokier
2002-10-29  2:44                     ` Davide Libenzi
2002-10-29  4:01                       ` [PATCH] Updated sys_epoll now with man pages Hanna Linder
2002-10-29  5:09                         ` Andrew Morton
2002-10-29  5:28                           ` [Lse-tech] " Randy.Dunlap
2002-10-29  5:47                             ` Davide Libenzi
2002-10-29  5:41                               ` Randy.Dunlap
2002-10-29  6:12                                 ` Davide Libenzi
2002-10-29  6:03                                   ` Randy.Dunlap
2002-10-29  6:23                                     ` Davide Libenzi
2002-10-29 14:59                             ` Paul Larson
2002-10-29  5:31                           ` Davide Libenzi
2002-10-29  7:34                           ` Davide Libenzi
2002-10-29 11:04                           ` bert hubert
2002-10-29 15:30                           ` [Lse-tech] " Shailabh Nagar
2002-10-29 17:45                             ` Davide Libenzi
2002-10-29 19:30                               ` Hanna Linder
2002-10-29 19:49                                 ` Davide Libenzi
2002-10-29 13:09                 ` and nicer too - Re: [PATCH] epoll more scalable than poll bert hubert
2002-10-29 21:25                   ` Davide Libenzi
2002-10-29 21:23                     ` Hanna Linder
2002-10-29 21:41                       ` Davide Libenzi
2002-10-29 23:06                         ` Hanna Linder
2002-10-29 23:14                           ` [Lse-tech] " Randy.Dunlap
2002-10-29 23:25                           ` Davide Libenzi
2002-10-29  1:47           ` Security critical race condition in epoll code John Gardiner Myers
2002-10-29  2:13             ` Davide Libenzi
2002-10-29  3:38             ` Davide Libenzi
2002-10-29 19:49           ` and nicer too - Re: [PATCH] epoll more scalable than poll John Gardiner Myers
2002-10-29 21:03             ` Davide Libenzi
2002-10-30  0:26               ` Jamie Lokier
2002-10-30  2:09                 ` Davide Libenzi
2002-10-30  5:51                 ` Davide Libenzi
2002-10-30  2:22               ` John Gardiner Myers
2002-10-30  3:51                 ` Davide Libenzi
2002-10-31  2:07                   ` John Gardiner Myers
2002-10-31  3:21                     ` Davide Libenzi
2002-10-31 11:10                       ` [Lse-tech] " Suparna Bhattacharya
2002-10-31 18:42                         ` Davide Libenzi
2002-10-30 23:01                 ` Jamie Lokier
2002-10-30 23:53                   ` Davide Libenzi
2002-10-31  0:52                     ` Jamie Lokier
2002-10-31  4:15                       ` Davide Libenzi
2002-10-31 15:07                         ` Jamie Lokier
2002-10-31 19:10                           ` Davide Libenzi
2002-11-01 17:42                             ` Dan Kegel
2002-11-01 17:45                               ` Davide Libenzi
2002-11-01 18:41                                 ` Dan Kegel
2002-11-01 19:16                               ` Jamie Lokier
2002-11-01 20:04                                 ` Charlie Krasic
2002-11-01 20:14                                   ` Jamie Lokier
2002-11-01 20:22                                 ` Mark Mielke
2002-10-31 15:41                         ` Unifying epoll,aio,futexes etc. (What I really want from epoll) Jamie Lokier
2002-10-31 15:48                           ` bert hubert
2002-10-31 16:45                           ` Alan Cox
2002-10-31 22:00                             ` Rusty Russell
2002-11-01  0:32                               ` Jamie Lokier
2002-11-01 13:23                               ` Alan Cox
2002-10-31 20:28                           ` Davide Libenzi
2002-10-31 23:02                             ` Jamie Lokier
2002-11-01  1:01                               ` Davide Libenzi
2002-11-01  2:01                                 ` Jamie Lokier
2002-11-01 17:36                                   ` Davide Libenzi
2002-11-01 23:27                                   ` John Gardiner Myers
2002-11-02  4:55                                     ` Mark Mielke
2002-11-05 18:15                                       ` pipe POLLOUT oddity John Gardiner Myers
2002-11-05 18:18                                         ` Benjamin LaHaise
2002-11-02 15:41                                     ` Unifying epoll,aio,futexes etc. (What I really want from epoll) Jamie Lokier
2002-11-01 20:45                                 ` Jamie Lokier
2002-11-01  1:55                               ` Matthew D. Hall
2002-11-01  2:54                                 ` Davide Libenzi
2002-11-01 18:18                                   ` Dan Kegel
2002-11-01  2:56                                 ` Jamie Lokier [this message]
2002-11-01 23:16                                 ` John Gardiner Myers
2002-11-01  4:29                               ` Mark Mielke
2002-11-01  4:59                                 ` Jamie Lokier
2002-10-30 18:59               ` and nicer too - Re: [PATCH] epoll more scalable than poll Zach Brown
2002-10-30 19:25                 ` Davide Libenzi
2002-10-31 16:54                 ` Davide Libenzi
2002-10-28 23:44     ` Jamie Lokier
2002-10-29  0:02       ` Davide Libenzi
2002-10-29  1:51         ` Jamie Lokier
2002-10-29  5:06           ` Davide Libenzi
2002-10-29 11:20             ` Jamie Lokier
2002-10-30  0:16               ` Davide Libenzi
2002-10-29  0:03       ` bert hubert
2002-10-29  0:20         ` Davide Libenzi
2002-10-29  0:48         ` Jamie Lokier
2002-10-29  1:53           ` Jamie Lokier

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20021101025614.GD30865@bjl1.asuk.net \
    --to=lk@tantalophile.demon.co.uk \
    --cc=akpm@digeo.com \
    --cc=alan@lxorguk.ukuu.org.uk \
    --cc=davidel@xmailserver.org \
    --cc=linux-aio@kvack.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=lse-tech@lists.sourceforge.net \
    --cc=mhall@free-market.net \
    --cc=torvalds@transmeta.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).