linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jamie Lokier <jamie@shareable.org>
To: Eric Varsanyi <e0206@foo21.com>
Cc: linux-kernel@vger.kernel.org, davidel@xmailserver.org
Subject: Re: [Patch][RFC] epoll and half closed TCP connections
Date: Sun, 13 Jul 2003 14:12:10 +0100	[thread overview]
Message-ID: <20030713131210.GA19132@mail.jlokier.co.uk> (raw)
In-Reply-To: <20030712205114.GC15643@srv.foo21.com>

Eric Varsanyi wrote:
> > Well then, use epoll's level-triggered mode.  It's quite easy - it's
> > the default now. :)
> 
> The problem with all the level triggered schemes (poll, select, epoll w/o
> EPOLLET) is that they call every driver and poll status for every call into
> the kernel. This appeared to be killing my app's performance and I verified
> by writing some simple micro benchmarks.

OH! :-O

Level-triggered epoll_wait() time _should_ be scalable - proportional
to the number of ready events, not the number of listening events.  If
this is not the case then it's a bug in epoll.

In principle, you will see a large delay only if you don't handle
those events (e.g. by calling read() on each ready fd), so that they
are still ready.

Reading the code in eventpoll.c et al, I think that some time will
be taken for fds that are transitioning on events which you're not
interested in.  Notably, each time a TCP segment is sent and
acknowledged by the other end, poll-waiters are woken, your task will
be woken and do some work in epoll_wait(), but no events are returned
if you are only listening for read availability.

I'm not 100% sure of this, but tracing through

    skb->destructor
    -> sock_wfree()
    -> tcp_write_space()
    -> wake_up_interruptible()
    -> ep_poll_callback()

it looks as though _every_ TCP ACK you receive will cause epoll to wake up
a task which is interested in _any_ socket events, but then in

    <context switch>
    ep_poll()
    -> ep_events_transfer()
    -> ep_send_events()

no events are transferred, so ep_poll() will loop and try again.  This
is quite unfortunate if true, as many of the apps which need to scale
write a lot of segments without receiving very much.

> As we start to scale up to production sized fd sets it gets crazy: around
> 8000 completely idle fd's the cost is 4ms per syscall. At this point
> even a high real load (which gathers lots of I/O per call) doesn't cover the
> now very high latency for each trip into the kernel to gather more work.

It should only be 4ms per syscall if it's actually returning ~8000
ready events.  If you're listening to 8000 but only, say, 10 are
ready, it should be fast.

> What was interesting is the response time was non-linear up to around 400-500
> fd's, then went steep and linear after that, so you pay much more (maybe due
> to some cache effects, I didn't pursue) for each connecting client in a light
> load environment.

> This is not web traffic, the clients typically connect and sit mostly idle.

Can you post your code?

(Btw, I don't disagree with POLLRDHUP - I think it's a fine idea.  I'd
use it.  It'd be unfortunate if it only worked with some socket types
and was not set by others, though.  Global search and replace POLLHUP
with "POLLHUP | POLLRDHUP" in most setters?  Following that a bit
further, we might as well admit that POLLHUP should be called
POLLWRHUP.)

-- Jamie

  parent reply	other threads:[~2003-07-13 12:57 UTC|newest]

Thread overview: 58+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2003-07-12 18:16 [Patch][RFC] epoll and half closed TCP connections Eric Varsanyi
2003-07-12 19:44 ` Jamie Lokier
2003-07-12 20:51   ` Eric Varsanyi
2003-07-12 20:48     ` Davide Libenzi
2003-07-12 21:19       ` Eric Varsanyi
2003-07-12 21:20         ` Davide Libenzi
2003-07-12 21:41         ` Davide Libenzi
2003-07-12 23:11           ` Eric Varsanyi
2003-07-12 23:55             ` Davide Libenzi
2003-07-13  1:05               ` Eric Varsanyi
2003-07-13 20:32       ` David Schwartz
2003-07-13 21:10         ` Jamie Lokier
2003-07-13 23:05           ` David Schwartz
2003-07-13 23:09             ` Davide Libenzi
2003-07-14  8:14               ` Alan Cox
2003-07-14 15:03                 ` Davide Libenzi
2003-07-14  1:27             ` Jamie Lokier
2003-07-13 21:14         ` Davide Libenzi
2003-07-13 23:05           ` David Schwartz
2003-07-13 23:11             ` Davide Libenzi
2003-07-13 23:52             ` Entrope
2003-07-14  6:14               ` David Schwartz
2003-07-14  7:20                 ` Jamie Lokier
2003-07-14  1:51             ` Jamie Lokier
2003-07-14  6:14               ` David Schwartz
2003-07-15 20:27             ` James Antill
2003-07-16  1:46               ` David Schwartz
2003-07-16  2:09                 ` James Antill
2003-07-13 13:12     ` Jamie Lokier [this message]
2003-07-13 16:55       ` Davide Libenzi
2003-07-12 20:01 ` Davide Libenzi
2003-07-13  5:24   ` David S. Miller
2003-07-13 14:07     ` Jamie Lokier
2003-07-13 17:00       ` Davide Libenzi
2003-07-13 19:15         ` Jamie Lokier
2003-07-13 23:03           ` Davide Libenzi
2003-07-14  1:41             ` Jamie Lokier
2003-07-14  2:24               ` POLLRDONCE optimisation for epoll users (was: epoll and half closed TCP connections) Jamie Lokier
2003-07-14  2:37                 ` Davide Libenzi
2003-07-14  2:43                   ` Davide Libenzi
2003-07-14  2:56                   ` Jamie Lokier
2003-07-14  3:02                     ` Davide Libenzi
2003-07-14  3:16                       ` Jamie Lokier
2003-07-14  3:21                         ` Davide Libenzi
2003-07-14  3:42                           ` Jamie Lokier
2003-07-14  4:00                             ` Davide Libenzi
2003-07-14  5:51                               ` Jamie Lokier
2003-07-14  6:24                                 ` Davide Libenzi
2003-07-14  6:57                                   ` Jamie Lokier
2003-07-14  3:12                     ` Jamie Lokier
2003-07-14  3:17                       ` Davide Libenzi
2003-07-14  3:35                         ` Jamie Lokier
2003-07-14  3:04                   ` Jamie Lokier
2003-07-14  3:12                     ` Davide Libenzi
2003-07-14  3:27                       ` Jamie Lokier
2003-07-14 17:09     ` [Patch][RFC] epoll and half closed TCP connections kuznet
2003-07-14 17:09       ` Davide Libenzi
2003-07-14 21:45       ` Jamie Lokier

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20030713131210.GA19132@mail.jlokier.co.uk \
    --to=jamie@shareable.org \
    --cc=davidel@xmailserver.org \
    --cc=e0206@foo21.com \
    --cc=linux-kernel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).