linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [Patch][RFC] epoll and half closed TCP connections
@ 2003-07-12 18:16 Eric Varsanyi
  2003-07-12 19:44 ` Jamie Lokier
  2003-07-12 20:01 ` Davide Libenzi
  0 siblings, 2 replies; 58+ messages in thread
From: Eric Varsanyi @ 2003-07-12 18:16 UTC (permalink / raw)
  To: linux-kernel; +Cc: davidel

I'm proposing adding a new POLL event type (POLLRDHUP) as way to solve
a new race introduced by having an edge triggered event mechanism
(epoll). The problem occurs when a client writes data and then does a
write side shutdown(). The server (using epoll) sees only one event for
the read data ready and the read EOF condition and has no way to tell
that an EOF occurred.

-Eric Varsanyi

Details
-----------
	- remote sends data and does a shutdown
	   [ we see a data bearing packet and FIN from client on the wire ]

	- user mode server gets around to doing accept() and registers
	  for EPOLLIN events (along with HUP and ERR which are forced on)

	- epoll_wait() returns a single EPOLLIN event on the FD representing
	  both the 1/2 shutdown state and data available

At this point there is no way the app can tell if there is a half closed
connection so it may issue a close() back to the client after writing
results. Normally the server would distinguish these events by assuming
EOF if it got a read ready indication and the first read returned 0 bytes,
or would issue read calls until less data was returned than was asked for.

In a level triggered world this all just works because the read ready
indication is driven back to the app as long as the socket state is half
closed. The event driven epoll mechanism folds these two indications
together and thus loses one 'edge'.

One would be tempted to issue an extra read() after getting back less than
expected, but this is an extra system call on every read event and you get
back the same '0' bytes that you get if the buffer is just empty. The only
sure bet seems to be CTL_MODding the FD to force a re-poll (which would
cost a syscall and hash-lookup in eventpoll for every read event).

The POLLHUP indication is specifically not used in this 1/2 closed state
since it is (by POSIX) not allowed to be masked and would interfere with
EPOLLOUT back to client if it were set.

I considered 2 possible ideas:

	1) have epoll return a count of events represented by a single
	   appearance in the list after an epoll_wait(); this could be
	   used as a hint to make some other (tbd) syscall to find out
	   if the socket was in half closed state and normally not have
	   to pay an extra syscall on every read; this seems like a lot of
	   tap dancing for a work around

	2) add a new 1/2 closed event type that a poll routine can return

The implementation is trivial, a patch is included below. If this idea sees
favor I'll fix the other architectures, ipv6, epoll.h, and make a 'real'
patch. I do not believe any drivers deserve to be modified to return this
new event.

This should not break existing programs, it still returns the read ready
indication as before and unless an app registers for this new event it won't
be woken up an extra time. Any epoll programs will not get the extra event
indicated unless they ask for it. I dislike extending such an established
API, but coopting any of the other event types just seems wrong.

Other ideas and comments would be appreciated,
-Eric Varsanyi

diff -Naur linux-2.4.20/include/asm-i386/poll.h linux-2.4.20_ev/include/asm-i386/poll.h
--- linux-2.4.20/include/asm-i386/poll.h	Thu Jan 23 13:01:28 1997
+++ linux-2.4.20_ev/include/asm-i386/poll.h	Sat Jul 12 12:29:11 2003
@@ -15,6 +15,7 @@
 #define POLLWRNORM	0x0100
 #define POLLWRBAND	0x0200
 #define POLLMSG		0x0400
+#define POLLRDHUP	0x0800
 
 struct pollfd {
 	int fd;
diff -Naur linux-2.4.20/net/ipv4/tcp.c linux-2.4.20_ev/net/ipv4/tcp.c
--- linux-2.4.20/net/ipv4/tcp.c	Tue Jul  8 09:40:42 2003
+++ linux-2.4.20_ev/net/ipv4/tcp.c	Sat Jul 12 12:29:56 2003
@@ -424,7 +424,7 @@
 	if (sk->shutdown == SHUTDOWN_MASK || sk->state == TCP_CLOSE)
 		mask |= POLLHUP;
 	if (sk->shutdown & RCV_SHUTDOWN)
-		mask |= POLLIN | POLLRDNORM;
+		mask |= POLLIN | POLLRDNORM | POLLRDHUP;
 
 	/* Connected? */
 	if ((1 << sk->state) & ~(TCPF_SYN_SENT|TCPF_SYN_RECV)) {

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [Patch][RFC] epoll and half closed TCP connections
  2003-07-12 18:16 [Patch][RFC] epoll and half closed TCP connections Eric Varsanyi
@ 2003-07-12 19:44 ` Jamie Lokier
  2003-07-12 20:51   ` Eric Varsanyi
  2003-07-12 20:01 ` Davide Libenzi
  1 sibling, 1 reply; 58+ messages in thread
From: Jamie Lokier @ 2003-07-12 19:44 UTC (permalink / raw)
  To: Eric Varsanyi; +Cc: linux-kernel, davidel

Eric Varsanyi wrote:
> 	- epoll_wait() returns a single EPOLLIN event on the FD representing
> 	  both the 1/2 shutdown state and data available

Correct.

> At this point there is no way the app can tell if there is a half closed
> connection so it may issue a close() back to the client after writing
> results. Normally the server would distinguish these events by assuming
> EOF if it got a read ready indication and the first read returned 0 bytes,
> or would issue read calls until less data was returned than was asked for.
> 
> In a level triggered world this all just works because the read ready
> indication is driven back to the app as long as the socket state is half
> closed. The event driven epoll mechanism folds these two indications
> together and thus loses one 'edge'.

Well then, use epoll's level-triggered mode.  It's quite easy - it's
the default now. :)

If there's an EOF condition pending after you called read(), and then
you call epoll_wait(), you _should_ see another EPOLLIN condition
immediately.

If you aren't seeing epoll_wait() return with EPOLLIN when there's an
EOF pending, *and* you haven't set EPOLLET in the event flags, that's
a bug in epoll.  Is that what you're seeing?

-- Jamie

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [Patch][RFC] epoll and half closed TCP connections
  2003-07-12 18:16 [Patch][RFC] epoll and half closed TCP connections Eric Varsanyi
  2003-07-12 19:44 ` Jamie Lokier
@ 2003-07-12 20:01 ` Davide Libenzi
  2003-07-13  5:24   ` David S. Miller
  1 sibling, 1 reply; 58+ messages in thread
From: Davide Libenzi @ 2003-07-12 20:01 UTC (permalink / raw)
  To: Eric Varsanyi; +Cc: David Miller, Linux Kernel Mailing List


[Cc:ing DaveM ]


On Sat, 12 Jul 2003, Eric Varsanyi wrote:

> I'm proposing adding a new POLL event type (POLLRDHUP) as way to solve
> a new race introduced by having an edge triggered event mechanism
> (epoll). The problem occurs when a client writes data and then does a
> write side shutdown(). The server (using epoll) sees only one event for
> the read data ready and the read EOF condition and has no way to tell
> that an EOF occurred.
>
> -Eric Varsanyi
>
> Details
> -----------
> 	- remote sends data and does a shutdown
> 	   [ we see a data bearing packet and FIN from client on the wire ]
>
> 	- user mode server gets around to doing accept() and registers
> 	  for EPOLLIN events (along with HUP and ERR which are forced on)
>
> 	- epoll_wait() returns a single EPOLLIN event on the FD representing
> 	  both the 1/2 shutdown state and data available
>
> At this point there is no way the app can tell if there is a half closed
> connection so it may issue a close() back to the client after writing
> results. Normally the server would distinguish these events by assuming
> EOF if it got a read ready indication and the first read returned 0 bytes,
> or would issue read calls until less data was returned than was asked for.
>
> In a level triggered world this all just works because the read ready
> indication is driven back to the app as long as the socket state is half
> closed. The event driven epoll mechanism folds these two indications
> together and thus loses one 'edge'.
>
> One would be tempted to issue an extra read() after getting back less than
> expected, but this is an extra system call on every read event and you get
> back the same '0' bytes that you get if the buffer is just empty. The only
> sure bet seems to be CTL_MODding the FD to force a re-poll (which would
> cost a syscall and hash-lookup in eventpoll for every read event).
>

Yes, this is overhead that should be avoided. It is true that you could
use Level Triggered events, but if you structured your app on edge you
should be able to solve this w/out overhead.



> 	2) add a new 1/2 closed event type that a poll routine can return
>
> The implementation is trivial, a patch is included below. If this idea sees
> favor I'll fix the other architectures, ipv6, epoll.h, and make a 'real'
> patch. I do not believe any drivers deserve to be modified to return this
> new event.

This looks good to me. David what do you think ?



> diff -Naur linux-2.4.20/include/asm-i386/poll.h linux-2.4.20_ev/include/asm-i386/poll.h
> --- linux-2.4.20/include/asm-i386/poll.h	Thu Jan 23 13:01:28 1997
> +++ linux-2.4.20_ev/include/asm-i386/poll.h	Sat Jul 12 12:29:11 2003
> @@ -15,6 +15,7 @@
>  #define POLLWRNORM	0x0100
>  #define POLLWRBAND	0x0200
>  #define POLLMSG		0x0400
> +#define POLLRDHUP	0x0800
>
>  struct pollfd {
>  	int fd;
> diff -Naur linux-2.4.20/net/ipv4/tcp.c linux-2.4.20_ev/net/ipv4/tcp.c
> --- linux-2.4.20/net/ipv4/tcp.c	Tue Jul  8 09:40:42 2003
> +++ linux-2.4.20_ev/net/ipv4/tcp.c	Sat Jul 12 12:29:56 2003
> @@ -424,7 +424,7 @@
>  	if (sk->shutdown == SHUTDOWN_MASK || sk->state == TCP_CLOSE)
>  		mask |= POLLHUP;
>  	if (sk->shutdown & RCV_SHUTDOWN)
> -		mask |= POLLIN | POLLRDNORM;
> +		mask |= POLLIN | POLLRDNORM | POLLRDHUP;
>
>  	/* Connected? */
>  	if ((1 << sk->state) & ~(TCPF_SYN_SENT|TCPF_SYN_RECV)) {
>



- Davide


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [Patch][RFC] epoll and half closed TCP connections
  2003-07-12 20:51   ` Eric Varsanyi
@ 2003-07-12 20:48     ` Davide Libenzi
  2003-07-12 21:19       ` Eric Varsanyi
  2003-07-13 20:32       ` David Schwartz
  2003-07-13 13:12     ` Jamie Lokier
  1 sibling, 2 replies; 58+ messages in thread
From: Davide Libenzi @ 2003-07-12 20:48 UTC (permalink / raw)
  To: Eric Varsanyi; +Cc: Linux Kernel Mailing List, Davide Libenzi

On Sat, 12 Jul 2003, Eric Varsanyi wrote:

> > > In a level triggered world this all just works because the read ready
> > > indication is driven back to the app as long as the socket state is half
> > > closed. The event driven epoll mechanism folds these two indications
> > > together and thus loses one 'edge'.
> >
> > Well then, use epoll's level-triggered mode.  It's quite easy - it's
> > the default now. :)
>
> The problem with all the level triggered schemes (poll, select, epoll w/o
> EPOLLET) is that they call every driver and poll status for every call into
> the kernel. This appeared to be killing my app's performance and I verified
> by writing some simple micro benchmarks.

Look this is false for epoll. Given N fds inside the set and M hot/ready
fds, epoll scale O(M) and not O(N) (like poll/select). There's a huge
difference, expecially with real loads.



- Davide


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [Patch][RFC] epoll and half closed TCP connections
  2003-07-12 19:44 ` Jamie Lokier
@ 2003-07-12 20:51   ` Eric Varsanyi
  2003-07-12 20:48     ` Davide Libenzi
  2003-07-13 13:12     ` Jamie Lokier
  0 siblings, 2 replies; 58+ messages in thread
From: Eric Varsanyi @ 2003-07-12 20:51 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: Eric Varsanyi, linux-kernel, davidel

> > In a level triggered world this all just works because the read ready
> > indication is driven back to the app as long as the socket state is half
> > closed. The event driven epoll mechanism folds these two indications
> > together and thus loses one 'edge'.
> 
> Well then, use epoll's level-triggered mode.  It's quite easy - it's
> the default now. :)

The problem with all the level triggered schemes (poll, select, epoll w/o
EPOLLET) is that they call every driver and poll status for every call into
the kernel. This appeared to be killing my app's performance and I verified
by writing some simple micro benchmarks.

I can post the details and code if anyone is interested, the summary is
that on 1 idle FD poll, select, epoll all take about 900ns on a 2.8G P4
(around the overhead of any syscall). With 10 idle FD's (pipes or sockets)
the overhead is around 2.5uSec, at 400fd's (a light load for this app)
we're up to 80uSec per call and the app is spending almost 100% of one
CPU in system mode with even a light tickling of I/O activity on a few
of the fd's.

As we start to scale up to production sized fd sets it gets crazy: around
8000 completely idle fd's the cost is 4ms per syscall. At this point
even a high real load (which gathers lots of I/O per call) doesn't cover the
now very high latency for each trip into the kernel to gather more work.

What was interesting is the response time was non-linear up to around 400-500
fd's, then went steep and linear after that, so you pay much more (maybe due
to some cache effects, I didn't pursue) for each connecting client in a light
load environment.

This is not web traffic, the clients typically connect and sit mostly idle.

> If there's an EOF condition pending after you called read(), and then
> you call epoll_wait(), you _should_ see another EPOLLIN condition
> immediately.
> 
> If you aren't seeing epoll_wait() return with EPOLLIN when there's an
> EOF pending, *and* you haven't set EPOLLET in the event flags, that's
> a bug in epoll.  Is that what you're seeing?

No, I am not asserting there is a problem with epoll in level triggered
mode (I've only used poll and select in level triggered mode, so I can't
say for sure it works either).

-Eric Varsanyi

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [Patch][RFC] epoll and half closed TCP connections
  2003-07-12 20:48     ` Davide Libenzi
@ 2003-07-12 21:19       ` Eric Varsanyi
  2003-07-12 21:20         ` Davide Libenzi
  2003-07-12 21:41         ` Davide Libenzi
  2003-07-13 20:32       ` David Schwartz
  1 sibling, 2 replies; 58+ messages in thread
From: Eric Varsanyi @ 2003-07-12 21:19 UTC (permalink / raw)
  To: Davide Libenzi; +Cc: Eric Varsanyi, Linux Kernel Mailing List

> > The problem with all the level triggered schemes (poll, select, epoll w/o
> > EPOLLET) is that they call every driver and poll status for every call into
> > the kernel. This appeared to be killing my app's performance and I verified
> > by writing some simple micro benchmarks.
> 
> Look this is false for epoll. Given N fds inside the set and M hot/ready
> fds, epoll scale O(M) and not O(N) (like poll/select). There's a huge
> difference, expecially with real loads.

Apologies, I did not benchmark epoll level triggered, just select.
The man page claimed epoll in level triggered mode was just a better
interface so I assumed it had to call each driver to check status.

Reading thru it I see (I think) the clever trick of just repolling things
that have already triggered (basically polling just for the trailing
edge after having seen a leading edge async), cool!

If it seems unpopular/unwise to add the extra poll event to show read EOF
using this level triggered mode would likely do the job for my app (the
extra polls every time for un-consumed events will be nothing compared to
calling every fd's poll every time). 

I guess my only argument would be that edge triggered mode isn't really
workable with TCP connections if there's no way to solve the ambiguity
between EOF and no data in buffer (at least w/o an extra syscall). I just
realized that the race you mention in the man page (reading data from
the 'next' event that hasn't been polled into user mode yet) will lead to
the same issue: how do you know if you got this event because you consumed
the data on the previous interrupt or if this is an EOF condition.

-Eric Varsanyi

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [Patch][RFC] epoll and half closed TCP connections
  2003-07-12 21:19       ` Eric Varsanyi
@ 2003-07-12 21:20         ` Davide Libenzi
  2003-07-12 21:41         ` Davide Libenzi
  1 sibling, 0 replies; 58+ messages in thread
From: Davide Libenzi @ 2003-07-12 21:20 UTC (permalink / raw)
  To: Eric Varsanyi; +Cc: Linux Kernel Mailing List

On Sat, 12 Jul 2003, Eric Varsanyi wrote:

> > > The problem with all the level triggered schemes (poll, select, epoll w/o
> > > EPOLLET) is that they call every driver and poll status for every call into
> > > the kernel. This appeared to be killing my app's performance and I verified
> > > by writing some simple micro benchmarks.
> >
> > Look this is false for epoll. Given N fds inside the set and M hot/ready
> > fds, epoll scale O(M) and not O(N) (like poll/select). There's a huge
> > difference, expecially with real loads.
>
> Apologies, I did not benchmark epoll level triggered, just select.
> The man page claimed epoll in level triggered mode was just a better
> interface so I assumed it had to call each driver to check status.

It is not neither better nor worse. For sure it is more close to what
existing apps do, and it is easier to understand.



> Reading thru it I see (I think) the clever trick of just repolling things
> that have already triggered (basically polling just for the trailing
> edge after having seen a leading edge async), cool!
>
> If it seems unpopular/unwise to add the extra poll event to show read EOF
> using this level triggered mode would likely do the job for my app (the
> extra polls every time for un-consumed events will be nothing compared to
> calling every fd's poll every time).

Even if it is true that you can drop an extra read(), having the RDHUP
event will cost exactly zero extra CPU cycles inside the kernel, and one
changed line of code (plus arch definitions in asm/poll.h). To me, it
looks acceptable. Let's see what DaveM think about it.



- Davide


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [Patch][RFC] epoll and half closed TCP connections
  2003-07-12 21:19       ` Eric Varsanyi
  2003-07-12 21:20         ` Davide Libenzi
@ 2003-07-12 21:41         ` Davide Libenzi
  2003-07-12 23:11           ` Eric Varsanyi
  1 sibling, 1 reply; 58+ messages in thread
From: Davide Libenzi @ 2003-07-12 21:41 UTC (permalink / raw)
  To: Eric Varsanyi; +Cc: Linux Kernel Mailing List

On Sat, 12 Jul 2003, Eric Varsanyi wrote:

> I guess my only argument would be that edge triggered mode isn't really
> workable with TCP connections if there's no way to solve the ambiguity
> between EOF and no data in buffer (at least w/o an extra syscall). I just
> realized that the race you mention in the man page (reading data from
> the 'next' event that hasn't been polled into user mode yet) will lead to
> the same issue: how do you know if you got this event because you consumed
> the data on the previous interrupt or if this is an EOF condition.

(Sorry, I missed this)
You can work that out very easily. When your read/write returns a lower
number of bytes, it means that it is time to stop processing this fd. If
events happened meanwhile, you will get them at the next epoll_wait(). If
not, the next time they'll happen. There's no blind spot if you follow
this simple rule, and you do not even have the extra syscall with EAGAIN.



- Davide


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [Patch][RFC] epoll and half closed TCP connections
  2003-07-12 21:41         ` Davide Libenzi
@ 2003-07-12 23:11           ` Eric Varsanyi
  2003-07-12 23:55             ` Davide Libenzi
  0 siblings, 1 reply; 58+ messages in thread
From: Eric Varsanyi @ 2003-07-12 23:11 UTC (permalink / raw)
  To: Davide Libenzi; +Cc: Eric Varsanyi, Linux Kernel Mailing List

> > I guess my only argument would be that edge triggered mode isn't really
> > workable with TCP connections if there's no way to solve the ambiguity
> > between EOF and no data in buffer (at least w/o an extra syscall). I just
> > realized that the race you mention in the man page (reading data from
> > the 'next' event that hasn't been polled into user mode yet) will lead to
> > the same issue: how do you know if you got this event because you consumed
> > the data on the previous interrupt or if this is an EOF condition.
> 
> (Sorry, I missed this)
> You can work that out very easily. When your read/write returns a lower
> number of bytes, it means that it is time to stop processing this fd. If
> events happened meanwhile, you will get them at the next epoll_wait(). If
> not, the next time they'll happen. There's no blind spot if you follow
> this simple rule, and you do not even have the extra syscall with EAGAIN.

The scenario that I think is still uncovered (edge trigger only):

User					Kernel
--------				----------
					Read data added to socket

					Socket posts read event to epfd

epoll_wait()				Event cleared from epfd, EPOLLIN
					  returned to user

					more read data added to socket

					Socket posts a new read event to epfd

read() until short read with EAGAIN 	all data read from socket

epoll_wait()				returns another EPOLLIN for socket and
					  clears it from epfd

read(), returns 0 right away		socket buffer is empty

This is your 'false positive' case in the epoll(4) man page.

How does the app tell the 0 read here from a read EOF coming from the remote?

If it assumes this is a false positive and there was also an EOF
indication, the EOF will be lost; if it assumes it an EOF the connection
will be prematurely terminated.

-Eric

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [Patch][RFC] epoll and half closed TCP connections
  2003-07-12 23:11           ` Eric Varsanyi
@ 2003-07-12 23:55             ` Davide Libenzi
  2003-07-13  1:05               ` Eric Varsanyi
  0 siblings, 1 reply; 58+ messages in thread
From: Davide Libenzi @ 2003-07-12 23:55 UTC (permalink / raw)
  To: Eric Varsanyi; +Cc: Linux Kernel Mailing List

On Sat, 12 Jul 2003, Eric Varsanyi wrote:

> > (Sorry, I missed this)
> > You can work that out very easily. When your read/write returns a lower
> > number of bytes, it means that it is time to stop processing this fd. If
> > events happened meanwhile, you will get them at the next epoll_wait(). If
> > not, the next time they'll happen. There's no blind spot if you follow
> > this simple rule, and you do not even have the extra syscall with EAGAIN.
>
> The scenario that I think is still uncovered (edge trigger only):
>
> User					Kernel
> --------				----------
> 					Read data added to socket
>
> 					Socket posts read event to epfd
>
> epoll_wait()				Event cleared from epfd, EPOLLIN
> 					  returned to user
>
> 					more read data added to socket
>
> 					Socket posts a new read event to epfd
>
> read() until short read with EAGAIN 	all data read from socket
>
> epoll_wait()				returns another EPOLLIN for socket and
> 					  clears it from epfd
>
> read(), returns 0 right away		socket buffer is empty

read will return -1 with errno=EAGAIN in that case, not zero.



- Davide


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [Patch][RFC] epoll and half closed TCP connections
  2003-07-12 23:55             ` Davide Libenzi
@ 2003-07-13  1:05               ` Eric Varsanyi
  0 siblings, 0 replies; 58+ messages in thread
From: Eric Varsanyi @ 2003-07-13  1:05 UTC (permalink / raw)
  To: Davide Libenzi; +Cc: Eric Varsanyi, Linux Kernel Mailing List

> > read(), returns 0 right away		socket buffer is empty
> 
> read will return -1 with errno=EAGAIN in that case, not zero.

Yes, my mistake. So the real issue (of the patch) is just the original
thing I posted about: you can't tell w/o another read() syscall if an
EOF has piggybacked in on an EPOLLIN event.

Thanks for being patient.

-Eric Varsanyi

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [Patch][RFC] epoll and half closed TCP connections
  2003-07-12 20:01 ` Davide Libenzi
@ 2003-07-13  5:24   ` David S. Miller
  2003-07-13 14:07     ` Jamie Lokier
  2003-07-14 17:09     ` [Patch][RFC] epoll and half closed TCP connections kuznet
  0 siblings, 2 replies; 58+ messages in thread
From: David S. Miller @ 2003-07-13  5:24 UTC (permalink / raw)
  To: Davide Libenzi; +Cc: e0206, linux-kernel, kuznet

On Sat, 12 Jul 2003 13:01:21 -0700 (PDT)
Davide Libenzi <davidel@xmailserver.org> wrote:

> 
> [Cc:ing DaveM ]

[Cc:ing Alexey :-) ]

Alexey, they seem to want to add some kind of POLLRDHUP thing,
comments wrt. TCP and elsewhere in the networking?  See below...

> On Sat, 12 Jul 2003, Eric Varsanyi wrote:
> 
> > I'm proposing adding a new POLL event type (POLLRDHUP) as way to solve
> > a new race introduced by having an edge triggered event mechanism
> > (epoll). The problem occurs when a client writes data and then does a
> > write side shutdown(). The server (using epoll) sees only one event for
> > the read data ready and the read EOF condition and has no way to tell
> > that an EOF occurred.
> >
> > -Eric Varsanyi
> >
> > Details
> > -----------
> > 	- remote sends data and does a shutdown
> > 	   [ we see a data bearing packet and FIN from client on the wire ]
> >
> > 	- user mode server gets around to doing accept() and registers
> > 	  for EPOLLIN events (along with HUP and ERR which are forced on)
> >
> > 	- epoll_wait() returns a single EPOLLIN event on the FD representing
> > 	  both the 1/2 shutdown state and data available
> >
> > At this point there is no way the app can tell if there is a half closed
> > connection so it may issue a close() back to the client after writing
> > results. Normally the server would distinguish these events by assuming
> > EOF if it got a read ready indication and the first read returned 0 bytes,
> > or would issue read calls until less data was returned than was asked for.
> >
> > In a level triggered world this all just works because the read ready
> > indication is driven back to the app as long as the socket state is half
> > closed. The event driven epoll mechanism folds these two indications
> > together and thus loses one 'edge'.
> >
> > One would be tempted to issue an extra read() after getting back less than
> > expected, but this is an extra system call on every read event and you get
> > back the same '0' bytes that you get if the buffer is just empty. The only
> > sure bet seems to be CTL_MODding the FD to force a re-poll (which would
> > cost a syscall and hash-lookup in eventpoll for every read event).
> >
> 
> Yes, this is overhead that should be avoided. It is true that you could
> use Level Triggered events, but if you structured your app on edge you
> should be able to solve this w/out overhead.
> 
> 
> 
> > 	2) add a new 1/2 closed event type that a poll routine can return
> >
> > The implementation is trivial, a patch is included below. If this idea sees
> > favor I'll fix the other architectures, ipv6, epoll.h, and make a 'real'
> > patch. I do not believe any drivers deserve to be modified to return this
> > new event.
> 
> This looks good to me. David what do you think ?
> 
> 
> 
> > diff -Naur linux-2.4.20/include/asm-i386/poll.h linux-2.4.20_ev/include/asm-i386/poll.h
> > --- linux-2.4.20/include/asm-i386/poll.h	Thu Jan 23 13:01:28 1997
> > +++ linux-2.4.20_ev/include/asm-i386/poll.h	Sat Jul 12 12:29:11 2003
> > @@ -15,6 +15,7 @@
> >  #define POLLWRNORM	0x0100
> >  #define POLLWRBAND	0x0200
> >  #define POLLMSG		0x0400
> > +#define POLLRDHUP	0x0800
> >
> >  struct pollfd {
> >  	int fd;
> > diff -Naur linux-2.4.20/net/ipv4/tcp.c linux-2.4.20_ev/net/ipv4/tcp.c
> > --- linux-2.4.20/net/ipv4/tcp.c	Tue Jul  8 09:40:42 2003
> > +++ linux-2.4.20_ev/net/ipv4/tcp.c	Sat Jul 12 12:29:56 2003
> > @@ -424,7 +424,7 @@
> >  	if (sk->shutdown == SHUTDOWN_MASK || sk->state == TCP_CLOSE)
> >  		mask |= POLLHUP;
> >  	if (sk->shutdown & RCV_SHUTDOWN)
> > -		mask |= POLLIN | POLLRDNORM;
> > +		mask |= POLLIN | POLLRDNORM | POLLRDHUP;
> >
> >  	/* Connected? */
> >  	if ((1 << sk->state) & ~(TCPF_SYN_SENT|TCPF_SYN_RECV)) {
> >
> 
> 
> 
> - Davide

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [Patch][RFC] epoll and half closed TCP connections
  2003-07-12 20:51   ` Eric Varsanyi
  2003-07-12 20:48     ` Davide Libenzi
@ 2003-07-13 13:12     ` Jamie Lokier
  2003-07-13 16:55       ` Davide Libenzi
  1 sibling, 1 reply; 58+ messages in thread
From: Jamie Lokier @ 2003-07-13 13:12 UTC (permalink / raw)
  To: Eric Varsanyi; +Cc: linux-kernel, davidel

Eric Varsanyi wrote:
> > Well then, use epoll's level-triggered mode.  It's quite easy - it's
> > the default now. :)
> 
> The problem with all the level triggered schemes (poll, select, epoll w/o
> EPOLLET) is that they call every driver and poll status for every call into
> the kernel. This appeared to be killing my app's performance and I verified
> by writing some simple micro benchmarks.

OH! :-O

Level-triggered epoll_wait() time _should_ be scalable - proportional
to the number of ready events, not the number of listening events.  If
this is not the case then it's a bug in epoll.

In principle, you will see a large delay only if you don't handle
those events (e.g. by calling read() on each ready fd), so that they
are still ready.

Reading the code in eventpoll.c et al, I think that some time will
be taken for fds that are transitioning on events which you're not
interested in.  Notably, each time a TCP segment is sent and
acknowledged by the other end, poll-waiters are woken, your task will
be woken and do some work in epoll_wait(), but no events are returned
if you are only listening for read availability.

I'm not 100% sure of this, but tracing through

    skb->destructor
    -> sock_wfree()
    -> tcp_write_space()
    -> wake_up_interruptible()
    -> ep_poll_callback()

it looks as though _every_ TCP ACK you receive will cause epoll to wake up
a task which is interested in _any_ socket events, but then in

    <context switch>
    ep_poll()
    -> ep_events_transfer()
    -> ep_send_events()

no events are transferred, so ep_poll() will loop and try again.  This
is quite unfortunate if true, as many of the apps which need to scale
write a lot of segments without receiving very much.

> As we start to scale up to production sized fd sets it gets crazy: around
> 8000 completely idle fd's the cost is 4ms per syscall. At this point
> even a high real load (which gathers lots of I/O per call) doesn't cover the
> now very high latency for each trip into the kernel to gather more work.

It should only be 4ms per syscall if it's actually returning ~8000
ready events.  If you're listening to 8000 but only, say, 10 are
ready, it should be fast.

> What was interesting is the response time was non-linear up to around 400-500
> fd's, then went steep and linear after that, so you pay much more (maybe due
> to some cache effects, I didn't pursue) for each connecting client in a light
> load environment.

> This is not web traffic, the clients typically connect and sit mostly idle.

Can you post your code?

(Btw, I don't disagree with POLLRDHUP - I think it's a fine idea.  I'd
use it.  It'd be unfortunate if it only worked with some socket types
and was not set by others, though.  Global search and replace POLLHUP
with "POLLHUP | POLLRDHUP" in most setters?  Following that a bit
further, we might as well admit that POLLHUP should be called
POLLWRHUP.)

-- Jamie

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [Patch][RFC] epoll and half closed TCP connections
  2003-07-13  5:24   ` David S. Miller
@ 2003-07-13 14:07     ` Jamie Lokier
  2003-07-13 17:00       ` Davide Libenzi
  2003-07-14 17:09     ` [Patch][RFC] epoll and half closed TCP connections kuznet
  1 sibling, 1 reply; 58+ messages in thread
From: Jamie Lokier @ 2003-07-13 14:07 UTC (permalink / raw)
  To: David S. Miller; +Cc: Davide Libenzi, e0206, linux-kernel, kuznet

David S. Miller wrote:
> Alexey, they seem to want to add some kind of POLLRDHUP thing,
> comments wrt. TCP and elsewhere in the networking?  See below...

POLLHUP is a mess.  It means different things according to the type of
fd, precisely because it is considered an unmaskeable event for the
poll() API so the standard meaning isn't useful for sockets.  (See the
comments in tcp_poll()).

POLLRDHUP makes sense because it could actually have a well-defined
meaning: set iff reading the fd would return EOF.

However, if a program is waiting on POLLRDHUP, you don't want the
program to have to say "if this fd is a TCP socket then listen for
POLLRDHUP else if this fd is another kind of socket call read to
detect EOF else listen for POLLHUP".  Programs have enough
version-specific special cases as it is.

So I suggest:

  - Everywhere that POLLHUP is currently set in a driver, socket etc.
    it should set POLLRDHUP|POLLHUP - unless it specifically knows
    about POLLRDHUP as in TCP (and presumably UDP, SCTP etc).

-- Jamie

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [Patch][RFC] epoll and half closed TCP connections
  2003-07-13 13:12     ` Jamie Lokier
@ 2003-07-13 16:55       ` Davide Libenzi
  0 siblings, 0 replies; 58+ messages in thread
From: Davide Libenzi @ 2003-07-13 16:55 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: Eric Varsanyi, Linux Kernel Mailing List

On Sun, 13 Jul 2003, Jamie Lokier wrote:

> Eric Varsanyi wrote:
> > > Well then, use epoll's level-triggered mode.  It's quite easy - it's
> > > the default now. :)
> >
> > The problem with all the level triggered schemes (poll, select, epoll w/o
> > EPOLLET) is that they call every driver and poll status for every call into
> > the kernel. This appeared to be killing my app's performance and I verified
> > by writing some simple micro benchmarks.
>
> OH! :-O
>
> Level-triggered epoll_wait() time _should_ be scalable - proportional
> to the number of ready events, not the number of listening events.  If
> this is not the case then it's a bug in epoll.

Jamie, he is talking about select here.


> Reading the code in eventpoll.c et al, I think that some time will
> be taken for fds that are transitioning on events which you're not
> interested in.  Notably, each time a TCP segment is sent and
> acknowledged by the other end, poll-waiters are woken, your task will
> be woken and do some work in epoll_wait(), but no events are returned
> if you are only listening for read availability.
>
> I'm not 100% sure of this, but tracing through
>
>     skb->destructor
>     -> sock_wfree()
>     -> tcp_write_space()
>     -> wake_up_interruptible()
>     -> ep_poll_callback()
>
> it looks as though _every_ TCP ACK you receive will cause epoll to wake up
> a task which is interested in _any_ socket events, but then in
>
>     <context switch>
>     ep_poll()
>     -> ep_events_transfer()
>     -> ep_send_events()
>
> no events are transferred, so ep_poll() will loop and try again.  This
> is quite unfortunate if true, as many of the apps which need to scale
> write a lot of segments without receiving very much.

That's true, it is the beauty of the poll hook ;) I said this a long time
ago. It is addressable by a wake_up_mask() and some code all around. I did
not see (as long as other didn't) any performance impact bacause of this,
with throughput that remained steady flat under any ratio of hot/cold fds.
Since it is easily addressable and will not require an API change, I'd
rather wait for someone to report a real (or even unreal) load that makes
epoll to not flat-scale.



- Davide


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [Patch][RFC] epoll and half closed TCP connections
  2003-07-13 14:07     ` Jamie Lokier
@ 2003-07-13 17:00       ` Davide Libenzi
  2003-07-13 19:15         ` Jamie Lokier
  0 siblings, 1 reply; 58+ messages in thread
From: Davide Libenzi @ 2003-07-13 17:00 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: David S. Miller, e0206, Linux Kernel Mailing List, kuznet

On Sun, 13 Jul 2003, Jamie Lokier wrote:

> David S. Miller wrote:
> > Alexey, they seem to want to add some kind of POLLRDHUP thing,
> > comments wrt. TCP and elsewhere in the networking?  See below...
>
> POLLHUP is a mess.  It means different things according to the type of
> fd, precisely because it is considered an unmaskeable event for the
> poll() API so the standard meaning isn't useful for sockets.  (See the
> comments in tcp_poll()).
>
> POLLRDHUP makes sense because it could actually have a well-defined
> meaning: set iff reading the fd would return EOF.
>
> However, if a program is waiting on POLLRDHUP, you don't want the
> program to have to say "if this fd is a TCP socket then listen for
> POLLRDHUP else if this fd is another kind of socket call read to
> detect EOF else listen for POLLHUP".  Programs have enough
> version-specific special cases as it is.
>
> So I suggest:
>
>   - Everywhere that POLLHUP is currently set in a driver, socket etc.
>     it should set POLLRDHUP|POLLHUP - unless it specifically knows
>     about POLLRDHUP as in TCP (and presumably UDP, SCTP etc).

Returning POLLHUP to a caller waiting for POLLIN might break existing code
IMHO. After ppl reporting the O_RDONLY|O_TRUNC case I'm inclined to expect
everything from existing apps ;) POLLHUP should be returned to apps
waiting for POLLOUT while POLLRDHUP to ones for POLLIN.



- Davide


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [Patch][RFC] epoll and half closed TCP connections
  2003-07-13 17:00       ` Davide Libenzi
@ 2003-07-13 19:15         ` Jamie Lokier
  2003-07-13 23:03           ` Davide Libenzi
  0 siblings, 1 reply; 58+ messages in thread
From: Jamie Lokier @ 2003-07-13 19:15 UTC (permalink / raw)
  To: Davide Libenzi; +Cc: David S. Miller, e0206, Linux Kernel Mailing List, kuznet

Davide Libenzi wrote:
> > However, if a program is waiting on POLLRDHUP, you don't want the
> > program to have to say "if this fd is a TCP socket then listen for
> > POLLRDHUP else if this fd is another kind of socket call read to
> > detect EOF else listen for POLLHUP".  Programs have enough
> > version-specific special cases as it is.
> >
> >   - Everywhere that POLLHUP is currently set in a driver, socket etc.
> >     it should set POLLRDHUP|POLLHUP - unless it specifically knows
> >     about POLLRDHUP as in TCP (and presumably UDP, SCTP etc).
> 
> Returning POLLHUP to a caller waiting for POLLIN might break existing code
> IMHO.

Oh, agreed.  I was(*) suggesting to add POLL_RDHUP to the set of
events reported by e.g. AF_UNIX sockets which are half-closed.  Not to
add POLLHUP to anything!  (Particularly as POLLHUP can't be ignored).

> After ppl reporting the O_RDONLY|O_TRUNC case I'm inclined to expect
> everything from existing apps ;) POLLHUP should be returned to apps
> waiting for POLLOUT while POLLRDHUP to ones for POLLIN.

Not sure exactly how you're thinking with that last sentence.

At present, it's impossible for socket code to return POLLHUP only to
apps which are waiting on POLLOUT - because POLLHUP is not maskable in
sys_poll's API.  Therefore sockets return POLLHUP only if they are
closed in both directions.

There is no way for a socket to return a HUP condition for someone who
is waiting only to write, but fortunately that doesn't matter :)

Back to the (*), (see above):

(*) There aren't that many places which set POLLHUP; they divide into:
sockets, ttys, SCSI-generic and PPP.  The latter two are not important
as they don't do half-close.

   __The critical thing with POLL_RDHUP is that it is set if read() would
   return EOF after returning data.__

   If this condition isn't met, than an app which is using POLL_RDHUP to
   optimise performance using epoll will hang occasionally.

Sockets are important: TCP is not the only thing to support
half-closing.  If an app is waiting for POLLRDHUP, and it doesn't know
what kind of socket it has been given (e.g. AF_UNIX), the network
stack had better return POLL_RDHUP when there's an EOF pending.

So we'd better add POLLRDHUP to all the socket types which do
half-closing.  For the rest, no change is required as POLLHUP is
non-maskable :)  (So apps should always say "if (events &
(POLLHUP|POLLRDHUP)) check_for_eof()").

And ttys?  They are problematic, because ttys can return EOF _after_
returning data without closing (and without being hung-up).  An epoll
loop which is reading a tty (and isn't programmed specially for a tty)
_must_ receive POLLRDHUP when the EOF is pending, else it may hang.

In other words, POLLRDHUP is the wrong name: the correct name is
POLLRDEOF.

-- Jamie

^ permalink raw reply	[flat|nested] 58+ messages in thread

* RE: [Patch][RFC] epoll and half closed TCP connections
  2003-07-12 20:48     ` Davide Libenzi
  2003-07-12 21:19       ` Eric Varsanyi
@ 2003-07-13 20:32       ` David Schwartz
  2003-07-13 21:10         ` Jamie Lokier
  2003-07-13 21:14         ` Davide Libenzi
  1 sibling, 2 replies; 58+ messages in thread
From: David Schwartz @ 2003-07-13 20:32 UTC (permalink / raw)
  To: Davide Libenzi, Eric Varsanyi; +Cc: Linux Kernel Mailing List


> Look this is false for epoll. Given N fds inside the set and M hot/ready
> fds, epoll scale O(M) and not O(N) (like poll/select). There's a huge
> difference, expecially with real loads.
>
> - Davide

	For most real-world loads, M is some fraction of N. The fraction
asymptotically approaches 1 as load increases because under load it takes
you longer to get back to polling, so a higher fraction of the descriptors
will be ready when you do.

	Even if you argue that most real-world loads consists of a few very busy
file descriptors and a lot of idle file descriptors, why would you think
that this ratio changes as the number of connections increase? Say a group
of two servers is handling a bunch of connections. Some of those connections
will be very active and some will be very idle. But surely the *percentage*
of active connections won't change just becase the connections are split
over the servers 50/50 rather than 10/90.

	If a particular protocol and usage sees 10 idle connections for every
active one, then N will be ten times M, and O(M) will be the same as O(N).
It's only if a higher percentage of connections are idle when there are more
connections (which seems an extreme rarity to me) that O(M) is better than
O(N).

	Is there any actual evidence to suggest that epoll scales better than poll
for "real loads"? Tests with increasing numbers of idle file descriptors as
the active count stays constant are not real loads.

	By the way, I'm not arguing against epoll. I believe it will use less
resources than poll in pretty much every conceivable situation. I simply
take issue with the argument that it has better ultimate scalability or
scales at a different order.

	DS



^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [Patch][RFC] epoll and half closed TCP connections
  2003-07-13 20:32       ` David Schwartz
@ 2003-07-13 21:10         ` Jamie Lokier
  2003-07-13 23:05           ` David Schwartz
  2003-07-13 21:14         ` Davide Libenzi
  1 sibling, 1 reply; 58+ messages in thread
From: Jamie Lokier @ 2003-07-13 21:10 UTC (permalink / raw)
  To: David Schwartz; +Cc: Davide Libenzi, Eric Varsanyi, Linux Kernel Mailing List

David Schwartz wrote:
> 	For most real-world loads, M is some fraction of N. The fraction
> asymptotically approaches 1 as load increases because under load it takes
> you longer to get back to polling, so a higher fraction of the descriptors
> will be ready when you do.

Ah, but as the fraction approaches 1, you'll find that you are
asymptotically approaching the point where you can't handle the load
_regardless_ of epoll overhead.

> 	By the way, I'm not arguing against epoll. I believe it will use less
> resources than poll in pretty much every conceivable situation. I simply
> take issue with the argument that it has better ultimate scalability or
> scales at a different order.

It scales according to the amount of work pending, which means that it
doesn't take any _more_ time than actually doing the pending work.
(This assumes you use epoll appropriately; there are many ways to use
epoll which don't have this property).

That was always the complaint about select() and poll(): they dominate
the run time for large numbers of connections.  epoll, on the other
hand, will always be in the noise relative to other work.

If you want a formula for slides :), time_polling/time_working is O(1)
with epoll, but O(N) with poll() & select().

-- Jamie

^ permalink raw reply	[flat|nested] 58+ messages in thread

* RE: [Patch][RFC] epoll and half closed TCP connections
  2003-07-13 20:32       ` David Schwartz
  2003-07-13 21:10         ` Jamie Lokier
@ 2003-07-13 21:14         ` Davide Libenzi
  2003-07-13 23:05           ` David Schwartz
  1 sibling, 1 reply; 58+ messages in thread
From: Davide Libenzi @ 2003-07-13 21:14 UTC (permalink / raw)
  To: David Schwartz; +Cc: Eric Varsanyi, Linux Kernel Mailing List

On Sun, 13 Jul 2003, David Schwartz wrote:

> 	For most real-world loads, M is some fraction of N. The fraction
> asymptotically approaches 1 as load increases because under load it takes
> you longer to get back to polling, so a higher fraction of the descriptors
> will be ready when you do.
>
> 	Even if you argue that most real-world loads consists of a few very busy
> file descriptors and a lot of idle file descriptors, why would you think
> that this ratio changes as the number of connections increase? Say a group
> of two servers is handling a bunch of connections. Some of those connections
> will be very active and some will be very idle. But surely the *percentage*
> of active connections won't change just becase the connections are split
> over the servers 50/50 rather than 10/90.
>
> 	If a particular protocol and usage sees 10 idle connections for every
> active one, then N will be ten times M, and O(M) will be the same as O(N).
> It's only if a higher percentage of connections are idle when there are more
> connections (which seems an extreme rarity to me) that O(M) is better than
> O(N).

Apoligize, I abused of O(*) (hopefully noone of my math profs are on lkml :).
Yes, N/M has little/none fluctuation in the N domain. So, using O(*)
correctly, they both scale O(N). But, we can trivially say that if we call
CP the cost of poll() in CPU cycles, and CE the cost of epoll :

CP(N, M) = KP * N
EP(N, M) = KE * M

Where KP and KE are constant that depends on the code architecture of the
two systems. If we fix KA (active coefficent ) :

KA = M / N

we can write the scalability coefficent like :

         KP * N          KP
KS = ------------- = ---------
      KE * KA * N     KE * KA

The scalability coefficent is clearly inv. proportional to KA. Let's look
at what the poll code does :

1) It has to allocate the kernel buffer for events

2) It has to copy it from userspace

3) It has to allocate wait queue buffer calling get_free_page (possibly
	multiple times when we talk about decent fds numbers)

4) It has to loop calling N times f_op->poll() that in turn will add into
	the wait queue getting/releasing IRQ locks

5) Loop another M loop to copy events to userspace

6) Call kfree() for all blocks allocated

7) Call poll_freewait() that will go with another N loop to unregister
	poll waits, that in turn will do another N IRQ locks

The epoll code does remember/cache things so that KE is largely lower that
KP, and this together with a pretty low KA explain results about poll
scalability against epoll.



> 	Is there any actual evidence to suggest that epoll scales better than poll
> for "real loads"? Tests with increasing numbers of idle file descriptors as
> the active count stays constant are not real loads.

Yes, of course. The time spent inside poll/select becomes a PITA when you
start dealing with huge number of fds. And this is kernel time. This does
not obviously mean that if epoll is 10 times faster than poll under load,
and you switch your app on epoll, it'll be ten times faster. It means that
the kernel time spent inside poll will be 1/10. And many of the operations
done by poll require IRQ locks and this increase the time the kernel
spend with disabled IRQs, that is never a good thing.



- Davide


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [Patch][RFC] epoll and half closed TCP connections
  2003-07-13 19:15         ` Jamie Lokier
@ 2003-07-13 23:03           ` Davide Libenzi
  2003-07-14  1:41             ` Jamie Lokier
  0 siblings, 1 reply; 58+ messages in thread
From: Davide Libenzi @ 2003-07-13 23:03 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: David S. Miller, e0206, Linux Kernel Mailing List, kuznet

On Sun, 13 Jul 2003, Jamie Lokier wrote:

> > After ppl reporting the O_RDONLY|O_TRUNC case I'm inclined to expect
> > everything from existing apps ;) POLLHUP should be returned to apps
> > waiting for POLLOUT while POLLRDHUP to ones for POLLIN.
>
> Not sure exactly how you're thinking with that last sentence.

Brain farting, delete it ;) This is a nice page about POLLHUP treatment :

http://www.greenend.org.uk/rjk/2001/06/poll.html



> At present, it's impossible for socket code to return POLLHUP only to
> apps which are waiting on POLLOUT - because POLLHUP is not maskable in
> sys_poll's API.  Therefore sockets return POLLHUP only if they are
> closed in both directions.
>
> There is no way for a socket to return a HUP condition for someone who
> is waiting only to write, but fortunately that doesn't matter :)

Yes, this could be improved though. If we could only pass our event
interest mask to f_op->poll() the function will be able to register it
inside the wait queue structure and release only waiters that matches the
available condition.



> (*) There aren't that many places which set POLLHUP; they divide into:
> sockets, ttys, SCSI-generic and PPP.  The latter two are not important
> as they don't do half-close.
>
>    __The critical thing with POLL_RDHUP is that it is set if read() would
>    return EOF after returning data.__
>
>    If this condition isn't met, than an app which is using POLL_RDHUP to
>    optimise performance using epoll will hang occasionally.
>
> Sockets are important: TCP is not the only thing to support
> half-closing.  If an app is waiting for POLLRDHUP, and it doesn't know
> what kind of socket it has been given (e.g. AF_UNIX), the network
> stack had better return POLL_RDHUP when there's an EOF pending.
>
> So we'd better add POLLRDHUP to all the socket types which do
> half-closing.  For the rest, no change is required as POLLHUP is
> non-maskable :)  (So apps should always say "if (events &
> (POLLHUP|POLLRDHUP)) check_for_eof()").
>
> And ttys?  They are problematic, because ttys can return EOF _after_
> returning data without closing (and without being hung-up).  An epoll
> loop which is reading a tty (and isn't programmed specially for a tty)
> _must_ receive POLLRDHUP when the EOF is pending, else it may hang.
>
> In other words, POLLRDHUP is the wrong name: the correct name is
> POLLRDEOF.

Please replace 'it may hung' with 'it may hung if it is using the read()
return bytes check trick' ;)



- Davide


^ permalink raw reply	[flat|nested] 58+ messages in thread

* RE: [Patch][RFC] epoll and half closed TCP connections
  2003-07-13 21:14         ` Davide Libenzi
@ 2003-07-13 23:05           ` David Schwartz
  2003-07-13 23:11             ` Davide Libenzi
                               ` (3 more replies)
  0 siblings, 4 replies; 58+ messages in thread
From: David Schwartz @ 2003-07-13 23:05 UTC (permalink / raw)
  To: Davide Libenzi; +Cc: Eric Varsanyi, Linux Kernel Mailing List


> Let's look at what the poll code does :
>
> 1) It has to allocate the kernel buffer for events
>
> 2) It has to copy it from userspace
>
> 3) It has to allocate wait queue buffer calling get_free_page (possibly
> 	multiple times when we talk about decent fds numbers)
>
> 4) It has to loop calling N times f_op->poll() that in turn will add into
> 	the wait queue getting/releasing IRQ locks
>
> 5) Loop another M loop to copy events to userspace
>
> 6) Call kfree() for all blocks allocated
>
> 7) Call poll_freewait() that will go with another N loop to unregister
> 	poll waits, that in turn will do another N IRQ locks

	This is really just due to bad coding in 'poll', or more precisely very bad
for this case. For example, why is it allocating a wait queue buffer if the
odds that it will need to wait are basically zero? Why is it adding file
descriptors to the wait queue before it has determined that it needs to
wait?

	As load increases, more and more calls to 'poll' require no waiting. Yet
'poll' is heavily optimized for the 'no or low load' case. That's why 'poll'
doesn't scale on Linux.

> Yes, of course. The time spent inside poll/select becomes a PITA when you
> start dealing with huge number of fds. And this is kernel time. This does
> not obviously mean that if epoll is 10 times faster than poll under load,
> and you switch your app on epoll, it'll be ten times faster. It means that
> the kernel time spent inside poll will be 1/10. And many of the operations
> done by poll require IRQ locks and this increase the time the kernel
> spend with disabled IRQs, that is never a good thing.

	My experience has been that this is a huge problem with Linux but not with
any other OS. It can be solved in user-space with some other penalities by
an adaptive sleep before each call to 'poll' and polling with a zero timeout
(thus avoiding the wait queue pain). But all the deficiencies in the 'poll'
implementation in the world won't show anything except that 'poll' is badly
implemented.

> - Davide

	DS



^ permalink raw reply	[flat|nested] 58+ messages in thread

* RE: [Patch][RFC] epoll and half closed TCP connections
  2003-07-13 21:10         ` Jamie Lokier
@ 2003-07-13 23:05           ` David Schwartz
  2003-07-13 23:09             ` Davide Libenzi
  2003-07-14  1:27             ` Jamie Lokier
  0 siblings, 2 replies; 58+ messages in thread
From: David Schwartz @ 2003-07-13 23:05 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: Davide Libenzi, Eric Varsanyi, Linux Kernel Mailing List




> David Schwartz wrote:

> > 	For most real-world loads, M is some fraction of N. The fraction
> > asymptotically approaches 1 as load increases because under
> > load it takes
> > you longer to get back to polling, so a higher fraction of the
> > descriptors
> > will be ready when you do.

> Ah, but as the fraction approaches 1, you'll find that you are
> asymptotically approaching the point where you can't handle the load
> _regardless_ of epoll overhead.

	This has not been my experience. On pretty much every OS except Linux, my
experience has been that as you are spending more time doing work, each call
to 'poll' discovers more file descriptors ready. Further, the number of
bytes you can send/receive is greater (because it took you longer to get
back to the same connection), so again, the amount of work you do, per call
to 'poll' goes way up. I think most of the problem is just that Linux's
'poll' is extremely expensive and not due to any inherent API benefit of
'epoll'.

> > 	By the way, I'm not arguing against epoll. I believe it
> > will use less
> > resources than poll in pretty much every conceivable situation. I simply
> > take issue with the argument that it has better ultimate scalability or
> > scales at a different order.

> It scales according to the amount of work pending, which means that it
> doesn't take any _more_ time than actually doing the pending work.
> (This assumes you use epoll appropriately; there are many ways to use
> epoll which don't have this property).

	But so does 'poll'. If you double the number of active and inactive
connections, 'poll' takes twice as long. But you do twice as much per call
to 'poll'. You will both discover more connections ready to do work on and
move more bytes per connection as the load increases.

> That was always the complaint about select() and poll(): they dominate
> the run time for large numbers of connections.  epoll, on the other
> hand, will always be in the noise relative to other work.

	I think this is largely true for Linux because of bad implementation of
'poll' and therefore 'select'.

> If you want a formula for slides :), time_polling/time_working is O(1)
> with epoll, but O(N) with poll() & select().

	It's not O(N) with 'poll' and 'select'. Twice as many file descriptors
means twice as many active file descriptors which means twice as many
discovered per call to 'poll'. If the calls to 'poll' are further apart
(because of the additional real work done in-between calls) it means more
than twice as many discovered per call to 'poll'. Add to this that you will
find more bytes ready to read or more space in the send queue per call to
'poll' as the load goes up.

	DS



^ permalink raw reply	[flat|nested] 58+ messages in thread

* RE: [Patch][RFC] epoll and half closed TCP connections
  2003-07-13 23:05           ` David Schwartz
@ 2003-07-13 23:09             ` Davide Libenzi
  2003-07-14  8:14               ` Alan Cox
  2003-07-14  1:27             ` Jamie Lokier
  1 sibling, 1 reply; 58+ messages in thread
From: Davide Libenzi @ 2003-07-13 23:09 UTC (permalink / raw)
  To: David Schwartz; +Cc: Jamie Lokier, Eric Varsanyi, Linux Kernel Mailing List

On Sun, 13 Jul 2003, David Schwartz wrote:

> 	It's not O(N) with 'poll' and 'select'. Twice as many file descriptors
> means twice as many active file descriptors which means twice as many
> discovered per call to 'poll'. If the calls to 'poll' are further apart

It is O(N), if N if the number of fds queried. The poll code does "at least"
2 * N loops among the set (plus other stuff), and hence it is O(N). Even
if you do N "nop" in your implementation, this becomes O(N) from a
mathematical point of view.



- Davide


^ permalink raw reply	[flat|nested] 58+ messages in thread

* RE: [Patch][RFC] epoll and half closed TCP connections
  2003-07-13 23:05           ` David Schwartz
@ 2003-07-13 23:11             ` Davide Libenzi
  2003-07-13 23:52             ` Entrope
                               ` (2 subsequent siblings)
  3 siblings, 0 replies; 58+ messages in thread
From: Davide Libenzi @ 2003-07-13 23:11 UTC (permalink / raw)
  To: David Schwartz; +Cc: Eric Varsanyi, Linux Kernel Mailing List

On Sun, 13 Jul 2003, David Schwartz wrote:

>
> > Let's look at what the poll code does :
> >
> > 1) It has to allocate the kernel buffer for events
> >
> > 2) It has to copy it from userspace
> >
> > 3) It has to allocate wait queue buffer calling get_free_page (possibly
> > 	multiple times when we talk about decent fds numbers)
> >
> > 4) It has to loop calling N times f_op->poll() that in turn will add into
> > 	the wait queue getting/releasing IRQ locks
> >
> > 5) Loop another M loop to copy events to userspace
> >
> > 6) Call kfree() for all blocks allocated
> >
> > 7) Call poll_freewait() that will go with another N loop to unregister
> > 	poll waits, that in turn will do another N IRQ locks
>
> 	This is really just due to bad coding in 'poll', or more precisely very bad
> for this case. For example, why is it allocating a wait queue buffer if the
> odds that it will need to wait are basically zero? Why is it adding file
> descriptors to the wait queue before it has determined that it needs to
> wait?
>
> 	As load increases, more and more calls to 'poll' require no waiting. Yet
> 'poll' is heavily optimized for the 'no or low load' case. That's why 'poll'
> doesn't scale on Linux.

However you implement poll(2) you have "at least" to do one iteration
among the interest set, and hence your implementation will be O(N).



- Davide


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [Patch][RFC] epoll and half closed TCP connections
  2003-07-13 23:05           ` David Schwartz
  2003-07-13 23:11             ` Davide Libenzi
@ 2003-07-13 23:52             ` Entrope
  2003-07-14  6:14               ` David Schwartz
  2003-07-14  1:51             ` Jamie Lokier
  2003-07-15 20:27             ` James Antill
  3 siblings, 1 reply; 58+ messages in thread
From: Entrope @ 2003-07-13 23:52 UTC (permalink / raw)
  To: David Schwartz; +Cc: Davide Libenzi, Eric Varsanyi, Linux Kernel Mailing List

"David Schwartz" <davids@webmaster.com> writes:

>	This is really just due to bad coding in 'poll', or more precisely very bad
> for this case. For example, why is it allocating a wait queue buffer if the
> odds that it will need to wait are basically zero? Why is it adding file
> descriptors to the wait queue before it has determined that it needs to
> wait?
>
> 	As load increases, more and more calls to 'poll' require no waiting. Yet
> 'poll' is heavily optimized for the 'no or low load' case. That's why 'poll'
> doesn't scale on Linux.

Your argument is bogus.  My first-hand experience is with IRC servers,
which customarily have thousands of connections at once, with a very
few percent active in a given check.  The scaling problem is not with
the length of waiting or how poll() is optimized -- it is with the
overhead *inherent* to processing poll().  Common IRC servers spend
100% of CPU when using poll() for only a few thousand clients.  Those
same servers, using FreeBSD's kqueue()/kevent() API, use well under
10% of the CPU.

Yes, the amount of time spent doing useful work increases as the
poll() load increases -- but the time wasted setting up and checking
activity for poll() is something you can never reclaim, and which only
goes up as your CPU gets faster.  epoll() makes you pay the cost of
updating the interest list only when the list changes; poll() makes
you pay the cost every time you call it.

Empirically, four of the five biggest IRC networks run server software
that prefers kqueue() on FreeBSD.  kqueue() did not cause them to be
large, but using kqueue() addresses specific concerns.  On the network
I can speak for, we look forward to having epoll() on Linux for the
same reason.

>> Yes, of course. The time spent inside poll/select becomes a PITA when you
>> start dealing with huge number of fds. And this is kernel time. This does
>> not obviously mean that if epoll is 10 times faster than poll under load,
>> and you switch your app on epoll, it'll be ten times faster. It means that
>> the kernel time spent inside poll will be 1/10. And many of the operations
>> done by poll require IRQ locks and this increase the time the kernel
>> spend with disabled IRQs, that is never a good thing.
>
> 	My experience has been that this is a huge problem with Linux but not with
> any other OS. It can be solved in user-space with some other penalities by
> an adaptive sleep before each call to 'poll' and polling with a zero timeout
> (thus avoiding the wait queue pain). But all the deficiencies in the 'poll'
> implementation in the world won't show anything except that 'poll' is badly
> implemented.

Your experience must be unique, because many people have seen poll()'s
inactive-client overhead cause CPU wastage problems on non-Linux OSes
(for me, FreeBSD and Solaris).

poll() may be badly implemented on Linux or not, but it shares a
design flaw with select(): that the application must specify the list
of FDs for each system call, no matter how few change per call.  That
is the design flaw that epoll() addresses.  If you truly believe that
poll()'s implementation is so flawed, please provide an improved
implementation.

To put it another way, all the optimizations in the world for a 'poll'
implementation won't sustain it unless you understand the flaw in its
specification.  The specification requires inefficient use of CPU for
very common situations.

Michael Poole

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [Patch][RFC] epoll and half closed TCP connections
  2003-07-13 23:05           ` David Schwartz
  2003-07-13 23:09             ` Davide Libenzi
@ 2003-07-14  1:27             ` Jamie Lokier
  1 sibling, 0 replies; 58+ messages in thread
From: Jamie Lokier @ 2003-07-14  1:27 UTC (permalink / raw)
  To: David Schwartz; +Cc: Davide Libenzi, Eric Varsanyi, Linux Kernel Mailing List

David Schwartz wrote:
> 	But so does 'poll'. If you double the number of active and inactive
> connections, 'poll' takes twice as long. But you do twice as much per call
> to 'poll'. You will both discover more connections ready to do work on and
> move more bytes per connection as the load increases.

Well, with that assumption sure there is nothing _to_ scale - poll and
select are perfect.

> > That was always the complaint about select() and poll(): they dominate
> > the run time for large numbers of connections.  epoll, on the other
> > hand, will always be in the noise relative to other work.
> 
> 	I think this is largely true for Linux because of bad implementation of
> 'poll' and therefore 'select'.

It's true those implementations could use clever methods to reduce the
amount of time they take by caching poll results, and probably
approach epoll speeds.

However their APIs have a built-in problem - at minimum, the kernel
has to read & write O(N) memory per call.

With your assumption of a fixed ratio of active/idle sockets, and
where that ration is not very small (10% is not very small), of course
the API is not a problem.

> > If you want a formula for slides :), time_polling/time_working is O(1)
> > with epoll, but O(N) with poll() & select().
> 
> 	It's not O(N) with 'poll' and 'select'. Twice as many file descriptors
> means twice as many active file descriptors which means twice as many
> discovered per call to 'poll'.

It is O(N).  When the load pattern follows your example, O(N) == O(1) :)

-- Jamie

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [Patch][RFC] epoll and half closed TCP connections
  2003-07-13 23:03           ` Davide Libenzi
@ 2003-07-14  1:41             ` Jamie Lokier
  2003-07-14  2:24               ` POLLRDONCE optimisation for epoll users (was: epoll and half closed TCP connections) Jamie Lokier
  0 siblings, 1 reply; 58+ messages in thread
From: Jamie Lokier @ 2003-07-14  1:41 UTC (permalink / raw)
  To: Davide Libenzi; +Cc: David S. Miller, e0206, Linux Kernel Mailing List, kuznet

Davide Libenzi wrote:
> Yes, this could be improved though. If we could only pass our event
> interest mask to f_op->poll() the function will be able to register it
> inside the wait queue structure and release only waiters that matches the
> available condition.

It's not a bad idea.

> > And ttys?  They are problematic, because ttys can return EOF _after_
> > returning data without closing (and without being hung-up).  An epoll
> > loop which is reading a tty (and isn't programmed specially for a tty)
> > _must_ receive POLLRDHUP when the EOF is pending, else it may hang.
> 
> Please replace 'it may hung' with 'it may hung if it is using the read()
> return bytes check trick' ;)

Sure - but take an app that is normally using TCP sockets and give it
an AF_UNIX socket..  Something as general as the event loop
_shouldn't_ have to depend on that subtlety.

Ok that's avoidable, but it's a trap.  It would be nice to get a flag
that doesn't have a caveat in the manual saying "this flag only works
(at present) on TCP sockets in kernels >= 2.5.76.  Take care not to
use the optimisation for any other kind of fd including other sockets,
as it will break your app...".  That's not the sort of thing I want to
see in the epoll manual page :)

Anyway, there is a correct answer and I have made the patch so wait
for next mail... :)

-- Jamie

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [Patch][RFC] epoll and half closed TCP connections
  2003-07-13 23:05           ` David Schwartz
  2003-07-13 23:11             ` Davide Libenzi
  2003-07-13 23:52             ` Entrope
@ 2003-07-14  1:51             ` Jamie Lokier
  2003-07-14  6:14               ` David Schwartz
  2003-07-15 20:27             ` James Antill
  3 siblings, 1 reply; 58+ messages in thread
From: Jamie Lokier @ 2003-07-14  1:51 UTC (permalink / raw)
  To: David Schwartz; +Cc: Davide Libenzi, Eric Varsanyi, Linux Kernel Mailing List

David Schwartz wrote:
> 	This is really just due to bad coding in 'poll', or more precisely very bad
> for this case. For example, why is it allocating a wait queue buffer if the
> odds that it will need to wait are basically zero? Why is it adding file
> descriptors to the wait queue before it has determined that it needs to
> wait?

Pfeh!  That's just tweaking.

If you really want to optimise 'poll', maintain a per-task event
interest set like epoll does (you could use the epoll infrastructure),
and on each call to 'poll' just apply the differences between the
interest set and whatever is passed to poll.

That would actually reduce the number of calls to per-fd f_op->poll()
methods to O(active), making the internal overhead of 'poll' and
'select' comparable with epoll.

I'd not be surprised if someone has done it already.  I heard of a
"scalable poll" patch some years ago.

That leaves the overhead of the API, which is O(interested) but it is
a much lower overhead factor than the calls to f_op->poll().

Enjoy,
-- Jamie

^ permalink raw reply	[flat|nested] 58+ messages in thread

* POLLRDONCE optimisation for epoll users (was: epoll and half closed TCP connections)
  2003-07-14  1:41             ` Jamie Lokier
@ 2003-07-14  2:24               ` Jamie Lokier
  2003-07-14  2:37                 ` Davide Libenzi
  0 siblings, 1 reply; 58+ messages in thread
From: Jamie Lokier @ 2003-07-14  2:24 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: David S. Miller, Eric Varsanyi, Linux Kernel Mailing List, kuznet

Jamie Lokier wrote:
> Anyway, there is a correct answer and I have made the patch so wait
> for next mail... :)

The patch is included below.  Note, it compiles and has been carefully
scrutinised by a team of master coders, but never actually run.  Eric,
you may want to try it out :)

The difference between this and POLLRDHUP is polarity, generality and
correctness :)

Eric, change the polarity of your test from

	if (events & POLLRDHUP)
		goto read_again;

to

	if (!(events & POLLRDONCE))
		goto read_again;

and it should (fingers crossed) work well.

Explanation
-----------

I wrote:

>    The critical thing with POLL_RDHUP is that it is set if read() would
>    return EOF after returning data.

I was mistaken.  That condition isn't strong enough for an
edge-triggered event loop.  Some kinds of fd return a short read when
there is more data pending, due to boundaries - not a HUP condition,
but read-until-EAGAIN is required nonetheless.  Ttys do it, some
devices do it, and TCP does it if there's urgent data (though POLLPRI
should also be set) and under some other conditions.

I strongly dislike the idea that an application should know anything
about the type of fd it is reading from, if all it's doing is reading.
Also having to check kernel version would be ugly - an app which
depended on POLLRDHUP would have to check the kernel version.

After much typing (my first kernel patch in aeons :), the answer follows...

              ----------------------------------

PATCH: POLLRDONCE optimisation for epoll users

The enclosed patch adds a POLLRDONCE condition.  This condition means
that it's _safe_ to read once before the next edge-triggered POLLIN event.

Otherwise, you must call read() repeatedly until you see EAGAIN, if
you are using edge-triggered epoll.  Simply calling read() once is not
guaranteed to re-arm the event.  (If you are using level-triggered
mode there is no problem, but edge-triggered mode is faster :)

This distinguishes the case where one read() is enough on a TCP
socket, from the case where a second read() is needed to fetch the EOF
condition.  Without this bit, applications must always call read()
twice in the very common case that the second will return -EAGAIN.

It is always safe for the application to ignore this bit, or if the
bit is not set.  So an application which looks for the bit will behave
correctly unmodified on older kernels or other kinds of fd.

This patch provides the optimisation for TCP sockets, pipes, fifos and
SOCK_STREAM AF_UNIX sockets, which are the common fd types for high
performance streaming applications.

Enjoy,
-- Jamie

diff -ur orig-2.5.75/fs/pipe.c pollrdonce-2.5.75/fs/pipe.c
--- orig-2.5.75/fs/pipe.c	2003-07-08 21:41:28.000000000 +0100
+++ pollrdonce-2.5.75/fs/pipe.c	2003-07-14 02:19:27.000000000 +0100
@@ -29,6 +29,9 @@
  *
  * pipe_read & write cleanup
  * -- Manfred Spraul <manfred@colorfullife.com> 2002-05-09
+ *
+ * Added POLLRDONCE.
+ * -- Jamie Lokier 2003-07-14
  */
 
 /* Drop the inode semaphore and wait for a pipe event, atomically */
@@ -250,6 +253,8 @@
 
 	/* Reading only -- no need for acquiring the semaphore.  */
 	mask = POLLIN | POLLRDNORM;
+	if (PIPE_WRITERS(*inode))
+		mask |= POLLRDONCE;
 	if (PIPE_EMPTY(*inode))
 		mask = POLLOUT | POLLWRNORM;
 	if (!PIPE_WRITERS(*inode) && filp->f_version != PIPE_WCOUNTER(*inode))
diff -ur orig-2.5.75/include/asm-alpha/poll.h pollrdonce-2.5.75/include/asm-alpha/poll.h
--- orig-2.5.75/include/asm-alpha/poll.h	2003-07-13 20:07:42.000000000 +0100
+++ pollrdonce-2.5.75/include/asm-alpha/poll.h	2003-07-13 22:35:21.000000000 +0100
@@ -13,6 +13,7 @@
 #define POLLWRBAND	(1 << 9)
 #define POLLMSG		(1 << 10)
 #define POLLREMOVE	(1 << 11)
+#define POLLRDONCE	(1 << 12)
 
 struct pollfd {
 	int fd;
diff -ur orig-2.5.75/include/asm-arm/poll.h pollrdonce-2.5.75/include/asm-arm/poll.h
--- orig-2.5.75/include/asm-arm/poll.h	2003-07-08 21:42:07.000000000 +0100
+++ pollrdonce-2.5.75/include/asm-arm/poll.h	2003-07-13 22:35:21.000000000 +0100
@@ -16,6 +16,7 @@
 #define POLLWRBAND	0x0200
 #define POLLMSG		0x0400
 #define POLLREMOVE	0x1000
+#define POLLRDONCE	0x2000
 
 struct pollfd {
 	int fd;
diff -ur orig-2.5.75/include/asm-arm26/poll.h pollrdonce-2.5.75/include/asm-arm26/poll.h
--- orig-2.5.75/include/asm-arm26/poll.h	2003-07-08 21:52:49.000000000 +0100
+++ pollrdonce-2.5.75/include/asm-arm26/poll.h	2003-07-13 22:35:21.000000000 +0100
@@ -15,6 +15,7 @@
 #define POLLWRNORM	0x0100
 #define POLLWRBAND	0x0200
 #define POLLMSG		0x0400
+#define POLLRDONCE	0x2000
 
 struct pollfd {
 	int fd;
diff -ur orig-2.5.75/include/asm-cris/poll.h pollrdonce-2.5.75/include/asm-cris/poll.h
--- orig-2.5.75/include/asm-cris/poll.h	2003-07-12 17:57:34.000000000 +0100
+++ pollrdonce-2.5.75/include/asm-cris/poll.h	2003-07-13 22:35:21.000000000 +0100
@@ -15,6 +15,7 @@
 #define POLLWRBAND	512
 #define POLLMSG		1024
 #define POLLREMOVE	4096
+#define POLLRDONCE	8192
 
 struct pollfd {
 	int fd;
diff -ur orig-2.5.75/include/asm-h8300/poll.h pollrdonce-2.5.75/include/asm-h8300/poll.h
--- orig-2.5.75/include/asm-h8300/poll.h	2003-07-08 21:42:18.000000000 +0100
+++ pollrdonce-2.5.75/include/asm-h8300/poll.h	2003-07-13 22:35:21.000000000 +0100
@@ -12,6 +12,7 @@
 #define POLLRDBAND	128
 #define POLLWRBAND	256
 #define POLLMSG		0x0400
+#define POLLRDONCE	2048
 
 struct pollfd {
 	int fd;
diff -ur orig-2.5.75/include/asm-i386/poll.h pollrdonce-2.5.75/include/asm-i386/poll.h
--- orig-2.5.75/include/asm-i386/poll.h	2003-07-08 21:42:26.000000000 +0100
+++ pollrdonce-2.5.75/include/asm-i386/poll.h	2003-07-13 22:35:21.000000000 +0100
@@ -16,6 +16,7 @@
 #define POLLWRBAND	0x0200
 #define POLLMSG		0x0400
 #define POLLREMOVE	0x1000
+#define POLLRDONCE	0x2000
 
 struct pollfd {
 	int fd;
diff -ur orig-2.5.75/include/asm-ia64/poll.h pollrdonce-2.5.75/include/asm-ia64/poll.h
--- orig-2.5.75/include/asm-ia64/poll.h	2003-07-08 21:42:30.000000000 +0100
+++ pollrdonce-2.5.75/include/asm-ia64/poll.h	2003-07-13 22:35:21.000000000 +0100
@@ -21,6 +21,7 @@
 #define POLLWRBAND	0x0200
 #define POLLMSG		0x0400
 #define POLLREMOVE	0x1000
+#define POLLRDONCE	0x2000
 
 struct pollfd {
 	int fd;
diff -ur orig-2.5.75/include/asm-m68k/poll.h pollrdonce-2.5.75/include/asm-m68k/poll.h
--- orig-2.5.75/include/asm-m68k/poll.h	2003-07-08 21:42:45.000000000 +0100
+++ pollrdonce-2.5.75/include/asm-m68k/poll.h	2003-07-13 22:35:21.000000000 +0100
@@ -13,6 +13,7 @@
 #define POLLWRBAND	256
 #define POLLMSG		0x0400
 #define POLLREMOVE	0x1000
+#define POLLRDONCE	0x2000
 
 struct pollfd {
 	int fd;
diff -ur orig-2.5.75/include/asm-mips/poll.h pollrdonce-2.5.75/include/asm-mips/poll.h
--- orig-2.5.75/include/asm-mips/poll.h	2003-07-08 21:55:30.000000000 +0100
+++ pollrdonce-2.5.75/include/asm-mips/poll.h	2003-07-13 22:35:21.000000000 +0100
@@ -17,6 +17,7 @@
 /* These seem to be more or less nonstandard ...  */
 #define POLLMSG		0x0400
 #define POLLREMOVE	0x1000
+#define POLLRDONCE	0x2000
 
 struct pollfd {
 	int fd;
diff -ur orig-2.5.75/include/asm-mips64/poll.h pollrdonce-2.5.75/include/asm-mips64/poll.h
--- orig-2.5.75/include/asm-mips64/poll.h	2003-07-08 21:55:33.000000000 +0100
+++ pollrdonce-2.5.75/include/asm-mips64/poll.h	2003-07-13 22:35:21.000000000 +0100
@@ -17,6 +17,7 @@
 /* These seem to be more or less nonstandard ...  */
 #define POLLMSG		0x0400
 #define POLLREMOVE	0x1000
+#define POLLRDONCE	0x2000
 
 struct pollfd {
 	int fd;
diff -ur orig-2.5.75/include/asm-parisc/poll.h pollrdonce-2.5.75/include/asm-parisc/poll.h
--- orig-2.5.75/include/asm-parisc/poll.h	2003-07-08 21:43:06.000000000 +0100
+++ pollrdonce-2.5.75/include/asm-parisc/poll.h	2003-07-13 22:35:21.000000000 +0100
@@ -16,6 +16,7 @@
 #define POLLWRBAND	0x0200
 #define POLLMSG		0x0400
 #define POLLREMOVE	0x1000
+#define POLLRDONCE	0x2000
 
 struct pollfd {
 	int fd;
diff -ur orig-2.5.75/include/asm-ppc/poll.h pollrdonce-2.5.75/include/asm-ppc/poll.h
--- orig-2.5.75/include/asm-ppc/poll.h	2003-07-08 21:43:15.000000000 +0100
+++ pollrdonce-2.5.75/include/asm-ppc/poll.h	2003-07-13 22:35:21.000000000 +0100
@@ -13,6 +13,7 @@
 #define POLLWRBAND	0x0200
 #define POLLMSG		0x0400
 #define POLLREMOVE	0x1000
+#define POLLRDONCE	0x2000
 
 struct pollfd {
 	int fd;
diff -ur orig-2.5.75/include/asm-ppc64/poll.h pollrdonce-2.5.75/include/asm-ppc64/poll.h
--- orig-2.5.75/include/asm-ppc64/poll.h	2003-07-08 21:43:15.000000000 +0100
+++ pollrdonce-2.5.75/include/asm-ppc64/poll.h	2003-07-13 22:35:21.000000000 +0100
@@ -22,6 +22,7 @@
 #define POLLWRBAND	0x0200
 #define POLLMSG		0x0400
 #define POLLREMOVE	0x1000
+#define POLLRDONCE	0x2000
 
 struct pollfd {
 	int fd;
diff -ur orig-2.5.75/include/asm-s390/poll.h pollrdonce-2.5.75/include/asm-s390/poll.h
--- orig-2.5.75/include/asm-s390/poll.h	2003-07-08 21:43:19.000000000 +0100
+++ pollrdonce-2.5.75/include/asm-s390/poll.h	2003-07-13 22:35:21.000000000 +0100
@@ -24,6 +24,7 @@
 #define POLLWRBAND	0x0200
 #define POLLMSG		0x0400
 #define POLLREMOVE	0x1000
+#define POLLRDONCE	0x2000
 
 struct pollfd {
 	int fd;
diff -ur orig-2.5.75/include/asm-sh/poll.h pollrdonce-2.5.75/include/asm-sh/poll.h
--- orig-2.5.75/include/asm-sh/poll.h	2003-07-08 21:55:36.000000000 +0100
+++ pollrdonce-2.5.75/include/asm-sh/poll.h	2003-07-13 22:35:21.000000000 +0100
@@ -16,6 +16,7 @@
 #define POLLWRBAND	0x0200
 #define POLLMSG		0x0400
 #define POLLREMOVE	0x1000
+#define POLLRDONCE	0x2000
 
 struct pollfd {
 	int fd;
diff -ur orig-2.5.75/include/asm-sparc/poll.h pollrdonce-2.5.75/include/asm-sparc/poll.h
--- orig-2.5.75/include/asm-sparc/poll.h	2003-07-08 21:43:28.000000000 +0100
+++ pollrdonce-2.5.75/include/asm-sparc/poll.h	2003-07-13 22:35:21.000000000 +0100
@@ -13,6 +13,7 @@
 #define POLLWRBAND	256
 #define POLLMSG		512
 #define POLLREMOVE	1024
+#define POLLRDONCE	2048
 
 struct pollfd {
 	int fd;
diff -ur orig-2.5.75/include/asm-sparc64/poll.h pollrdonce-2.5.75/include/asm-sparc64/poll.h
--- orig-2.5.75/include/asm-sparc64/poll.h	2003-07-08 21:43:33.000000000 +0100
+++ pollrdonce-2.5.75/include/asm-sparc64/poll.h	2003-07-13 22:35:21.000000000 +0100
@@ -13,6 +13,7 @@
 #define POLLWRBAND	256
 #define POLLMSG		512
 #define POLLREMOVE	1024
+#define POLLRDONCE	2048
 
 struct pollfd {
 	int fd;
diff -ur orig-2.5.75/include/asm-v850/poll.h pollrdonce-2.5.75/include/asm-v850/poll.h
--- orig-2.5.75/include/asm-v850/poll.h	2003-07-08 21:43:40.000000000 +0100
+++ pollrdonce-2.5.75/include/asm-v850/poll.h	2003-07-13 22:35:21.000000000 +0100
@@ -13,6 +13,7 @@
 #define POLLWRBAND	0x0100
 #define POLLMSG		0x0400
 #define POLLREMOVE	0x1000
+#define POLLRDONCE	0x2000
 
 struct pollfd {
 	int fd;
diff -ur orig-2.5.75/include/asm-x86_64/poll.h pollrdonce-2.5.75/include/asm-x86_64/poll.h
--- orig-2.5.75/include/asm-x86_64/poll.h	2003-07-08 21:43:45.000000000 +0100
+++ pollrdonce-2.5.75/include/asm-x86_64/poll.h	2003-07-13 22:35:21.000000000 +0100
@@ -16,6 +16,7 @@
 #define POLLWRBAND	0x0200
 #define POLLMSG		0x0400
 #define POLLREMOVE	0x1000
+#define POLLRDONCE	0x2000
 
 struct pollfd {
 	int fd;
diff -ur orig-2.5.75/net/ipv4/tcp.c pollrdonce-2.5.75/net/ipv4/tcp.c
--- orig-2.5.75/net/ipv4/tcp.c	2003-07-08 21:54:33.000000000 +0100
+++ pollrdonce-2.5.75/net/ipv4/tcp.c	2003-07-14 02:19:16.000000000 +0100
@@ -206,6 +206,7 @@
  *					lingertime == 0 (RFC 793 ABORT Call)
  *	Hirokazu Takahashi	:	Use copy_from_user() instead of
  *					csum_and_copy_from_user() if possible.
+ *		Jamie Lokier	:	Added POLLRDONCE.
  *
  *		This program is free software; you can redistribute it and/or
  *		modify it under the terms of the GNU General Public License
@@ -426,22 +427,39 @@
 	 *
 	 * NOTE. Check for TCP_CLOSE is added. The goal is to prevent
 	 * blocking on fresh not-connected or disconnected socket. --ANK
+	 *
+	 * NOTE. POLLRDONCE will be set _only_ if multiple read/recvmsg
+	 * calls are not required before the next edge-triggered epoll
+	 * wakeup.  Typically multiple calls are needed when data+EOF is
+	 * pending; then the first read() is not enough to re-arm the
+	 * POLLIN event.     -- Jamie Lokier
 	 */
 	if (sk->sk_shutdown == SHUTDOWN_MASK || sk->sk_state == TCP_CLOSE)
 		mask |= POLLHUP;
+	/*
+	 * RCV_SHUTDOWN is always set when an EOF condition becomes pending.
+	 */
 	if (sk->sk_shutdown & RCV_SHUTDOWN)
 		mask |= POLLIN | POLLRDNORM;
 
 	/* Connected? */
 	if ((1 << sk->sk_state) & ~(TCPF_SYN_SENT | TCPF_SYN_RECV)) {
+
+		if (tp->urg_data & TCP_URG_VALID)
+			mask |= POLLPRI;
+
 		/* Potential race condition. If read of tp below will
 		 * escape above sk->sk_state, we can be illegally awaken
 		 * in SYN_* states. */
 		if ((tp->rcv_nxt != tp->copied_seq) &&
 		    (tp->urg_seq != tp->copied_seq ||
 		     tp->rcv_nxt != tp->copied_seq + 1 ||
-		     sock_flag(sk, SOCK_URGINLINE) || !tp->urg_data))
-			mask |= POLLIN | POLLRDNORM;
+		     sock_flag(sk, SOCK_URGINLINE) || !tp->urg_data)) {
+			if (mask == 0 && sk->sk_rcvlowat == INT_MAX)
+				mask = POLLIN | POLLRDNORM | POLLRDONCE;
+			else
+				mask |= POLLIN | POLLRDNORM;
+		}
 
 		if (!(sk->sk_shutdown & SEND_SHUTDOWN)) {
 			if (tcp_wspace(sk) >= tcp_min_write_space(sk)) {
@@ -459,9 +477,6 @@
 					mask |= POLLOUT | POLLWRNORM;
 			}
 		}
-
-		if (tp->urg_data & TCP_URG_VALID)
-			mask |= POLLPRI;
 	}
 	return mask;
 }
diff -ur orig-2.5.75/net/unix/af_unix.c pollrdonce-2.5.75/net/unix/af_unix.c
--- orig-2.5.75/net/unix/af_unix.c	2003-07-12 17:57:39.000000000 +0100
+++ pollrdonce-2.5.75/net/unix/af_unix.c	2003-07-14 02:33:25.000000000 +0100
@@ -50,6 +50,7 @@
  *	     Arnaldo C. Melo	:	Remove MOD_{INC,DEC}_USE_COUNT,
  *	     				the core infrastructure is doing that
  *	     				for all net proto families now (2.5.69+)
+ *		Jamie Lokier	:	Added POLLRDONCE.
  *
  *
  * Known differences from reference BSD that was tested:
@@ -1784,15 +1785,21 @@
 	if (sk->sk_shutdown == SHUTDOWN_MASK)
 		mask |= POLLHUP;
 
-	/* readable? */
-	if (!skb_queue_empty(&sk->sk_receive_queue) ||
-	    (sk->sk_shutdown & RCV_SHUTDOWN))
-		mask |= POLLIN | POLLRDNORM;
-
 	/* Connection-based need to check for termination and startup */
 	if (sk->sk_type == SOCK_STREAM && sk->sk_state == TCP_CLOSE)
 		mask |= POLLHUP;
 
+	/* readable? */
+	if ((sk->sk_shutdown & RCV_SHUTDOWN))
+		mask |= POLLIN | POLLRDNORM;
+	else if (!skb_queue_empty(&sk->sk_receive_queue)) {
+		if (mask == 0 && sk->sk_type == SOCK_STREAM
+		    && sk->sk_rcvlowat == INT_MAX)
+			mask = POLLIN | POLLRDNORM | POLLRDONCE;
+		else
+			mask |= POLLIN | POLLRDNORM;
+	}
+
 	/*
 	 * we set writable also when the other side has shut down the
 	 * connection. This prevents stuck sockets.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: POLLRDONCE optimisation for epoll users (was: epoll and half closed TCP connections)
  2003-07-14  2:24               ` POLLRDONCE optimisation for epoll users (was: epoll and half closed TCP connections) Jamie Lokier
@ 2003-07-14  2:37                 ` Davide Libenzi
  2003-07-14  2:43                   ` Davide Libenzi
                                     ` (2 more replies)
  0 siblings, 3 replies; 58+ messages in thread
From: Davide Libenzi @ 2003-07-14  2:37 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: David S. Miller, Eric Varsanyi, Linux Kernel Mailing List, kuznet

On Mon, 14 Jul 2003, Jamie Lokier wrote:

> Jamie Lokier wrote:
> > Anyway, there is a correct answer and I have made the patch so wait
> > for next mail... :)
>
> The patch is included below.  Note, it compiles and has been carefully
> scrutinised by a team of master coders, but never actually run.  Eric,
> you may want to try it out :)
>
> The difference between this and POLLRDHUP is polarity, generality and
> correctness :)
>
> Eric, change the polarity of your test from
>
> 	if (events & POLLRDHUP)
> 		goto read_again;
>
> to
>
> 	if (!(events & POLLRDONCE))
> 		goto read_again;
>
> and it should (fingers crossed) work well.

Ouch, I definitely liked more the POLLHUP thing. It is not linked to epoll
at all. Ok, suppose that our super-fast app choses Edge Triggered epoll
and also, aiming at topp speeds, using the smart read(2) trick :


void my_process_read(my_data *d, unsigned int events) {
	int n, s;

	do {
		s = d->buffer_size - d->in_buffer;
		if ((n = read(d->fd, d->buffer + d->in_buffer, s)) > 0) {
			process_partial_buffer(d, s);
			d->in_buffer += s;
		}
	} while (n == s);
	if (s == -1 && errno != EAGAIN) {
		handle_read_error(d);
		return;
	}
	if (events & EPOLLRDHUP) {
		d->flags |= HANGUP;
		schedule_removal(d);
	}
}

Where this will break by using a POLLRDHUP ?



- Davide


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: POLLRDONCE optimisation for epoll users (was: epoll and half closed TCP connections)
  2003-07-14  2:37                 ` Davide Libenzi
@ 2003-07-14  2:43                   ` Davide Libenzi
  2003-07-14  2:56                   ` Jamie Lokier
  2003-07-14  3:04                   ` Jamie Lokier
  2 siblings, 0 replies; 58+ messages in thread
From: Davide Libenzi @ 2003-07-14  2:43 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: David S. Miller, Eric Varsanyi, Linux Kernel Mailing List, kuznet

On Sun, 13 Jul 2003, Davide Libenzi wrote:

> void my_process_read(my_data *d, unsigned int events) {
> 	int n, s;
>
> 	do {
> 		s = d->buffer_size - d->in_buffer;
> 		if ((n = read(d->fd, d->buffer + d->in_buffer, s)) > 0) {
> 			process_partial_buffer(d, s);
> 			d->in_buffer += s;
> 		}
> 	} while (n == s);
> 	if (s == -1 && errno != EAGAIN) {
> 		handle_read_error(d);
> 		return;
> 	}
> 	if (events & EPOLLRDHUP) {
> 		d->flags |= HANGUP;
> 		schedule_removal(d);
> 	}
> }

Ouch, this is obviously :

void my_process_read(my_data *d, unsigned int events) {
      int n, s;

      do {
		s = d->buffer_size - d->in_buffer;
		if ((n = read(d->fd, d->buffer + d->in_buffer, s)) > 0) {
			process_partial_buffer(d, n);
			d->in_buffer += n;
		}
	} while (n == s);
	if (n == -1 && errno != EAGAIN) {
		handle_read_error(d);
		return;
	}
	if (events & EPOLLRDHUP) {
		d->flags |= HANGUP;
		schedule_removal(d);
	}
}



- Davide


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: POLLRDONCE optimisation for epoll users (was: epoll and half closed TCP connections)
  2003-07-14  2:37                 ` Davide Libenzi
  2003-07-14  2:43                   ` Davide Libenzi
@ 2003-07-14  2:56                   ` Jamie Lokier
  2003-07-14  3:02                     ` Davide Libenzi
  2003-07-14  3:12                     ` Jamie Lokier
  2003-07-14  3:04                   ` Jamie Lokier
  2 siblings, 2 replies; 58+ messages in thread
From: Jamie Lokier @ 2003-07-14  2:56 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: David S. Miller, Eric Varsanyi, Linux Kernel Mailing List, kuznet

Davide Libenzi wrote:
> Where this will break by using a POLLRDHUP ?

It will break if

	(a) fd isn't a socket
	(b) fd isn't a TCP socket
	(c) kernel version <= 2.5.75
	(d) SO_RCVLOWAT < s
	(e) there is urgent data with OOBINLINE (I think)

-- Jamie

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: POLLRDONCE optimisation for epoll users (was: epoll and half closed TCP connections)
  2003-07-14  2:56                   ` Jamie Lokier
@ 2003-07-14  3:02                     ` Davide Libenzi
  2003-07-14  3:16                       ` Jamie Lokier
  2003-07-14  3:12                     ` Jamie Lokier
  1 sibling, 1 reply; 58+ messages in thread
From: Davide Libenzi @ 2003-07-14  3:02 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: David S. Miller, Eric Varsanyi, Linux Kernel Mailing List, kuznet

On Mon, 14 Jul 2003, Jamie Lokier wrote:

> Davide Libenzi wrote:
> > Where this will break by using a POLLRDHUP ?
>
> It will break if
>
> 	(a) fd isn't a socket
> 	(b) fd isn't a TCP socket
> 	(c) kernel version <= 2.5.75
> 	(d) SO_RCVLOWAT < s
> 	(e) there is urgent data with OOBINLINE (I think)

Jamie, did you smoke that stuff again ? :)
With Eric patch in the proper places it is just fine. You just make
f_op->poll() to report the extra flag other that POLLIN. What's the problem ?



- Davide


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: POLLRDONCE optimisation for epoll users (was: epoll and half closed TCP connections)
  2003-07-14  2:37                 ` Davide Libenzi
  2003-07-14  2:43                   ` Davide Libenzi
  2003-07-14  2:56                   ` Jamie Lokier
@ 2003-07-14  3:04                   ` Jamie Lokier
  2003-07-14  3:12                     ` Davide Libenzi
  2 siblings, 1 reply; 58+ messages in thread
From: Jamie Lokier @ 2003-07-14  3:04 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: David S. Miller, Eric Varsanyi, Linux Kernel Mailing List, kuznet

Davide Libenzi wrote:
> Ouch, I definitely liked more the POLLHUP thing. It is not linked to epoll
> at all.

POLLRDONCE isn't linked to epoll - it's a valid hint for poll/select too.

It means something different, i.e. you can't write:

> 	if (events & EPOLLRDHUP) {
> 		d->flags |= HANGUP;
> 		schedule_removal(d);
> 	}

Be careful, because that isn't valid if there is urgent data.  You
need to check POLLPRI too.  Granted urgent data is usually best ignored :)

If fast hangup is a useful optimisation too, we should have both flags.
(However calling read() doesn't seem like a great penalty for that).

-- Jamie

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: POLLRDONCE optimisation for epoll users (was: epoll and half closed TCP connections)
  2003-07-14  3:04                   ` Jamie Lokier
@ 2003-07-14  3:12                     ` Davide Libenzi
  2003-07-14  3:27                       ` Jamie Lokier
  0 siblings, 1 reply; 58+ messages in thread
From: Davide Libenzi @ 2003-07-14  3:12 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: David S. Miller, Eric Varsanyi, Linux Kernel Mailing List, kuznet

On Mon, 14 Jul 2003, Jamie Lokier wrote:

> Davide Libenzi wrote:
> > Ouch, I definitely liked more the POLLHUP thing. It is not linked to epoll
> > at all.
>
> POLLRDONCE isn't linked to epoll - it's a valid hint for poll/select too.
>
> It means something different, i.e. you can't write:
>
> > 	if (events & EPOLLRDHUP) {
> > 		d->flags |= HANGUP;
> > 		schedule_removal(d);
> > 	}
>
> Be careful, because that isn't valid if there is urgent data.  You
> need to check POLLPRI too.  Granted urgent data is usually best ignored :)

I didn't want to code the whole application here, hope you understand ;)


> If fast hangup is a useful optimisation too, we should have both flags.
> (However calling read() doesn't seem like a great penalty for that).

Indeed. Hangup cases are a small fraction of std ones, so it has no sense
to optimize for them trying to avoid at all the read. And the name
READONCE seems to imply that you can't read(2) twice. I'd rather prefer
the RDHUP flag that tells me : There's an hungup condition for sure, and
you might also find some data since POLLIN is set.



- Davide


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: POLLRDONCE optimisation for epoll users (was: epoll and half closed TCP connections)
  2003-07-14  2:56                   ` Jamie Lokier
  2003-07-14  3:02                     ` Davide Libenzi
@ 2003-07-14  3:12                     ` Jamie Lokier
  2003-07-14  3:17                       ` Davide Libenzi
  1 sibling, 1 reply; 58+ messages in thread
From: Jamie Lokier @ 2003-07-14  3:12 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: David S. Miller, Eric Varsanyi, Linux Kernel Mailing List, kuznet

Jamie Lokier wrote:
> 	(e) there is urgent data with OOBINLINE (I think)

To be more precise, using the POLLRDHUP patch as-is, if someone sends
your program some data, then an URGent segment, then a FIN with
optional data in between, your program won't notice the second data or
FIN and will fail to clean up the socket.

-- Jamie

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: POLLRDONCE optimisation for epoll users (was: epoll and half closed TCP connections)
  2003-07-14  3:02                     ` Davide Libenzi
@ 2003-07-14  3:16                       ` Jamie Lokier
  2003-07-14  3:21                         ` Davide Libenzi
  0 siblings, 1 reply; 58+ messages in thread
From: Jamie Lokier @ 2003-07-14  3:16 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: David S. Miller, Eric Varsanyi, Linux Kernel Mailing List, kuznet

Davide Libenzi wrote:
> > > Where this will break by using a POLLRDHUP ?
> >
> > It will break if
> >
> > 	(a) fd isn't a socket
> > 	(b) fd isn't a TCP socket
> > 	(c) kernel version <= 2.5.75
> > 	(d) SO_RCVLOWAT < s
> > 	(e) there is urgent data with OOBINLINE (I think)

> Jamie, did you smoke that stuff again ? :)
> With Eric patch in the proper places it is just fine. You just make
> f_op->poll() to report the extra flag other that POLLIN. What's the problem ?

The problem in cases (a)-(e) is your loop will call read() just once
when it needs to call read() until it sees EAGAIN.

What's wrong is the behaviour of your program when the extra flag
_isn't_ set.

-- Jamie

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: POLLRDONCE optimisation for epoll users (was: epoll and half closed TCP connections)
  2003-07-14  3:12                     ` Jamie Lokier
@ 2003-07-14  3:17                       ` Davide Libenzi
  2003-07-14  3:35                         ` Jamie Lokier
  0 siblings, 1 reply; 58+ messages in thread
From: Davide Libenzi @ 2003-07-14  3:17 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: David S. Miller, Eric Varsanyi, Linux Kernel Mailing List, kuznet

On Mon, 14 Jul 2003, Jamie Lokier wrote:

> Jamie Lokier wrote:
> > 	(e) there is urgent data with OOBINLINE (I think)
>
> To be more precise, using the POLLRDHUP patch as-is, if someone sends
> your program some data, then an URGent segment, then a FIN with
> optional data in between, your program won't notice the second data or
> FIN and will fail to clean up the socket.

And why ? To me it looks fairly simple. When the FIN is received a wakeup
is done on the poll wait list and the following f_op->poll will fetch the
RDHUP flag. Then the next epoll_wait() will fetch the event and will have
all the info it needs to do things correctly.



- Davide


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: POLLRDONCE optimisation for epoll users (was: epoll and half closed TCP connections)
  2003-07-14  3:16                       ` Jamie Lokier
@ 2003-07-14  3:21                         ` Davide Libenzi
  2003-07-14  3:42                           ` Jamie Lokier
  0 siblings, 1 reply; 58+ messages in thread
From: Davide Libenzi @ 2003-07-14  3:21 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: David S. Miller, Eric Varsanyi, Linux Kernel Mailing List, kuznet

On Mon, 14 Jul 2003, Jamie Lokier wrote:

> Davide Libenzi wrote:
> > > > Where this will break by using a POLLRDHUP ?
> > >
> > > It will break if
> > >
> > > 	(a) fd isn't a socket
> > > 	(b) fd isn't a TCP socket
> > > 	(c) kernel version <= 2.5.75
> > > 	(d) SO_RCVLOWAT < s
> > > 	(e) there is urgent data with OOBINLINE (I think)
>
> > Jamie, did you smoke that stuff again ? :)
> > With Eric patch in the proper places it is just fine. You just make
> > f_op->poll() to report the extra flag other that POLLIN. What's the problem ?
>
> The problem in cases (a)-(e) is your loop will call read() just once
> when it needs to call read() until it sees EAGAIN.
>
> What's wrong is the behaviour of your program when the extra flag
> _isn't_ set.

Jamie, the loop will call read(2) until data is available. With the trick
of checking the returned number of bytes you can avoid the extra EAGAIN
read(2). That's the point of the read(2) trick. The final check for RDHUP
will tell that it has no more to wait for POLLINs since there's no more
someone sending.



- Davide


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: POLLRDONCE optimisation for epoll users (was: epoll and half closed TCP connections)
  2003-07-14  3:12                     ` Davide Libenzi
@ 2003-07-14  3:27                       ` Jamie Lokier
  0 siblings, 0 replies; 58+ messages in thread
From: Jamie Lokier @ 2003-07-14  3:27 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: David S. Miller, Eric Varsanyi, Linux Kernel Mailing List, kuznet

Davide Libenzi wrote:
> And the name READONCE seems to imply that you can't read(2) twice.

Like all POLL* flags, you can always do more than it implies and get EAGAIN :)

I don't care about the name, feel free to pick another.

> I'd rather prefer the RDHUP flag that tells me : There's an hungup
> condition for sure, and you might also find some data since POLLIN is set.

Yeah, but it doesn't stop the do-while loop from being broken :)

-- Jamie

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: POLLRDONCE optimisation for epoll users (was: epoll and half closed TCP connections)
  2003-07-14  3:17                       ` Davide Libenzi
@ 2003-07-14  3:35                         ` Jamie Lokier
  0 siblings, 0 replies; 58+ messages in thread
From: Jamie Lokier @ 2003-07-14  3:35 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: David S. Miller, Eric Varsanyi, Linux Kernel Mailing List, kuznet

Davide Libenzi wrote:
> > To be more precise, using the POLLRDHUP patch as-is, if someone sends
> > your program some data, then an URGent segment, then a FIN with
> > optional data in between, your program won't notice the second data or
> > FIN and will fail to clean up the socket.
> 
> And why ? To me it looks fairly simple. When the FIN is received a wakeup
> is done on the poll wait list and the following f_op->poll will fetch the
> RDHUP flag. Then the next epoll_wait() will fetch the event and will have
> all the info it needs to do things correctly.

Burp.
You're right.

The loop failure comes when user sends URG and more data _without_ FIN.

Then RDHUP is not set, and your loop will read up to before the URG
and no further.

(Normal behaviour would be to skip the URG segment and continue
reading data after it, or to include the URG segmenet if OOBINLINE is
set.)

Ahem,
-- Jamie

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: POLLRDONCE optimisation for epoll users (was: epoll and half closed TCP connections)
  2003-07-14  3:21                         ` Davide Libenzi
@ 2003-07-14  3:42                           ` Jamie Lokier
  2003-07-14  4:00                             ` Davide Libenzi
  0 siblings, 1 reply; 58+ messages in thread
From: Jamie Lokier @ 2003-07-14  3:42 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: David S. Miller, Eric Varsanyi, Linux Kernel Mailing List, kuznet

Davide Libenzi wrote:
> > > > 	(a) fd isn't a socket
> > > > 	(b) fd isn't a TCP socket
> > > > 	(c) kernel version <= 2.5.75
> > > > 	(d) SO_RCVLOWAT < s
> > > > 	(e) there is urgent data with OOBINLINE (I think)
> >
> > > Jamie, did you smoke that stuff again ? :)
> > > With Eric patch in the proper places it is just fine. You just make
> > > f_op->poll() to report the extra flag other that POLLIN. What's the problem ?
> >
> > The problem in cases (a)-(e) is your loop will call read() just once
> > when it needs to call read() until it sees EAGAIN.
> >
> > What's wrong is the behaviour of your program when the extra flag
> > _isn't_ set.
> 
> Jamie, the loop will call read(2) until data is available. With the trick
> of checking the returned number of bytes you can avoid the extra EAGAIN
> read(2). That's the point of the read(2) trick.

That _only_ works if none of those conditions (a)-(e) applies.
Otherwise, short reads are possible when there is more to come.

Sure, if you're willing to assert that the program is running on
kernel >= 2.5.76, all its fds are for sure TCP sockets and you added
the POLLPRI check, then yes it's fine.

I think mine is better because it works always, and you are free to
code the optimisation in any programs, libraries etc.

> The final check for RDHUP will tell that it has no more to wait for
> POLLINs since there's no more someone sending.

Sure, _that_ check is fine.

-- Jamie

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: POLLRDONCE optimisation for epoll users (was: epoll and half closed TCP connections)
  2003-07-14  3:42                           ` Jamie Lokier
@ 2003-07-14  4:00                             ` Davide Libenzi
  2003-07-14  5:51                               ` Jamie Lokier
  0 siblings, 1 reply; 58+ messages in thread
From: Davide Libenzi @ 2003-07-14  4:00 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: David S. Miller, Eric Varsanyi, Linux Kernel Mailing List, kuznet

On Mon, 14 Jul 2003, Jamie Lokier wrote:

> (a) fd isn't a socket
> (b) fd isn't a TCP socket

Jamie, libraries, like for example libevent, are completely generic indeed.
They fetch events and they call the associated callback. You obviously
know inside your callback which kind of fd you working on. So you use the
reading function that best fit the fd type. Obviously the read(2) trick
only works for stream type fds.


> (c) kernel version <= 2.5.75

Obviously, POLLRDHUP is not yet inside the kernel :)


> (d) SO_RCVLOWAT < s

This does not apply with non-blocking fds.


> (e) there is urgent data with OOBINLINE (I think)

You obviously need an EPOLLPRI check in your read handling routine if you
app is expecting urgent data.



- Davide


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: POLLRDONCE optimisation for epoll users (was: epoll and half closed TCP connections)
  2003-07-14  4:00                             ` Davide Libenzi
@ 2003-07-14  5:51                               ` Jamie Lokier
  2003-07-14  6:24                                 ` Davide Libenzi
  0 siblings, 1 reply; 58+ messages in thread
From: Jamie Lokier @ 2003-07-14  5:51 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: David S. Miller, Eric Varsanyi, Linux Kernel Mailing List, kuznet

Oh!  I realised why we're mis-communicating.

I started in the assumption that you're using POLLRDHUP to trigger a
second read, but you're using it in a different way (makes sense given
the name!)

You're assuming that a short read is always the last one with data,
that's the read() trick, and detecting EOF is really a wholly separate
problem.

Ah..

First let's get the misc bits out of the way.

Davide Libenzi wrote:
> > (d) SO_RCVLOWAT < s
> 
> This does not apply with non-blocking fds.

Look at the line "if (copied >= target)" in tcp_recvmsg.

> > (e) there is urgent data with OOBINLINE (I think)
> 
> You obviously need an EPOLLPRI check in your read handling routine if you
> app is expecting urgent data.

Normal behaviour is for urgent data to be discarded, I believe.  Now
if someone sends it to you, you'll end up with the socket stalling
with pending data in the buffers.  Not saying whether you care, it's
just a difference of behaviour to be noted and a potential DOS
(filling socket buffers which app doesn't know to empty).

Now on to the important stuff.

> On Mon, 14 Jul 2003, Jamie Lokier wrote:
> 
> > (a) fd isn't a socket
> > (b) fd isn't a TCP socket
> 
> Jamie, libraries, like for example libevent, are completely generic indeed.
> They fetch events and they call the associated callback. You obviously
> know inside your callback which kind of fd you working on.

I disagree - inside a stream parser callback (e.g. XML transcoder) I
prefer to _not_ know the difference between pipe, file, tty and socket
that I am reading.

> So you use the reading function that best fit the fd type. Obviously
> the read(2) trick only works for stream type fds.

Stream fds provided you don't hit the unusual cases.

Point taken.  Now I'm saying there's an interface which is no more or
less complicated, but _doesn't_ require you to treat different kinds
of fds differently.  So I can write code which uses the read() trick
universally without having to pass that extra parameter,
EOF_SETS_RDHUP, to the event callback :)

> > (c) kernel version <= 2.5.75
> 
> Obviously, POLLRDHUP is not yet inside the kernel :)

Quite.  When you write an app that uses it and the read(2) trick
you'll see the bug which Eric brought up :)

I'm saying there's a way to write an app which can use the read(2)
trick, yet which does _not_ hang on older kernels.  Hence is robust.

My overall philosophy on this:

    The less object A needs to know about object B, the better, right?

Right? :)

-- Jamie

^ permalink raw reply	[flat|nested] 58+ messages in thread

* RE: [Patch][RFC] epoll and half closed TCP connections
  2003-07-13 23:52             ` Entrope
@ 2003-07-14  6:14               ` David Schwartz
  2003-07-14  7:20                 ` Jamie Lokier
  0 siblings, 1 reply; 58+ messages in thread
From: David Schwartz @ 2003-07-14  6:14 UTC (permalink / raw)
  To: Entrope; +Cc: Davide Libenzi, Eric Varsanyi, Linux Kernel Mailing List


> Your argument is bogus.  My first-hand experience is with IRC servers,
> which customarily have thousands of connections at once, with a very
> few percent active in a given check.  The scaling problem is not with
> the length of waiting or how poll() is optimized -- it is with the
> overhead *inherent* to processing poll().  Common IRC servers spend
> 100% of CPU when using poll() for only a few thousand clients.  Those
> same servers, using FreeBSD's kqueue()/kevent() API, use well under
> 10% of the CPU.

	My argument is not bogus, you just don't understand it. Two algorithms can
scale at the same order and yet one can perform much better than the other.
Poll, for example, could use 10% of the CPU at 100 users and 100% at 1000
users. While kqueue/kevent could use .01% at 100 users and .2% at 1000
users. With these numbers, poll is much more scalable (its CPU usage went up
by a factor of 10 while kqueue went up by a factor of 20) yet kqueue will
outperform poll.

	I am specifically talking about scalability, in the compsci sense. I grant
that epoll will use less CPU than poll in every configuration.

> Yes, the amount of time spent doing useful work increases as the
> poll() load increases -- but the time wasted setting up and checking
> activity for poll() is something you can never reclaim, and which only
> goes up as your CPU gets faster.  epoll() makes you pay the cost of
> updating the interest list only when the list changes; poll() makes
> you pay the cost every time you call it.

	I know what epoll *is*.

> Empirically, four of the five biggest IRC networks run server software
> that prefers kqueue() on FreeBSD.  kqueue() did not cause them to be
> large, but using kqueue() addresses specific concerns.  On the network
> I can speak for, we look forward to having epoll() on Linux for the
> same reason.

	Wonderful, now please show me where the error in my argument is.

> > 	My experience has been that this is a huge problem with
> > Linux but not with
> > any other OS. It can be solved in user-space with some other
> > penalities by
> > an adaptive sleep before each call to 'poll' and polling with a
> > zero timeout
> > (thus avoiding the wait queue pain). But all the deficiencies
> > in the 'poll'
> > implementation in the world won't show anything except that
> > 'poll' is badly
> > implemented.

> Your experience must be unique, because many people have seen poll()'s
> inactive-client overhead cause CPU wastage problems on non-Linux OSes
> (for me, FreeBSD and Solaris).

	References please? And again, artificial cases where the number of active
descriptors were held constant while the number of inactive descriptors were
increased do not count.

> poll() may be badly implemented on Linux or not, but it shares a
> design flaw with select(): that the application must specify the list
> of FDs for each system call, no matter how few change per call.  That
> is the design flaw that epoll() addresses.

	I know that. What does that have to do with anything? Are you even reading
what I'm writing?

> If you truly believe that
> poll()'s implementation is so flawed, please provide an improved
> implementation.

	It's trivial to make the optimizations that my post very obviously
suggests. One would be to defer creating the wait queue until it's clear we
need to wait. The problem is, these optimizations would harm the low-load
and no-load cases and it's my understanding from the last time this was
discussed that changes that worsen the 'most common' case will be refused
even if they improve the 'high load' case.

> To put it another way, all the optimizations in the world for a 'poll'
> implementation won't sustain it unless you understand the flaw in its
> specification.  The specification requires inefficient use of CPU for
> very common situations.

	Fine, but show me how that flaw impacts scalability. Please read what I
said again, I understand that 'epoll' is superior to 'poll'. I'm
specifically disputing whether or not 'epoll' has a specific mathematical
characteristic.

	DS



^ permalink raw reply	[flat|nested] 58+ messages in thread

* RE: [Patch][RFC] epoll and half closed TCP connections
  2003-07-14  1:51             ` Jamie Lokier
@ 2003-07-14  6:14               ` David Schwartz
  0 siblings, 0 replies; 58+ messages in thread
From: David Schwartz @ 2003-07-14  6:14 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: Davide Libenzi, Eric Varsanyi, Linux Kernel Mailing List


> David Schwartz wrote:

> > 	This is really just due to bad coding in 'poll', or more
> > precisely very bad
> > for this case. For example, why is it allocating a wait queue
> > buffer if the
> > odds that it will need to wait are basically zero? Why is it adding file
> > descriptors to the wait queue before it has determined that it needs to
> > wait?

> Pfeh!  That's just tweaking.

	Nonetheless, it's embarassing for Linux that performance shoots up if you
replace a normal call to 'poll' with a sleep followed by a call to 'poll'
with a zero wait. But that's peripheral to my point.

> If you really want to optimise 'poll', maintain a per-task event
> interest set like epoll does (you could use the epoll infrastructure),
> and on each call to 'poll' just apply the differences between the
> interest set and whatever is passed to poll.
>
> That would actually reduce the number of calls to per-fd f_op->poll()
> methods to O(active), making the internal overhead of 'poll' and
> 'select' comparable with epoll.
>
> I'd not be surprised if someone has done it already.  I heard of a
> "scalable poll" patch some years ago.
>
> That leaves the overhead of the API, which is O(interested) but it is
> a much lower overhead factor than the calls to f_op->poll().

	Definitely, the API will always guarantee performance is O(n). If you're
interested in ultimate scalability, you can never exceed O(n) with 'poll'.
But my point is that you can never exceed O(n) with any discovery
implementation because the number of sockets to be discovered scales at
O(n), and you have to do something per socket.

	This doesn't change the fact that 'epoll' outperforms 'poll' at every
conceivable situation. It also doesn't change the fact that edge-based
triggering allows some phenomenal optimizations in multi-threaded code. It
also doesn't change the fact that Linux's 'poll' implementation is not so
good for the high-load, busy server case.

	All I'm trying to say is that the argument that 'epoll' scales at a better
order than 'poll' is bogus. They both scale at O(n) where 'n' is the number
of connections you have. No conceivable implementation could scale any
better.

	DS



^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: POLLRDONCE optimisation for epoll users (was: epoll and half closed TCP connections)
  2003-07-14  5:51                               ` Jamie Lokier
@ 2003-07-14  6:24                                 ` Davide Libenzi
  2003-07-14  6:57                                   ` Jamie Lokier
  0 siblings, 1 reply; 58+ messages in thread
From: Davide Libenzi @ 2003-07-14  6:24 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: David S. Miller, Eric Varsanyi, Linux Kernel Mailing List, kuznet

On Mon, 14 Jul 2003, Jamie Lokier wrote:

> Davide Libenzi wrote:
> > > (d) SO_RCVLOWAT < s
> >
> > This does not apply with non-blocking fds.
>
> Look at the line "if (copied >= target)" in tcp_recvmsg.

Look at this :

	timeo = sock_rcvtimeo(sk, nonblock);

;)


>
> > > (e) there is urgent data with OOBINLINE (I think)
> >
> > You obviously need an EPOLLPRI check in your read handling routine if you
> > app is expecting urgent data.
>
> Normal behaviour is for urgent data to be discarded, I believe.  Now
> if someone sends it to you, you'll end up with the socket stalling
> with pending data in the buffers.  Not saying whether you care, it's
> just a difference of behaviour to be noted and a potential DOS
> (filling socket buffers which app doesn't know to empty).

Yes, with OOBINLINE you need to take care of EPOLLPRI if you want to use
the read(2) trick. The OOB virtually break the read.



> > On Mon, 14 Jul 2003, Jamie Lokier wrote:
> >
> > > (a) fd isn't a socket
> > > (b) fd isn't a TCP socket
> >
> > Jamie, libraries, like for example libevent, are completely generic indeed.
> > They fetch events and they call the associated callback. You obviously
> > know inside your callback which kind of fd you working on.
>
> I disagree - inside a stream parser callback (e.g. XML transcoder) I
> prefer to _not_ know the difference between pipe, file, tty and socket
> that I am reading.

These are streams and you can use the read(2) trick w/out problems. I
don't think you want to mount your XML parser over UDP.



> > > (c) kernel version <= 2.5.75
> >
> > Obviously, POLLRDHUP is not yet inside the kernel :)
>
> Quite.  When you write an app that uses it and the read(2) trick
> you'll see the bug which Eric brought up :)
>
> I'm saying there's a way to write an app which can use the read(2)
> trick, yet which does _not_ hang on older kernels.  Hence is robust.

How, if you do not change the kernel by making it returning an extra flag ?



- Davide


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: POLLRDONCE optimisation for epoll users (was: epoll and half closed TCP connections)
  2003-07-14  6:24                                 ` Davide Libenzi
@ 2003-07-14  6:57                                   ` Jamie Lokier
  0 siblings, 0 replies; 58+ messages in thread
From: Jamie Lokier @ 2003-07-14  6:57 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: David S. Miller, Eric Varsanyi, Linux Kernel Mailing List, kuznet

Davide Libenzi wrote:
> > > > (d) SO_RCVLOWAT < s
> > > This does not apply with non-blocking fds.
> > Look at the line "if (copied >= target)" in tcp_recvmsg.
> 
> Look at this :
> 	timeo = sock_rcvtimeo(sk, nonblock);

How does `nonblock' prevent short reads?  I don't see it.

> > I disagree - inside a stream parser callback (e.g. XML transcoder) I
> > prefer to _not_ know the difference between pipe, file, tty and socket
> > that I am reading.
> 
> These are streams and you can use the read(2) trick w/out problems. I
> don't think you want to mount your XML parser over UDP.

You cannot use the read(2) trick with a tty or file; RDHUP doesn't help.

> > > > (c) kernel version <= 2.5.75
> > > Obviously, POLLRDHUP is not yet inside the kernel :)
> > Quite.  When you write an app that uses it and the read(2) trick
> > you'll see the bug which Eric brought up :)
> >
> > I'm saying there's a way to write an app which can use the read(2)
> > trick, yet which does _not_ hang on older kernels.  Hence is robust.
> 
> How, if you do not change the kernel by making it returning an extra flag ?

By defining the interface such that _not_ setting the flag merely
suppresses the optimisation, it doesn't stop the program from working.

-- Jamie


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [Patch][RFC] epoll and half closed TCP connections
  2003-07-14  6:14               ` David Schwartz
@ 2003-07-14  7:20                 ` Jamie Lokier
  0 siblings, 0 replies; 58+ messages in thread
From: Jamie Lokier @ 2003-07-14  7:20 UTC (permalink / raw)
  To: David Schwartz
  Cc: Entrope, Davide Libenzi, Eric Varsanyi, Linux Kernel Mailing List

David, you are point out that given a certain base assumption, namely
that the number of active desriptors is proportional to the number of
inactive descriptors, that epoll does not scale any differently to poll.

Correct.

Davide and I have pointed out that the key difference between
select/poll and epoll, mathematically, is that epoll time is
bounded by the number of active descriptors, and select/poll time is not.

Also correct.

(The latter is a logically stronger statement than the former, because
it has fewer assumptions, however that doesn't make it's increased
range of applicability relevant.)

Now if you look at web server benchmarks and other artificial
benchmarks, rumour has it that epoll-based server CPU usage increases
linearly with load, and pages served increases linearly with the
number of requests, until the point where the server is maxed, after
which both these observables remain roughly constant with increasing
load.

Similar rumours have it that select/poll-based servers CPU usage
increases in the same way, but not only do the observables increase
faster (irrelevent for this discussion), when the server is maxed its
number of pages served decreases (badly) with increasing load.

In this sense, it's useful to refer to epoll as more scalable: not the
linear part at the beginning, but later, when resources are exhausted.

Load here is defined as number of concurrent client connections.  In
fact, the number of active descriptors _is_ less than proportional to
the number of idle descriptors in this state, because the slower
responses act as a form of flow control on the rate of new connections
or new data coming in to the server.

If instead you define a test in terms of increasing the rate of
incoming connections, as if the clients are oblivious to each other,
then your point might be spot on.  But that is a different kind of
test, and it doesn't take away from the validity of the first kind.

-- Jamie

^ permalink raw reply	[flat|nested] 58+ messages in thread

* RE: [Patch][RFC] epoll and half closed TCP connections
  2003-07-13 23:09             ` Davide Libenzi
@ 2003-07-14  8:14               ` Alan Cox
  2003-07-14 15:03                 ` Davide Libenzi
  0 siblings, 1 reply; 58+ messages in thread
From: Alan Cox @ 2003-07-14  8:14 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: David Schwartz, Jamie Lokier, Eric Varsanyi, Linux Kernel Mailing List

On Llu, 2003-07-14 at 00:09, Davide Libenzi wrote:
> On Sun, 13 Jul 2003, David Schwartz wrote:
> 
> > 	It's not O(N) with 'poll' and 'select'. Twice as many file descriptors
> > means twice as many active file descriptors which means twice as many
> > discovered per call to 'poll'. If the calls to 'poll' are further apart
> 
> It is O(N), if N if the number of fds queried. The poll code does "at least"
> 2 * N loops among the set (plus other stuff), and hence it is O(N). Even
> if you do N "nop" in your implementation, this becomes O(N) from a
> mathematical point of view.

You need to apply queue theory and use a model of the distribution of
data arrival on the inputs/outputs to actually tell. The its O(N) claim
is like most such claims and probably only useful if data arrives
infinitely slowly and you have infinite ram and cache is not a factor.

For some loads poll/select are actually extremely efficient. X clients
batch commands up and there is a cost to switching between tasks for
different clients. Viewed as an entire system you actually get quite
interesting little graphs, especially in the critical load cases where
select/poll's batching effect makes throughput increase rapidly at 100%
CPU load, even if it gets you there far too early. Ditto with
webservers.


^ permalink raw reply	[flat|nested] 58+ messages in thread

* RE: [Patch][RFC] epoll and half closed TCP connections
  2003-07-14  8:14               ` Alan Cox
@ 2003-07-14 15:03                 ` Davide Libenzi
  0 siblings, 0 replies; 58+ messages in thread
From: Davide Libenzi @ 2003-07-14 15:03 UTC (permalink / raw)
  To: Alan Cox
  Cc: David Schwartz, Jamie Lokier, Eric Varsanyi, Linux Kernel Mailing List

On Mon, 14 Jul 2003, Alan Cox wrote:

> For some loads poll/select are actually extremely efficient. X clients
> batch commands up and there is a cost to switching between tasks for
> different clients. Viewed as an entire system you actually get quite
> interesting little graphs, especially in the critical load cases where
> select/poll's batching effect makes throughput increase rapidly at 100%
> CPU load, even if it gets you there far too early. Ditto with
> webservers.

Indeed, poll/select are very nice APIs and you definitely want to use them
if your apps does not need certain requirements. If N/M approaches 1, poll
scales exactly (alomst) like epoll. But poll does not born to scale on
huge number of fds, and this is recognized by ages. Yesterday I pulled out
a Mogul paper where (a long time ago) he talks about poll limits and he
also talks about three ideal APIs to deal with networking load :

declare_interest == epoll_ctl(EPOLL_CTL_ADD)
revoke_interest == epoll_ctl(EPOLL_CTL_DEL)
dequeue_next_events == epoll_wait()

This a long time before epoll :)



- Davide


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [Patch][RFC] epoll and half closed TCP connections
  2003-07-13  5:24   ` David S. Miller
  2003-07-13 14:07     ` Jamie Lokier
@ 2003-07-14 17:09     ` kuznet
  2003-07-14 17:09       ` Davide Libenzi
  2003-07-14 21:45       ` Jamie Lokier
  1 sibling, 2 replies; 58+ messages in thread
From: kuznet @ 2003-07-14 17:09 UTC (permalink / raw)
  To: David S. Miller; +Cc: davidel, e0206, linux-kernel

Hello!

> Alexey, they seem to want to add some kind of POLLRDHUP thing,
> comments wrt. TCP and elsewhere in the networking?  See below...

I see. It is highly reasonable. Unlike SVR4 POLLHUP. :-)

Well, "elsewhere" is mostly af_unix, half-duplex close makes sense only
for tcp and af_unix.

Alexey

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [Patch][RFC] epoll and half closed TCP connections
  2003-07-14 17:09     ` [Patch][RFC] epoll and half closed TCP connections kuznet
@ 2003-07-14 17:09       ` Davide Libenzi
  2003-07-14 21:45       ` Jamie Lokier
  1 sibling, 0 replies; 58+ messages in thread
From: Davide Libenzi @ 2003-07-14 17:09 UTC (permalink / raw)
  To: kuznet; +Cc: David S. Miller, e0206, Linux Kernel Mailing List

On Mon, 14 Jul 2003 kuznet@ms2.inr.ac.ru wrote:

> Hello!
>
> > Alexey, they seem to want to add some kind of POLLRDHUP thing,
> > comments wrt. TCP and elsewhere in the networking?  See below...
>
> I see. It is highly reasonable. Unlike SVR4 POLLHUP. :-)
>
> Well, "elsewhere" is mostly af_unix, half-duplex close makes sense only
> for tcp and af_unix.

If you agree guys, we can prepare a patch that does the handling in all
places where it has a mean. So that you can look at it.



- Davide


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [Patch][RFC] epoll and half closed TCP connections
  2003-07-14 17:09     ` [Patch][RFC] epoll and half closed TCP connections kuznet
  2003-07-14 17:09       ` Davide Libenzi
@ 2003-07-14 21:45       ` Jamie Lokier
  1 sibling, 0 replies; 58+ messages in thread
From: Jamie Lokier @ 2003-07-14 21:45 UTC (permalink / raw)
  To: kuznet; +Cc: David S. Miller, davidel, e0206, linux-kernel

kuznet@ms2.inr.ac.ru wrote:
> Well, "elsewhere" is mostly af_unix, half-duplex close makes sense only
> for tcp and af_unix.

What about sctp - can that do half-duplex close?

-- Jamie

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [Patch][RFC] epoll and half closed TCP connections
  2003-07-13 23:05           ` David Schwartz
                               ` (2 preceding siblings ...)
  2003-07-14  1:51             ` Jamie Lokier
@ 2003-07-15 20:27             ` James Antill
  2003-07-16  1:46               ` David Schwartz
  3 siblings, 1 reply; 58+ messages in thread
From: James Antill @ 2003-07-15 20:27 UTC (permalink / raw)
  To: David Schwartz; +Cc: Davide Libenzi, Eric Varsanyi, Linux Kernel Mailing List

"David Schwartz" <davids@webmaster.com> writes:

> 	This is really just due to bad coding in 'poll', or more precisely very bad
> for this case. For example, why is it allocating a wait queue buffer if the
> odds that it will need to wait are basically zero? Why is it adding file
> descriptors to the wait queue before it has determined that it needs to
> wait?

 Because this is much easier to do in userspace, it's just not very
well documented that you should almost always call poll() with a zero
timeout first. However it's been there for years, and things have used
it[1].
 There are still optimizations that could have been done to poll() to
speed it up but Linus has generally refused to add them.

[1] http://www.and.org/socket_poll/

-- 
# James Antill -- james@and.org
:0:
* ^From: .*james@and\.org
/dev/null

^ permalink raw reply	[flat|nested] 58+ messages in thread

* RE: [Patch][RFC] epoll and half closed TCP connections
  2003-07-15 20:27             ` James Antill
@ 2003-07-16  1:46               ` David Schwartz
  2003-07-16  2:09                 ` James Antill
  0 siblings, 1 reply; 58+ messages in thread
From: David Schwartz @ 2003-07-16  1:46 UTC (permalink / raw)
  To: James Antill; +Cc: Linux Kernel Mailing List


> "David Schwartz" <davids@webmaster.com> writes:

> > 	This is really just due to bad coding in 'poll', or more
> > precisely very bad
> > for this case. For example, why is it allocating a wait queue
> > buffer if the
> > odds that it will need to wait are basically zero? Why is it adding file
> > descriptors to the wait queue before it has determined that it needs to
> > wait?

> Because this is much easier to do in userspace, it's just not very
> well documented that you should almost always call poll() with a zero
> timeout first.

	It's neither easier to do nor harder, it's basically the same code in
either place. However, doing it in kernel space saves the extra user/kernel
transition, poll set allocations, and copies across the u/k boundary in the
case where we do actually need to wait.

> However it's been there for years, and things have used
> it[1].

	The thing is, for some reason it (it being the cost of calling poll with a
constant timeout for 1,024 file descriptors) is exceptionally bad on Linux.
Worse than every other OS I've tested.

> There are still optimizations that could have been done to poll() to
> speed it up but Linus has generally refused to add them.

	Yep, so we invent new APIs to fix the deficiencies in the most common API's
implementation. Whatever.

	DS



^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [Patch][RFC] epoll and half closed TCP connections
  2003-07-16  1:46               ` David Schwartz
@ 2003-07-16  2:09                 ` James Antill
  0 siblings, 0 replies; 58+ messages in thread
From: James Antill @ 2003-07-16  2:09 UTC (permalink / raw)
  To: David Schwartz; +Cc: Linux Kernel Mailing List

"David Schwartz" <davids@webmaster.com> writes:

> > "David Schwartz" <davids@webmaster.com> writes:
> 
> > > 	This is really just due to bad coding in 'poll', or more
> > > precisely very bad
> > > for this case. For example, why is it allocating a wait queue
> > > buffer if the
> > > odds that it will need to wait are basically zero? Why is it adding file
> > > descriptors to the wait queue before it has determined that it needs to
> > > wait?
> 
> > Because this is much easier to do in userspace, it's just not very
> > well documented that you should almost always call poll() with a zero
> > timeout first.
> 
> 	It's neither easier to do nor harder, it's basically the same code in
> either place. However, doing it in kernel space saves the extra user/kernel
> transition, poll set allocations, and copies across the u/k boundary in the
> case where we do actually need to wait.

 Optimizing for the waiting case doesn't sound like the right
approach, IMO. And all things being equal, doing it outside the kernel
rules.
 Plus it's possible that someone could come up with a case where you
don't want to do it.

> > However it's been there for years, and things have used
> > it[1].
> 
> 	The thing is, for some reason it (it being the cost of calling poll with a
> constant timeout for 1,024 file descriptors) is exceptionally bad on Linux.
> Worse than every other OS I've tested.

 I'd put money on that being drastically reduced if the allocation
didn't happen every call though.

> > There are still optimizations that could have been done to poll() to
> > speed it up but Linus has generally refused to add them.
> 
> 	Yep, so we invent new APIs to fix the deficiencies in the most common API's
> implementation. Whatever.

 No, we know that there are conditions where whatever you do to poll()
the latency kills you. And to fix that we need new APIs. Personally
I'd prefer to have poll() be as good a level triggered event mechanism
as it could be and have epoll just as good for edge triggered events
as it could be ... but I'm not the one you need to convince.

-- 
# James Antill -- james@and.org
:0:
* ^From: .*james@and\.org
/dev/null

^ permalink raw reply	[flat|nested] 58+ messages in thread

end of thread, other threads:[~2003-07-16  1:54 UTC | newest]

Thread overview: 58+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2003-07-12 18:16 [Patch][RFC] epoll and half closed TCP connections Eric Varsanyi
2003-07-12 19:44 ` Jamie Lokier
2003-07-12 20:51   ` Eric Varsanyi
2003-07-12 20:48     ` Davide Libenzi
2003-07-12 21:19       ` Eric Varsanyi
2003-07-12 21:20         ` Davide Libenzi
2003-07-12 21:41         ` Davide Libenzi
2003-07-12 23:11           ` Eric Varsanyi
2003-07-12 23:55             ` Davide Libenzi
2003-07-13  1:05               ` Eric Varsanyi
2003-07-13 20:32       ` David Schwartz
2003-07-13 21:10         ` Jamie Lokier
2003-07-13 23:05           ` David Schwartz
2003-07-13 23:09             ` Davide Libenzi
2003-07-14  8:14               ` Alan Cox
2003-07-14 15:03                 ` Davide Libenzi
2003-07-14  1:27             ` Jamie Lokier
2003-07-13 21:14         ` Davide Libenzi
2003-07-13 23:05           ` David Schwartz
2003-07-13 23:11             ` Davide Libenzi
2003-07-13 23:52             ` Entrope
2003-07-14  6:14               ` David Schwartz
2003-07-14  7:20                 ` Jamie Lokier
2003-07-14  1:51             ` Jamie Lokier
2003-07-14  6:14               ` David Schwartz
2003-07-15 20:27             ` James Antill
2003-07-16  1:46               ` David Schwartz
2003-07-16  2:09                 ` James Antill
2003-07-13 13:12     ` Jamie Lokier
2003-07-13 16:55       ` Davide Libenzi
2003-07-12 20:01 ` Davide Libenzi
2003-07-13  5:24   ` David S. Miller
2003-07-13 14:07     ` Jamie Lokier
2003-07-13 17:00       ` Davide Libenzi
2003-07-13 19:15         ` Jamie Lokier
2003-07-13 23:03           ` Davide Libenzi
2003-07-14  1:41             ` Jamie Lokier
2003-07-14  2:24               ` POLLRDONCE optimisation for epoll users (was: epoll and half closed TCP connections) Jamie Lokier
2003-07-14  2:37                 ` Davide Libenzi
2003-07-14  2:43                   ` Davide Libenzi
2003-07-14  2:56                   ` Jamie Lokier
2003-07-14  3:02                     ` Davide Libenzi
2003-07-14  3:16                       ` Jamie Lokier
2003-07-14  3:21                         ` Davide Libenzi
2003-07-14  3:42                           ` Jamie Lokier
2003-07-14  4:00                             ` Davide Libenzi
2003-07-14  5:51                               ` Jamie Lokier
2003-07-14  6:24                                 ` Davide Libenzi
2003-07-14  6:57                                   ` Jamie Lokier
2003-07-14  3:12                     ` Jamie Lokier
2003-07-14  3:17                       ` Davide Libenzi
2003-07-14  3:35                         ` Jamie Lokier
2003-07-14  3:04                   ` Jamie Lokier
2003-07-14  3:12                     ` Davide Libenzi
2003-07-14  3:27                       ` Jamie Lokier
2003-07-14 17:09     ` [Patch][RFC] epoll and half closed TCP connections kuznet
2003-07-14 17:09       ` Davide Libenzi
2003-07-14 21:45       ` Jamie Lokier

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).