* [Patch][RFC] epoll and half closed TCP connections @ 2003-07-12 18:16 Eric Varsanyi 2003-07-12 19:44 ` Jamie Lokier 2003-07-12 20:01 ` Davide Libenzi 0 siblings, 2 replies; 58+ messages in thread From: Eric Varsanyi @ 2003-07-12 18:16 UTC (permalink / raw) To: linux-kernel; +Cc: davidel I'm proposing adding a new POLL event type (POLLRDHUP) as way to solve a new race introduced by having an edge triggered event mechanism (epoll). The problem occurs when a client writes data and then does a write side shutdown(). The server (using epoll) sees only one event for the read data ready and the read EOF condition and has no way to tell that an EOF occurred. -Eric Varsanyi Details ----------- - remote sends data and does a shutdown [ we see a data bearing packet and FIN from client on the wire ] - user mode server gets around to doing accept() and registers for EPOLLIN events (along with HUP and ERR which are forced on) - epoll_wait() returns a single EPOLLIN event on the FD representing both the 1/2 shutdown state and data available At this point there is no way the app can tell if there is a half closed connection so it may issue a close() back to the client after writing results. Normally the server would distinguish these events by assuming EOF if it got a read ready indication and the first read returned 0 bytes, or would issue read calls until less data was returned than was asked for. In a level triggered world this all just works because the read ready indication is driven back to the app as long as the socket state is half closed. The event driven epoll mechanism folds these two indications together and thus loses one 'edge'. One would be tempted to issue an extra read() after getting back less than expected, but this is an extra system call on every read event and you get back the same '0' bytes that you get if the buffer is just empty. The only sure bet seems to be CTL_MODding the FD to force a re-poll (which would cost a syscall and hash-lookup in eventpoll for every read event). The POLLHUP indication is specifically not used in this 1/2 closed state since it is (by POSIX) not allowed to be masked and would interfere with EPOLLOUT back to client if it were set. I considered 2 possible ideas: 1) have epoll return a count of events represented by a single appearance in the list after an epoll_wait(); this could be used as a hint to make some other (tbd) syscall to find out if the socket was in half closed state and normally not have to pay an extra syscall on every read; this seems like a lot of tap dancing for a work around 2) add a new 1/2 closed event type that a poll routine can return The implementation is trivial, a patch is included below. If this idea sees favor I'll fix the other architectures, ipv6, epoll.h, and make a 'real' patch. I do not believe any drivers deserve to be modified to return this new event. This should not break existing programs, it still returns the read ready indication as before and unless an app registers for this new event it won't be woken up an extra time. Any epoll programs will not get the extra event indicated unless they ask for it. I dislike extending such an established API, but coopting any of the other event types just seems wrong. Other ideas and comments would be appreciated, -Eric Varsanyi diff -Naur linux-2.4.20/include/asm-i386/poll.h linux-2.4.20_ev/include/asm-i386/poll.h --- linux-2.4.20/include/asm-i386/poll.h Thu Jan 23 13:01:28 1997 +++ linux-2.4.20_ev/include/asm-i386/poll.h Sat Jul 12 12:29:11 2003 @@ -15,6 +15,7 @@ #define POLLWRNORM 0x0100 #define POLLWRBAND 0x0200 #define POLLMSG 0x0400 +#define POLLRDHUP 0x0800 struct pollfd { int fd; diff -Naur linux-2.4.20/net/ipv4/tcp.c linux-2.4.20_ev/net/ipv4/tcp.c --- linux-2.4.20/net/ipv4/tcp.c Tue Jul 8 09:40:42 2003 +++ linux-2.4.20_ev/net/ipv4/tcp.c Sat Jul 12 12:29:56 2003 @@ -424,7 +424,7 @@ if (sk->shutdown == SHUTDOWN_MASK || sk->state == TCP_CLOSE) mask |= POLLHUP; if (sk->shutdown & RCV_SHUTDOWN) - mask |= POLLIN | POLLRDNORM; + mask |= POLLIN | POLLRDNORM | POLLRDHUP; /* Connected? */ if ((1 << sk->state) & ~(TCPF_SYN_SENT|TCPF_SYN_RECV)) { ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [Patch][RFC] epoll and half closed TCP connections 2003-07-12 18:16 [Patch][RFC] epoll and half closed TCP connections Eric Varsanyi @ 2003-07-12 19:44 ` Jamie Lokier 2003-07-12 20:51 ` Eric Varsanyi 2003-07-12 20:01 ` Davide Libenzi 1 sibling, 1 reply; 58+ messages in thread From: Jamie Lokier @ 2003-07-12 19:44 UTC (permalink / raw) To: Eric Varsanyi; +Cc: linux-kernel, davidel Eric Varsanyi wrote: > - epoll_wait() returns a single EPOLLIN event on the FD representing > both the 1/2 shutdown state and data available Correct. > At this point there is no way the app can tell if there is a half closed > connection so it may issue a close() back to the client after writing > results. Normally the server would distinguish these events by assuming > EOF if it got a read ready indication and the first read returned 0 bytes, > or would issue read calls until less data was returned than was asked for. > > In a level triggered world this all just works because the read ready > indication is driven back to the app as long as the socket state is half > closed. The event driven epoll mechanism folds these two indications > together and thus loses one 'edge'. Well then, use epoll's level-triggered mode. It's quite easy - it's the default now. :) If there's an EOF condition pending after you called read(), and then you call epoll_wait(), you _should_ see another EPOLLIN condition immediately. If you aren't seeing epoll_wait() return with EPOLLIN when there's an EOF pending, *and* you haven't set EPOLLET in the event flags, that's a bug in epoll. Is that what you're seeing? -- Jamie ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [Patch][RFC] epoll and half closed TCP connections 2003-07-12 19:44 ` Jamie Lokier @ 2003-07-12 20:51 ` Eric Varsanyi 2003-07-12 20:48 ` Davide Libenzi 2003-07-13 13:12 ` Jamie Lokier 0 siblings, 2 replies; 58+ messages in thread From: Eric Varsanyi @ 2003-07-12 20:51 UTC (permalink / raw) To: Jamie Lokier; +Cc: Eric Varsanyi, linux-kernel, davidel > > In a level triggered world this all just works because the read ready > > indication is driven back to the app as long as the socket state is half > > closed. The event driven epoll mechanism folds these two indications > > together and thus loses one 'edge'. > > Well then, use epoll's level-triggered mode. It's quite easy - it's > the default now. :) The problem with all the level triggered schemes (poll, select, epoll w/o EPOLLET) is that they call every driver and poll status for every call into the kernel. This appeared to be killing my app's performance and I verified by writing some simple micro benchmarks. I can post the details and code if anyone is interested, the summary is that on 1 idle FD poll, select, epoll all take about 900ns on a 2.8G P4 (around the overhead of any syscall). With 10 idle FD's (pipes or sockets) the overhead is around 2.5uSec, at 400fd's (a light load for this app) we're up to 80uSec per call and the app is spending almost 100% of one CPU in system mode with even a light tickling of I/O activity on a few of the fd's. As we start to scale up to production sized fd sets it gets crazy: around 8000 completely idle fd's the cost is 4ms per syscall. At this point even a high real load (which gathers lots of I/O per call) doesn't cover the now very high latency for each trip into the kernel to gather more work. What was interesting is the response time was non-linear up to around 400-500 fd's, then went steep and linear after that, so you pay much more (maybe due to some cache effects, I didn't pursue) for each connecting client in a light load environment. This is not web traffic, the clients typically connect and sit mostly idle. > If there's an EOF condition pending after you called read(), and then > you call epoll_wait(), you _should_ see another EPOLLIN condition > immediately. > > If you aren't seeing epoll_wait() return with EPOLLIN when there's an > EOF pending, *and* you haven't set EPOLLET in the event flags, that's > a bug in epoll. Is that what you're seeing? No, I am not asserting there is a problem with epoll in level triggered mode (I've only used poll and select in level triggered mode, so I can't say for sure it works either). -Eric Varsanyi ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [Patch][RFC] epoll and half closed TCP connections 2003-07-12 20:51 ` Eric Varsanyi @ 2003-07-12 20:48 ` Davide Libenzi 2003-07-12 21:19 ` Eric Varsanyi 2003-07-13 20:32 ` David Schwartz 2003-07-13 13:12 ` Jamie Lokier 1 sibling, 2 replies; 58+ messages in thread From: Davide Libenzi @ 2003-07-12 20:48 UTC (permalink / raw) To: Eric Varsanyi; +Cc: Linux Kernel Mailing List, Davide Libenzi On Sat, 12 Jul 2003, Eric Varsanyi wrote: > > > In a level triggered world this all just works because the read ready > > > indication is driven back to the app as long as the socket state is half > > > closed. The event driven epoll mechanism folds these two indications > > > together and thus loses one 'edge'. > > > > Well then, use epoll's level-triggered mode. It's quite easy - it's > > the default now. :) > > The problem with all the level triggered schemes (poll, select, epoll w/o > EPOLLET) is that they call every driver and poll status for every call into > the kernel. This appeared to be killing my app's performance and I verified > by writing some simple micro benchmarks. Look this is false for epoll. Given N fds inside the set and M hot/ready fds, epoll scale O(M) and not O(N) (like poll/select). There's a huge difference, expecially with real loads. - Davide ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [Patch][RFC] epoll and half closed TCP connections 2003-07-12 20:48 ` Davide Libenzi @ 2003-07-12 21:19 ` Eric Varsanyi 2003-07-12 21:20 ` Davide Libenzi 2003-07-12 21:41 ` Davide Libenzi 2003-07-13 20:32 ` David Schwartz 1 sibling, 2 replies; 58+ messages in thread From: Eric Varsanyi @ 2003-07-12 21:19 UTC (permalink / raw) To: Davide Libenzi; +Cc: Eric Varsanyi, Linux Kernel Mailing List > > The problem with all the level triggered schemes (poll, select, epoll w/o > > EPOLLET) is that they call every driver and poll status for every call into > > the kernel. This appeared to be killing my app's performance and I verified > > by writing some simple micro benchmarks. > > Look this is false for epoll. Given N fds inside the set and M hot/ready > fds, epoll scale O(M) and not O(N) (like poll/select). There's a huge > difference, expecially with real loads. Apologies, I did not benchmark epoll level triggered, just select. The man page claimed epoll in level triggered mode was just a better interface so I assumed it had to call each driver to check status. Reading thru it I see (I think) the clever trick of just repolling things that have already triggered (basically polling just for the trailing edge after having seen a leading edge async), cool! If it seems unpopular/unwise to add the extra poll event to show read EOF using this level triggered mode would likely do the job for my app (the extra polls every time for un-consumed events will be nothing compared to calling every fd's poll every time). I guess my only argument would be that edge triggered mode isn't really workable with TCP connections if there's no way to solve the ambiguity between EOF and no data in buffer (at least w/o an extra syscall). I just realized that the race you mention in the man page (reading data from the 'next' event that hasn't been polled into user mode yet) will lead to the same issue: how do you know if you got this event because you consumed the data on the previous interrupt or if this is an EOF condition. -Eric Varsanyi ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [Patch][RFC] epoll and half closed TCP connections 2003-07-12 21:19 ` Eric Varsanyi @ 2003-07-12 21:20 ` Davide Libenzi 2003-07-12 21:41 ` Davide Libenzi 1 sibling, 0 replies; 58+ messages in thread From: Davide Libenzi @ 2003-07-12 21:20 UTC (permalink / raw) To: Eric Varsanyi; +Cc: Linux Kernel Mailing List On Sat, 12 Jul 2003, Eric Varsanyi wrote: > > > The problem with all the level triggered schemes (poll, select, epoll w/o > > > EPOLLET) is that they call every driver and poll status for every call into > > > the kernel. This appeared to be killing my app's performance and I verified > > > by writing some simple micro benchmarks. > > > > Look this is false for epoll. Given N fds inside the set and M hot/ready > > fds, epoll scale O(M) and not O(N) (like poll/select). There's a huge > > difference, expecially with real loads. > > Apologies, I did not benchmark epoll level triggered, just select. > The man page claimed epoll in level triggered mode was just a better > interface so I assumed it had to call each driver to check status. It is not neither better nor worse. For sure it is more close to what existing apps do, and it is easier to understand. > Reading thru it I see (I think) the clever trick of just repolling things > that have already triggered (basically polling just for the trailing > edge after having seen a leading edge async), cool! > > If it seems unpopular/unwise to add the extra poll event to show read EOF > using this level triggered mode would likely do the job for my app (the > extra polls every time for un-consumed events will be nothing compared to > calling every fd's poll every time). Even if it is true that you can drop an extra read(), having the RDHUP event will cost exactly zero extra CPU cycles inside the kernel, and one changed line of code (plus arch definitions in asm/poll.h). To me, it looks acceptable. Let's see what DaveM think about it. - Davide ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [Patch][RFC] epoll and half closed TCP connections 2003-07-12 21:19 ` Eric Varsanyi 2003-07-12 21:20 ` Davide Libenzi @ 2003-07-12 21:41 ` Davide Libenzi 2003-07-12 23:11 ` Eric Varsanyi 1 sibling, 1 reply; 58+ messages in thread From: Davide Libenzi @ 2003-07-12 21:41 UTC (permalink / raw) To: Eric Varsanyi; +Cc: Linux Kernel Mailing List On Sat, 12 Jul 2003, Eric Varsanyi wrote: > I guess my only argument would be that edge triggered mode isn't really > workable with TCP connections if there's no way to solve the ambiguity > between EOF and no data in buffer (at least w/o an extra syscall). I just > realized that the race you mention in the man page (reading data from > the 'next' event that hasn't been polled into user mode yet) will lead to > the same issue: how do you know if you got this event because you consumed > the data on the previous interrupt or if this is an EOF condition. (Sorry, I missed this) You can work that out very easily. When your read/write returns a lower number of bytes, it means that it is time to stop processing this fd. If events happened meanwhile, you will get them at the next epoll_wait(). If not, the next time they'll happen. There's no blind spot if you follow this simple rule, and you do not even have the extra syscall with EAGAIN. - Davide ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [Patch][RFC] epoll and half closed TCP connections 2003-07-12 21:41 ` Davide Libenzi @ 2003-07-12 23:11 ` Eric Varsanyi 2003-07-12 23:55 ` Davide Libenzi 0 siblings, 1 reply; 58+ messages in thread From: Eric Varsanyi @ 2003-07-12 23:11 UTC (permalink / raw) To: Davide Libenzi; +Cc: Eric Varsanyi, Linux Kernel Mailing List > > I guess my only argument would be that edge triggered mode isn't really > > workable with TCP connections if there's no way to solve the ambiguity > > between EOF and no data in buffer (at least w/o an extra syscall). I just > > realized that the race you mention in the man page (reading data from > > the 'next' event that hasn't been polled into user mode yet) will lead to > > the same issue: how do you know if you got this event because you consumed > > the data on the previous interrupt or if this is an EOF condition. > > (Sorry, I missed this) > You can work that out very easily. When your read/write returns a lower > number of bytes, it means that it is time to stop processing this fd. If > events happened meanwhile, you will get them at the next epoll_wait(). If > not, the next time they'll happen. There's no blind spot if you follow > this simple rule, and you do not even have the extra syscall with EAGAIN. The scenario that I think is still uncovered (edge trigger only): User Kernel -------- ---------- Read data added to socket Socket posts read event to epfd epoll_wait() Event cleared from epfd, EPOLLIN returned to user more read data added to socket Socket posts a new read event to epfd read() until short read with EAGAIN all data read from socket epoll_wait() returns another EPOLLIN for socket and clears it from epfd read(), returns 0 right away socket buffer is empty This is your 'false positive' case in the epoll(4) man page. How does the app tell the 0 read here from a read EOF coming from the remote? If it assumes this is a false positive and there was also an EOF indication, the EOF will be lost; if it assumes it an EOF the connection will be prematurely terminated. -Eric ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [Patch][RFC] epoll and half closed TCP connections 2003-07-12 23:11 ` Eric Varsanyi @ 2003-07-12 23:55 ` Davide Libenzi 2003-07-13 1:05 ` Eric Varsanyi 0 siblings, 1 reply; 58+ messages in thread From: Davide Libenzi @ 2003-07-12 23:55 UTC (permalink / raw) To: Eric Varsanyi; +Cc: Linux Kernel Mailing List On Sat, 12 Jul 2003, Eric Varsanyi wrote: > > (Sorry, I missed this) > > You can work that out very easily. When your read/write returns a lower > > number of bytes, it means that it is time to stop processing this fd. If > > events happened meanwhile, you will get them at the next epoll_wait(). If > > not, the next time they'll happen. There's no blind spot if you follow > > this simple rule, and you do not even have the extra syscall with EAGAIN. > > The scenario that I think is still uncovered (edge trigger only): > > User Kernel > -------- ---------- > Read data added to socket > > Socket posts read event to epfd > > epoll_wait() Event cleared from epfd, EPOLLIN > returned to user > > more read data added to socket > > Socket posts a new read event to epfd > > read() until short read with EAGAIN all data read from socket > > epoll_wait() returns another EPOLLIN for socket and > clears it from epfd > > read(), returns 0 right away socket buffer is empty read will return -1 with errno=EAGAIN in that case, not zero. - Davide ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [Patch][RFC] epoll and half closed TCP connections 2003-07-12 23:55 ` Davide Libenzi @ 2003-07-13 1:05 ` Eric Varsanyi 0 siblings, 0 replies; 58+ messages in thread From: Eric Varsanyi @ 2003-07-13 1:05 UTC (permalink / raw) To: Davide Libenzi; +Cc: Eric Varsanyi, Linux Kernel Mailing List > > read(), returns 0 right away socket buffer is empty > > read will return -1 with errno=EAGAIN in that case, not zero. Yes, my mistake. So the real issue (of the patch) is just the original thing I posted about: you can't tell w/o another read() syscall if an EOF has piggybacked in on an EPOLLIN event. Thanks for being patient. -Eric Varsanyi ^ permalink raw reply [flat|nested] 58+ messages in thread
* RE: [Patch][RFC] epoll and half closed TCP connections 2003-07-12 20:48 ` Davide Libenzi 2003-07-12 21:19 ` Eric Varsanyi @ 2003-07-13 20:32 ` David Schwartz 2003-07-13 21:10 ` Jamie Lokier 2003-07-13 21:14 ` Davide Libenzi 1 sibling, 2 replies; 58+ messages in thread From: David Schwartz @ 2003-07-13 20:32 UTC (permalink / raw) To: Davide Libenzi, Eric Varsanyi; +Cc: Linux Kernel Mailing List > Look this is false for epoll. Given N fds inside the set and M hot/ready > fds, epoll scale O(M) and not O(N) (like poll/select). There's a huge > difference, expecially with real loads. > > - Davide For most real-world loads, M is some fraction of N. The fraction asymptotically approaches 1 as load increases because under load it takes you longer to get back to polling, so a higher fraction of the descriptors will be ready when you do. Even if you argue that most real-world loads consists of a few very busy file descriptors and a lot of idle file descriptors, why would you think that this ratio changes as the number of connections increase? Say a group of two servers is handling a bunch of connections. Some of those connections will be very active and some will be very idle. But surely the *percentage* of active connections won't change just becase the connections are split over the servers 50/50 rather than 10/90. If a particular protocol and usage sees 10 idle connections for every active one, then N will be ten times M, and O(M) will be the same as O(N). It's only if a higher percentage of connections are idle when there are more connections (which seems an extreme rarity to me) that O(M) is better than O(N). Is there any actual evidence to suggest that epoll scales better than poll for "real loads"? Tests with increasing numbers of idle file descriptors as the active count stays constant are not real loads. By the way, I'm not arguing against epoll. I believe it will use less resources than poll in pretty much every conceivable situation. I simply take issue with the argument that it has better ultimate scalability or scales at a different order. DS ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [Patch][RFC] epoll and half closed TCP connections 2003-07-13 20:32 ` David Schwartz @ 2003-07-13 21:10 ` Jamie Lokier 2003-07-13 23:05 ` David Schwartz 2003-07-13 21:14 ` Davide Libenzi 1 sibling, 1 reply; 58+ messages in thread From: Jamie Lokier @ 2003-07-13 21:10 UTC (permalink / raw) To: David Schwartz; +Cc: Davide Libenzi, Eric Varsanyi, Linux Kernel Mailing List David Schwartz wrote: > For most real-world loads, M is some fraction of N. The fraction > asymptotically approaches 1 as load increases because under load it takes > you longer to get back to polling, so a higher fraction of the descriptors > will be ready when you do. Ah, but as the fraction approaches 1, you'll find that you are asymptotically approaching the point where you can't handle the load _regardless_ of epoll overhead. > By the way, I'm not arguing against epoll. I believe it will use less > resources than poll in pretty much every conceivable situation. I simply > take issue with the argument that it has better ultimate scalability or > scales at a different order. It scales according to the amount of work pending, which means that it doesn't take any _more_ time than actually doing the pending work. (This assumes you use epoll appropriately; there are many ways to use epoll which don't have this property). That was always the complaint about select() and poll(): they dominate the run time for large numbers of connections. epoll, on the other hand, will always be in the noise relative to other work. If you want a formula for slides :), time_polling/time_working is O(1) with epoll, but O(N) with poll() & select(). -- Jamie ^ permalink raw reply [flat|nested] 58+ messages in thread
* RE: [Patch][RFC] epoll and half closed TCP connections 2003-07-13 21:10 ` Jamie Lokier @ 2003-07-13 23:05 ` David Schwartz 2003-07-13 23:09 ` Davide Libenzi 2003-07-14 1:27 ` Jamie Lokier 0 siblings, 2 replies; 58+ messages in thread From: David Schwartz @ 2003-07-13 23:05 UTC (permalink / raw) To: Jamie Lokier; +Cc: Davide Libenzi, Eric Varsanyi, Linux Kernel Mailing List > David Schwartz wrote: > > For most real-world loads, M is some fraction of N. The fraction > > asymptotically approaches 1 as load increases because under > > load it takes > > you longer to get back to polling, so a higher fraction of the > > descriptors > > will be ready when you do. > Ah, but as the fraction approaches 1, you'll find that you are > asymptotically approaching the point where you can't handle the load > _regardless_ of epoll overhead. This has not been my experience. On pretty much every OS except Linux, my experience has been that as you are spending more time doing work, each call to 'poll' discovers more file descriptors ready. Further, the number of bytes you can send/receive is greater (because it took you longer to get back to the same connection), so again, the amount of work you do, per call to 'poll' goes way up. I think most of the problem is just that Linux's 'poll' is extremely expensive and not due to any inherent API benefit of 'epoll'. > > By the way, I'm not arguing against epoll. I believe it > > will use less > > resources than poll in pretty much every conceivable situation. I simply > > take issue with the argument that it has better ultimate scalability or > > scales at a different order. > It scales according to the amount of work pending, which means that it > doesn't take any _more_ time than actually doing the pending work. > (This assumes you use epoll appropriately; there are many ways to use > epoll which don't have this property). But so does 'poll'. If you double the number of active and inactive connections, 'poll' takes twice as long. But you do twice as much per call to 'poll'. You will both discover more connections ready to do work on and move more bytes per connection as the load increases. > That was always the complaint about select() and poll(): they dominate > the run time for large numbers of connections. epoll, on the other > hand, will always be in the noise relative to other work. I think this is largely true for Linux because of bad implementation of 'poll' and therefore 'select'. > If you want a formula for slides :), time_polling/time_working is O(1) > with epoll, but O(N) with poll() & select(). It's not O(N) with 'poll' and 'select'. Twice as many file descriptors means twice as many active file descriptors which means twice as many discovered per call to 'poll'. If the calls to 'poll' are further apart (because of the additional real work done in-between calls) it means more than twice as many discovered per call to 'poll'. Add to this that you will find more bytes ready to read or more space in the send queue per call to 'poll' as the load goes up. DS ^ permalink raw reply [flat|nested] 58+ messages in thread
* RE: [Patch][RFC] epoll and half closed TCP connections 2003-07-13 23:05 ` David Schwartz @ 2003-07-13 23:09 ` Davide Libenzi 2003-07-14 8:14 ` Alan Cox 2003-07-14 1:27 ` Jamie Lokier 1 sibling, 1 reply; 58+ messages in thread From: Davide Libenzi @ 2003-07-13 23:09 UTC (permalink / raw) To: David Schwartz; +Cc: Jamie Lokier, Eric Varsanyi, Linux Kernel Mailing List On Sun, 13 Jul 2003, David Schwartz wrote: > It's not O(N) with 'poll' and 'select'. Twice as many file descriptors > means twice as many active file descriptors which means twice as many > discovered per call to 'poll'. If the calls to 'poll' are further apart It is O(N), if N if the number of fds queried. The poll code does "at least" 2 * N loops among the set (plus other stuff), and hence it is O(N). Even if you do N "nop" in your implementation, this becomes O(N) from a mathematical point of view. - Davide ^ permalink raw reply [flat|nested] 58+ messages in thread
* RE: [Patch][RFC] epoll and half closed TCP connections 2003-07-13 23:09 ` Davide Libenzi @ 2003-07-14 8:14 ` Alan Cox 2003-07-14 15:03 ` Davide Libenzi 0 siblings, 1 reply; 58+ messages in thread From: Alan Cox @ 2003-07-14 8:14 UTC (permalink / raw) To: Davide Libenzi Cc: David Schwartz, Jamie Lokier, Eric Varsanyi, Linux Kernel Mailing List On Llu, 2003-07-14 at 00:09, Davide Libenzi wrote: > On Sun, 13 Jul 2003, David Schwartz wrote: > > > It's not O(N) with 'poll' and 'select'. Twice as many file descriptors > > means twice as many active file descriptors which means twice as many > > discovered per call to 'poll'. If the calls to 'poll' are further apart > > It is O(N), if N if the number of fds queried. The poll code does "at least" > 2 * N loops among the set (plus other stuff), and hence it is O(N). Even > if you do N "nop" in your implementation, this becomes O(N) from a > mathematical point of view. You need to apply queue theory and use a model of the distribution of data arrival on the inputs/outputs to actually tell. The its O(N) claim is like most such claims and probably only useful if data arrives infinitely slowly and you have infinite ram and cache is not a factor. For some loads poll/select are actually extremely efficient. X clients batch commands up and there is a cost to switching between tasks for different clients. Viewed as an entire system you actually get quite interesting little graphs, especially in the critical load cases where select/poll's batching effect makes throughput increase rapidly at 100% CPU load, even if it gets you there far too early. Ditto with webservers. ^ permalink raw reply [flat|nested] 58+ messages in thread
* RE: [Patch][RFC] epoll and half closed TCP connections 2003-07-14 8:14 ` Alan Cox @ 2003-07-14 15:03 ` Davide Libenzi 0 siblings, 0 replies; 58+ messages in thread From: Davide Libenzi @ 2003-07-14 15:03 UTC (permalink / raw) To: Alan Cox Cc: David Schwartz, Jamie Lokier, Eric Varsanyi, Linux Kernel Mailing List On Mon, 14 Jul 2003, Alan Cox wrote: > For some loads poll/select are actually extremely efficient. X clients > batch commands up and there is a cost to switching between tasks for > different clients. Viewed as an entire system you actually get quite > interesting little graphs, especially in the critical load cases where > select/poll's batching effect makes throughput increase rapidly at 100% > CPU load, even if it gets you there far too early. Ditto with > webservers. Indeed, poll/select are very nice APIs and you definitely want to use them if your apps does not need certain requirements. If N/M approaches 1, poll scales exactly (alomst) like epoll. But poll does not born to scale on huge number of fds, and this is recognized by ages. Yesterday I pulled out a Mogul paper where (a long time ago) he talks about poll limits and he also talks about three ideal APIs to deal with networking load : declare_interest == epoll_ctl(EPOLL_CTL_ADD) revoke_interest == epoll_ctl(EPOLL_CTL_DEL) dequeue_next_events == epoll_wait() This a long time before epoll :) - Davide ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [Patch][RFC] epoll and half closed TCP connections 2003-07-13 23:05 ` David Schwartz 2003-07-13 23:09 ` Davide Libenzi @ 2003-07-14 1:27 ` Jamie Lokier 1 sibling, 0 replies; 58+ messages in thread From: Jamie Lokier @ 2003-07-14 1:27 UTC (permalink / raw) To: David Schwartz; +Cc: Davide Libenzi, Eric Varsanyi, Linux Kernel Mailing List David Schwartz wrote: > But so does 'poll'. If you double the number of active and inactive > connections, 'poll' takes twice as long. But you do twice as much per call > to 'poll'. You will both discover more connections ready to do work on and > move more bytes per connection as the load increases. Well, with that assumption sure there is nothing _to_ scale - poll and select are perfect. > > That was always the complaint about select() and poll(): they dominate > > the run time for large numbers of connections. epoll, on the other > > hand, will always be in the noise relative to other work. > > I think this is largely true for Linux because of bad implementation of > 'poll' and therefore 'select'. It's true those implementations could use clever methods to reduce the amount of time they take by caching poll results, and probably approach epoll speeds. However their APIs have a built-in problem - at minimum, the kernel has to read & write O(N) memory per call. With your assumption of a fixed ratio of active/idle sockets, and where that ration is not very small (10% is not very small), of course the API is not a problem. > > If you want a formula for slides :), time_polling/time_working is O(1) > > with epoll, but O(N) with poll() & select(). > > It's not O(N) with 'poll' and 'select'. Twice as many file descriptors > means twice as many active file descriptors which means twice as many > discovered per call to 'poll'. It is O(N). When the load pattern follows your example, O(N) == O(1) :) -- Jamie ^ permalink raw reply [flat|nested] 58+ messages in thread
* RE: [Patch][RFC] epoll and half closed TCP connections 2003-07-13 20:32 ` David Schwartz 2003-07-13 21:10 ` Jamie Lokier @ 2003-07-13 21:14 ` Davide Libenzi 2003-07-13 23:05 ` David Schwartz 1 sibling, 1 reply; 58+ messages in thread From: Davide Libenzi @ 2003-07-13 21:14 UTC (permalink / raw) To: David Schwartz; +Cc: Eric Varsanyi, Linux Kernel Mailing List On Sun, 13 Jul 2003, David Schwartz wrote: > For most real-world loads, M is some fraction of N. The fraction > asymptotically approaches 1 as load increases because under load it takes > you longer to get back to polling, so a higher fraction of the descriptors > will be ready when you do. > > Even if you argue that most real-world loads consists of a few very busy > file descriptors and a lot of idle file descriptors, why would you think > that this ratio changes as the number of connections increase? Say a group > of two servers is handling a bunch of connections. Some of those connections > will be very active and some will be very idle. But surely the *percentage* > of active connections won't change just becase the connections are split > over the servers 50/50 rather than 10/90. > > If a particular protocol and usage sees 10 idle connections for every > active one, then N will be ten times M, and O(M) will be the same as O(N). > It's only if a higher percentage of connections are idle when there are more > connections (which seems an extreme rarity to me) that O(M) is better than > O(N). Apoligize, I abused of O(*) (hopefully noone of my math profs are on lkml :). Yes, N/M has little/none fluctuation in the N domain. So, using O(*) correctly, they both scale O(N). But, we can trivially say that if we call CP the cost of poll() in CPU cycles, and CE the cost of epoll : CP(N, M) = KP * N EP(N, M) = KE * M Where KP and KE are constant that depends on the code architecture of the two systems. If we fix KA (active coefficent ) : KA = M / N we can write the scalability coefficent like : KP * N KP KS = ------------- = --------- KE * KA * N KE * KA The scalability coefficent is clearly inv. proportional to KA. Let's look at what the poll code does : 1) It has to allocate the kernel buffer for events 2) It has to copy it from userspace 3) It has to allocate wait queue buffer calling get_free_page (possibly multiple times when we talk about decent fds numbers) 4) It has to loop calling N times f_op->poll() that in turn will add into the wait queue getting/releasing IRQ locks 5) Loop another M loop to copy events to userspace 6) Call kfree() for all blocks allocated 7) Call poll_freewait() that will go with another N loop to unregister poll waits, that in turn will do another N IRQ locks The epoll code does remember/cache things so that KE is largely lower that KP, and this together with a pretty low KA explain results about poll scalability against epoll. > Is there any actual evidence to suggest that epoll scales better than poll > for "real loads"? Tests with increasing numbers of idle file descriptors as > the active count stays constant are not real loads. Yes, of course. The time spent inside poll/select becomes a PITA when you start dealing with huge number of fds. And this is kernel time. This does not obviously mean that if epoll is 10 times faster than poll under load, and you switch your app on epoll, it'll be ten times faster. It means that the kernel time spent inside poll will be 1/10. And many of the operations done by poll require IRQ locks and this increase the time the kernel spend with disabled IRQs, that is never a good thing. - Davide ^ permalink raw reply [flat|nested] 58+ messages in thread
* RE: [Patch][RFC] epoll and half closed TCP connections 2003-07-13 21:14 ` Davide Libenzi @ 2003-07-13 23:05 ` David Schwartz 2003-07-13 23:11 ` Davide Libenzi ` (3 more replies) 0 siblings, 4 replies; 58+ messages in thread From: David Schwartz @ 2003-07-13 23:05 UTC (permalink / raw) To: Davide Libenzi; +Cc: Eric Varsanyi, Linux Kernel Mailing List > Let's look at what the poll code does : > > 1) It has to allocate the kernel buffer for events > > 2) It has to copy it from userspace > > 3) It has to allocate wait queue buffer calling get_free_page (possibly > multiple times when we talk about decent fds numbers) > > 4) It has to loop calling N times f_op->poll() that in turn will add into > the wait queue getting/releasing IRQ locks > > 5) Loop another M loop to copy events to userspace > > 6) Call kfree() for all blocks allocated > > 7) Call poll_freewait() that will go with another N loop to unregister > poll waits, that in turn will do another N IRQ locks This is really just due to bad coding in 'poll', or more precisely very bad for this case. For example, why is it allocating a wait queue buffer if the odds that it will need to wait are basically zero? Why is it adding file descriptors to the wait queue before it has determined that it needs to wait? As load increases, more and more calls to 'poll' require no waiting. Yet 'poll' is heavily optimized for the 'no or low load' case. That's why 'poll' doesn't scale on Linux. > Yes, of course. The time spent inside poll/select becomes a PITA when you > start dealing with huge number of fds. And this is kernel time. This does > not obviously mean that if epoll is 10 times faster than poll under load, > and you switch your app on epoll, it'll be ten times faster. It means that > the kernel time spent inside poll will be 1/10. And many of the operations > done by poll require IRQ locks and this increase the time the kernel > spend with disabled IRQs, that is never a good thing. My experience has been that this is a huge problem with Linux but not with any other OS. It can be solved in user-space with some other penalities by an adaptive sleep before each call to 'poll' and polling with a zero timeout (thus avoiding the wait queue pain). But all the deficiencies in the 'poll' implementation in the world won't show anything except that 'poll' is badly implemented. > - Davide DS ^ permalink raw reply [flat|nested] 58+ messages in thread
* RE: [Patch][RFC] epoll and half closed TCP connections 2003-07-13 23:05 ` David Schwartz @ 2003-07-13 23:11 ` Davide Libenzi 2003-07-13 23:52 ` Entrope ` (2 subsequent siblings) 3 siblings, 0 replies; 58+ messages in thread From: Davide Libenzi @ 2003-07-13 23:11 UTC (permalink / raw) To: David Schwartz; +Cc: Eric Varsanyi, Linux Kernel Mailing List On Sun, 13 Jul 2003, David Schwartz wrote: > > > Let's look at what the poll code does : > > > > 1) It has to allocate the kernel buffer for events > > > > 2) It has to copy it from userspace > > > > 3) It has to allocate wait queue buffer calling get_free_page (possibly > > multiple times when we talk about decent fds numbers) > > > > 4) It has to loop calling N times f_op->poll() that in turn will add into > > the wait queue getting/releasing IRQ locks > > > > 5) Loop another M loop to copy events to userspace > > > > 6) Call kfree() for all blocks allocated > > > > 7) Call poll_freewait() that will go with another N loop to unregister > > poll waits, that in turn will do another N IRQ locks > > This is really just due to bad coding in 'poll', or more precisely very bad > for this case. For example, why is it allocating a wait queue buffer if the > odds that it will need to wait are basically zero? Why is it adding file > descriptors to the wait queue before it has determined that it needs to > wait? > > As load increases, more and more calls to 'poll' require no waiting. Yet > 'poll' is heavily optimized for the 'no or low load' case. That's why 'poll' > doesn't scale on Linux. However you implement poll(2) you have "at least" to do one iteration among the interest set, and hence your implementation will be O(N). - Davide ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [Patch][RFC] epoll and half closed TCP connections 2003-07-13 23:05 ` David Schwartz 2003-07-13 23:11 ` Davide Libenzi @ 2003-07-13 23:52 ` Entrope 2003-07-14 6:14 ` David Schwartz 2003-07-14 1:51 ` Jamie Lokier 2003-07-15 20:27 ` James Antill 3 siblings, 1 reply; 58+ messages in thread From: Entrope @ 2003-07-13 23:52 UTC (permalink / raw) To: David Schwartz; +Cc: Davide Libenzi, Eric Varsanyi, Linux Kernel Mailing List "David Schwartz" <davids@webmaster.com> writes: > This is really just due to bad coding in 'poll', or more precisely very bad > for this case. For example, why is it allocating a wait queue buffer if the > odds that it will need to wait are basically zero? Why is it adding file > descriptors to the wait queue before it has determined that it needs to > wait? > > As load increases, more and more calls to 'poll' require no waiting. Yet > 'poll' is heavily optimized for the 'no or low load' case. That's why 'poll' > doesn't scale on Linux. Your argument is bogus. My first-hand experience is with IRC servers, which customarily have thousands of connections at once, with a very few percent active in a given check. The scaling problem is not with the length of waiting or how poll() is optimized -- it is with the overhead *inherent* to processing poll(). Common IRC servers spend 100% of CPU when using poll() for only a few thousand clients. Those same servers, using FreeBSD's kqueue()/kevent() API, use well under 10% of the CPU. Yes, the amount of time spent doing useful work increases as the poll() load increases -- but the time wasted setting up and checking activity for poll() is something you can never reclaim, and which only goes up as your CPU gets faster. epoll() makes you pay the cost of updating the interest list only when the list changes; poll() makes you pay the cost every time you call it. Empirically, four of the five biggest IRC networks run server software that prefers kqueue() on FreeBSD. kqueue() did not cause them to be large, but using kqueue() addresses specific concerns. On the network I can speak for, we look forward to having epoll() on Linux for the same reason. >> Yes, of course. The time spent inside poll/select becomes a PITA when you >> start dealing with huge number of fds. And this is kernel time. This does >> not obviously mean that if epoll is 10 times faster than poll under load, >> and you switch your app on epoll, it'll be ten times faster. It means that >> the kernel time spent inside poll will be 1/10. And many of the operations >> done by poll require IRQ locks and this increase the time the kernel >> spend with disabled IRQs, that is never a good thing. > > My experience has been that this is a huge problem with Linux but not with > any other OS. It can be solved in user-space with some other penalities by > an adaptive sleep before each call to 'poll' and polling with a zero timeout > (thus avoiding the wait queue pain). But all the deficiencies in the 'poll' > implementation in the world won't show anything except that 'poll' is badly > implemented. Your experience must be unique, because many people have seen poll()'s inactive-client overhead cause CPU wastage problems on non-Linux OSes (for me, FreeBSD and Solaris). poll() may be badly implemented on Linux or not, but it shares a design flaw with select(): that the application must specify the list of FDs for each system call, no matter how few change per call. That is the design flaw that epoll() addresses. If you truly believe that poll()'s implementation is so flawed, please provide an improved implementation. To put it another way, all the optimizations in the world for a 'poll' implementation won't sustain it unless you understand the flaw in its specification. The specification requires inefficient use of CPU for very common situations. Michael Poole ^ permalink raw reply [flat|nested] 58+ messages in thread
* RE: [Patch][RFC] epoll and half closed TCP connections 2003-07-13 23:52 ` Entrope @ 2003-07-14 6:14 ` David Schwartz 2003-07-14 7:20 ` Jamie Lokier 0 siblings, 1 reply; 58+ messages in thread From: David Schwartz @ 2003-07-14 6:14 UTC (permalink / raw) To: Entrope; +Cc: Davide Libenzi, Eric Varsanyi, Linux Kernel Mailing List > Your argument is bogus. My first-hand experience is with IRC servers, > which customarily have thousands of connections at once, with a very > few percent active in a given check. The scaling problem is not with > the length of waiting or how poll() is optimized -- it is with the > overhead *inherent* to processing poll(). Common IRC servers spend > 100% of CPU when using poll() for only a few thousand clients. Those > same servers, using FreeBSD's kqueue()/kevent() API, use well under > 10% of the CPU. My argument is not bogus, you just don't understand it. Two algorithms can scale at the same order and yet one can perform much better than the other. Poll, for example, could use 10% of the CPU at 100 users and 100% at 1000 users. While kqueue/kevent could use .01% at 100 users and .2% at 1000 users. With these numbers, poll is much more scalable (its CPU usage went up by a factor of 10 while kqueue went up by a factor of 20) yet kqueue will outperform poll. I am specifically talking about scalability, in the compsci sense. I grant that epoll will use less CPU than poll in every configuration. > Yes, the amount of time spent doing useful work increases as the > poll() load increases -- but the time wasted setting up and checking > activity for poll() is something you can never reclaim, and which only > goes up as your CPU gets faster. epoll() makes you pay the cost of > updating the interest list only when the list changes; poll() makes > you pay the cost every time you call it. I know what epoll *is*. > Empirically, four of the five biggest IRC networks run server software > that prefers kqueue() on FreeBSD. kqueue() did not cause them to be > large, but using kqueue() addresses specific concerns. On the network > I can speak for, we look forward to having epoll() on Linux for the > same reason. Wonderful, now please show me where the error in my argument is. > > My experience has been that this is a huge problem with > > Linux but not with > > any other OS. It can be solved in user-space with some other > > penalities by > > an adaptive sleep before each call to 'poll' and polling with a > > zero timeout > > (thus avoiding the wait queue pain). But all the deficiencies > > in the 'poll' > > implementation in the world won't show anything except that > > 'poll' is badly > > implemented. > Your experience must be unique, because many people have seen poll()'s > inactive-client overhead cause CPU wastage problems on non-Linux OSes > (for me, FreeBSD and Solaris). References please? And again, artificial cases where the number of active descriptors were held constant while the number of inactive descriptors were increased do not count. > poll() may be badly implemented on Linux or not, but it shares a > design flaw with select(): that the application must specify the list > of FDs for each system call, no matter how few change per call. That > is the design flaw that epoll() addresses. I know that. What does that have to do with anything? Are you even reading what I'm writing? > If you truly believe that > poll()'s implementation is so flawed, please provide an improved > implementation. It's trivial to make the optimizations that my post very obviously suggests. One would be to defer creating the wait queue until it's clear we need to wait. The problem is, these optimizations would harm the low-load and no-load cases and it's my understanding from the last time this was discussed that changes that worsen the 'most common' case will be refused even if they improve the 'high load' case. > To put it another way, all the optimizations in the world for a 'poll' > implementation won't sustain it unless you understand the flaw in its > specification. The specification requires inefficient use of CPU for > very common situations. Fine, but show me how that flaw impacts scalability. Please read what I said again, I understand that 'epoll' is superior to 'poll'. I'm specifically disputing whether or not 'epoll' has a specific mathematical characteristic. DS ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [Patch][RFC] epoll and half closed TCP connections 2003-07-14 6:14 ` David Schwartz @ 2003-07-14 7:20 ` Jamie Lokier 0 siblings, 0 replies; 58+ messages in thread From: Jamie Lokier @ 2003-07-14 7:20 UTC (permalink / raw) To: David Schwartz Cc: Entrope, Davide Libenzi, Eric Varsanyi, Linux Kernel Mailing List David, you are point out that given a certain base assumption, namely that the number of active desriptors is proportional to the number of inactive descriptors, that epoll does not scale any differently to poll. Correct. Davide and I have pointed out that the key difference between select/poll and epoll, mathematically, is that epoll time is bounded by the number of active descriptors, and select/poll time is not. Also correct. (The latter is a logically stronger statement than the former, because it has fewer assumptions, however that doesn't make it's increased range of applicability relevant.) Now if you look at web server benchmarks and other artificial benchmarks, rumour has it that epoll-based server CPU usage increases linearly with load, and pages served increases linearly with the number of requests, until the point where the server is maxed, after which both these observables remain roughly constant with increasing load. Similar rumours have it that select/poll-based servers CPU usage increases in the same way, but not only do the observables increase faster (irrelevent for this discussion), when the server is maxed its number of pages served decreases (badly) with increasing load. In this sense, it's useful to refer to epoll as more scalable: not the linear part at the beginning, but later, when resources are exhausted. Load here is defined as number of concurrent client connections. In fact, the number of active descriptors _is_ less than proportional to the number of idle descriptors in this state, because the slower responses act as a form of flow control on the rate of new connections or new data coming in to the server. If instead you define a test in terms of increasing the rate of incoming connections, as if the clients are oblivious to each other, then your point might be spot on. But that is a different kind of test, and it doesn't take away from the validity of the first kind. -- Jamie ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [Patch][RFC] epoll and half closed TCP connections 2003-07-13 23:05 ` David Schwartz 2003-07-13 23:11 ` Davide Libenzi 2003-07-13 23:52 ` Entrope @ 2003-07-14 1:51 ` Jamie Lokier 2003-07-14 6:14 ` David Schwartz 2003-07-15 20:27 ` James Antill 3 siblings, 1 reply; 58+ messages in thread From: Jamie Lokier @ 2003-07-14 1:51 UTC (permalink / raw) To: David Schwartz; +Cc: Davide Libenzi, Eric Varsanyi, Linux Kernel Mailing List David Schwartz wrote: > This is really just due to bad coding in 'poll', or more precisely very bad > for this case. For example, why is it allocating a wait queue buffer if the > odds that it will need to wait are basically zero? Why is it adding file > descriptors to the wait queue before it has determined that it needs to > wait? Pfeh! That's just tweaking. If you really want to optimise 'poll', maintain a per-task event interest set like epoll does (you could use the epoll infrastructure), and on each call to 'poll' just apply the differences between the interest set and whatever is passed to poll. That would actually reduce the number of calls to per-fd f_op->poll() methods to O(active), making the internal overhead of 'poll' and 'select' comparable with epoll. I'd not be surprised if someone has done it already. I heard of a "scalable poll" patch some years ago. That leaves the overhead of the API, which is O(interested) but it is a much lower overhead factor than the calls to f_op->poll(). Enjoy, -- Jamie ^ permalink raw reply [flat|nested] 58+ messages in thread
* RE: [Patch][RFC] epoll and half closed TCP connections 2003-07-14 1:51 ` Jamie Lokier @ 2003-07-14 6:14 ` David Schwartz 0 siblings, 0 replies; 58+ messages in thread From: David Schwartz @ 2003-07-14 6:14 UTC (permalink / raw) To: Jamie Lokier; +Cc: Davide Libenzi, Eric Varsanyi, Linux Kernel Mailing List > David Schwartz wrote: > > This is really just due to bad coding in 'poll', or more > > precisely very bad > > for this case. For example, why is it allocating a wait queue > > buffer if the > > odds that it will need to wait are basically zero? Why is it adding file > > descriptors to the wait queue before it has determined that it needs to > > wait? > Pfeh! That's just tweaking. Nonetheless, it's embarassing for Linux that performance shoots up if you replace a normal call to 'poll' with a sleep followed by a call to 'poll' with a zero wait. But that's peripheral to my point. > If you really want to optimise 'poll', maintain a per-task event > interest set like epoll does (you could use the epoll infrastructure), > and on each call to 'poll' just apply the differences between the > interest set and whatever is passed to poll. > > That would actually reduce the number of calls to per-fd f_op->poll() > methods to O(active), making the internal overhead of 'poll' and > 'select' comparable with epoll. > > I'd not be surprised if someone has done it already. I heard of a > "scalable poll" patch some years ago. > > That leaves the overhead of the API, which is O(interested) but it is > a much lower overhead factor than the calls to f_op->poll(). Definitely, the API will always guarantee performance is O(n). If you're interested in ultimate scalability, you can never exceed O(n) with 'poll'. But my point is that you can never exceed O(n) with any discovery implementation because the number of sockets to be discovered scales at O(n), and you have to do something per socket. This doesn't change the fact that 'epoll' outperforms 'poll' at every conceivable situation. It also doesn't change the fact that edge-based triggering allows some phenomenal optimizations in multi-threaded code. It also doesn't change the fact that Linux's 'poll' implementation is not so good for the high-load, busy server case. All I'm trying to say is that the argument that 'epoll' scales at a better order than 'poll' is bogus. They both scale at O(n) where 'n' is the number of connections you have. No conceivable implementation could scale any better. DS ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [Patch][RFC] epoll and half closed TCP connections 2003-07-13 23:05 ` David Schwartz ` (2 preceding siblings ...) 2003-07-14 1:51 ` Jamie Lokier @ 2003-07-15 20:27 ` James Antill 2003-07-16 1:46 ` David Schwartz 3 siblings, 1 reply; 58+ messages in thread From: James Antill @ 2003-07-15 20:27 UTC (permalink / raw) To: David Schwartz; +Cc: Davide Libenzi, Eric Varsanyi, Linux Kernel Mailing List "David Schwartz" <davids@webmaster.com> writes: > This is really just due to bad coding in 'poll', or more precisely very bad > for this case. For example, why is it allocating a wait queue buffer if the > odds that it will need to wait are basically zero? Why is it adding file > descriptors to the wait queue before it has determined that it needs to > wait? Because this is much easier to do in userspace, it's just not very well documented that you should almost always call poll() with a zero timeout first. However it's been there for years, and things have used it[1]. There are still optimizations that could have been done to poll() to speed it up but Linus has generally refused to add them. [1] http://www.and.org/socket_poll/ -- # James Antill -- james@and.org :0: * ^From: .*james@and\.org /dev/null ^ permalink raw reply [flat|nested] 58+ messages in thread
* RE: [Patch][RFC] epoll and half closed TCP connections 2003-07-15 20:27 ` James Antill @ 2003-07-16 1:46 ` David Schwartz 2003-07-16 2:09 ` James Antill 0 siblings, 1 reply; 58+ messages in thread From: David Schwartz @ 2003-07-16 1:46 UTC (permalink / raw) To: James Antill; +Cc: Linux Kernel Mailing List > "David Schwartz" <davids@webmaster.com> writes: > > This is really just due to bad coding in 'poll', or more > > precisely very bad > > for this case. For example, why is it allocating a wait queue > > buffer if the > > odds that it will need to wait are basically zero? Why is it adding file > > descriptors to the wait queue before it has determined that it needs to > > wait? > Because this is much easier to do in userspace, it's just not very > well documented that you should almost always call poll() with a zero > timeout first. It's neither easier to do nor harder, it's basically the same code in either place. However, doing it in kernel space saves the extra user/kernel transition, poll set allocations, and copies across the u/k boundary in the case where we do actually need to wait. > However it's been there for years, and things have used > it[1]. The thing is, for some reason it (it being the cost of calling poll with a constant timeout for 1,024 file descriptors) is exceptionally bad on Linux. Worse than every other OS I've tested. > There are still optimizations that could have been done to poll() to > speed it up but Linus has generally refused to add them. Yep, so we invent new APIs to fix the deficiencies in the most common API's implementation. Whatever. DS ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [Patch][RFC] epoll and half closed TCP connections 2003-07-16 1:46 ` David Schwartz @ 2003-07-16 2:09 ` James Antill 0 siblings, 0 replies; 58+ messages in thread From: James Antill @ 2003-07-16 2:09 UTC (permalink / raw) To: David Schwartz; +Cc: Linux Kernel Mailing List "David Schwartz" <davids@webmaster.com> writes: > > "David Schwartz" <davids@webmaster.com> writes: > > > > This is really just due to bad coding in 'poll', or more > > > precisely very bad > > > for this case. For example, why is it allocating a wait queue > > > buffer if the > > > odds that it will need to wait are basically zero? Why is it adding file > > > descriptors to the wait queue before it has determined that it needs to > > > wait? > > > Because this is much easier to do in userspace, it's just not very > > well documented that you should almost always call poll() with a zero > > timeout first. > > It's neither easier to do nor harder, it's basically the same code in > either place. However, doing it in kernel space saves the extra user/kernel > transition, poll set allocations, and copies across the u/k boundary in the > case where we do actually need to wait. Optimizing for the waiting case doesn't sound like the right approach, IMO. And all things being equal, doing it outside the kernel rules. Plus it's possible that someone could come up with a case where you don't want to do it. > > However it's been there for years, and things have used > > it[1]. > > The thing is, for some reason it (it being the cost of calling poll with a > constant timeout for 1,024 file descriptors) is exceptionally bad on Linux. > Worse than every other OS I've tested. I'd put money on that being drastically reduced if the allocation didn't happen every call though. > > There are still optimizations that could have been done to poll() to > > speed it up but Linus has generally refused to add them. > > Yep, so we invent new APIs to fix the deficiencies in the most common API's > implementation. Whatever. No, we know that there are conditions where whatever you do to poll() the latency kills you. And to fix that we need new APIs. Personally I'd prefer to have poll() be as good a level triggered event mechanism as it could be and have epoll just as good for edge triggered events as it could be ... but I'm not the one you need to convince. -- # James Antill -- james@and.org :0: * ^From: .*james@and\.org /dev/null ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [Patch][RFC] epoll and half closed TCP connections 2003-07-12 20:51 ` Eric Varsanyi 2003-07-12 20:48 ` Davide Libenzi @ 2003-07-13 13:12 ` Jamie Lokier 2003-07-13 16:55 ` Davide Libenzi 1 sibling, 1 reply; 58+ messages in thread From: Jamie Lokier @ 2003-07-13 13:12 UTC (permalink / raw) To: Eric Varsanyi; +Cc: linux-kernel, davidel Eric Varsanyi wrote: > > Well then, use epoll's level-triggered mode. It's quite easy - it's > > the default now. :) > > The problem with all the level triggered schemes (poll, select, epoll w/o > EPOLLET) is that they call every driver and poll status for every call into > the kernel. This appeared to be killing my app's performance and I verified > by writing some simple micro benchmarks. OH! :-O Level-triggered epoll_wait() time _should_ be scalable - proportional to the number of ready events, not the number of listening events. If this is not the case then it's a bug in epoll. In principle, you will see a large delay only if you don't handle those events (e.g. by calling read() on each ready fd), so that they are still ready. Reading the code in eventpoll.c et al, I think that some time will be taken for fds that are transitioning on events which you're not interested in. Notably, each time a TCP segment is sent and acknowledged by the other end, poll-waiters are woken, your task will be woken and do some work in epoll_wait(), but no events are returned if you are only listening for read availability. I'm not 100% sure of this, but tracing through skb->destructor -> sock_wfree() -> tcp_write_space() -> wake_up_interruptible() -> ep_poll_callback() it looks as though _every_ TCP ACK you receive will cause epoll to wake up a task which is interested in _any_ socket events, but then in <context switch> ep_poll() -> ep_events_transfer() -> ep_send_events() no events are transferred, so ep_poll() will loop and try again. This is quite unfortunate if true, as many of the apps which need to scale write a lot of segments without receiving very much. > As we start to scale up to production sized fd sets it gets crazy: around > 8000 completely idle fd's the cost is 4ms per syscall. At this point > even a high real load (which gathers lots of I/O per call) doesn't cover the > now very high latency for each trip into the kernel to gather more work. It should only be 4ms per syscall if it's actually returning ~8000 ready events. If you're listening to 8000 but only, say, 10 are ready, it should be fast. > What was interesting is the response time was non-linear up to around 400-500 > fd's, then went steep and linear after that, so you pay much more (maybe due > to some cache effects, I didn't pursue) for each connecting client in a light > load environment. > This is not web traffic, the clients typically connect and sit mostly idle. Can you post your code? (Btw, I don't disagree with POLLRDHUP - I think it's a fine idea. I'd use it. It'd be unfortunate if it only worked with some socket types and was not set by others, though. Global search and replace POLLHUP with "POLLHUP | POLLRDHUP" in most setters? Following that a bit further, we might as well admit that POLLHUP should be called POLLWRHUP.) -- Jamie ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [Patch][RFC] epoll and half closed TCP connections 2003-07-13 13:12 ` Jamie Lokier @ 2003-07-13 16:55 ` Davide Libenzi 0 siblings, 0 replies; 58+ messages in thread From: Davide Libenzi @ 2003-07-13 16:55 UTC (permalink / raw) To: Jamie Lokier; +Cc: Eric Varsanyi, Linux Kernel Mailing List On Sun, 13 Jul 2003, Jamie Lokier wrote: > Eric Varsanyi wrote: > > > Well then, use epoll's level-triggered mode. It's quite easy - it's > > > the default now. :) > > > > The problem with all the level triggered schemes (poll, select, epoll w/o > > EPOLLET) is that they call every driver and poll status for every call into > > the kernel. This appeared to be killing my app's performance and I verified > > by writing some simple micro benchmarks. > > OH! :-O > > Level-triggered epoll_wait() time _should_ be scalable - proportional > to the number of ready events, not the number of listening events. If > this is not the case then it's a bug in epoll. Jamie, he is talking about select here. > Reading the code in eventpoll.c et al, I think that some time will > be taken for fds that are transitioning on events which you're not > interested in. Notably, each time a TCP segment is sent and > acknowledged by the other end, poll-waiters are woken, your task will > be woken and do some work in epoll_wait(), but no events are returned > if you are only listening for read availability. > > I'm not 100% sure of this, but tracing through > > skb->destructor > -> sock_wfree() > -> tcp_write_space() > -> wake_up_interruptible() > -> ep_poll_callback() > > it looks as though _every_ TCP ACK you receive will cause epoll to wake up > a task which is interested in _any_ socket events, but then in > > <context switch> > ep_poll() > -> ep_events_transfer() > -> ep_send_events() > > no events are transferred, so ep_poll() will loop and try again. This > is quite unfortunate if true, as many of the apps which need to scale > write a lot of segments without receiving very much. That's true, it is the beauty of the poll hook ;) I said this a long time ago. It is addressable by a wake_up_mask() and some code all around. I did not see (as long as other didn't) any performance impact bacause of this, with throughput that remained steady flat under any ratio of hot/cold fds. Since it is easily addressable and will not require an API change, I'd rather wait for someone to report a real (or even unreal) load that makes epoll to not flat-scale. - Davide ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [Patch][RFC] epoll and half closed TCP connections 2003-07-12 18:16 [Patch][RFC] epoll and half closed TCP connections Eric Varsanyi 2003-07-12 19:44 ` Jamie Lokier @ 2003-07-12 20:01 ` Davide Libenzi 2003-07-13 5:24 ` David S. Miller 1 sibling, 1 reply; 58+ messages in thread From: Davide Libenzi @ 2003-07-12 20:01 UTC (permalink / raw) To: Eric Varsanyi; +Cc: David Miller, Linux Kernel Mailing List [Cc:ing DaveM ] On Sat, 12 Jul 2003, Eric Varsanyi wrote: > I'm proposing adding a new POLL event type (POLLRDHUP) as way to solve > a new race introduced by having an edge triggered event mechanism > (epoll). The problem occurs when a client writes data and then does a > write side shutdown(). The server (using epoll) sees only one event for > the read data ready and the read EOF condition and has no way to tell > that an EOF occurred. > > -Eric Varsanyi > > Details > ----------- > - remote sends data and does a shutdown > [ we see a data bearing packet and FIN from client on the wire ] > > - user mode server gets around to doing accept() and registers > for EPOLLIN events (along with HUP and ERR which are forced on) > > - epoll_wait() returns a single EPOLLIN event on the FD representing > both the 1/2 shutdown state and data available > > At this point there is no way the app can tell if there is a half closed > connection so it may issue a close() back to the client after writing > results. Normally the server would distinguish these events by assuming > EOF if it got a read ready indication and the first read returned 0 bytes, > or would issue read calls until less data was returned than was asked for. > > In a level triggered world this all just works because the read ready > indication is driven back to the app as long as the socket state is half > closed. The event driven epoll mechanism folds these two indications > together and thus loses one 'edge'. > > One would be tempted to issue an extra read() after getting back less than > expected, but this is an extra system call on every read event and you get > back the same '0' bytes that you get if the buffer is just empty. The only > sure bet seems to be CTL_MODding the FD to force a re-poll (which would > cost a syscall and hash-lookup in eventpoll for every read event). > Yes, this is overhead that should be avoided. It is true that you could use Level Triggered events, but if you structured your app on edge you should be able to solve this w/out overhead. > 2) add a new 1/2 closed event type that a poll routine can return > > The implementation is trivial, a patch is included below. If this idea sees > favor I'll fix the other architectures, ipv6, epoll.h, and make a 'real' > patch. I do not believe any drivers deserve to be modified to return this > new event. This looks good to me. David what do you think ? > diff -Naur linux-2.4.20/include/asm-i386/poll.h linux-2.4.20_ev/include/asm-i386/poll.h > --- linux-2.4.20/include/asm-i386/poll.h Thu Jan 23 13:01:28 1997 > +++ linux-2.4.20_ev/include/asm-i386/poll.h Sat Jul 12 12:29:11 2003 > @@ -15,6 +15,7 @@ > #define POLLWRNORM 0x0100 > #define POLLWRBAND 0x0200 > #define POLLMSG 0x0400 > +#define POLLRDHUP 0x0800 > > struct pollfd { > int fd; > diff -Naur linux-2.4.20/net/ipv4/tcp.c linux-2.4.20_ev/net/ipv4/tcp.c > --- linux-2.4.20/net/ipv4/tcp.c Tue Jul 8 09:40:42 2003 > +++ linux-2.4.20_ev/net/ipv4/tcp.c Sat Jul 12 12:29:56 2003 > @@ -424,7 +424,7 @@ > if (sk->shutdown == SHUTDOWN_MASK || sk->state == TCP_CLOSE) > mask |= POLLHUP; > if (sk->shutdown & RCV_SHUTDOWN) > - mask |= POLLIN | POLLRDNORM; > + mask |= POLLIN | POLLRDNORM | POLLRDHUP; > > /* Connected? */ > if ((1 << sk->state) & ~(TCPF_SYN_SENT|TCPF_SYN_RECV)) { > - Davide ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [Patch][RFC] epoll and half closed TCP connections 2003-07-12 20:01 ` Davide Libenzi @ 2003-07-13 5:24 ` David S. Miller 2003-07-13 14:07 ` Jamie Lokier 2003-07-14 17:09 ` [Patch][RFC] epoll and half closed TCP connections kuznet 0 siblings, 2 replies; 58+ messages in thread From: David S. Miller @ 2003-07-13 5:24 UTC (permalink / raw) To: Davide Libenzi; +Cc: e0206, linux-kernel, kuznet On Sat, 12 Jul 2003 13:01:21 -0700 (PDT) Davide Libenzi <davidel@xmailserver.org> wrote: > > [Cc:ing DaveM ] [Cc:ing Alexey :-) ] Alexey, they seem to want to add some kind of POLLRDHUP thing, comments wrt. TCP and elsewhere in the networking? See below... > On Sat, 12 Jul 2003, Eric Varsanyi wrote: > > > I'm proposing adding a new POLL event type (POLLRDHUP) as way to solve > > a new race introduced by having an edge triggered event mechanism > > (epoll). The problem occurs when a client writes data and then does a > > write side shutdown(). The server (using epoll) sees only one event for > > the read data ready and the read EOF condition and has no way to tell > > that an EOF occurred. > > > > -Eric Varsanyi > > > > Details > > ----------- > > - remote sends data and does a shutdown > > [ we see a data bearing packet and FIN from client on the wire ] > > > > - user mode server gets around to doing accept() and registers > > for EPOLLIN events (along with HUP and ERR which are forced on) > > > > - epoll_wait() returns a single EPOLLIN event on the FD representing > > both the 1/2 shutdown state and data available > > > > At this point there is no way the app can tell if there is a half closed > > connection so it may issue a close() back to the client after writing > > results. Normally the server would distinguish these events by assuming > > EOF if it got a read ready indication and the first read returned 0 bytes, > > or would issue read calls until less data was returned than was asked for. > > > > In a level triggered world this all just works because the read ready > > indication is driven back to the app as long as the socket state is half > > closed. The event driven epoll mechanism folds these two indications > > together and thus loses one 'edge'. > > > > One would be tempted to issue an extra read() after getting back less than > > expected, but this is an extra system call on every read event and you get > > back the same '0' bytes that you get if the buffer is just empty. The only > > sure bet seems to be CTL_MODding the FD to force a re-poll (which would > > cost a syscall and hash-lookup in eventpoll for every read event). > > > > Yes, this is overhead that should be avoided. It is true that you could > use Level Triggered events, but if you structured your app on edge you > should be able to solve this w/out overhead. > > > > > 2) add a new 1/2 closed event type that a poll routine can return > > > > The implementation is trivial, a patch is included below. If this idea sees > > favor I'll fix the other architectures, ipv6, epoll.h, and make a 'real' > > patch. I do not believe any drivers deserve to be modified to return this > > new event. > > This looks good to me. David what do you think ? > > > > > diff -Naur linux-2.4.20/include/asm-i386/poll.h linux-2.4.20_ev/include/asm-i386/poll.h > > --- linux-2.4.20/include/asm-i386/poll.h Thu Jan 23 13:01:28 1997 > > +++ linux-2.4.20_ev/include/asm-i386/poll.h Sat Jul 12 12:29:11 2003 > > @@ -15,6 +15,7 @@ > > #define POLLWRNORM 0x0100 > > #define POLLWRBAND 0x0200 > > #define POLLMSG 0x0400 > > +#define POLLRDHUP 0x0800 > > > > struct pollfd { > > int fd; > > diff -Naur linux-2.4.20/net/ipv4/tcp.c linux-2.4.20_ev/net/ipv4/tcp.c > > --- linux-2.4.20/net/ipv4/tcp.c Tue Jul 8 09:40:42 2003 > > +++ linux-2.4.20_ev/net/ipv4/tcp.c Sat Jul 12 12:29:56 2003 > > @@ -424,7 +424,7 @@ > > if (sk->shutdown == SHUTDOWN_MASK || sk->state == TCP_CLOSE) > > mask |= POLLHUP; > > if (sk->shutdown & RCV_SHUTDOWN) > > - mask |= POLLIN | POLLRDNORM; > > + mask |= POLLIN | POLLRDNORM | POLLRDHUP; > > > > /* Connected? */ > > if ((1 << sk->state) & ~(TCPF_SYN_SENT|TCPF_SYN_RECV)) { > > > > > > - Davide ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [Patch][RFC] epoll and half closed TCP connections 2003-07-13 5:24 ` David S. Miller @ 2003-07-13 14:07 ` Jamie Lokier 2003-07-13 17:00 ` Davide Libenzi 2003-07-14 17:09 ` [Patch][RFC] epoll and half closed TCP connections kuznet 1 sibling, 1 reply; 58+ messages in thread From: Jamie Lokier @ 2003-07-13 14:07 UTC (permalink / raw) To: David S. Miller; +Cc: Davide Libenzi, e0206, linux-kernel, kuznet David S. Miller wrote: > Alexey, they seem to want to add some kind of POLLRDHUP thing, > comments wrt. TCP and elsewhere in the networking? See below... POLLHUP is a mess. It means different things according to the type of fd, precisely because it is considered an unmaskeable event for the poll() API so the standard meaning isn't useful for sockets. (See the comments in tcp_poll()). POLLRDHUP makes sense because it could actually have a well-defined meaning: set iff reading the fd would return EOF. However, if a program is waiting on POLLRDHUP, you don't want the program to have to say "if this fd is a TCP socket then listen for POLLRDHUP else if this fd is another kind of socket call read to detect EOF else listen for POLLHUP". Programs have enough version-specific special cases as it is. So I suggest: - Everywhere that POLLHUP is currently set in a driver, socket etc. it should set POLLRDHUP|POLLHUP - unless it specifically knows about POLLRDHUP as in TCP (and presumably UDP, SCTP etc). -- Jamie ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [Patch][RFC] epoll and half closed TCP connections 2003-07-13 14:07 ` Jamie Lokier @ 2003-07-13 17:00 ` Davide Libenzi 2003-07-13 19:15 ` Jamie Lokier 0 siblings, 1 reply; 58+ messages in thread From: Davide Libenzi @ 2003-07-13 17:00 UTC (permalink / raw) To: Jamie Lokier; +Cc: David S. Miller, e0206, Linux Kernel Mailing List, kuznet On Sun, 13 Jul 2003, Jamie Lokier wrote: > David S. Miller wrote: > > Alexey, they seem to want to add some kind of POLLRDHUP thing, > > comments wrt. TCP and elsewhere in the networking? See below... > > POLLHUP is a mess. It means different things according to the type of > fd, precisely because it is considered an unmaskeable event for the > poll() API so the standard meaning isn't useful for sockets. (See the > comments in tcp_poll()). > > POLLRDHUP makes sense because it could actually have a well-defined > meaning: set iff reading the fd would return EOF. > > However, if a program is waiting on POLLRDHUP, you don't want the > program to have to say "if this fd is a TCP socket then listen for > POLLRDHUP else if this fd is another kind of socket call read to > detect EOF else listen for POLLHUP". Programs have enough > version-specific special cases as it is. > > So I suggest: > > - Everywhere that POLLHUP is currently set in a driver, socket etc. > it should set POLLRDHUP|POLLHUP - unless it specifically knows > about POLLRDHUP as in TCP (and presumably UDP, SCTP etc). Returning POLLHUP to a caller waiting for POLLIN might break existing code IMHO. After ppl reporting the O_RDONLY|O_TRUNC case I'm inclined to expect everything from existing apps ;) POLLHUP should be returned to apps waiting for POLLOUT while POLLRDHUP to ones for POLLIN. - Davide ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [Patch][RFC] epoll and half closed TCP connections 2003-07-13 17:00 ` Davide Libenzi @ 2003-07-13 19:15 ` Jamie Lokier 2003-07-13 23:03 ` Davide Libenzi 0 siblings, 1 reply; 58+ messages in thread From: Jamie Lokier @ 2003-07-13 19:15 UTC (permalink / raw) To: Davide Libenzi; +Cc: David S. Miller, e0206, Linux Kernel Mailing List, kuznet Davide Libenzi wrote: > > However, if a program is waiting on POLLRDHUP, you don't want the > > program to have to say "if this fd is a TCP socket then listen for > > POLLRDHUP else if this fd is another kind of socket call read to > > detect EOF else listen for POLLHUP". Programs have enough > > version-specific special cases as it is. > > > > - Everywhere that POLLHUP is currently set in a driver, socket etc. > > it should set POLLRDHUP|POLLHUP - unless it specifically knows > > about POLLRDHUP as in TCP (and presumably UDP, SCTP etc). > > Returning POLLHUP to a caller waiting for POLLIN might break existing code > IMHO. Oh, agreed. I was(*) suggesting to add POLL_RDHUP to the set of events reported by e.g. AF_UNIX sockets which are half-closed. Not to add POLLHUP to anything! (Particularly as POLLHUP can't be ignored). > After ppl reporting the O_RDONLY|O_TRUNC case I'm inclined to expect > everything from existing apps ;) POLLHUP should be returned to apps > waiting for POLLOUT while POLLRDHUP to ones for POLLIN. Not sure exactly how you're thinking with that last sentence. At present, it's impossible for socket code to return POLLHUP only to apps which are waiting on POLLOUT - because POLLHUP is not maskable in sys_poll's API. Therefore sockets return POLLHUP only if they are closed in both directions. There is no way for a socket to return a HUP condition for someone who is waiting only to write, but fortunately that doesn't matter :) Back to the (*), (see above): (*) There aren't that many places which set POLLHUP; they divide into: sockets, ttys, SCSI-generic and PPP. The latter two are not important as they don't do half-close. __The critical thing with POLL_RDHUP is that it is set if read() would return EOF after returning data.__ If this condition isn't met, than an app which is using POLL_RDHUP to optimise performance using epoll will hang occasionally. Sockets are important: TCP is not the only thing to support half-closing. If an app is waiting for POLLRDHUP, and it doesn't know what kind of socket it has been given (e.g. AF_UNIX), the network stack had better return POLL_RDHUP when there's an EOF pending. So we'd better add POLLRDHUP to all the socket types which do half-closing. For the rest, no change is required as POLLHUP is non-maskable :) (So apps should always say "if (events & (POLLHUP|POLLRDHUP)) check_for_eof()"). And ttys? They are problematic, because ttys can return EOF _after_ returning data without closing (and without being hung-up). An epoll loop which is reading a tty (and isn't programmed specially for a tty) _must_ receive POLLRDHUP when the EOF is pending, else it may hang. In other words, POLLRDHUP is the wrong name: the correct name is POLLRDEOF. -- Jamie ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [Patch][RFC] epoll and half closed TCP connections 2003-07-13 19:15 ` Jamie Lokier @ 2003-07-13 23:03 ` Davide Libenzi 2003-07-14 1:41 ` Jamie Lokier 0 siblings, 1 reply; 58+ messages in thread From: Davide Libenzi @ 2003-07-13 23:03 UTC (permalink / raw) To: Jamie Lokier; +Cc: David S. Miller, e0206, Linux Kernel Mailing List, kuznet On Sun, 13 Jul 2003, Jamie Lokier wrote: > > After ppl reporting the O_RDONLY|O_TRUNC case I'm inclined to expect > > everything from existing apps ;) POLLHUP should be returned to apps > > waiting for POLLOUT while POLLRDHUP to ones for POLLIN. > > Not sure exactly how you're thinking with that last sentence. Brain farting, delete it ;) This is a nice page about POLLHUP treatment : http://www.greenend.org.uk/rjk/2001/06/poll.html > At present, it's impossible for socket code to return POLLHUP only to > apps which are waiting on POLLOUT - because POLLHUP is not maskable in > sys_poll's API. Therefore sockets return POLLHUP only if they are > closed in both directions. > > There is no way for a socket to return a HUP condition for someone who > is waiting only to write, but fortunately that doesn't matter :) Yes, this could be improved though. If we could only pass our event interest mask to f_op->poll() the function will be able to register it inside the wait queue structure and release only waiters that matches the available condition. > (*) There aren't that many places which set POLLHUP; they divide into: > sockets, ttys, SCSI-generic and PPP. The latter two are not important > as they don't do half-close. > > __The critical thing with POLL_RDHUP is that it is set if read() would > return EOF after returning data.__ > > If this condition isn't met, than an app which is using POLL_RDHUP to > optimise performance using epoll will hang occasionally. > > Sockets are important: TCP is not the only thing to support > half-closing. If an app is waiting for POLLRDHUP, and it doesn't know > what kind of socket it has been given (e.g. AF_UNIX), the network > stack had better return POLL_RDHUP when there's an EOF pending. > > So we'd better add POLLRDHUP to all the socket types which do > half-closing. For the rest, no change is required as POLLHUP is > non-maskable :) (So apps should always say "if (events & > (POLLHUP|POLLRDHUP)) check_for_eof()"). > > And ttys? They are problematic, because ttys can return EOF _after_ > returning data without closing (and without being hung-up). An epoll > loop which is reading a tty (and isn't programmed specially for a tty) > _must_ receive POLLRDHUP when the EOF is pending, else it may hang. > > In other words, POLLRDHUP is the wrong name: the correct name is > POLLRDEOF. Please replace 'it may hung' with 'it may hung if it is using the read() return bytes check trick' ;) - Davide ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [Patch][RFC] epoll and half closed TCP connections 2003-07-13 23:03 ` Davide Libenzi @ 2003-07-14 1:41 ` Jamie Lokier 2003-07-14 2:24 ` POLLRDONCE optimisation for epoll users (was: epoll and half closed TCP connections) Jamie Lokier 0 siblings, 1 reply; 58+ messages in thread From: Jamie Lokier @ 2003-07-14 1:41 UTC (permalink / raw) To: Davide Libenzi; +Cc: David S. Miller, e0206, Linux Kernel Mailing List, kuznet Davide Libenzi wrote: > Yes, this could be improved though. If we could only pass our event > interest mask to f_op->poll() the function will be able to register it > inside the wait queue structure and release only waiters that matches the > available condition. It's not a bad idea. > > And ttys? They are problematic, because ttys can return EOF _after_ > > returning data without closing (and without being hung-up). An epoll > > loop which is reading a tty (and isn't programmed specially for a tty) > > _must_ receive POLLRDHUP when the EOF is pending, else it may hang. > > Please replace 'it may hung' with 'it may hung if it is using the read() > return bytes check trick' ;) Sure - but take an app that is normally using TCP sockets and give it an AF_UNIX socket.. Something as general as the event loop _shouldn't_ have to depend on that subtlety. Ok that's avoidable, but it's a trap. It would be nice to get a flag that doesn't have a caveat in the manual saying "this flag only works (at present) on TCP sockets in kernels >= 2.5.76. Take care not to use the optimisation for any other kind of fd including other sockets, as it will break your app...". That's not the sort of thing I want to see in the epoll manual page :) Anyway, there is a correct answer and I have made the patch so wait for next mail... :) -- Jamie ^ permalink raw reply [flat|nested] 58+ messages in thread
* POLLRDONCE optimisation for epoll users (was: epoll and half closed TCP connections) 2003-07-14 1:41 ` Jamie Lokier @ 2003-07-14 2:24 ` Jamie Lokier 2003-07-14 2:37 ` Davide Libenzi 0 siblings, 1 reply; 58+ messages in thread From: Jamie Lokier @ 2003-07-14 2:24 UTC (permalink / raw) To: Davide Libenzi Cc: David S. Miller, Eric Varsanyi, Linux Kernel Mailing List, kuznet Jamie Lokier wrote: > Anyway, there is a correct answer and I have made the patch so wait > for next mail... :) The patch is included below. Note, it compiles and has been carefully scrutinised by a team of master coders, but never actually run. Eric, you may want to try it out :) The difference between this and POLLRDHUP is polarity, generality and correctness :) Eric, change the polarity of your test from if (events & POLLRDHUP) goto read_again; to if (!(events & POLLRDONCE)) goto read_again; and it should (fingers crossed) work well. Explanation ----------- I wrote: > The critical thing with POLL_RDHUP is that it is set if read() would > return EOF after returning data. I was mistaken. That condition isn't strong enough for an edge-triggered event loop. Some kinds of fd return a short read when there is more data pending, due to boundaries - not a HUP condition, but read-until-EAGAIN is required nonetheless. Ttys do it, some devices do it, and TCP does it if there's urgent data (though POLLPRI should also be set) and under some other conditions. I strongly dislike the idea that an application should know anything about the type of fd it is reading from, if all it's doing is reading. Also having to check kernel version would be ugly - an app which depended on POLLRDHUP would have to check the kernel version. After much typing (my first kernel patch in aeons :), the answer follows... ---------------------------------- PATCH: POLLRDONCE optimisation for epoll users The enclosed patch adds a POLLRDONCE condition. This condition means that it's _safe_ to read once before the next edge-triggered POLLIN event. Otherwise, you must call read() repeatedly until you see EAGAIN, if you are using edge-triggered epoll. Simply calling read() once is not guaranteed to re-arm the event. (If you are using level-triggered mode there is no problem, but edge-triggered mode is faster :) This distinguishes the case where one read() is enough on a TCP socket, from the case where a second read() is needed to fetch the EOF condition. Without this bit, applications must always call read() twice in the very common case that the second will return -EAGAIN. It is always safe for the application to ignore this bit, or if the bit is not set. So an application which looks for the bit will behave correctly unmodified on older kernels or other kinds of fd. This patch provides the optimisation for TCP sockets, pipes, fifos and SOCK_STREAM AF_UNIX sockets, which are the common fd types for high performance streaming applications. Enjoy, -- Jamie diff -ur orig-2.5.75/fs/pipe.c pollrdonce-2.5.75/fs/pipe.c --- orig-2.5.75/fs/pipe.c 2003-07-08 21:41:28.000000000 +0100 +++ pollrdonce-2.5.75/fs/pipe.c 2003-07-14 02:19:27.000000000 +0100 @@ -29,6 +29,9 @@ * * pipe_read & write cleanup * -- Manfred Spraul <manfred@colorfullife.com> 2002-05-09 + * + * Added POLLRDONCE. + * -- Jamie Lokier 2003-07-14 */ /* Drop the inode semaphore and wait for a pipe event, atomically */ @@ -250,6 +253,8 @@ /* Reading only -- no need for acquiring the semaphore. */ mask = POLLIN | POLLRDNORM; + if (PIPE_WRITERS(*inode)) + mask |= POLLRDONCE; if (PIPE_EMPTY(*inode)) mask = POLLOUT | POLLWRNORM; if (!PIPE_WRITERS(*inode) && filp->f_version != PIPE_WCOUNTER(*inode)) diff -ur orig-2.5.75/include/asm-alpha/poll.h pollrdonce-2.5.75/include/asm-alpha/poll.h --- orig-2.5.75/include/asm-alpha/poll.h 2003-07-13 20:07:42.000000000 +0100 +++ pollrdonce-2.5.75/include/asm-alpha/poll.h 2003-07-13 22:35:21.000000000 +0100 @@ -13,6 +13,7 @@ #define POLLWRBAND (1 << 9) #define POLLMSG (1 << 10) #define POLLREMOVE (1 << 11) +#define POLLRDONCE (1 << 12) struct pollfd { int fd; diff -ur orig-2.5.75/include/asm-arm/poll.h pollrdonce-2.5.75/include/asm-arm/poll.h --- orig-2.5.75/include/asm-arm/poll.h 2003-07-08 21:42:07.000000000 +0100 +++ pollrdonce-2.5.75/include/asm-arm/poll.h 2003-07-13 22:35:21.000000000 +0100 @@ -16,6 +16,7 @@ #define POLLWRBAND 0x0200 #define POLLMSG 0x0400 #define POLLREMOVE 0x1000 +#define POLLRDONCE 0x2000 struct pollfd { int fd; diff -ur orig-2.5.75/include/asm-arm26/poll.h pollrdonce-2.5.75/include/asm-arm26/poll.h --- orig-2.5.75/include/asm-arm26/poll.h 2003-07-08 21:52:49.000000000 +0100 +++ pollrdonce-2.5.75/include/asm-arm26/poll.h 2003-07-13 22:35:21.000000000 +0100 @@ -15,6 +15,7 @@ #define POLLWRNORM 0x0100 #define POLLWRBAND 0x0200 #define POLLMSG 0x0400 +#define POLLRDONCE 0x2000 struct pollfd { int fd; diff -ur orig-2.5.75/include/asm-cris/poll.h pollrdonce-2.5.75/include/asm-cris/poll.h --- orig-2.5.75/include/asm-cris/poll.h 2003-07-12 17:57:34.000000000 +0100 +++ pollrdonce-2.5.75/include/asm-cris/poll.h 2003-07-13 22:35:21.000000000 +0100 @@ -15,6 +15,7 @@ #define POLLWRBAND 512 #define POLLMSG 1024 #define POLLREMOVE 4096 +#define POLLRDONCE 8192 struct pollfd { int fd; diff -ur orig-2.5.75/include/asm-h8300/poll.h pollrdonce-2.5.75/include/asm-h8300/poll.h --- orig-2.5.75/include/asm-h8300/poll.h 2003-07-08 21:42:18.000000000 +0100 +++ pollrdonce-2.5.75/include/asm-h8300/poll.h 2003-07-13 22:35:21.000000000 +0100 @@ -12,6 +12,7 @@ #define POLLRDBAND 128 #define POLLWRBAND 256 #define POLLMSG 0x0400 +#define POLLRDONCE 2048 struct pollfd { int fd; diff -ur orig-2.5.75/include/asm-i386/poll.h pollrdonce-2.5.75/include/asm-i386/poll.h --- orig-2.5.75/include/asm-i386/poll.h 2003-07-08 21:42:26.000000000 +0100 +++ pollrdonce-2.5.75/include/asm-i386/poll.h 2003-07-13 22:35:21.000000000 +0100 @@ -16,6 +16,7 @@ #define POLLWRBAND 0x0200 #define POLLMSG 0x0400 #define POLLREMOVE 0x1000 +#define POLLRDONCE 0x2000 struct pollfd { int fd; diff -ur orig-2.5.75/include/asm-ia64/poll.h pollrdonce-2.5.75/include/asm-ia64/poll.h --- orig-2.5.75/include/asm-ia64/poll.h 2003-07-08 21:42:30.000000000 +0100 +++ pollrdonce-2.5.75/include/asm-ia64/poll.h 2003-07-13 22:35:21.000000000 +0100 @@ -21,6 +21,7 @@ #define POLLWRBAND 0x0200 #define POLLMSG 0x0400 #define POLLREMOVE 0x1000 +#define POLLRDONCE 0x2000 struct pollfd { int fd; diff -ur orig-2.5.75/include/asm-m68k/poll.h pollrdonce-2.5.75/include/asm-m68k/poll.h --- orig-2.5.75/include/asm-m68k/poll.h 2003-07-08 21:42:45.000000000 +0100 +++ pollrdonce-2.5.75/include/asm-m68k/poll.h 2003-07-13 22:35:21.000000000 +0100 @@ -13,6 +13,7 @@ #define POLLWRBAND 256 #define POLLMSG 0x0400 #define POLLREMOVE 0x1000 +#define POLLRDONCE 0x2000 struct pollfd { int fd; diff -ur orig-2.5.75/include/asm-mips/poll.h pollrdonce-2.5.75/include/asm-mips/poll.h --- orig-2.5.75/include/asm-mips/poll.h 2003-07-08 21:55:30.000000000 +0100 +++ pollrdonce-2.5.75/include/asm-mips/poll.h 2003-07-13 22:35:21.000000000 +0100 @@ -17,6 +17,7 @@ /* These seem to be more or less nonstandard ... */ #define POLLMSG 0x0400 #define POLLREMOVE 0x1000 +#define POLLRDONCE 0x2000 struct pollfd { int fd; diff -ur orig-2.5.75/include/asm-mips64/poll.h pollrdonce-2.5.75/include/asm-mips64/poll.h --- orig-2.5.75/include/asm-mips64/poll.h 2003-07-08 21:55:33.000000000 +0100 +++ pollrdonce-2.5.75/include/asm-mips64/poll.h 2003-07-13 22:35:21.000000000 +0100 @@ -17,6 +17,7 @@ /* These seem to be more or less nonstandard ... */ #define POLLMSG 0x0400 #define POLLREMOVE 0x1000 +#define POLLRDONCE 0x2000 struct pollfd { int fd; diff -ur orig-2.5.75/include/asm-parisc/poll.h pollrdonce-2.5.75/include/asm-parisc/poll.h --- orig-2.5.75/include/asm-parisc/poll.h 2003-07-08 21:43:06.000000000 +0100 +++ pollrdonce-2.5.75/include/asm-parisc/poll.h 2003-07-13 22:35:21.000000000 +0100 @@ -16,6 +16,7 @@ #define POLLWRBAND 0x0200 #define POLLMSG 0x0400 #define POLLREMOVE 0x1000 +#define POLLRDONCE 0x2000 struct pollfd { int fd; diff -ur orig-2.5.75/include/asm-ppc/poll.h pollrdonce-2.5.75/include/asm-ppc/poll.h --- orig-2.5.75/include/asm-ppc/poll.h 2003-07-08 21:43:15.000000000 +0100 +++ pollrdonce-2.5.75/include/asm-ppc/poll.h 2003-07-13 22:35:21.000000000 +0100 @@ -13,6 +13,7 @@ #define POLLWRBAND 0x0200 #define POLLMSG 0x0400 #define POLLREMOVE 0x1000 +#define POLLRDONCE 0x2000 struct pollfd { int fd; diff -ur orig-2.5.75/include/asm-ppc64/poll.h pollrdonce-2.5.75/include/asm-ppc64/poll.h --- orig-2.5.75/include/asm-ppc64/poll.h 2003-07-08 21:43:15.000000000 +0100 +++ pollrdonce-2.5.75/include/asm-ppc64/poll.h 2003-07-13 22:35:21.000000000 +0100 @@ -22,6 +22,7 @@ #define POLLWRBAND 0x0200 #define POLLMSG 0x0400 #define POLLREMOVE 0x1000 +#define POLLRDONCE 0x2000 struct pollfd { int fd; diff -ur orig-2.5.75/include/asm-s390/poll.h pollrdonce-2.5.75/include/asm-s390/poll.h --- orig-2.5.75/include/asm-s390/poll.h 2003-07-08 21:43:19.000000000 +0100 +++ pollrdonce-2.5.75/include/asm-s390/poll.h 2003-07-13 22:35:21.000000000 +0100 @@ -24,6 +24,7 @@ #define POLLWRBAND 0x0200 #define POLLMSG 0x0400 #define POLLREMOVE 0x1000 +#define POLLRDONCE 0x2000 struct pollfd { int fd; diff -ur orig-2.5.75/include/asm-sh/poll.h pollrdonce-2.5.75/include/asm-sh/poll.h --- orig-2.5.75/include/asm-sh/poll.h 2003-07-08 21:55:36.000000000 +0100 +++ pollrdonce-2.5.75/include/asm-sh/poll.h 2003-07-13 22:35:21.000000000 +0100 @@ -16,6 +16,7 @@ #define POLLWRBAND 0x0200 #define POLLMSG 0x0400 #define POLLREMOVE 0x1000 +#define POLLRDONCE 0x2000 struct pollfd { int fd; diff -ur orig-2.5.75/include/asm-sparc/poll.h pollrdonce-2.5.75/include/asm-sparc/poll.h --- orig-2.5.75/include/asm-sparc/poll.h 2003-07-08 21:43:28.000000000 +0100 +++ pollrdonce-2.5.75/include/asm-sparc/poll.h 2003-07-13 22:35:21.000000000 +0100 @@ -13,6 +13,7 @@ #define POLLWRBAND 256 #define POLLMSG 512 #define POLLREMOVE 1024 +#define POLLRDONCE 2048 struct pollfd { int fd; diff -ur orig-2.5.75/include/asm-sparc64/poll.h pollrdonce-2.5.75/include/asm-sparc64/poll.h --- orig-2.5.75/include/asm-sparc64/poll.h 2003-07-08 21:43:33.000000000 +0100 +++ pollrdonce-2.5.75/include/asm-sparc64/poll.h 2003-07-13 22:35:21.000000000 +0100 @@ -13,6 +13,7 @@ #define POLLWRBAND 256 #define POLLMSG 512 #define POLLREMOVE 1024 +#define POLLRDONCE 2048 struct pollfd { int fd; diff -ur orig-2.5.75/include/asm-v850/poll.h pollrdonce-2.5.75/include/asm-v850/poll.h --- orig-2.5.75/include/asm-v850/poll.h 2003-07-08 21:43:40.000000000 +0100 +++ pollrdonce-2.5.75/include/asm-v850/poll.h 2003-07-13 22:35:21.000000000 +0100 @@ -13,6 +13,7 @@ #define POLLWRBAND 0x0100 #define POLLMSG 0x0400 #define POLLREMOVE 0x1000 +#define POLLRDONCE 0x2000 struct pollfd { int fd; diff -ur orig-2.5.75/include/asm-x86_64/poll.h pollrdonce-2.5.75/include/asm-x86_64/poll.h --- orig-2.5.75/include/asm-x86_64/poll.h 2003-07-08 21:43:45.000000000 +0100 +++ pollrdonce-2.5.75/include/asm-x86_64/poll.h 2003-07-13 22:35:21.000000000 +0100 @@ -16,6 +16,7 @@ #define POLLWRBAND 0x0200 #define POLLMSG 0x0400 #define POLLREMOVE 0x1000 +#define POLLRDONCE 0x2000 struct pollfd { int fd; diff -ur orig-2.5.75/net/ipv4/tcp.c pollrdonce-2.5.75/net/ipv4/tcp.c --- orig-2.5.75/net/ipv4/tcp.c 2003-07-08 21:54:33.000000000 +0100 +++ pollrdonce-2.5.75/net/ipv4/tcp.c 2003-07-14 02:19:16.000000000 +0100 @@ -206,6 +206,7 @@ * lingertime == 0 (RFC 793 ABORT Call) * Hirokazu Takahashi : Use copy_from_user() instead of * csum_and_copy_from_user() if possible. + * Jamie Lokier : Added POLLRDONCE. * * This program is free software; you can redistribute it and/or * modify it under the terms of the GNU General Public License @@ -426,22 +427,39 @@ * * NOTE. Check for TCP_CLOSE is added. The goal is to prevent * blocking on fresh not-connected or disconnected socket. --ANK + * + * NOTE. POLLRDONCE will be set _only_ if multiple read/recvmsg + * calls are not required before the next edge-triggered epoll + * wakeup. Typically multiple calls are needed when data+EOF is + * pending; then the first read() is not enough to re-arm the + * POLLIN event. -- Jamie Lokier */ if (sk->sk_shutdown == SHUTDOWN_MASK || sk->sk_state == TCP_CLOSE) mask |= POLLHUP; + /* + * RCV_SHUTDOWN is always set when an EOF condition becomes pending. + */ if (sk->sk_shutdown & RCV_SHUTDOWN) mask |= POLLIN | POLLRDNORM; /* Connected? */ if ((1 << sk->sk_state) & ~(TCPF_SYN_SENT | TCPF_SYN_RECV)) { + + if (tp->urg_data & TCP_URG_VALID) + mask |= POLLPRI; + /* Potential race condition. If read of tp below will * escape above sk->sk_state, we can be illegally awaken * in SYN_* states. */ if ((tp->rcv_nxt != tp->copied_seq) && (tp->urg_seq != tp->copied_seq || tp->rcv_nxt != tp->copied_seq + 1 || - sock_flag(sk, SOCK_URGINLINE) || !tp->urg_data)) - mask |= POLLIN | POLLRDNORM; + sock_flag(sk, SOCK_URGINLINE) || !tp->urg_data)) { + if (mask == 0 && sk->sk_rcvlowat == INT_MAX) + mask = POLLIN | POLLRDNORM | POLLRDONCE; + else + mask |= POLLIN | POLLRDNORM; + } if (!(sk->sk_shutdown & SEND_SHUTDOWN)) { if (tcp_wspace(sk) >= tcp_min_write_space(sk)) { @@ -459,9 +477,6 @@ mask |= POLLOUT | POLLWRNORM; } } - - if (tp->urg_data & TCP_URG_VALID) - mask |= POLLPRI; } return mask; } diff -ur orig-2.5.75/net/unix/af_unix.c pollrdonce-2.5.75/net/unix/af_unix.c --- orig-2.5.75/net/unix/af_unix.c 2003-07-12 17:57:39.000000000 +0100 +++ pollrdonce-2.5.75/net/unix/af_unix.c 2003-07-14 02:33:25.000000000 +0100 @@ -50,6 +50,7 @@ * Arnaldo C. Melo : Remove MOD_{INC,DEC}_USE_COUNT, * the core infrastructure is doing that * for all net proto families now (2.5.69+) + * Jamie Lokier : Added POLLRDONCE. * * * Known differences from reference BSD that was tested: @@ -1784,15 +1785,21 @@ if (sk->sk_shutdown == SHUTDOWN_MASK) mask |= POLLHUP; - /* readable? */ - if (!skb_queue_empty(&sk->sk_receive_queue) || - (sk->sk_shutdown & RCV_SHUTDOWN)) - mask |= POLLIN | POLLRDNORM; - /* Connection-based need to check for termination and startup */ if (sk->sk_type == SOCK_STREAM && sk->sk_state == TCP_CLOSE) mask |= POLLHUP; + /* readable? */ + if ((sk->sk_shutdown & RCV_SHUTDOWN)) + mask |= POLLIN | POLLRDNORM; + else if (!skb_queue_empty(&sk->sk_receive_queue)) { + if (mask == 0 && sk->sk_type == SOCK_STREAM + && sk->sk_rcvlowat == INT_MAX) + mask = POLLIN | POLLRDNORM | POLLRDONCE; + else + mask |= POLLIN | POLLRDNORM; + } + /* * we set writable also when the other side has shut down the * connection. This prevents stuck sockets. ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: POLLRDONCE optimisation for epoll users (was: epoll and half closed TCP connections) 2003-07-14 2:24 ` POLLRDONCE optimisation for epoll users (was: epoll and half closed TCP connections) Jamie Lokier @ 2003-07-14 2:37 ` Davide Libenzi 2003-07-14 2:43 ` Davide Libenzi ` (2 more replies) 0 siblings, 3 replies; 58+ messages in thread From: Davide Libenzi @ 2003-07-14 2:37 UTC (permalink / raw) To: Jamie Lokier Cc: David S. Miller, Eric Varsanyi, Linux Kernel Mailing List, kuznet On Mon, 14 Jul 2003, Jamie Lokier wrote: > Jamie Lokier wrote: > > Anyway, there is a correct answer and I have made the patch so wait > > for next mail... :) > > The patch is included below. Note, it compiles and has been carefully > scrutinised by a team of master coders, but never actually run. Eric, > you may want to try it out :) > > The difference between this and POLLRDHUP is polarity, generality and > correctness :) > > Eric, change the polarity of your test from > > if (events & POLLRDHUP) > goto read_again; > > to > > if (!(events & POLLRDONCE)) > goto read_again; > > and it should (fingers crossed) work well. Ouch, I definitely liked more the POLLHUP thing. It is not linked to epoll at all. Ok, suppose that our super-fast app choses Edge Triggered epoll and also, aiming at topp speeds, using the smart read(2) trick : void my_process_read(my_data *d, unsigned int events) { int n, s; do { s = d->buffer_size - d->in_buffer; if ((n = read(d->fd, d->buffer + d->in_buffer, s)) > 0) { process_partial_buffer(d, s); d->in_buffer += s; } } while (n == s); if (s == -1 && errno != EAGAIN) { handle_read_error(d); return; } if (events & EPOLLRDHUP) { d->flags |= HANGUP; schedule_removal(d); } } Where this will break by using a POLLRDHUP ? - Davide ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: POLLRDONCE optimisation for epoll users (was: epoll and half closed TCP connections) 2003-07-14 2:37 ` Davide Libenzi @ 2003-07-14 2:43 ` Davide Libenzi 2003-07-14 2:56 ` Jamie Lokier 2003-07-14 3:04 ` Jamie Lokier 2 siblings, 0 replies; 58+ messages in thread From: Davide Libenzi @ 2003-07-14 2:43 UTC (permalink / raw) To: Jamie Lokier Cc: David S. Miller, Eric Varsanyi, Linux Kernel Mailing List, kuznet On Sun, 13 Jul 2003, Davide Libenzi wrote: > void my_process_read(my_data *d, unsigned int events) { > int n, s; > > do { > s = d->buffer_size - d->in_buffer; > if ((n = read(d->fd, d->buffer + d->in_buffer, s)) > 0) { > process_partial_buffer(d, s); > d->in_buffer += s; > } > } while (n == s); > if (s == -1 && errno != EAGAIN) { > handle_read_error(d); > return; > } > if (events & EPOLLRDHUP) { > d->flags |= HANGUP; > schedule_removal(d); > } > } Ouch, this is obviously : void my_process_read(my_data *d, unsigned int events) { int n, s; do { s = d->buffer_size - d->in_buffer; if ((n = read(d->fd, d->buffer + d->in_buffer, s)) > 0) { process_partial_buffer(d, n); d->in_buffer += n; } } while (n == s); if (n == -1 && errno != EAGAIN) { handle_read_error(d); return; } if (events & EPOLLRDHUP) { d->flags |= HANGUP; schedule_removal(d); } } - Davide ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: POLLRDONCE optimisation for epoll users (was: epoll and half closed TCP connections) 2003-07-14 2:37 ` Davide Libenzi 2003-07-14 2:43 ` Davide Libenzi @ 2003-07-14 2:56 ` Jamie Lokier 2003-07-14 3:02 ` Davide Libenzi 2003-07-14 3:12 ` Jamie Lokier 2003-07-14 3:04 ` Jamie Lokier 2 siblings, 2 replies; 58+ messages in thread From: Jamie Lokier @ 2003-07-14 2:56 UTC (permalink / raw) To: Davide Libenzi Cc: David S. Miller, Eric Varsanyi, Linux Kernel Mailing List, kuznet Davide Libenzi wrote: > Where this will break by using a POLLRDHUP ? It will break if (a) fd isn't a socket (b) fd isn't a TCP socket (c) kernel version <= 2.5.75 (d) SO_RCVLOWAT < s (e) there is urgent data with OOBINLINE (I think) -- Jamie ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: POLLRDONCE optimisation for epoll users (was: epoll and half closed TCP connections) 2003-07-14 2:56 ` Jamie Lokier @ 2003-07-14 3:02 ` Davide Libenzi 2003-07-14 3:16 ` Jamie Lokier 2003-07-14 3:12 ` Jamie Lokier 1 sibling, 1 reply; 58+ messages in thread From: Davide Libenzi @ 2003-07-14 3:02 UTC (permalink / raw) To: Jamie Lokier Cc: David S. Miller, Eric Varsanyi, Linux Kernel Mailing List, kuznet On Mon, 14 Jul 2003, Jamie Lokier wrote: > Davide Libenzi wrote: > > Where this will break by using a POLLRDHUP ? > > It will break if > > (a) fd isn't a socket > (b) fd isn't a TCP socket > (c) kernel version <= 2.5.75 > (d) SO_RCVLOWAT < s > (e) there is urgent data with OOBINLINE (I think) Jamie, did you smoke that stuff again ? :) With Eric patch in the proper places it is just fine. You just make f_op->poll() to report the extra flag other that POLLIN. What's the problem ? - Davide ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: POLLRDONCE optimisation for epoll users (was: epoll and half closed TCP connections) 2003-07-14 3:02 ` Davide Libenzi @ 2003-07-14 3:16 ` Jamie Lokier 2003-07-14 3:21 ` Davide Libenzi 0 siblings, 1 reply; 58+ messages in thread From: Jamie Lokier @ 2003-07-14 3:16 UTC (permalink / raw) To: Davide Libenzi Cc: David S. Miller, Eric Varsanyi, Linux Kernel Mailing List, kuznet Davide Libenzi wrote: > > > Where this will break by using a POLLRDHUP ? > > > > It will break if > > > > (a) fd isn't a socket > > (b) fd isn't a TCP socket > > (c) kernel version <= 2.5.75 > > (d) SO_RCVLOWAT < s > > (e) there is urgent data with OOBINLINE (I think) > Jamie, did you smoke that stuff again ? :) > With Eric patch in the proper places it is just fine. You just make > f_op->poll() to report the extra flag other that POLLIN. What's the problem ? The problem in cases (a)-(e) is your loop will call read() just once when it needs to call read() until it sees EAGAIN. What's wrong is the behaviour of your program when the extra flag _isn't_ set. -- Jamie ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: POLLRDONCE optimisation for epoll users (was: epoll and half closed TCP connections) 2003-07-14 3:16 ` Jamie Lokier @ 2003-07-14 3:21 ` Davide Libenzi 2003-07-14 3:42 ` Jamie Lokier 0 siblings, 1 reply; 58+ messages in thread From: Davide Libenzi @ 2003-07-14 3:21 UTC (permalink / raw) To: Jamie Lokier Cc: David S. Miller, Eric Varsanyi, Linux Kernel Mailing List, kuznet On Mon, 14 Jul 2003, Jamie Lokier wrote: > Davide Libenzi wrote: > > > > Where this will break by using a POLLRDHUP ? > > > > > > It will break if > > > > > > (a) fd isn't a socket > > > (b) fd isn't a TCP socket > > > (c) kernel version <= 2.5.75 > > > (d) SO_RCVLOWAT < s > > > (e) there is urgent data with OOBINLINE (I think) > > > Jamie, did you smoke that stuff again ? :) > > With Eric patch in the proper places it is just fine. You just make > > f_op->poll() to report the extra flag other that POLLIN. What's the problem ? > > The problem in cases (a)-(e) is your loop will call read() just once > when it needs to call read() until it sees EAGAIN. > > What's wrong is the behaviour of your program when the extra flag > _isn't_ set. Jamie, the loop will call read(2) until data is available. With the trick of checking the returned number of bytes you can avoid the extra EAGAIN read(2). That's the point of the read(2) trick. The final check for RDHUP will tell that it has no more to wait for POLLINs since there's no more someone sending. - Davide ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: POLLRDONCE optimisation for epoll users (was: epoll and half closed TCP connections) 2003-07-14 3:21 ` Davide Libenzi @ 2003-07-14 3:42 ` Jamie Lokier 2003-07-14 4:00 ` Davide Libenzi 0 siblings, 1 reply; 58+ messages in thread From: Jamie Lokier @ 2003-07-14 3:42 UTC (permalink / raw) To: Davide Libenzi Cc: David S. Miller, Eric Varsanyi, Linux Kernel Mailing List, kuznet Davide Libenzi wrote: > > > > (a) fd isn't a socket > > > > (b) fd isn't a TCP socket > > > > (c) kernel version <= 2.5.75 > > > > (d) SO_RCVLOWAT < s > > > > (e) there is urgent data with OOBINLINE (I think) > > > > > Jamie, did you smoke that stuff again ? :) > > > With Eric patch in the proper places it is just fine. You just make > > > f_op->poll() to report the extra flag other that POLLIN. What's the problem ? > > > > The problem in cases (a)-(e) is your loop will call read() just once > > when it needs to call read() until it sees EAGAIN. > > > > What's wrong is the behaviour of your program when the extra flag > > _isn't_ set. > > Jamie, the loop will call read(2) until data is available. With the trick > of checking the returned number of bytes you can avoid the extra EAGAIN > read(2). That's the point of the read(2) trick. That _only_ works if none of those conditions (a)-(e) applies. Otherwise, short reads are possible when there is more to come. Sure, if you're willing to assert that the program is running on kernel >= 2.5.76, all its fds are for sure TCP sockets and you added the POLLPRI check, then yes it's fine. I think mine is better because it works always, and you are free to code the optimisation in any programs, libraries etc. > The final check for RDHUP will tell that it has no more to wait for > POLLINs since there's no more someone sending. Sure, _that_ check is fine. -- Jamie ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: POLLRDONCE optimisation for epoll users (was: epoll and half closed TCP connections) 2003-07-14 3:42 ` Jamie Lokier @ 2003-07-14 4:00 ` Davide Libenzi 2003-07-14 5:51 ` Jamie Lokier 0 siblings, 1 reply; 58+ messages in thread From: Davide Libenzi @ 2003-07-14 4:00 UTC (permalink / raw) To: Jamie Lokier Cc: David S. Miller, Eric Varsanyi, Linux Kernel Mailing List, kuznet On Mon, 14 Jul 2003, Jamie Lokier wrote: > (a) fd isn't a socket > (b) fd isn't a TCP socket Jamie, libraries, like for example libevent, are completely generic indeed. They fetch events and they call the associated callback. You obviously know inside your callback which kind of fd you working on. So you use the reading function that best fit the fd type. Obviously the read(2) trick only works for stream type fds. > (c) kernel version <= 2.5.75 Obviously, POLLRDHUP is not yet inside the kernel :) > (d) SO_RCVLOWAT < s This does not apply with non-blocking fds. > (e) there is urgent data with OOBINLINE (I think) You obviously need an EPOLLPRI check in your read handling routine if you app is expecting urgent data. - Davide ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: POLLRDONCE optimisation for epoll users (was: epoll and half closed TCP connections) 2003-07-14 4:00 ` Davide Libenzi @ 2003-07-14 5:51 ` Jamie Lokier 2003-07-14 6:24 ` Davide Libenzi 0 siblings, 1 reply; 58+ messages in thread From: Jamie Lokier @ 2003-07-14 5:51 UTC (permalink / raw) To: Davide Libenzi Cc: David S. Miller, Eric Varsanyi, Linux Kernel Mailing List, kuznet Oh! I realised why we're mis-communicating. I started in the assumption that you're using POLLRDHUP to trigger a second read, but you're using it in a different way (makes sense given the name!) You're assuming that a short read is always the last one with data, that's the read() trick, and detecting EOF is really a wholly separate problem. Ah.. First let's get the misc bits out of the way. Davide Libenzi wrote: > > (d) SO_RCVLOWAT < s > > This does not apply with non-blocking fds. Look at the line "if (copied >= target)" in tcp_recvmsg. > > (e) there is urgent data with OOBINLINE (I think) > > You obviously need an EPOLLPRI check in your read handling routine if you > app is expecting urgent data. Normal behaviour is for urgent data to be discarded, I believe. Now if someone sends it to you, you'll end up with the socket stalling with pending data in the buffers. Not saying whether you care, it's just a difference of behaviour to be noted and a potential DOS (filling socket buffers which app doesn't know to empty). Now on to the important stuff. > On Mon, 14 Jul 2003, Jamie Lokier wrote: > > > (a) fd isn't a socket > > (b) fd isn't a TCP socket > > Jamie, libraries, like for example libevent, are completely generic indeed. > They fetch events and they call the associated callback. You obviously > know inside your callback which kind of fd you working on. I disagree - inside a stream parser callback (e.g. XML transcoder) I prefer to _not_ know the difference between pipe, file, tty and socket that I am reading. > So you use the reading function that best fit the fd type. Obviously > the read(2) trick only works for stream type fds. Stream fds provided you don't hit the unusual cases. Point taken. Now I'm saying there's an interface which is no more or less complicated, but _doesn't_ require you to treat different kinds of fds differently. So I can write code which uses the read() trick universally without having to pass that extra parameter, EOF_SETS_RDHUP, to the event callback :) > > (c) kernel version <= 2.5.75 > > Obviously, POLLRDHUP is not yet inside the kernel :) Quite. When you write an app that uses it and the read(2) trick you'll see the bug which Eric brought up :) I'm saying there's a way to write an app which can use the read(2) trick, yet which does _not_ hang on older kernels. Hence is robust. My overall philosophy on this: The less object A needs to know about object B, the better, right? Right? :) -- Jamie ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: POLLRDONCE optimisation for epoll users (was: epoll and half closed TCP connections) 2003-07-14 5:51 ` Jamie Lokier @ 2003-07-14 6:24 ` Davide Libenzi 2003-07-14 6:57 ` Jamie Lokier 0 siblings, 1 reply; 58+ messages in thread From: Davide Libenzi @ 2003-07-14 6:24 UTC (permalink / raw) To: Jamie Lokier Cc: David S. Miller, Eric Varsanyi, Linux Kernel Mailing List, kuznet On Mon, 14 Jul 2003, Jamie Lokier wrote: > Davide Libenzi wrote: > > > (d) SO_RCVLOWAT < s > > > > This does not apply with non-blocking fds. > > Look at the line "if (copied >= target)" in tcp_recvmsg. Look at this : timeo = sock_rcvtimeo(sk, nonblock); ;) > > > > (e) there is urgent data with OOBINLINE (I think) > > > > You obviously need an EPOLLPRI check in your read handling routine if you > > app is expecting urgent data. > > Normal behaviour is for urgent data to be discarded, I believe. Now > if someone sends it to you, you'll end up with the socket stalling > with pending data in the buffers. Not saying whether you care, it's > just a difference of behaviour to be noted and a potential DOS > (filling socket buffers which app doesn't know to empty). Yes, with OOBINLINE you need to take care of EPOLLPRI if you want to use the read(2) trick. The OOB virtually break the read. > > On Mon, 14 Jul 2003, Jamie Lokier wrote: > > > > > (a) fd isn't a socket > > > (b) fd isn't a TCP socket > > > > Jamie, libraries, like for example libevent, are completely generic indeed. > > They fetch events and they call the associated callback. You obviously > > know inside your callback which kind of fd you working on. > > I disagree - inside a stream parser callback (e.g. XML transcoder) I > prefer to _not_ know the difference between pipe, file, tty and socket > that I am reading. These are streams and you can use the read(2) trick w/out problems. I don't think you want to mount your XML parser over UDP. > > > (c) kernel version <= 2.5.75 > > > > Obviously, POLLRDHUP is not yet inside the kernel :) > > Quite. When you write an app that uses it and the read(2) trick > you'll see the bug which Eric brought up :) > > I'm saying there's a way to write an app which can use the read(2) > trick, yet which does _not_ hang on older kernels. Hence is robust. How, if you do not change the kernel by making it returning an extra flag ? - Davide ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: POLLRDONCE optimisation for epoll users (was: epoll and half closed TCP connections) 2003-07-14 6:24 ` Davide Libenzi @ 2003-07-14 6:57 ` Jamie Lokier 0 siblings, 0 replies; 58+ messages in thread From: Jamie Lokier @ 2003-07-14 6:57 UTC (permalink / raw) To: Davide Libenzi Cc: David S. Miller, Eric Varsanyi, Linux Kernel Mailing List, kuznet Davide Libenzi wrote: > > > > (d) SO_RCVLOWAT < s > > > This does not apply with non-blocking fds. > > Look at the line "if (copied >= target)" in tcp_recvmsg. > > Look at this : > timeo = sock_rcvtimeo(sk, nonblock); How does `nonblock' prevent short reads? I don't see it. > > I disagree - inside a stream parser callback (e.g. XML transcoder) I > > prefer to _not_ know the difference between pipe, file, tty and socket > > that I am reading. > > These are streams and you can use the read(2) trick w/out problems. I > don't think you want to mount your XML parser over UDP. You cannot use the read(2) trick with a tty or file; RDHUP doesn't help. > > > > (c) kernel version <= 2.5.75 > > > Obviously, POLLRDHUP is not yet inside the kernel :) > > Quite. When you write an app that uses it and the read(2) trick > > you'll see the bug which Eric brought up :) > > > > I'm saying there's a way to write an app which can use the read(2) > > trick, yet which does _not_ hang on older kernels. Hence is robust. > > How, if you do not change the kernel by making it returning an extra flag ? By defining the interface such that _not_ setting the flag merely suppresses the optimisation, it doesn't stop the program from working. -- Jamie ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: POLLRDONCE optimisation for epoll users (was: epoll and half closed TCP connections) 2003-07-14 2:56 ` Jamie Lokier 2003-07-14 3:02 ` Davide Libenzi @ 2003-07-14 3:12 ` Jamie Lokier 2003-07-14 3:17 ` Davide Libenzi 1 sibling, 1 reply; 58+ messages in thread From: Jamie Lokier @ 2003-07-14 3:12 UTC (permalink / raw) To: Davide Libenzi Cc: David S. Miller, Eric Varsanyi, Linux Kernel Mailing List, kuznet Jamie Lokier wrote: > (e) there is urgent data with OOBINLINE (I think) To be more precise, using the POLLRDHUP patch as-is, if someone sends your program some data, then an URGent segment, then a FIN with optional data in between, your program won't notice the second data or FIN and will fail to clean up the socket. -- Jamie ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: POLLRDONCE optimisation for epoll users (was: epoll and half closed TCP connections) 2003-07-14 3:12 ` Jamie Lokier @ 2003-07-14 3:17 ` Davide Libenzi 2003-07-14 3:35 ` Jamie Lokier 0 siblings, 1 reply; 58+ messages in thread From: Davide Libenzi @ 2003-07-14 3:17 UTC (permalink / raw) To: Jamie Lokier Cc: David S. Miller, Eric Varsanyi, Linux Kernel Mailing List, kuznet On Mon, 14 Jul 2003, Jamie Lokier wrote: > Jamie Lokier wrote: > > (e) there is urgent data with OOBINLINE (I think) > > To be more precise, using the POLLRDHUP patch as-is, if someone sends > your program some data, then an URGent segment, then a FIN with > optional data in between, your program won't notice the second data or > FIN and will fail to clean up the socket. And why ? To me it looks fairly simple. When the FIN is received a wakeup is done on the poll wait list and the following f_op->poll will fetch the RDHUP flag. Then the next epoll_wait() will fetch the event and will have all the info it needs to do things correctly. - Davide ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: POLLRDONCE optimisation for epoll users (was: epoll and half closed TCP connections) 2003-07-14 3:17 ` Davide Libenzi @ 2003-07-14 3:35 ` Jamie Lokier 0 siblings, 0 replies; 58+ messages in thread From: Jamie Lokier @ 2003-07-14 3:35 UTC (permalink / raw) To: Davide Libenzi Cc: David S. Miller, Eric Varsanyi, Linux Kernel Mailing List, kuznet Davide Libenzi wrote: > > To be more precise, using the POLLRDHUP patch as-is, if someone sends > > your program some data, then an URGent segment, then a FIN with > > optional data in between, your program won't notice the second data or > > FIN and will fail to clean up the socket. > > And why ? To me it looks fairly simple. When the FIN is received a wakeup > is done on the poll wait list and the following f_op->poll will fetch the > RDHUP flag. Then the next epoll_wait() will fetch the event and will have > all the info it needs to do things correctly. Burp. You're right. The loop failure comes when user sends URG and more data _without_ FIN. Then RDHUP is not set, and your loop will read up to before the URG and no further. (Normal behaviour would be to skip the URG segment and continue reading data after it, or to include the URG segmenet if OOBINLINE is set.) Ahem, -- Jamie ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: POLLRDONCE optimisation for epoll users (was: epoll and half closed TCP connections) 2003-07-14 2:37 ` Davide Libenzi 2003-07-14 2:43 ` Davide Libenzi 2003-07-14 2:56 ` Jamie Lokier @ 2003-07-14 3:04 ` Jamie Lokier 2003-07-14 3:12 ` Davide Libenzi 2 siblings, 1 reply; 58+ messages in thread From: Jamie Lokier @ 2003-07-14 3:04 UTC (permalink / raw) To: Davide Libenzi Cc: David S. Miller, Eric Varsanyi, Linux Kernel Mailing List, kuznet Davide Libenzi wrote: > Ouch, I definitely liked more the POLLHUP thing. It is not linked to epoll > at all. POLLRDONCE isn't linked to epoll - it's a valid hint for poll/select too. It means something different, i.e. you can't write: > if (events & EPOLLRDHUP) { > d->flags |= HANGUP; > schedule_removal(d); > } Be careful, because that isn't valid if there is urgent data. You need to check POLLPRI too. Granted urgent data is usually best ignored :) If fast hangup is a useful optimisation too, we should have both flags. (However calling read() doesn't seem like a great penalty for that). -- Jamie ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: POLLRDONCE optimisation for epoll users (was: epoll and half closed TCP connections) 2003-07-14 3:04 ` Jamie Lokier @ 2003-07-14 3:12 ` Davide Libenzi 2003-07-14 3:27 ` Jamie Lokier 0 siblings, 1 reply; 58+ messages in thread From: Davide Libenzi @ 2003-07-14 3:12 UTC (permalink / raw) To: Jamie Lokier Cc: David S. Miller, Eric Varsanyi, Linux Kernel Mailing List, kuznet On Mon, 14 Jul 2003, Jamie Lokier wrote: > Davide Libenzi wrote: > > Ouch, I definitely liked more the POLLHUP thing. It is not linked to epoll > > at all. > > POLLRDONCE isn't linked to epoll - it's a valid hint for poll/select too. > > It means something different, i.e. you can't write: > > > if (events & EPOLLRDHUP) { > > d->flags |= HANGUP; > > schedule_removal(d); > > } > > Be careful, because that isn't valid if there is urgent data. You > need to check POLLPRI too. Granted urgent data is usually best ignored :) I didn't want to code the whole application here, hope you understand ;) > If fast hangup is a useful optimisation too, we should have both flags. > (However calling read() doesn't seem like a great penalty for that). Indeed. Hangup cases are a small fraction of std ones, so it has no sense to optimize for them trying to avoid at all the read. And the name READONCE seems to imply that you can't read(2) twice. I'd rather prefer the RDHUP flag that tells me : There's an hungup condition for sure, and you might also find some data since POLLIN is set. - Davide ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: POLLRDONCE optimisation for epoll users (was: epoll and half closed TCP connections) 2003-07-14 3:12 ` Davide Libenzi @ 2003-07-14 3:27 ` Jamie Lokier 0 siblings, 0 replies; 58+ messages in thread From: Jamie Lokier @ 2003-07-14 3:27 UTC (permalink / raw) To: Davide Libenzi Cc: David S. Miller, Eric Varsanyi, Linux Kernel Mailing List, kuznet Davide Libenzi wrote: > And the name READONCE seems to imply that you can't read(2) twice. Like all POLL* flags, you can always do more than it implies and get EAGAIN :) I don't care about the name, feel free to pick another. > I'd rather prefer the RDHUP flag that tells me : There's an hungup > condition for sure, and you might also find some data since POLLIN is set. Yeah, but it doesn't stop the do-while loop from being broken :) -- Jamie ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [Patch][RFC] epoll and half closed TCP connections 2003-07-13 5:24 ` David S. Miller 2003-07-13 14:07 ` Jamie Lokier @ 2003-07-14 17:09 ` kuznet 2003-07-14 17:09 ` Davide Libenzi 2003-07-14 21:45 ` Jamie Lokier 1 sibling, 2 replies; 58+ messages in thread From: kuznet @ 2003-07-14 17:09 UTC (permalink / raw) To: David S. Miller; +Cc: davidel, e0206, linux-kernel Hello! > Alexey, they seem to want to add some kind of POLLRDHUP thing, > comments wrt. TCP and elsewhere in the networking? See below... I see. It is highly reasonable. Unlike SVR4 POLLHUP. :-) Well, "elsewhere" is mostly af_unix, half-duplex close makes sense only for tcp and af_unix. Alexey ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [Patch][RFC] epoll and half closed TCP connections 2003-07-14 17:09 ` [Patch][RFC] epoll and half closed TCP connections kuznet @ 2003-07-14 17:09 ` Davide Libenzi 2003-07-14 21:45 ` Jamie Lokier 1 sibling, 0 replies; 58+ messages in thread From: Davide Libenzi @ 2003-07-14 17:09 UTC (permalink / raw) To: kuznet; +Cc: David S. Miller, e0206, Linux Kernel Mailing List On Mon, 14 Jul 2003 kuznet@ms2.inr.ac.ru wrote: > Hello! > > > Alexey, they seem to want to add some kind of POLLRDHUP thing, > > comments wrt. TCP and elsewhere in the networking? See below... > > I see. It is highly reasonable. Unlike SVR4 POLLHUP. :-) > > Well, "elsewhere" is mostly af_unix, half-duplex close makes sense only > for tcp and af_unix. If you agree guys, we can prepare a patch that does the handling in all places where it has a mean. So that you can look at it. - Davide ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [Patch][RFC] epoll and half closed TCP connections 2003-07-14 17:09 ` [Patch][RFC] epoll and half closed TCP connections kuznet 2003-07-14 17:09 ` Davide Libenzi @ 2003-07-14 21:45 ` Jamie Lokier 1 sibling, 0 replies; 58+ messages in thread From: Jamie Lokier @ 2003-07-14 21:45 UTC (permalink / raw) To: kuznet; +Cc: David S. Miller, davidel, e0206, linux-kernel kuznet@ms2.inr.ac.ru wrote: > Well, "elsewhere" is mostly af_unix, half-duplex close makes sense only > for tcp and af_unix. What about sctp - can that do half-duplex close? -- Jamie ^ permalink raw reply [flat|nested] 58+ messages in thread
end of thread, other threads:[~2003-07-16 1:54 UTC | newest] Thread overview: 58+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2003-07-12 18:16 [Patch][RFC] epoll and half closed TCP connections Eric Varsanyi 2003-07-12 19:44 ` Jamie Lokier 2003-07-12 20:51 ` Eric Varsanyi 2003-07-12 20:48 ` Davide Libenzi 2003-07-12 21:19 ` Eric Varsanyi 2003-07-12 21:20 ` Davide Libenzi 2003-07-12 21:41 ` Davide Libenzi 2003-07-12 23:11 ` Eric Varsanyi 2003-07-12 23:55 ` Davide Libenzi 2003-07-13 1:05 ` Eric Varsanyi 2003-07-13 20:32 ` David Schwartz 2003-07-13 21:10 ` Jamie Lokier 2003-07-13 23:05 ` David Schwartz 2003-07-13 23:09 ` Davide Libenzi 2003-07-14 8:14 ` Alan Cox 2003-07-14 15:03 ` Davide Libenzi 2003-07-14 1:27 ` Jamie Lokier 2003-07-13 21:14 ` Davide Libenzi 2003-07-13 23:05 ` David Schwartz 2003-07-13 23:11 ` Davide Libenzi 2003-07-13 23:52 ` Entrope 2003-07-14 6:14 ` David Schwartz 2003-07-14 7:20 ` Jamie Lokier 2003-07-14 1:51 ` Jamie Lokier 2003-07-14 6:14 ` David Schwartz 2003-07-15 20:27 ` James Antill 2003-07-16 1:46 ` David Schwartz 2003-07-16 2:09 ` James Antill 2003-07-13 13:12 ` Jamie Lokier 2003-07-13 16:55 ` Davide Libenzi 2003-07-12 20:01 ` Davide Libenzi 2003-07-13 5:24 ` David S. Miller 2003-07-13 14:07 ` Jamie Lokier 2003-07-13 17:00 ` Davide Libenzi 2003-07-13 19:15 ` Jamie Lokier 2003-07-13 23:03 ` Davide Libenzi 2003-07-14 1:41 ` Jamie Lokier 2003-07-14 2:24 ` POLLRDONCE optimisation for epoll users (was: epoll and half closed TCP connections) Jamie Lokier 2003-07-14 2:37 ` Davide Libenzi 2003-07-14 2:43 ` Davide Libenzi 2003-07-14 2:56 ` Jamie Lokier 2003-07-14 3:02 ` Davide Libenzi 2003-07-14 3:16 ` Jamie Lokier 2003-07-14 3:21 ` Davide Libenzi 2003-07-14 3:42 ` Jamie Lokier 2003-07-14 4:00 ` Davide Libenzi 2003-07-14 5:51 ` Jamie Lokier 2003-07-14 6:24 ` Davide Libenzi 2003-07-14 6:57 ` Jamie Lokier 2003-07-14 3:12 ` Jamie Lokier 2003-07-14 3:17 ` Davide Libenzi 2003-07-14 3:35 ` Jamie Lokier 2003-07-14 3:04 ` Jamie Lokier 2003-07-14 3:12 ` Davide Libenzi 2003-07-14 3:27 ` Jamie Lokier 2003-07-14 17:09 ` [Patch][RFC] epoll and half closed TCP connections kuznet 2003-07-14 17:09 ` Davide Libenzi 2003-07-14 21:45 ` Jamie Lokier
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).