linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* re: [PATCH] /dev/epoll update ...
@ 2001-09-19  2:20 Dan Kegel
  2001-09-19  6:25 ` Dan Kegel
                   ` (2 more replies)
  0 siblings, 3 replies; 51+ messages in thread
From: Dan Kegel @ 2001-09-19  2:20 UTC (permalink / raw)
  To: linux-kernel, Davide Libenzi

Davide wrote:

> The /dev/epoll patch has been updated :
> 
> *) Stale events removal
> *) Help in Configure.help ( thanks to David E. Weekly )
> *) Fit 2.4.9
> ...
> http://www.xmailserver.org/linux-patches/nio-improve.html

Davide, 
I'm getting ready to stress-test /dev/epoll finally.
In porting my Poller_devpoll.{cc,h} to support /dev/epoll, I noticed
the following issues:

1. it would be very nice to be able to expand the interest list
   without affecting the currently ready list.  In fact, this may
   be needed to support existing programs.    A quick look at
   your code gives me the impression that it would be easy to add
   a ioctl(kdpfd, EP_REALLOC, newmaxfds) call to do this.  Do you agree?

2. The names made visible to userland by your patch do not follow
   a consistent naming convention.  May I suggest that you use
   EPOLL_ as a uniform prefix, and epoll.h as the user-visible include file?
   http://www.opengroup.org/onlinepubs/007908799/xsh/compilation.html
   shows that Posix cares greatly about this kind of namespace issue,
   and it'd be nice to follow their lead, even though this isn't a Posix
   interface.

3. You modify asm/poll.h.  Can your modifications be restricted to epoll.h 
   instead?  (Hey, I don't know much, maybe there's a good reason you did this.)

Thanks,
Dan Kegel

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] /dev/epoll update ...
  2001-09-19  2:20 [PATCH] /dev/epoll update Dan Kegel
@ 2001-09-19  6:25 ` Dan Kegel
  2001-09-19  7:04 ` Christopher K. St. John
  2001-09-19 17:21 ` Davide Libenzi
  2 siblings, 0 replies; 51+ messages in thread
From: Dan Kegel @ 2001-09-19  6:25 UTC (permalink / raw)
  To: linux-kernel, Davide Libenzi

Dan Kegel wrote:
> 1. it would be very nice to be able to expand the interest list
>    without affecting the currently ready list.  In fact, this may
>    be needed to support existing programs.    A quick look at
>    your code gives me the impression that it would be easy to add
>    a ioctl(kdpfd, EP_REALLOC, newmaxfds) call to do this.  Do you agree?

Aw, crap, nevermind.   Since when you expand the interest list
you can double it, this happens so seldom it doesn't matter that
you have to do EP_FREE + EP_ALLOC + EP_POLL. 

I stand by my other two requests, though (the uniform naming convention
and hands off poll.h).

- Dan

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] /dev/epoll update ...
  2001-09-19  2:20 [PATCH] /dev/epoll update Dan Kegel
  2001-09-19  6:25 ` Dan Kegel
@ 2001-09-19  7:04 ` Christopher K. St. John
  2001-09-19 15:37   ` Dan Kegel
  2001-09-19 17:25   ` Davide Libenzi
  2001-09-19 17:21 ` Davide Libenzi
  2 siblings, 2 replies; 51+ messages in thread
From: Christopher K. St. John @ 2001-09-19  7:04 UTC (permalink / raw)
  To: linux-kernel; +Cc: Dan Kegel, davidel

Dan Kegel wrote:
> 
> I'm getting ready to stress-test /dev/epoll finally.
> In porting my Poller_devpoll.{cc,h} to support /dev/epoll, I noticed
> the following issues:
> 

 Another issue to throw into the mix:

 The Banga, Mogul and Druschel[1] paper (which I understand
was the inspiration for the Solaris /dev/poll which was the
inspiration for /dev/epoll?) talks about having the poll
return the current state of new descriptors. As far as I can
tell, /dev/epoll only gives you events on state changes. So,
for example, if you accept() a new socket and add it to the
interest list, you (probably) won't get a POLLIN. That's
not fatal, but it's awkward.

 The BMD paper suggests making the behavior optional, but
I didn't see anything about it in the Solaris /dev/poll
manpage (and I don't have a copy of Solaris to try it out
on).

 My vote would be to always report the initial state, but
that would make the driver a little more complicated.

 What are the preferred semantics?


[1] http://citeseer.nj.nec.com/banga99scalable.html


-- 
Christopher St. John cks@distributopia.com
DistribuTopia http://www.distributopia.com

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] /dev/epoll update ...
  2001-09-19  7:04 ` Christopher K. St. John
@ 2001-09-19 15:37   ` Dan Kegel
  2001-09-19 15:59     ` Zach Brown
                       ` (3 more replies)
  2001-09-19 17:25   ` Davide Libenzi
  1 sibling, 4 replies; 51+ messages in thread
From: Dan Kegel @ 2001-09-19 15:37 UTC (permalink / raw)
  To: Christopher K. St. John; +Cc: linux-kernel, davidel

"Christopher K. St. John" wrote:
>  The Banga, Mogul and Druschel[1] paper (which I understand
> was the inspiration for the Solaris /dev/poll which was the
> inspiration for /dev/epoll?) talks about having the poll
> return the current state of new descriptors. As far as I can
> tell, /dev/epoll only gives you events on state changes. So,
> for example, if you accept() a new socket and add it to the
> interest list, you (probably) won't get a POLLIN. That's
> not fatal, but it's awkward.
>...
>  My vote would be to always report the initial state, but
> that would make the driver a little more complicated.
> 
>  What are the preferred semantics?

Taking an extreme but justifiable position for discussion's sake:

Stevens [UNPV1, in chapter on nonblocking accept] suggests that readiness
notifications from the OS should only be considered hints, and that user
programs should behave properly even if the OS feeds it false readiness
events.  

Thus one possible approach would be for /dev/epoll (or users of /dev/epoll)
to assume that an fd is initially ready for all (normal) events, and just
try handling them all.  That probably involves a single system call
to read() (or possibly a call to both write() and read(), or a call to accept(),
or a call to getsockopt() in the case of nonblocking connect), so the overhead
isn't very high.

(In fact, programs that use select(), poll(), or /dev/epoll would benefit
from having a test mode where false readiness events are injected at random;
the program should continue to behave normally, perhaps with slightly increased
CPU usage.)

That said, the principle of least suprise would suggest that /dev/epoll should
indeed return an accurate initial status.  There are a lot of programmers who
don't agree with Stevens on this issue, and who write code that breaks if you
feed it incorrect readiness events.

- Dan

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] /dev/epoll update ...
  2001-09-19 15:37   ` Dan Kegel
@ 2001-09-19 15:59     ` Zach Brown
  2001-09-19 17:12     ` Christopher K. St. John
                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 51+ messages in thread
From: Zach Brown @ 2001-09-19 15:59 UTC (permalink / raw)
  To: Dan Kegel; +Cc: Christopher K. St. John, linux-kernel, davidel

> Stevens [UNPV1, in chapter on nonblocking accept] suggests that readiness
> notifications from the OS should only be considered hints, and that user
> programs should behave properly even if the OS feeds it false readiness
> events.  

[ ... ]

> That said, the principle of least suprise would suggest that /dev/epoll should
> indeed return an accurate initial status.  There are a lot of programmers who
> don't agree with Stevens on this issue, and who write code that breaks if you
> feed it incorrect readiness events.

They're living a lie :)  A readiness event does not guarantee future
operations, it provides a hint of the status of things at the time the
event was generated.  Networking events can happen that change the status
of sockets from when readiness events come in and when the app tries to
react to them..

	- kernel gets ack, freeing tx queue space
	- the kernel wakes up the task with some POLL_OUT event
	- a packet comes in from the wire that resets the socket
	- the app sees poll_out and tries to write, and is surprised

there are many more situations like this.  readiness is _always_ only
a hint, and the app has to deal with this.

- z

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] /dev/epoll update ...
  2001-09-19 15:37   ` Dan Kegel
  2001-09-19 15:59     ` Zach Brown
@ 2001-09-19 17:12     ` Christopher K. St. John
  2001-09-19 17:39     ` Davide Libenzi
  2001-09-19 18:26     ` Alan Cox
  3 siblings, 0 replies; 51+ messages in thread
From: Christopher K. St. John @ 2001-09-19 17:12 UTC (permalink / raw)
  To: linux-kernel; +Cc: Dan Kegel, davidel

Dan Kegel wrote:
> 
> >  My vote would be to always report the initial state, but
> > that would make the driver more complicated.
> >
> Stevens [UNPV1, in chapter on nonblocking accept] suggests that readiness
> notifications from the OS should only be considered hints, and that user
> programs should behave properly even if the OS feeds it false readiness
> events.
>

 I agree that apps must properly handle incorrect hints, but
there's a difference between:

 A) Signalling readiness when the fd really isn't ready. This
    happens because of the nature of the universe, and isn't
    avoidable (because the state can change after the signal is
    sent but before the signal is received)

 B) Not reliably signalling readiness when the fd is ready.
    This is a bug, because it makes the mechanism 99%
    useless (If you must manually poll all the fd's to make
    sure there hasn't been a lost event, then you haven't
    gained very much)

 Not signalling initial state isn't as bad as (B), because the
app can code around it. But boy it's ugly, and because the
kernel already knows the information, it's 100% fixable in the
driver. (Although I'm not sure how much complexity it would
add to the driver, so I can't comment if the tradeoff it 
worth it)


> Thus one possible approach would be for /dev/epoll (or users of /dev/epoll)
> to assume that an fd is initially ready for all (normal) events, and just
> try handling them all. 
>

 Right, that's the solution mentioned in the BMD paper, and
that's what I've done. But it's (IMHO) ugly and (as you point
out) unexpected. 

 Anybody know what Solaris /dev/poll does? The man page I 
read wasn't clear, and I don't have Solaris box to try it
out on.


-- 
Christopher St. John cks@distributopia.com
DistribuTopia http://www.distributopia.com

^ permalink raw reply	[flat|nested] 51+ messages in thread

* re: [PATCH] /dev/epoll update ...
  2001-09-19  2:20 [PATCH] /dev/epoll update Dan Kegel
  2001-09-19  6:25 ` Dan Kegel
  2001-09-19  7:04 ` Christopher K. St. John
@ 2001-09-19 17:21 ` Davide Libenzi
  2 siblings, 0 replies; 51+ messages in thread
From: Davide Libenzi @ 2001-09-19 17:21 UTC (permalink / raw)
  To: Dan Kegel; +Cc: linux-kernel


On 19-Sep-2001 Dan Kegel wrote:
> Davide wrote:
> 
>> The /dev/epoll patch has been updated :
>> 
>> *) Stale events removal
>> *) Help in Configure.help ( thanks to David E. Weekly )
>> *) Fit 2.4.9
>> ...
>> http://www.xmailserver.org/linux-patches/nio-improve.html
> 
> Davide, 
> I'm getting ready to stress-test /dev/epoll finally.
> In porting my Poller_devpoll.{cc,h} to support /dev/epoll, I noticed
> the following issues:

Pls wait the end of today to let me update the patch correctly.


> 
> 2. The names made visible to userland by your patch do not follow
>    a consistent naming convention.  May I suggest that you use
>    EPOLL_ as a uniform prefix, and epoll.h as the user-visible include file?
>    http://www.opengroup.org/onlinepubs/007908799/xsh/compilation.html
>    shows that Posix cares greatly about this kind of namespace issue,
>    and it'd be nice to follow their lead, even though this isn't a Posix
>    interface.

Posix spoke :) I'll change it in the next versions.



> 3. You modify asm/poll.h.  Can your modifications be restricted to epoll.h 
>    instead?  (Hey, I don't know much, maybe there's a good reason you did this.)

This is where flags are stored and using an external file could lead to a collision
when other coders will add flags. IMHO is better to have a centralized definition
of these flags.




- Davide


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] /dev/epoll update ...
  2001-09-19  7:04 ` Christopher K. St. John
  2001-09-19 15:37   ` Dan Kegel
@ 2001-09-19 17:25   ` Davide Libenzi
  2001-09-19 19:03     ` Christopher K. St. John
  1 sibling, 1 reply; 51+ messages in thread
From: Davide Libenzi @ 2001-09-19 17:25 UTC (permalink / raw)
  To: Christopher K. St. John; +Cc: Dan Kegel, linux-kernel


On 19-Sep-2001 Christopher K. St. John wrote:
> Dan Kegel wrote:
>> 
>> I'm getting ready to stress-test /dev/epoll finally.
>> In porting my Poller_devpoll.{cc,h} to support /dev/epoll, I noticed
>> the following issues:
>> 
> 
>  Another issue to throw into the mix:
> 
>  The Banga, Mogul and Druschel[1] paper (which I understand
> was the inspiration for the Solaris /dev/poll which was the
> inspiration for /dev/epoll?) talks about having the poll
> return the current state of new descriptors. As far as I can
> tell, /dev/epoll only gives you events on state changes. So,
> for example, if you accept() a new socket and add it to the
> interest list, you (probably) won't get a POLLIN. That's
> not fatal, but it's awkward.

Being an event change notification you simply can't add the fd
to the "monitor" after you've issued the accept().
The skeleton for /dev/epoll usage is :

while (system_call(...) == FAIL) {

        wait_event();
}



- Davide


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] /dev/epoll update ...
  2001-09-19 15:37   ` Dan Kegel
  2001-09-19 15:59     ` Zach Brown
  2001-09-19 17:12     ` Christopher K. St. John
@ 2001-09-19 17:39     ` Davide Libenzi
  2001-09-19 18:26     ` Alan Cox
  3 siblings, 0 replies; 51+ messages in thread
From: Davide Libenzi @ 2001-09-19 17:39 UTC (permalink / raw)
  To: Dan Kegel; +Cc: linux-kernel, linux-kernel, Christopher K. St. John


On 19-Sep-2001 Dan Kegel wrote:
> "Christopher K. St. John" wrote:
>>  The Banga, Mogul and Druschel[1] paper (which I understand
>> was the inspiration for the Solaris /dev/poll which was the
>> inspiration for /dev/epoll?) talks about having the poll
>> return the current state of new descriptors. As far as I can
>> tell, /dev/epoll only gives you events on state changes. So,
>> for example, if you accept() a new socket and add it to the
>> interest list, you (probably) won't get a POLLIN. That's
>> not fatal, but it's awkward.
>>...
>>  My vote would be to always report the initial state, but
>> that would make the driver a little more complicated.
>> 
>>  What are the preferred semantics?
> 
> Taking an extreme but justifiable position for discussion's sake:
> 
> Stevens [UNPV1, in chapter on nonblocking accept] suggests that readiness
> notifications from the OS should only be considered hints, and that user
> programs should behave properly even if the OS feeds it false readiness
> events.  
> 
> Thus one possible approach would be for /dev/epoll (or users of /dev/epoll)
> to assume that an fd is initially ready for all (normal) events, and just
> try handling them all.  That probably involves a single system call
> to read() (or possibly a call to both write() and read(), or a call to accept(),
> or a call to getsockopt() in the case of nonblocking connect), so the overhead
> isn't very high.

I think there's an advantage instead.
With the usual scheme :

        select()/poll();
        recv()/send();

you always issue two system calls each time, while with :

        while (recv()/send() == FAIL) {
                wait_event();
        }

you're going to issue two calls only in certain conditions.




- Davide


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] /dev/epoll update ...
  2001-09-19 15:37   ` Dan Kegel
                       ` (2 preceding siblings ...)
  2001-09-19 17:39     ` Davide Libenzi
@ 2001-09-19 18:26     ` Alan Cox
  3 siblings, 0 replies; 51+ messages in thread
From: Alan Cox @ 2001-09-19 18:26 UTC (permalink / raw)
  To: Dan Kegel; +Cc: Christopher K. St. John, linux-kernel, davidel

> Stevens [UNPV1, in chapter on nonblocking accept] suggests that readiness
> notifications from the OS should only be considered hints, and that user
> programs should behave properly even if the OS feeds it false readiness
> events.  

For accept this is specifically and definitely true. A pending connection
can go away before you accept it. What happens then is rather OS specific -
BSD unix gives you a socket that has died - which can be handy and avoids
the problem but others don't all do the same

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] /dev/epoll update ...
  2001-09-19 17:25   ` Davide Libenzi
@ 2001-09-19 19:03     ` Christopher K. St. John
  2001-09-19 19:30       ` Davide Libenzi
  0 siblings, 1 reply; 51+ messages in thread
From: Christopher K. St. John @ 2001-09-19 19:03 UTC (permalink / raw)
  To: linux-kernel; +Cc: Davide Libenzi, Dan Kegel

Davide Libenzi wrote:
> 
> > /dev/epoll only gives you events on state changes. So,
> > for example, if you accept() a new socket and add it to the
> > interest list, you (probably) won't get a POLLIN. That's
> > not fatal, but it's awkward.
> 
> Being an event change notification you simply can't add the fd
> to the "monitor" after you've issued the accept().
> The skeleton for /dev/epoll usage is :
> 
> while (system_call(...) == FAIL) {
> 
>         wait_event();
> }
> 

 I'm not sure I understand. I'm assuming you can do
something along the lines of:

 // application accepts new socket
 new_socket_fd = accept()

 // application registers interest with epoll
 write(dev_poll_fd, new_socket_fd):
   drivers/char/eventpoll.c:ep_insert():
    - add new_socket_fd to interest list
    - check new_socket_fd for readable, writable, and
      error. if any true, then add new event to 
      event queue, as if the state had changed.

 // application asks for current set of events
 app: ioctl(dev_poll_fd, EP_POLL):
   drivers/char/eventpoll.c:ep_poll():
     - return the current event queue

 In other words, when new fd's are added to the
interest set, you generate synthetic events which
are returned at the next ioctl(EP_POLL).

 Are you saying that isn't possible? It's the
suggested behavior from the BMD paper, so evidently
they got it to work somehow (and I suspect it's how
Solaris /dev/poll works, but I'm not sure)

-- 
Christopher St. John cks@distributopia.com
DistribuTopia http://www.distributopia.com

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] /dev/epoll update ...
  2001-09-19 19:03     ` Christopher K. St. John
@ 2001-09-19 19:30       ` Davide Libenzi
  2001-09-19 21:49         ` Christopher K. St. John
  0 siblings, 1 reply; 51+ messages in thread
From: Davide Libenzi @ 2001-09-19 19:30 UTC (permalink / raw)
  To: Christopher K. St. John; +Cc: Dan Kegel, linux-kernel


On 19-Sep-2001 Christopher K. St. John wrote:
> Davide Libenzi wrote:
>> 
>> > /dev/epoll only gives you events on state changes. So,
>> > for example, if you accept() a new socket and add it to the
>> > interest list, you (probably) won't get a POLLIN. That's
>> > not fatal, but it's awkward.
>> 
>> Being an event change notification you simply can't add the fd
>> to the "monitor" after you've issued the accept().
>> The skeleton for /dev/epoll usage is :
>> 
>> while (system_call(...) == FAIL) {
>> 
>>         wait_event();
>> }
>> 
> 
>  I'm not sure I understand. I'm assuming you can do
> something along the lines of:
> 
>  // application accepts new socket
>  new_socket_fd = accept()
> 
>  // application registers interest with epoll
>  write(dev_poll_fd, new_socket_fd):
>    drivers/char/eventpoll.c:ep_insert():
>     - add new_socket_fd to interest list
>     - check new_socket_fd for readable, writable, and
>       error. if any true, then add new event to 
>       event queue, as if the state had changed.

No it does't check. It's not needed for how it works.


>  // application asks for current set of events
>  app: ioctl(dev_poll_fd, EP_POLL):
>    drivers/char/eventpoll.c:ep_poll():
>      - return the current event queue
> 
>  In other words, when new fd's are added to the
> interest set, you generate synthetic events which
> are returned at the next ioctl(EP_POLL).
> 
>  Are you saying that isn't possible? It's the
> suggested behavior from the BMD paper, so evidently
> they got it to work somehow (and I suspect it's how
> Solaris /dev/poll works, but I'm not sure)

select()/poll() works in a different way :

1)        select()/poll();
2)        recv()/send();

while /dev/epoll works like described above :

1)        if (recv()/send() == FAIL)
2)                wait_event();

I intentionally changed the name to epoll coz it works in a different way.



- Davide


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] /dev/epoll update ...
  2001-09-19 19:30       ` Davide Libenzi
@ 2001-09-19 21:49         ` Christopher K. St. John
  2001-09-19 22:11           ` Davide Libenzi
  0 siblings, 1 reply; 51+ messages in thread
From: Christopher K. St. John @ 2001-09-19 21:49 UTC (permalink / raw)
  To: linux-kernel; +Cc: Davide Libenzi, Dan Kegel

Davide Libenzi wrote:
> 
> >     - check new_socket_fd for readable, writable, and
> >       error. if any true, then add new event to
> >       event queue, as if the state had changed.
> 
> No it does't check. It's not needed for how it works.
> 

 Yes, I see that it currently works that way. I'm
suggesting that it's a needlessly awkward way to work.
It also results in thousands of spurious syscalls a
second as servers are forced to double check there
isn't i/o to be done.

 This is frustrating, as the application must ask for
information that the kernel could have easily provided
in the first place.

 Providing an initial set of events makes application
programming easier, doesn't appear to add significant
complexity to the driver (maybe), greatly reduces the
number of required system calls, and still fits neatly
into the conceptual api model. It seems like a clear
win.


> I intentionally changed the name to epoll because it
> works in a different way.
>

 Am I missing something? I don't think you'd need a
linear scan of anything, and there wouldn't be any
changes to the api. Existing code would work exactly
the same. Etc.

 It's Davide's patch, and if he doesn't like my
suggestion, I certainly don't expect him to change his
code. If there's any consensus that the "initial event
set" behavior is a good thing, I'd be willing to whip
up a patch to Davide's patch. OTOH, if there's a good
reason the changes are a bad thing, I don't want to
confuse the issue with yet-another /dev/poll variant.

 Does anybody else have an opinion?


-- 
Christopher St. John cks@distributopia.com
DistribuTopia http://www.distributopia.com

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] /dev/epoll update ...
  2001-09-19 21:49         ` Christopher K. St. John
@ 2001-09-19 22:11           ` Davide Libenzi
  2001-09-19 23:24             ` Christopher K. St. John
                               ` (2 more replies)
  0 siblings, 3 replies; 51+ messages in thread
From: Davide Libenzi @ 2001-09-19 22:11 UTC (permalink / raw)
  To: Christopher K. St. John; +Cc: Dan Kegel, linux-kernel


On 19-Sep-2001 Christopher K. St. John wrote:
> Davide Libenzi wrote:
>> 
>> >     - check new_socket_fd for readable, writable, and
>> >       error. if any true, then add new event to
>> >       event queue, as if the state had changed.
>> 
>> No it does't check. It's not needed for how it works.
>> 
> 
>  Yes, I see that it currently works that way. I'm
> suggesting that it's a needlessly awkward way to work.
> It also results in thousands of spurious syscalls a
> second as servers are forced to double check there
> isn't i/o to be done.

Again :

1)      select()/poll();
2)      recv()/send();

vs :

1)      if (recv()/send() == FAIL)
2)              ioctl(EP_POLL);


When there's no data/tx buffer full these will result in 2 syscalls while
if data is available/tx buffer ok the first method will result in 2 syscalls
while the second will never call the ioctl(). 
It looks very linear to me, with select()/poll() you're asking for a state while
with /dev/epoll you're asking for a state change.




- Davide


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] /dev/epoll update ...
  2001-09-19 22:11           ` Davide Libenzi
@ 2001-09-19 23:24             ` Christopher K. St. John
  2001-09-19 23:52               ` Davide Libenzi
  2001-09-20  2:13             ` Dan Kegel
  2001-09-21  5:59             ` Ton Hospel
  2 siblings, 1 reply; 51+ messages in thread
From: Christopher K. St. John @ 2001-09-19 23:24 UTC (permalink / raw)
  To: linux-kernel; +Cc: Davide Libenzi, Dan Kegel

Davide Libenzi wrote:
> 
> 1)      select()/poll();
> 2)      recv()/send();
> 
> vs :
> 
> 1)      if (recv()/send() == FAIL)
> 2)              ioctl(EP_POLL);
> 
> When there's no data/tx buffer full these will result in 2 syscalls while
> if data is available/tx buffer ok the first method will result in 2 syscalls
> while the second will never call the ioctl().
> It looks very linear to me, with select()/poll() you're asking for a state while
> with /dev/epoll you're asking for a state change.
> 

 Ok, if we're just disagreeing about the best api,
then I can live with that. But it appears we're
talking at cross-purposes, so I want to try this one
more time. I'll lay my though processes out in detail,
and you can tell me at which step I'm going wrong:


 Normally, you'd spend most of your time sitting in
ioctl(EP_POLL) waiting for something to happen. So
that's one syscall.

 If you get an event that indicates you can accept()
a new connection, then you do an accept(). Assume it
succeeds. That's two syscalls. Then you register
interest in the fd with a write to /dev/poll, that's
three.

 With the current /dev/epoll, you must try to read()
the new socket before you go back to ioctl(EP_POLL),
just in case there is data available. You expect
there isn't, but you have to try. This is the step
I'm talking about. That's four.

 Assume data was not available, so you loop back
to ioctl(EP_POLL) and wait for an event. That's five
syscalls. The event comes in, you do another read()
on the socket, and probably get some data. That's
six syscalls to finally get your data.

 ioctl(kpfd, EP_POLL)	1     wait for events
 s = accept()           2     accept a new socket
 write(kpfd, s)         3     register interest
 n = read(s)            4 <-- annoying test-read
 ioctl(kpfd, EP_POLL)   5     wait for events
 n = read(s)            6     get some data

 You have a similiar problem with write's, but I'm
guessing it's safe to assume the first write will
always succeed, so it's awkward but not a big
problem.

 If /dev/epoll tested the initial state of the socket,
then there would be no need for the test read:

 ioctl(kpfd, EP_POLL)	1     wait for events
 s = accept()		2     accept a new socket
 write(kpfd, s)		3     register interest
 ioctl(kpfd, EP_POLL)	4     wait for events
 n = read(s)		5     get some data

 So, we've saved a syscall and, perhaps more importantly,
we don't have to keep a list of to-be-read-just-in-case
fd's sitting around. I wouldrather make this a "clean
api" argument than a performance argument, since it's
unclear that there is really any significant speed
difference in practice.

 Note that the number of unnecessary syscalls is much
greater than 20%, since on a heavily loaded server, you
could be doing 1000's of unecessary reads for every
ioctl(EP_POLL).

 On a fast local network you'd expect the test reads
to mostly return something, so it's no big deal. But
if you've got 10k very slow connections...

 There's a good summary of the problem in the Banga,
Mogul and Druschel[1] paper at:

  http://citeseer.nj.nec.com/banga99scalable.html

 Page 5, right hand column, third paragraph.

 By the way, thanks for the patch. I know I've been
complaining about it, but I wouldn't have bothered
unless I thought it was a good thing. I appreciate
your taking the time to write and release it.


-- 
Christopher St. John cks@distributopia.com
DistribuTopia http://www.distributopia.com

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] /dev/epoll update ...
  2001-09-19 23:24             ` Christopher K. St. John
@ 2001-09-19 23:52               ` Davide Libenzi
  0 siblings, 0 replies; 51+ messages in thread
From: Davide Libenzi @ 2001-09-19 23:52 UTC (permalink / raw)
  To: Christopher K. St. John; +Cc: Dan Kegel, linux-kernel


On 19-Sep-2001 Christopher K. St. John wrote:
> Davide Libenzi wrote:
>> 
>> 1)      select()/poll();
>> 2)      recv()/send();
>> 
>> vs :
>> 
>> 1)      if (recv()/send() == FAIL)
>> 2)              ioctl(EP_POLL);
>> 
>> When there's no data/tx buffer full these will result in 2 syscalls while
>> if data is available/tx buffer ok the first method will result in 2 syscalls
>> while the second will never call the ioctl().
>> It looks very linear to me, with select()/poll() you're asking for a state while
>> with /dev/epoll you're asking for a state change.
>> 
> 
>  Ok, if we're just disagreeing about the best api,
> then I can live with that. But it appears we're
> talking at cross-purposes, so I want to try this one
> more time. I'll lay my though processes out in detail,
> and you can tell me at which step I'm going wrong:
> 
> 
>  Normally, you'd spend most of your time sitting in
> ioctl(EP_POLL) waiting for something to happen. So
> that's one syscall.
> 
>  If you get an event that indicates you can accept()
> a new connection, then you do an accept(). Assume it
> succeeds. That's two syscalls. Then you register
> interest in the fd with a write to /dev/poll, that's
> three.
> 
>  With the current /dev/epoll, you must try to read()
> the new socket before you go back to ioctl(EP_POLL),
> just in case there is data available. You expect
> there isn't, but you have to try. This is the step
> I'm talking about. That's four.
> 
>  Assume data was not available, so you loop back
> to ioctl(EP_POLL) and wait for an event. That's five
> syscalls. The event comes in, you do another read()
> on the socket, and probably get some data. That's
> six syscalls to finally get your data.
> 
>  ioctl(kpfd, EP_POLL) 1     wait for events
>  s = accept()           2     accept a new socket
>  write(kpfd, s)         3     register interest
>  n = read(s)            4 <-- annoying test-read
>  ioctl(kpfd, EP_POLL)   5     wait for events
>  n = read(s)            6     get some data

You continue to put the state check ( ioctl() ) before the system call,
that require you to use select()/poll()//dev/poll interfaces that are
state inquiry interfaces.
The /dev/epoll is, like i said before, a state change notification interface.
That's how have been designed and that how it completely avoid fds scan.
If you're looking for a state inquiry interface it's better for you to seek /dev/poll.




- Davide


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] /dev/epoll update ...
  2001-09-19 22:11           ` Davide Libenzi
  2001-09-19 23:24             ` Christopher K. St. John
@ 2001-09-20  2:13             ` Dan Kegel
  2001-09-20  2:28               ` Davide Libenzi
  2001-09-21  5:59             ` Ton Hospel
  2 siblings, 1 reply; 51+ messages in thread
From: Dan Kegel @ 2001-09-20  2:13 UTC (permalink / raw)
  To: Davide Libenzi; +Cc: Christopher K. St. John, linux-kernel

Davide Libenzi wrote:
> 1)      if (recv()/send() == FAIL)
> 2)              ioctl(EP_POLL);

A lot of people, including me, were under the mistaken impression
that /dev/epoll, like /dev/poll, provided an efficient way to
retrieve the current readiness state of fd's.  I understand from your post
that /dev/epoll's purpose is to retrieve state changes; in other
words, it's exactly like F_SETSIG/F_SETOWN/O_ASYNC except that
the readiness change indications are picked up via an ioctl
rather than via a signal.

A scorecard for the confused (Davide, correct me if I'm wrong):

* API's that allow you to retrieve the current readiness state of
  a set of fd's:  poll(), select(), /dev/poll, kqueue().
  Buzzwords describing this kind of interface: level-triggered, multishot.

* API's that allow you to retrieve *changes* to the readiness state of
  a set of fd's: F_SETSIG/F_SETOWN/O_ASYNC + sigtimedwait(), /dev/epoll, kqueue().
  Buzzwords describing this kind of interface: edge-triggered, single-shot.

(Note that kqueue is in both camps.)

Er, I guess that means I'll rip up the /dev/epoll support I based on my
/dev/poll code, and replace it with some based on my O_ASYNC code...

- Dan

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] /dev/epoll update ...
  2001-09-20  2:13             ` Dan Kegel
@ 2001-09-20  2:28               ` Davide Libenzi
  2001-09-20  3:03                 ` Dan Kegel
  2001-09-20  4:32                 ` Christopher K. St. John
  0 siblings, 2 replies; 51+ messages in thread
From: Davide Libenzi @ 2001-09-20  2:28 UTC (permalink / raw)
  To: Dan Kegel; +Cc: linux-kernel, linux-kernel, Christopher K. St. John


On 20-Sep-2001 Dan Kegel wrote:
> Davide Libenzi wrote:
>> 1)      if (recv()/send() == FAIL)
>> 2)              ioctl(EP_POLL);
> 
> A lot of people, including me, were under the mistaken impression
> that /dev/epoll, like /dev/poll, provided an efficient way to
> retrieve the current readiness state of fd's.  I understand from your post
> that /dev/epoll's purpose is to retrieve state changes; in other
> words, it's exactly like F_SETSIG/F_SETOWN/O_ASYNC except that
> the readiness change indications are picked up via an ioctl
> rather than via a signal.
> 
> A scorecard for the confused (Davide, correct me if I'm wrong):
> 
> * API's that allow you to retrieve the current readiness state of
>   a set of fd's:  poll(), select(), /dev/poll, kqueue().
>   Buzzwords describing this kind of interface: level-triggered, multishot.
> 
> * API's that allow you to retrieve *changes* to the readiness state of
>   a set of fd's: F_SETSIG/F_SETOWN/O_ASYNC + sigtimedwait(), /dev/epoll, kqueue().
>   Buzzwords describing this kind of interface: edge-triggered, single-shot.
> 
> (Note that kqueue is in both camps.)
> 
> Er, I guess that means I'll rip up the /dev/epoll support I based on my
> /dev/poll code, and replace it with some based on my O_ASYNC code...


Exactly :)
Here are examples basic functions when used with coroutines :


int dph_connect(struct dph_conn *conn, const struct sockaddr *serv_addr, socklen_t addrlen) {

        if (connect(conn->sfd, serv_addr, addrlen) == -1) {
                if (errno != EWOULDBLOCK && errno != EINPROGRESS)
                        return -1;
                conn->events = POLLOUT | POLLERR | POLLHUP;
                co_resume(conn);
                if (conn->revents & (POLLERR | POLLHUP))
                        return -1;
        }
        return 0;
}

int dph_read(struct dph_conn *conn, char *buf, int nbyte) {
        int n;

        while ((n = read(conn->sfd, buf, nbyte)) < 0) {
                if (errno == EINTR)
                        continue;
                if (errno != EAGAIN && errno != EWOULDBLOCK)
                        return -1;
                conn->events = POLLIN | POLLERR | POLLHUP;
                co_resume(conn);
        }
        return n;
}

int dph_write(struct dph_conn *conn, char const *buf, int nbyte) {
        int n;

        while ((n = write(conn->sfd, buf, nbyte)) < 0) {
                if (errno == EINTR)
                        continue;
                if (errno != EAGAIN && errno != EWOULDBLOCK)
                        return -1;
                conn->events = POLLOUT | POLLERR | POLLHUP;
                co_resume(conn);
        }
        return n;
}

int dph_accept(struct dph_conn *conn, struct sockaddr *addr, int *addrlen) {
        int sfd;

        while ((sfd = accept(conn->sfd, addr, (socklen_t *) addrlen)) < 0) {
                if (errno == EINTR)
                        continue;
                if (errno != EAGAIN && errno != EWOULDBLOCK)
                        return -1;
                conn->events = POLLIN | POLLERR | POLLHUP;
                co_resume(conn);
        }
        return sfd;
}




- Davide


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] /dev/epoll update ...
  2001-09-20  2:28               ` Davide Libenzi
@ 2001-09-20  3:03                 ` Dan Kegel
  2001-09-20 16:58                   ` Davide Libenzi
  2001-09-20  4:32                 ` Christopher K. St. John
  1 sibling, 1 reply; 51+ messages in thread
From: Dan Kegel @ 2001-09-20  3:03 UTC (permalink / raw)
  To: Davide Libenzi, linux-kernel

One more question: if I guess wrong initially about how many
file descriptors I'll be monitoring with /dev/epoll, and I need
to increase the size of the area inside /dev/epoll in the middle of
my scan through the results, what is the proper sequence of calls?

Some possibilities:

1)  EP_ALLOC, and continue scanning through the results

2)  EP_FREE, EP_ALLOC, EP_POLL because old results are now invalid

3)  EP_FREE, EP_ALLOC, write new copies of all the old fds to /dev/epoll, 
    EP_POLL, and start new scan

I bet it's #3.  Am I right?
- Dan

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] /dev/epoll update ...
  2001-09-20  2:28               ` Davide Libenzi
  2001-09-20  3:03                 ` Dan Kegel
@ 2001-09-20  4:32                 ` Christopher K. St. John
  2001-09-20  4:43                   ` Christopher K. St. John
  2001-09-20 17:18                   ` Davide Libenzi
  1 sibling, 2 replies; 51+ messages in thread
From: Christopher K. St. John @ 2001-09-20  4:32 UTC (permalink / raw)
  To: linux-kernel; +Cc: Davide Libenzi

Davide Libenzi wrote:
> 
> Here are examples basic functions when used with
> coroutines.
>
 
 I think all might be made clear if you did a quick
test harness that didn't use coroutines. I'm guessing
the vast majority of potential users will not be using
a coroutine library.

 On "nio-improve" page, you've got:

        for (;;) {
          evp.ep_timeout = STD_SCHED_TIMEOUT;
          evp.ep_resoff = 0;
          nfds = ioctl(kdpfd, EP_POLL, &evp);
          pfds = (struct pollfd *) (map + evp.ep_resoff);
          for (ii = 0; ii < nfds; ii++, pfds++) {
             ...
          }
        }

 Assume your server is so overloaded that you need
to avoid any unproductive calls to read() or write()
or accept(). Assume that instead of many very fast
connections coming in at a furious rate, you get a
large steady current of very slow connections.

 If you try to flesh out the above template with those
goals in mind, I think you'll quickly see what I've
been trying to get at with regard to the awkwardness
of not getting back some indication of the initial
state of the fd.

 The current situation isn't fatal, just awkward. And
fixable. For the low low price of a tiny bit of
idealogical purity...



-- 
Christopher St. John cks@distributopia.com
DistribuTopia http://www.distributopia.com

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] /dev/epoll update ...
  2001-09-20  4:32                 ` Christopher K. St. John
@ 2001-09-20  4:43                   ` Christopher K. St. John
  2001-09-20  5:05                     ` Benjamin LaHaise
  2001-09-20 17:18                   ` Davide Libenzi
  1 sibling, 1 reply; 51+ messages in thread
From: Christopher K. St. John @ 2001-09-20  4:43 UTC (permalink / raw)
  To: linux-kernel; +Cc: Davide Libenzi

"Christopher K. St. John" wrote:
> 
> Assume that instead of many very fast
> connections coming in at a furious rate, you get a
> large steady current of very slow connections.
> 

 Sorry, bad editing, that should be:

 Assume a large but bursty current of low bandwidth
high latency connections instead of a continuous steady
flood of high bandwidth low latency connections.


-- 
Christopher St. John cks@distributopia.com
DistribuTopia http://www.distributopia.com

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] /dev/epoll update ...
  2001-09-20  4:43                   ` Christopher K. St. John
@ 2001-09-20  5:05                     ` Benjamin LaHaise
  2001-09-20 18:25                       ` Davide Libenzi
  0 siblings, 1 reply; 51+ messages in thread
From: Benjamin LaHaise @ 2001-09-20  5:05 UTC (permalink / raw)
  To: Christopher K. St. John; +Cc: linux-kernel, Davide Libenzi

On Wed, Sep 19, 2001 at 11:43:57PM -0500, Christopher K. St. John wrote:
>  Sorry, bad editing, that should be:
> 
>  Assume a large but bursty current of low bandwidth
> high latency connections instead of a continuous steady
> flood of high bandwidth low latency connections.

Isn't asynchronous io a better model for that case?

		-ben

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] /dev/epoll update ...
  2001-09-20  3:03                 ` Dan Kegel
@ 2001-09-20 16:58                   ` Davide Libenzi
  0 siblings, 0 replies; 51+ messages in thread
From: Davide Libenzi @ 2001-09-20 16:58 UTC (permalink / raw)
  To: Dan Kegel; +Cc: linux-kernel


On 20-Sep-2001 Dan Kegel wrote:
> One more question: if I guess wrong initially about how many
> file descriptors I'll be monitoring with /dev/epoll, and I need
> to increase the size of the area inside /dev/epoll in the middle of
> my scan through the results, what is the proper sequence of calls?
> 
> Some possibilities:
> 
> 1)  EP_ALLOC, and continue scanning through the results
> 
> 2)  EP_FREE, EP_ALLOC, EP_POLL because old results are now invalid
> 
> 3)  EP_FREE, EP_ALLOC, write new copies of all the old fds to /dev/epoll, 
>     EP_POLL, and start new scan
> 
> I bet it's #3.  Am I right?

I'm coding a solution that try to minimize the reallocation cost even if it's better
to preallocate the number of fds without changing it.
If you think to handle 200000 fds in your system the memory cost of the epoll
allocation is nothing compared to the file*, socket buffer, etc...




- Davide


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] /dev/epoll update ...
  2001-09-20  4:32                 ` Christopher K. St. John
  2001-09-20  4:43                   ` Christopher K. St. John
@ 2001-09-20 17:18                   ` Davide Libenzi
  2001-09-24  0:11                     ` Gordon Oliver
  2001-09-24 19:23                     ` Eric W. Biederman
  1 sibling, 2 replies; 51+ messages in thread
From: Davide Libenzi @ 2001-09-20 17:18 UTC (permalink / raw)
  To: Christopher K. St. John; +Cc: linux-kernel


On 20-Sep-2001 Christopher K. St. John wrote:
> Davide Libenzi wrote:
>> 
>> Here are examples basic functions when used with
>> coroutines.
>>
>  
>  I think all might be made clear if you did a quick
> test harness that didn't use coroutines. I'm guessing
> the vast majority of potential users will not be using
> a coroutine library.
> 
>  On "nio-improve" page, you've got:
> 
>         for (;;) {
>           evp.ep_timeout = STD_SCHED_TIMEOUT;
>           evp.ep_resoff = 0;
>           nfds = ioctl(kdpfd, EP_POLL, &evp);
>           pfds = (struct pollfd *) (map + evp.ep_resoff);
>           for (ii = 0; ii < nfds; ii++, pfds++) {
>              ...
>           }
>         }

Coroutines or not, this does not change the picture.
All multiplexed servers have an IO driven scheduler that calls
code sections based on the fd.
Obviously if you've a one-thread-per-socket model, epoll is not your answer.



>  Assume your server is so overloaded that you need
> to avoid any unproductive calls to read() or write()
> or accept(). Assume that instead of many very fast
> connections coming in at a furious rate, you get a
> large steady current of very slow connections.

>>>> Sorry, bad editing, that should be:
>> Assume a large but bursty current of low bandwidth
>> high latency connections instead of a continuous steady
>> flood of high bandwidth low latency connections.

>  If you try to flesh out the above template with those
> goals in mind, I think you'll quickly see what I've
> been trying to get at with regard to the awkwardness
> of not getting back some indication of the initial
> state of the fd.
> 
>  The current situation isn't fatal, just awkward. And
> fixable. For the low low price of a tiny bit of
> idealogical purity...

Again, no.
If you need to request the current status of a socket you've to f_ops->poll the fd.
The cost of the extra read, done only for fds that are not "ready", is nothing
compared to the cost of a linear scan with HUGE numbers of fds.
You could implement a solution where the low level io functions goes directly to write
inside the mmapped fd set where the data buffer is empty or the out buffer is full.
This would be a way more intrusive patch whose perf gain won't match the cost.




- Davide


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] /dev/epoll update ...
  2001-09-20  5:05                     ` Benjamin LaHaise
@ 2001-09-20 18:25                       ` Davide Libenzi
  2001-09-20 19:33                         ` Benjamin LaHaise
  0 siblings, 1 reply; 51+ messages in thread
From: Davide Libenzi @ 2001-09-20 18:25 UTC (permalink / raw)
  To: Benjamin LaHaise; +Cc: linux-kernel, linux-kernel, Christopher K. St. John


On 20-Sep-2001 Benjamin LaHaise wrote:
> On Wed, Sep 19, 2001 at 11:43:57PM -0500, Christopher K. St. John wrote:
>>  Sorry, bad editing, that should be:
>> 
>>  Assume a large but bursty current of low bandwidth
>> high latency connections instead of a continuous steady
>> flood of high bandwidth low latency connections.
> 
> Isn't asynchronous io a better model for that case?

The advantage /dev/epoll has compared to aio_* and RTsig is 
1) multiple event delivery/system call
2) less user<->kernel memory moves

The concept is very similar anyway coz you basically have to initiate the
io-call and wait for an event.
The difference is how events are collected.




- Davide


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] /dev/epoll update ...
  2001-09-20 18:25                       ` Davide Libenzi
@ 2001-09-20 19:33                         ` Benjamin LaHaise
  2001-09-20 19:58                           ` Davide Libenzi
  0 siblings, 1 reply; 51+ messages in thread
From: Benjamin LaHaise @ 2001-09-20 19:33 UTC (permalink / raw)
  To: Davide Libenzi; +Cc: linux-kernel, Christopher K. St. John

On Thu, Sep 20, 2001 at 11:25:13AM -0700, Davide Libenzi wrote:
> The advantage /dev/epoll has compared to aio_* and RTsig is 
> 1) multiple event delivery/system call

This is actually covered in my aio plan, and just needs the kernel 
provided syscall function library support to read from shared memory.  
The ABI I'm using is based on aio_*, but is different.  There are a 
few emails I've written on the subject recently that I can forward to 
you, but the basic API is: io_submit queues aio requests which later 
write a 32 byte completion entry containing the object, user data and 
result codes to a ringbuffer.

> 2) less user<->kernel memory moves
> 
> The concept is very similar anyway coz you basically have to initiate the
> io-call and wait for an event.
> The difference is how events are collected.

See the above. =)  aio also works much better as the io request helps 
define the duration for memory pinning of any O_DIRECT or similar 
operations that allow the hardware to act on user provided buffers.

		-ben

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] /dev/epoll update ...
  2001-09-20 19:33                         ` Benjamin LaHaise
@ 2001-09-20 19:58                           ` Davide Libenzi
  0 siblings, 0 replies; 51+ messages in thread
From: Davide Libenzi @ 2001-09-20 19:58 UTC (permalink / raw)
  To: Benjamin LaHaise
  Cc: Christopher K. St. John, Christopher K. St. John, linux-kernel


On 20-Sep-2001 Benjamin LaHaise wrote:
> On Thu, Sep 20, 2001 at 11:25:13AM -0700, Davide Libenzi wrote:
>> 2) less user<->kernel memory moves
>> 
>> The concept is very similar anyway coz you basically have to initiate the
>> io-call and wait for an event.
>> The difference is how events are collected.
> 
> See the above. =)  aio also works much better as the io request helps 
> define the duration for memory pinning of any O_DIRECT or similar 
> operations that allow the hardware to act on user provided buffers.

Obviously if you hook down at lower kernel levels you can better optimize the event
notification but my design guide for the patch was to be the less intrusive
as possible.
If you look at the patch, it has a very limited changes inside linux core files
and this will make it usable/merged even if it won't be included inside the
mainstream code.




- Davide


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] /dev/epoll update ...
  2001-09-19 22:11           ` Davide Libenzi
  2001-09-19 23:24             ` Christopher K. St. John
  2001-09-20  2:13             ` Dan Kegel
@ 2001-09-21  5:59             ` Ton Hospel
  2001-09-21 16:48               ` Davide Libenzi
  2 siblings, 1 reply; 51+ messages in thread
From: Ton Hospel @ 2001-09-21  5:59 UTC (permalink / raw)
  To: linux-kernel

In article <XFMail.20010919151147.davidel@xmailserver.org>,
	Davide Libenzi <davidel@xmailserver.org> writes:
> On 19-Sep-2001 Christopher K. St. John wrote:
>> Davide Libenzi wrote:
> Again :
> 
> 1)      select()/poll();
> 2)      recv()/send();
> 
> vs :
> 
> 1)      if (recv()/send() == FAIL)
> 2)              ioctl(EP_POLL);
> 

mm, I don't really get the second one. What if the scenario is:
In the place you are in your program, you now decide that a
read is in order.  You try read, nothing there yet,
the syscall returns, the data event happens and THEN you go into
the ioctl ?

Possibilities seem:
1) You hang, having missed the only event that will happen
2) Just having data triggers the ioctl (maybe only the first time),
   why not leaving out the initial read then and just do it afterwards
   like select ?
3) It generates a fake event the first time you notify interest, but then
   the startup case leads to doing the read uselessly twice.

Or is there a fourth way I'm missing this really works ?

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] /dev/epoll update ...
  2001-09-21  5:59             ` Ton Hospel
@ 2001-09-21 16:48               ` Davide Libenzi
  0 siblings, 0 replies; 51+ messages in thread
From: Davide Libenzi @ 2001-09-21 16:48 UTC (permalink / raw)
  To: Ton Hospel; +Cc: linux-kernel


On 21-Sep-2001 Ton Hospel wrote:
> In article <XFMail.20010919151147.davidel@xmailserver.org>,
>       Davide Libenzi <davidel@xmailserver.org> writes:
>> On 19-Sep-2001 Christopher K. St. John wrote:
>>> Davide Libenzi wrote:
>> Again :
>> 
>> 1)      select()/poll();
>> 2)      recv()/send();
>> 
>> vs :
>> 
>> 1)      if (recv()/send() == FAIL)
>> 2)              ioctl(EP_POLL);
>> 
> 
> mm, I don't really get the second one. What if the scenario is:
> In the place you are in your program, you now decide that a
> read is in order.  You try read, nothing there yet,
> the syscall returns, the data event happens and THEN you go into
> the ioctl ?
> 
> Possibilities seem:
> 1) You hang, having missed the only event that will happen
> 2) Just having data triggers the ioctl (maybe only the first time),
>    why not leaving out the initial read then and just do it afterwards
>    like select ?
> 3) It generates a fake event the first time you notify interest, but then
>    the startup case leads to doing the read uselessly twice.
> 
> Or is there a fourth way I'm missing this really works ?

That was a simplified function :

        while (recv()/send() == FAIL)
                ioctl(EP_POLL);

this is the right code.
If an event happens between the recv() and the ioctl() this is cached by the
driver and it'll be returned from ioctl().




- Davide


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] /dev/epoll update ...
  2001-09-20 17:18                   ` Davide Libenzi
@ 2001-09-24  0:11                     ` Gordon Oliver
  2001-09-24  0:33                       ` Davide Libenzi
  2001-09-24 19:23                     ` Eric W. Biederman
  1 sibling, 1 reply; 51+ messages in thread
From: Gordon Oliver @ 2001-09-24  0:11 UTC (permalink / raw)
  To: Davide Libenzi; +Cc: linux-kernel

On 2001.09.20 10:18 Davide Libenzi wrote:
> If you need to request the current status of a socket you've to
> f_ops->poll the fd.
> The cost of the extra read, done only for fds that are not "ready", is
> nothing
> compared to the cost of a linear scan with HUGE numbers of fds.
> You could implement a solution where the low level io functions goes
> directly to write
> inside the mmapped fd set where the data buffer is empty or the out
> buffer is full.
> This would be a way more intrusive patch whose perf gain won't match the
> cost.

But you missed the obvious optimization of doing an f_ops->poll when
the file is _added_. This means that you'll get an initial event when
there is data ready. This means you still never do a scan (only check
when an fd is added), but you don't have to do an empty read every time
you add an fd.

Before you argue that this does not save a system call, it will in
the typical case of:
   <add fd>
   <fail read>
   <wait on events>
   <successful read>

Note that it has the additional advantage of making the dispatch code
in the user application easier. You no longer have to do special code
to handle the speculative read after adding the fd.
	-gordo

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] /dev/epoll update ...
  2001-09-24  0:11                     ` Gordon Oliver
@ 2001-09-24  0:33                       ` Davide Libenzi
  0 siblings, 0 replies; 51+ messages in thread
From: Davide Libenzi @ 2001-09-24  0:33 UTC (permalink / raw)
  To: Gordon Oliver; +Cc: linux-kernel


On 24-Sep-2001 Gordon Oliver wrote:
> On 2001.09.20 10:18 Davide Libenzi wrote:
>> If you need to request the current status of a socket you've to
>> f_ops->poll the fd.
>> The cost of the extra read, done only for fds that are not "ready", is
>> nothing
>> compared to the cost of a linear scan with HUGE numbers of fds.
>> You could implement a solution where the low level io functions goes
>> directly to write
>> inside the mmapped fd set where the data buffer is empty or the out
>> buffer is full.
>> This would be a way more intrusive patch whose perf gain won't match the
>> cost.
> 
> But you missed the obvious optimization of doing an f_ops->poll when
> the file is _added_. This means that you'll get an initial event when
> there is data ready. This means you still never do a scan (only check
> when an fd is added), but you don't have to do an empty read every time
> you add an fd.
> 

Why is it so diffucult to understand that /dev/epoll is an "state change" interface.
Even if you add an event at fd insert time this DOES NOT transform /dev/epoll
in a "state monitor" interface.
That means that you can't use code like this :

        if (readable(fd))
                read();

that is common to "state monitor" interfaces.
The code prototype for "state change" interfaces is like :

        while (read() == FAIL)
                wait(READ_EVENT);

Suppose you transform this in a code like this :

int my_smart_read() {
        if (wait(READ_EVENT))
                read();
}

and a packet with 1000 bytes lands onto the terminal.
If you call my_smart_read() and you read 666 bytes the next
time you're going to call my_smart_read() you get stuck.
This coz /dev/epoll catch terminal "state change" events by design.
You could say, "but i want the terminal state to be reported" and
i say "use select()/poll()//dev/poll".


> Before you argue that this does not save a system call, it will in
> the typical case of:
>    <add fd>
>    <fail read>
>    <wait on events>
>    <successful read>
> 
> Note that it has the additional advantage of making the dispatch code
> in the user application easier. You no longer have to do special code
> to handle the speculative read after adding the fd.

You've to use speculative read()/write() coz these are going to change
( rx buffer empty, tx buffer full ) the state of the terminal without
which you'll never receive nexts state change events.
Please look at the code the uses rt signals and if you don't like it
i guess you'll never love /dev/epoll.





- Davide


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] /dev/epoll update ...
  2001-09-20 17:18                   ` Davide Libenzi
  2001-09-24  0:11                     ` Gordon Oliver
@ 2001-09-24 19:23                     ` Eric W. Biederman
  2001-09-24 20:04                       ` Davide Libenzi
  1 sibling, 1 reply; 51+ messages in thread
From: Eric W. Biederman @ 2001-09-24 19:23 UTC (permalink / raw)
  To: Davide Libenzi; +Cc: Christopher K. St. John, linux-kernel

Davide Libenzi <davidel@xmailserver.org> writes:

> On 20-Sep-2001 Christopher K. St. John wrote:
> > Davide Libenzi wrote:
> >> 
> >> Here are examples basic functions when used with
> >> coroutines.
> >>
> >  
> >  I think all might be made clear if you did a quick
> > test harness that didn't use coroutines. I'm guessing
> > the vast majority of potential users will not be using
> > a coroutine library.
> > 
> >  On "nio-improve" page, you've got:
> > 
> >         for (;;) {
> >           evp.ep_timeout = STD_SCHED_TIMEOUT;
> >           evp.ep_resoff = 0;
> >           nfds = ioctl(kdpfd, EP_POLL, &evp);
> >           pfds = (struct pollfd *) (map + evp.ep_resoff);
> >           for (ii = 0; ii < nfds; ii++, pfds++) {
> >              ...
> >           }
> >         }
> 
> Coroutines or not, this does not change the picture.
> All multiplexed servers have an IO driven scheduler that calls
> code sections based on the fd.
> Obviously if you've a one-thread-per-socket model, epoll is not your answer.

A couroutine is a thread, the two terms are synonyms.  Generally
coroutines refer to threads with a high volumne of commniucation
between them.  And the terms come from different programming groups.

However a fully cooperative thread (as is implemented in the current
coroutine library) can be quite cheap, and is a easy way to implement
a state machine.  A pure state machine will have a smaller data
footprint than the stack of a cooperative thread, but otherwise
the concepts are pretty much the same.  Language support for
cooperative threads, so you could verify you wouldn't overflow your
stack would be very nice. 

So epoll is a good solution if you have a one-thread-per-socket model,
and you are doing cooperative threads.  The thread being used here is
simply a shortcut to writing a state machine.

Eric

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] /dev/epoll update ...
  2001-09-24 19:23                     ` Eric W. Biederman
@ 2001-09-24 20:04                       ` Davide Libenzi
  0 siblings, 0 replies; 51+ messages in thread
From: Davide Libenzi @ 2001-09-24 20:04 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: linux-kernel, linux-kernel, Christopher K. St. John


On 24-Sep-2001 Eric W. Biederman wrote:
>> Coroutines or not, this does not change the picture.
>> All multiplexed servers have an IO driven scheduler that calls
>> code sections based on the fd.
>> Obviously if you've a one-thread-per-socket model, epoll is not your answer.
> 
> A couroutine is a thread, the two terms are synonyms.  Generally
> coroutines refer to threads with a high volumne of commniucation
> between them.  And the terms come from different programming groups.
> 
> However a fully cooperative thread (as is implemented in the current
> coroutine library) can be quite cheap, and is a easy way to implement
> a state machine.  A pure state machine will have a smaller data
> footprint than the stack of a cooperative thread, but otherwise
> the concepts are pretty much the same.  Language support for
> cooperative threads, so you could verify you wouldn't overflow your
> stack would be very nice. 
> 
> So epoll is a good solution if you have a one-thread-per-socket model,
> and you are doing cooperative threads.  The thread being used here is
> simply a shortcut to writing a state machine.

If you'd be the os i guess you'd not say the same :)
It was pretty clear the model i meant was one real thread/process per fd.
The main difference with coroutines is that the /dev/epoll engine become
the scheduler of your app.
It's also clear that you can avoid the coroutines by writing a state machine.
There's a HUGE memory save with the stack removal that you pay with a more
complicated code.




- Davide


^ permalink raw reply	[flat|nested] 51+ messages in thread

* [patch] /dev/epoll update ...
@ 2002-03-20  3:49 Davide Libenzi
  0 siblings, 0 replies; 51+ messages in thread
From: Davide Libenzi @ 2002-03-20  3:49 UTC (permalink / raw)
  To: Linux Kernel Mailing List


*) Export correct symbols from fcblist.c so eventpoll can be used
	as a module ( thx to  Paul P Komkoff Jr  )
*) Added GPL modlicense to eventpoll.c ( thx to  Paul P Komkoff Jr  )
*) Added timeout unsigned long overflow check ( thx to  Ossama Othman  )


http://www.xmailserver.org/linux-patches/nio-improve.html#patches




- Davide







^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] /dev/epoll update ...
  2001-09-25 17:36     ` [PATCH] " Jonathan Lemon
@ 2001-09-25 18:34       ` Dan Kegel
  0 siblings, 0 replies; 51+ messages in thread
From: Dan Kegel @ 2001-09-25 18:34 UTC (permalink / raw)
  To: Jonathan Lemon; +Cc: linux-kernel

Jonathan Lemon wrote:
> 
> In article <local.mail.linux-kernel/3BB03C6A.7D1DD7B3@kegel.com> you write:
> >Right, and kqueue() can't even represent the 'level triggered' style --
> >or at least it isn't clear from the paper that it can!  
>
> Yes it does - kqueue() is 'level-triggered' by default.  

Apologies for the line noise.  I've corrected 
http://www.kegel.com/c10k.html to show kqueue as both edge- and level- triggered.
Further corrections welcome...

> As Christopher pointed out, any event can be converted into an
> edge-triggered style notification simply by setting EV_CLEAR.  However,
> this is not usually a popular model from a programmer's point of view,
> as it increases the complexity of their app.  (This is what I've seen, YMMV)

Agreed; the poll()-like semantics of level-triggering are particularly
forgiving.

- Dan

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] /dev/epoll update ...
       [not found]   ` <local.mail.linux-kernel/3BAF83EF.C8018E45@distributopia.com>
@ 2001-09-25 17:36     ` Jonathan Lemon
  2001-09-25 18:34       ` Dan Kegel
  0 siblings, 1 reply; 51+ messages in thread
From: Jonathan Lemon @ 2001-09-25 17:36 UTC (permalink / raw)
  To: dank, linux-kernel

In article <local.mail.linux-kernel/3BB03C6A.7D1DD7B3@kegel.com> you write:
>"Christopher K. St. John" wrote:
>>  Ok, just to confirm. Using the language of BSD's
>> kqueue[1]. you've got:
>> 
>>   a) report the event only once when it occurs aka
>> "edge triggered" (EV_CLEAR, not EV_ONESHOT)
>> 
>>  b) continuously report the event as long as the
>> state is valid, aka "level triggered"
>
>Right, and kqueue() can't even represent the 'level triggered' style --
>or at least it isn't clear from the paper that it can!  True "level triggered"
>would require that the kernel track readiness of the affected file descriptors.

Yes it does - kqueue() is 'level-triggered' by default.  You may want
to check my latest USENIX paper, which explains this, as well as some
performance measurements, at:

    http://www.flugsvamp.com/~jlemon/fbsd/kqueue_usenix2001.pdf

The kernel validates the state (or "level") before returning the event
to the user, so the event is guaranteed to be valid at the time the 
syscall returns.

As Christopher pointed out, any event can be converted into an
edge-triggered style notification simply by setting EV_CLEAR.  However,
this is not usually a popular model from a programmer's point of view,
as it increases the complexity of their app.  (This is what I've seen, YMMV)
-- 
Jonathan

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] /dev/epoll update ...
  2001-09-24 22:09           ` Jamie Lokier
  2001-09-24 22:20             ` Davide Libenzi
@ 2001-09-25  9:25             ` Dan Kegel
  1 sibling, 0 replies; 51+ messages in thread
From: Dan Kegel @ 2001-09-25  9:25 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Davide Libenzi, Eric W. Biederman, linux-kernel, Gordon Oliver

Jamie Lokier wrote:
> > Anyway there's a pretty good patch ( http://www.luban.org/GPL/gpl.html ),
> > that has been tested here :
> >
> > http://www.xmailserver.org/linux-patches/nio-improve.html
> >
> > that implement the signal-per-fd mechanism and it achieves a very good
> > scalability too.
> 
> It has the bonus of requiring no userspace changes too.  Lovely!

Well, not quite *no* userspace changes, but not many.  You have to
use si_band rather than si_code (and with Luban's version, you also
need to set a new flag).

It has some locking problems that only show up under very heavy use,
so caveat emptor.  I put together a stress test 
(http://www.kegel.com/dkftpbench/ with the -sf option);
run that against betaftpd, and around 4500 ftp sessions, you might
see it crash because a signal comes in while the file table is expanding...

(By the way, I finally updated http://www.kegel.com/c10k.html to
distinguish properly between edge-triggered readiness notification
methods and level-triggered ones.  Hope that helps dispel some 
confusion in the future.)
- Dan

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] /dev/epoll update ...
       [not found] ` <3BAF83EF.C8018E45@distributopia.com>
@ 2001-09-25  8:12   ` Dan Kegel
  0 siblings, 0 replies; 51+ messages in thread
From: Dan Kegel @ 2001-09-25  8:12 UTC (permalink / raw)
  To: Christopher K. St. John, linux-kernel

"Christopher K. St. John" wrote:
>  Ok, just to confirm. Using the language of BSD's
> kqueue[1]. you've got:
> 
>   a) report the event only once when it occurs aka
> "edge triggered" (EV_CLEAR, not EV_ONESHOT)
> 
>  b) continuously report the event as long as the
> state is valid, aka "level triggered"

Right, and kqueue() can't even represent the 'level triggered' style --
or at least it isn't clear from the paper that it can!  True "level triggered"
would require that the kernel track readiness of the affected file descriptors.
 
>  The Banga99 paper certainly appears to describe an
> "edge triggered" interface:
> 
>  "Our new API follows the event-based approach. In
>   this model the kernel simply reports a stream of
>   events to the application. ... The kernel does
>   not track the readiness of any descriptor ... "
> 
>  Libenzi-/dev/epoll, being a partical implementation
> of the Banga99 mechanism, is also edge-triggered.
> 
>  OTOH, the Provos/Lever Linux /dev/poll paper describes
> what appears to be a "level triggered" interface.

Agreed.
 
>  Now for a question: My initial impression was that
> Solaris-/dev/poll, in contrast to Linux /dev/poll, was
> edge-triggered. That would explain why it might be
> more efficient that Linux-/dev/poll.
> 
>  But I don't have a copy of Solaris, handy, so I
> can't confirm. Do you know for sure? (Or is part of
> my analysis wrong?)

Solaris /dev/poll is definitely level-triggered; see Poller_test.cc in
http://www.kegel.com/dkftpbench/dkftpbench-0.33.tar.gz, which verifies this.
Poller_devpoll.cc is a thin wrapper around /dev/poll, and it definitely exhibits
level-triggered behavior with both Solaris and Linux /dev/poll.

(I later extended Poller to support edge-triggered notifications from the OS,
and translate them to level-triggered notification for the user app. 
Poller_sigio.cc and Poller_sigfd.cc are somewhat fatter wrappers around O_ASYNC,
and achieve level-triggered behavior only with cooperation from the application,
which has to call clearReadiness(fd) when the OS returns EWOULDBLOCK!
Surely the OS could do that internally, eh?)

Java's Selector in JDK 1.4 will have level-triggered behavior, not
edge-triggered behavior, btw.
- Dan

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] /dev/epoll update ...
  2001-09-24 22:21               ` Jamie Lokier
@ 2001-09-24 22:30                 ` Davide Libenzi
  0 siblings, 0 replies; 51+ messages in thread
From: Davide Libenzi @ 2001-09-24 22:30 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: linux-kernel


On 24-Sep-2001 Jamie Lokier wrote:
> Davide Libenzi wrote:
>> Sure you can avoid the scan, if you pick up one event at a time.  To
>> be compared to /dev/epoll you need the signal-per-fd patch plus a
>> method to collect the whole event-set in a single system call ( see
>> perfs ).
> 
> Yes, I agree.  A variant of sigwaitinfo that will return multiple queued
> signals was mentioned ages ago, but because the siginfo structure is
> much larger than is needed, that isn't a very effective use of cache.
> 
> Something specialised for fd events is more appropriate IMO.  Large
> numbers of queued RT signals aren't used for anything else AFAIK anyway,
> not even timers.

The bottom line is, for what i saw in my tests, that both /dev/epoll and
RT signals ( with signal-per-fd ) offers good performance and scalability.




- Davide


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] /dev/epoll update ...
  2001-09-24 22:20             ` Davide Libenzi
@ 2001-09-24 22:21               ` Jamie Lokier
  2001-09-24 22:30                 ` Davide Libenzi
  0 siblings, 1 reply; 51+ messages in thread
From: Jamie Lokier @ 2001-09-24 22:21 UTC (permalink / raw)
  To: Davide Libenzi; +Cc: Gordon Oliver, linux-kernel, Dan Kegel, Eric W. Biederman

Davide Libenzi wrote:
> Sure you can avoid the scan, if you pick up one event at a time.  To
> be compared to /dev/epoll you need the signal-per-fd patch plus a
> method to collect the whole event-set in a single system call ( see
> perfs ).

Yes, I agree.  A variant of sigwaitinfo that will return multiple queued
signals was mentioned ages ago, but because the siginfo structure is
much larger than is needed, that isn't a very effective use of cache.

Something specialised for fd events is more appropriate IMO.  Large
numbers of queued RT signals aren't used for anything else AFAIK anyway,
not even timers.

-- Jamie

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] /dev/epoll update ...
  2001-09-24 22:09           ` Jamie Lokier
@ 2001-09-24 22:20             ` Davide Libenzi
  2001-09-24 22:21               ` Jamie Lokier
  2001-09-25  9:25             ` Dan Kegel
  1 sibling, 1 reply; 51+ messages in thread
From: Davide Libenzi @ 2001-09-24 22:20 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Gordon Oliver, Gordon Oliver, linux-kernel, Dan Kegel, Eric W. Biederman


On 24-Sep-2001 Jamie Lokier wrote:
> Davide Libenzi wrote:
>> > Well, memory move consists of 2 words: (a) file descriptor; (b) poll
>> > state/edge flags.
>> 
>> 2-words * number-of-ready-fds == pretty-high-cache-drain
> 
> Perhaps there is a cache issue, but note it is the number of _new_ ready
> fds (since the last sample), not the number currently ready.
> 
>> > That will be completely swamped by the system calls and so on needed to
>> > processes each of the file descriptors.  I.e. no scalability problem here.
>> 
>> The other issue is that by keeping infos in file* you'll have to scan each fd
>> to report the ready ones, that will make the method to fall back in O(n).
> 
> No, that would be silly.  You would queue signals exactly as they are
> queued now (but collapsing multiple signals per fd into one).
> 
>> Anyway there's a pretty good patch ( http://www.luban.org/GPL/gpl.html ),
>> that has been tested here :
>> 
>> http://www.xmailserver.org/linux-patches/nio-improve.html
>> 
>> that implement the signal-per-fd mechanism and it achieves a very good
>> scalability too.
> 
> It has the bonus of requiring no userspace changes too.  Lovely!

Sure you can avoid the scan, if you pick up one event at a time.
To be compared to /dev/epoll you need the signal-per-fd patch plus a method to
collect the whole event-set in a single system call ( see perfs ).




- Davide


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] /dev/epoll update ...
  2001-09-24 22:08         ` Davide Libenzi
@ 2001-09-24 22:09           ` Jamie Lokier
  2001-09-24 22:20             ` Davide Libenzi
  2001-09-25  9:25             ` Dan Kegel
  0 siblings, 2 replies; 51+ messages in thread
From: Jamie Lokier @ 2001-09-24 22:09 UTC (permalink / raw)
  To: Davide Libenzi; +Cc: Eric W. Biederman, Dan Kegel, linux-kernel, Gordon Oliver

Davide Libenzi wrote:
> > Well, memory move consists of 2 words: (a) file descriptor; (b) poll
> > state/edge flags.
> 
> 2-words * number-of-ready-fds == pretty-high-cache-drain

Perhaps there is a cache issue, but note it is the number of _new_ ready
fds (since the last sample), not the number currently ready.

> > That will be completely swamped by the system calls and so on needed to
> > processes each of the file descriptors.  I.e. no scalability problem here.
> 
> The other issue is that by keeping infos in file* you'll have to scan each fd
> to report the ready ones, that will make the method to fall back in O(n).

No, that would be silly.  You would queue signals exactly as they are
queued now (but collapsing multiple signals per fd into one).

> Anyway there's a pretty good patch ( http://www.luban.org/GPL/gpl.html ),
> that has been tested here :
> 
> http://www.xmailserver.org/linux-patches/nio-improve.html
> 
> that implement the signal-per-fd mechanism and it achieves a very good
> scalability too.

It has the bonus of requiring no userspace changes too.  Lovely!

-- Jamie


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] /dev/epoll update ...
  2001-09-24 21:56       ` Jamie Lokier
@ 2001-09-24 22:08         ` Davide Libenzi
  2001-09-24 22:09           ` Jamie Lokier
  0 siblings, 1 reply; 51+ messages in thread
From: Davide Libenzi @ 2001-09-24 22:08 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: Eric W. Biederman, Dan Kegel, linux-kernel, Gordon Oliver


On 24-Sep-2001 Jamie Lokier wrote:
> Davide Libenzi wrote:
>> > You could even keep the memory for the queued signal / event inside
> the file structure.
>> 
>> By keeping the event structure inside the file* require you to collect
>> these events ( read memory moves ) at peek time.
>> With /dev/epoll the event is directly dropped inside the mmaped area.
> 
> Well, memory move consists of 2 words: (a) file descriptor; (b) poll
> state/edge flags.

2-words * number-of-ready-fds == pretty-high-cache-drain


> That will be completely swamped by the system calls and so on needed to
> processes each of the file descriptors.  I.e. no scalability problem here.

The other issue is that by keeping infos in file* you'll have to scan each fd
to report the ready ones, that will make the method to fall back in O(n).
Anyway there's a pretty good patch ( http://www.luban.org/GPL/gpl.html ),
that has been tested here :

http://www.xmailserver.org/linux-patches/nio-improve.html

that implement the signal-per-fd mechanism and it achieves a very good scalability too.





- Davide


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] /dev/epoll update ...
  2001-09-24 20:09     ` Davide Libenzi
@ 2001-09-24 21:56       ` Jamie Lokier
  2001-09-24 22:08         ` Davide Libenzi
  0 siblings, 1 reply; 51+ messages in thread
From: Jamie Lokier @ 2001-09-24 21:56 UTC (permalink / raw)
  To: Davide Libenzi; +Cc: Gordon Oliver, linux-kernel, Dan Kegel, Eric W. Biederman

Davide Libenzi wrote:
> > You could even keep the memory for the queued signal / event inside
the file structure.
> 
> By keeping the event structure inside the file* require you to collect
> these events ( read memory moves ) at peek time.
> With /dev/epoll the event is directly dropped inside the mmaped area.

Well, memory move consists of 2 words: (a) file descriptor; (b) poll
state/edge flags.

That will be completely swamped by the system calls and so on needed to
processes each of the file descriptors.  I.e. no scalability problem here.

-- Jamie

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] /dev/epoll update ...
  2001-09-24 19:34   ` Jamie Lokier
@ 2001-09-24 20:09     ` Davide Libenzi
  2001-09-24 21:56       ` Jamie Lokier
  0 siblings, 1 reply; 51+ messages in thread
From: Davide Libenzi @ 2001-09-24 20:09 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Gordon Oliver, Gordon Oliver, linux-kernel, Dan Kegel, Eric W. Biederman


On 24-Sep-2001 Jamie Lokier wrote:
> Eric W. Biederman wrote:
>> > As Davide points out in his reply, /dev/epoll is an exact clone of
>> > the O_SETSIG/O_SETOWN/O_ASYNC realtime signal way of getting readiness
>> > change events, but using a memory-mapped buffer instead of signal delivery
>> > (and obeying an interest mask).  Unlike /dev/poll, it only provides
>> > information about *changes* in readiness.
>> 
>> Right.  But it does one additional thing that the rtsig method doesn't
>> it collapses multiple readiness *changes* into a single readiness change.
>> This allows the kernel to keep a fixed size buffer so you never need
>> to fallback to poll as you need to with the rtsig approach.
> 
> That could be added to rtsigs, with the same result: no need to fallback
> to poll.

There's already a patch that implement this.


> You could even keep the memory for the queued signal / event inside the file structure.

By keeping the event structure inside the file* require you to collect
these events ( read memory moves ) at peek time.
With /dev/epoll the event is directly dropped inside the mmaped area.




- Davide


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] /dev/epoll update ...
  2001-09-24 19:11 ` Eric W. Biederman
@ 2001-09-24 19:34   ` Jamie Lokier
  2001-09-24 20:09     ` Davide Libenzi
  0 siblings, 1 reply; 51+ messages in thread
From: Jamie Lokier @ 2001-09-24 19:34 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Dan Kegel, linux-kernel, Gordon Oliver

Eric W. Biederman wrote:
> > As Davide points out in his reply, /dev/epoll is an exact clone of
> > the O_SETSIG/O_SETOWN/O_ASYNC realtime signal way of getting readiness
> > change events, but using a memory-mapped buffer instead of signal delivery
> > (and obeying an interest mask).  Unlike /dev/poll, it only provides
> > information about *changes* in readiness.
> 
> Right.  But it does one additional thing that the rtsig method doesn't
> it collapses multiple readiness *changes* into a single readiness change.
> This allows the kernel to keep a fixed size buffer so you never need
> to fallback to poll as you need to with the rtsig approach.

That could be added to rtsigs, with the same result: no need to fallback
to poll.  You could even keep the memory for the queued signal / event
inside the file structure.

-- Jamie

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] /dev/epoll update ...
  2001-09-24  4:16 Dan Kegel
@ 2001-09-24 19:11 ` Eric W. Biederman
  2001-09-24 19:34   ` Jamie Lokier
       [not found] ` <3BAF83EF.C8018E45@distributopia.com>
  1 sibling, 1 reply; 51+ messages in thread
From: Eric W. Biederman @ 2001-09-24 19:11 UTC (permalink / raw)
  To: Dan Kegel; +Cc: linux-kernel, Gordon Oliver

Dan Kegel <dank@kegel.com> writes:

> As Davide points out in his reply, /dev/epoll is an exact clone of
> the O_SETSIG/O_SETOWN/O_ASYNC realtime signal way of getting readiness
> change events, but using a memory-mapped buffer instead of signal delivery
> (and obeying an interest mask).  Unlike /dev/poll, it only provides
> information about *changes* in readiness.

Right.  But it does one additional thing that the rtsig method doesn't
it collapses multiple readiness *changes* into a single readiness change.
This allows the kernel to keep a fixed size buffer so you never need
to fallback to poll as you need to with the rtsig approach.

> I think there is still some confusion out there because of the name
> Davide chose; /dev/epoll is so close to /dev/poll that it lulls many
> people (myself included) into thinking it's a very similar thing.  It ain't.
> (I really have to fix my c10k page to reflect that correctly...)

Hmm.  /dev/epoll could and possibly should remove the readiness event 
if the fd becomes unready before someone gets to reading the
/dev/epoll buffer.  This is a natural extension of collapsing events.

But even with that it would still only give you the state as of the
last state change.  And it you have the state already it expects user
space to remember the state and not the kernel.  Which is both
different from /dev/poll and more efficient.

If the goal is to minimize system calls letting user space assume the
state is initially not ready.  And forcing a state query when
the fd is added should help.   I cannot think of a case where having
the kernel do the query would be necessary though.

If the goal is simply to provide a highly scalable event interface.
The current /dev/epoll sounds very good.  Though I'm not at all
thrilled with the user space interface.  As far as I can tell the case
of a fd becoming not ready is unlikely enough that it probably doesn't
need to be handled.

Eric


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] /dev/epoll update ...
@ 2001-09-24  4:16 Dan Kegel
  2001-09-24 19:11 ` Eric W. Biederman
       [not found] ` <3BAF83EF.C8018E45@distributopia.com>
  0 siblings, 2 replies; 51+ messages in thread
From: Dan Kegel @ 2001-09-24  4:16 UTC (permalink / raw)
  To: linux-kernel, Gordon Oliver

Gordon Oliver <gordo@pincoya.com> wrote:
> But you missed the obvious optimization of doing an f_ops->poll when
> the file is _added_. This means that you'll get an initial event when
> there is data ready. ...

Note that you can do that in userspace by calling poll(), btw.  That
gets you down to a single extra system call initially.

> Note that it has the additional advantage of making the dispatch code
> in the user application easier. You no longer have to do special code
> to handle the speculative read after adding the fd.

As Davide points out in his reply, /dev/epoll is an exact clone of
the O_SETSIG/O_SETOWN/O_ASYNC realtime signal way of getting readiness
change events, but using a memory-mapped buffer instead of signal delivery
(and obeying an interest mask).  Unlike /dev/poll, it only provides
information about *changes* in readiness.

Everyone who has successfully written code using the O_SETSIG/O_SETOWN/O_ASYNC
code knows that it does not send an initial state event.  This has not
gotten in the way, as a rule.

If it does turn out to be Very Important for these single-shot readiness
notification schemes to generate synthetic initial readiness events,
it should be added both to /dev/epoll and to O_SETSIG/O_SETOWN/O_ASYNC.

I think there is still some confusion out there because of the name
Davide chose; /dev/epoll is so close to /dev/poll that it lulls many
people (myself included) into thinking it's a very similar thing.  It ain't.
(I really have to fix my c10k page to reflect that correctly...)
- Dan

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] /dev/epoll update ...
  2001-09-21  6:22 Dan Kegel
@ 2001-09-21 18:45 ` Davide Libenzi
  0 siblings, 0 replies; 51+ messages in thread
From: Davide Libenzi @ 2001-09-21 18:45 UTC (permalink / raw)
  To: Dan Kegel; +Cc: linux-kernel


On 21-Sep-2001 Dan Kegel wrote:
> Davide wrote:
>> If you need to request the current status of 
>> a socket you've to f_ops->poll the fd.
>> The cost of the extra read, done only for fds that are not "ready", is nothing
>> compared to the cost of a linear scan with HUGE numbers of fds.
> 
> Hey, wait a sec, Davide... the whole point of the Solaris /dev/poll
> is that you *don't* need to f_ops->poll the fd, I think.
> And in fact, Solaris /dev/poll is insanely fast, way faster than O(N).

If the fd support hints, yes.


> Consider this: what if we added to your patch logic to clear
> the current read readiness bit for a fd whenever a read() on
> that fd returned EWOULDBLOCK?  Then we're real close to having
> the current readiness state for each fd, as the /dev/poll afficianados 
> want.  Now, there's a lot more work that'd be needed, but maybe you
> get the idea of where some of us are coming from.

Then you'll fall down to /dev/poll and /dev/epoll designed for "state change"
driven servers ( like rtsigs ).
Instead of requesting /dev/epoll changes to make it something that is not born for,
i think that the /dev/poll patch can be improved in a significant way.
The numbers i've got from my test left me quite a bit deluded.




- Davide


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] /dev/epoll update ...
@ 2001-09-21  6:22 Dan Kegel
  2001-09-21 18:45 ` Davide Libenzi
  0 siblings, 1 reply; 51+ messages in thread
From: Dan Kegel @ 2001-09-21  6:22 UTC (permalink / raw)
  To: linux-kernel, Davide Libenzi

Davide wrote:
> If you need to request the current status of 
> a socket you've to f_ops->poll the fd.
> The cost of the extra read, done only for fds that are not "ready", is nothing
> compared to the cost of a linear scan with HUGE numbers of fds.

Hey, wait a sec, Davide... the whole point of the Solaris /dev/poll
is that you *don't* need to f_ops->poll the fd, I think.
And in fact, Solaris /dev/poll is insanely fast, way faster than O(N).

Consider this: what if we added to your patch logic to clear
the current read readiness bit for a fd whenever a read() on
that fd returned EWOULDBLOCK?  Then we're real close to having
the current readiness state for each fd, as the /dev/poll afficianados 
want.  Now, there's a lot more work that'd be needed, but maybe you
get the idea of where some of us are coming from.

Christopher K. St. John is requesting example code using /dev/epoll
that does not use coroutines.  Fair enough.  Christopher, take a look
at any program that uses the F_SETSIG/F_SETOWN/O_ASYNC/sigio stuff in the
2.4 kernel (for example, my Poller_sigio.cc at http://www.kegel.com/dkftpbench/dkftpbench-0.31.tar.gz )
and mentally replace the sigtimedwait() with Davide's ioctl, kinda.
The overhead of not knowing the initial poll state is at most one
or two system calls per fd over the life of the program, I think,
so it's not too bad.

- Dan

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH] /dev/epoll update ...
@ 2001-09-07 19:27 Davide Libenzi
  0 siblings, 0 replies; 51+ messages in thread
From: Davide Libenzi @ 2001-09-07 19:27 UTC (permalink / raw)
  To: lkml


The /dev/epoll patch has been updated :

*) Stale events removal
*) Help in Configure.help ( thanks to David E. Weekly )
*) Fit 2.4.9

This is the link :

http://www.xmailserver.org/linux-patches/nio-improve.html




- Davide


^ permalink raw reply	[flat|nested] 51+ messages in thread

end of thread, other threads:[~2002-03-20  3:44 UTC | newest]

Thread overview: 51+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2001-09-19  2:20 [PATCH] /dev/epoll update Dan Kegel
2001-09-19  6:25 ` Dan Kegel
2001-09-19  7:04 ` Christopher K. St. John
2001-09-19 15:37   ` Dan Kegel
2001-09-19 15:59     ` Zach Brown
2001-09-19 17:12     ` Christopher K. St. John
2001-09-19 17:39     ` Davide Libenzi
2001-09-19 18:26     ` Alan Cox
2001-09-19 17:25   ` Davide Libenzi
2001-09-19 19:03     ` Christopher K. St. John
2001-09-19 19:30       ` Davide Libenzi
2001-09-19 21:49         ` Christopher K. St. John
2001-09-19 22:11           ` Davide Libenzi
2001-09-19 23:24             ` Christopher K. St. John
2001-09-19 23:52               ` Davide Libenzi
2001-09-20  2:13             ` Dan Kegel
2001-09-20  2:28               ` Davide Libenzi
2001-09-20  3:03                 ` Dan Kegel
2001-09-20 16:58                   ` Davide Libenzi
2001-09-20  4:32                 ` Christopher K. St. John
2001-09-20  4:43                   ` Christopher K. St. John
2001-09-20  5:05                     ` Benjamin LaHaise
2001-09-20 18:25                       ` Davide Libenzi
2001-09-20 19:33                         ` Benjamin LaHaise
2001-09-20 19:58                           ` Davide Libenzi
2001-09-20 17:18                   ` Davide Libenzi
2001-09-24  0:11                     ` Gordon Oliver
2001-09-24  0:33                       ` Davide Libenzi
2001-09-24 19:23                     ` Eric W. Biederman
2001-09-24 20:04                       ` Davide Libenzi
2001-09-21  5:59             ` Ton Hospel
2001-09-21 16:48               ` Davide Libenzi
2001-09-19 17:21 ` Davide Libenzi
  -- strict thread matches above, loose matches on Subject: below --
2002-03-20  3:49 [patch] " Davide Libenzi
     [not found] <local.mail.linux-kernel/3BB03C6A.7D1DD7B3@kegel.com>
     [not found] ` <local.mail.linux-kernel/3BAEB39B.DE7932CF@kegel.com>
     [not found]   ` <local.mail.linux-kernel/3BAF83EF.C8018E45@distributopia.com>
2001-09-25 17:36     ` [PATCH] " Jonathan Lemon
2001-09-25 18:34       ` Dan Kegel
2001-09-24  4:16 Dan Kegel
2001-09-24 19:11 ` Eric W. Biederman
2001-09-24 19:34   ` Jamie Lokier
2001-09-24 20:09     ` Davide Libenzi
2001-09-24 21:56       ` Jamie Lokier
2001-09-24 22:08         ` Davide Libenzi
2001-09-24 22:09           ` Jamie Lokier
2001-09-24 22:20             ` Davide Libenzi
2001-09-24 22:21               ` Jamie Lokier
2001-09-24 22:30                 ` Davide Libenzi
2001-09-25  9:25             ` Dan Kegel
     [not found] ` <3BAF83EF.C8018E45@distributopia.com>
2001-09-25  8:12   ` Dan Kegel
2001-09-21  6:22 Dan Kegel
2001-09-21 18:45 ` Davide Libenzi
2001-09-07 19:27 Davide Libenzi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).