linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC] adding aio_readv/writev
@ 2002-09-20 20:39 Shailabh Nagar
       [not found] ` <1032555981.2082.10.camel@dell_ss3.pdx.osdl.net>
  0 siblings, 1 reply; 9+ messages in thread
From: Shailabh Nagar @ 2002-09-20 20:39 UTC (permalink / raw)
  To: Ben LaHaise
  Cc: Andrew Morton, Alexander Viro, linux-aio, linux-kernel, lse-tech

Ben,

Currently there is no way to initiate an aio readv/writev in 2.5. There 
were no aio_readv/writev calls in 2.4 either - I'm wondering if there 
was any particular reason for excluding readv/writev operations from aio ?

The read/readv paths have anyway been merged for raw/O_DIRECT and 
regular file read/writes. So why not expose the vector read/write to the 
user by adding the IOCB_CMD_PREADV/IOCB_CMD_READV and 
IOCB_CMD_PWRITEV/IOCB_CMD_WRITEV commands to the aio set. Without that, 
raw/O_DIRECT readv users would need to unnecessarily cycle through their 
iovecs at a library level submitting them individually.
For larger iovecs, user/library code would needlessly deal with multiple 
completions. While I'm not sure of the performance impact of the absence 
of aio_readv/writev, it seems easy enough to provide.
Most of the functions are already in place. We would only
need a way to pass the iovec through the iocb.

I was thinking of something like this:

struct iocb {

+union {
        __u64	aio_buf
+      __u64	aio_iovp
+}
+union {
        __u64	aio_nbytes
+      __u64	aio_nsegs
+}

allowing the iovec * & nsegs to be passed into sys_io_submit. Some code 
would be added (within case handling of IOCB_CMD_READV within 
io_submit_one) to copy & verify the iovec pointers and then call 
aio_readv/aio_writev (if its defined for the fs).

What do you think ? I wanted to get some feedback before trying to code 
this up.

While we are on the topic of expanding aio operations, what about 
providing IOCB_CMD_READ/WRITE, distinct from their pread/pwrite 
counterparts ? Do you think thats needed ?

- Shailabh
	


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC] adding aio_readv/writev
       [not found] ` <1032555981.2082.10.camel@dell_ss3.pdx.osdl.net>
@ 2002-09-23 14:30   ` Shailabh Nagar
  2002-09-23 18:53     ` Clement T. Cole
       [not found]     ` <20020923114104.A11680@redhat.com>
  0 siblings, 2 replies; 9+ messages in thread
From: Shailabh Nagar @ 2002-09-23 14:30 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Benjamin LaHaise, Andrew Morton, Alexander Viro, linux-aio, linux-kernel

Stephen Hemminger wrote:

>Why not batch up multiple requests with one io_submit? It has the same
>effect, except there would be multiple responses.
>
Even though the multiple iocb's enter the kernel together, they still 
get processed individually so a fair amount of unnecessary data 
transmission and function invocation are still occurring in the submit 
code path.
Depending on how long it takes for io_submit_one to return, there might 
be a reduced probability for merging of io requests at the i/o scheduler.
Finally, the multiple responses need to be handled as you mentioned. I 
suppose the application could wait for the last request (in the 
io_submit list) and that would most probably ensure that the preceding 
ones were complete as well but its not a guarantee offered by the aio 
API, right ?
Besides, the application needs the data (represented by multiple 
requests) at one go so partial completion isn't likely to be  useful and 
will only be an overhead.

While a quantitative assessment of the above tradeoffs is possible, it 
will be difficult to make a good comparison before "true" aio 
functionality is in place for 2.5. Such an assessment is unlikely to 
happen before the feature freeze takes effect. So I'm making a case for 
putting in async vector I/O interfaces in for the following three reasons:
- the synchronous API does provide separate entry points for vector I/O. 
Extending the same to the async interfaces, especially when it doesn't 
even involve creating new syscalls, seems natural for completeness.
- underlying in-kernel infrastructure already supports it, so no major 
changes are needed.
- there exists atleast one major application class (databases) that uses 
vectored I/O heavily and benefits from async I/O. Hence async vectored 
I/O is also likely to be useful. Can anyone else with experience on 
other OS's comment on this ?

Comments, reasons for not doing async readv/writev directly welcome.

- Shailabh

>
>
>On Fri, 2002-09-20 at 13:39, Shailabh Nagar wrote:
>
>>Ben,
>>
>>Currently there is no way to initiate an aio readv/writev in 2.5. There 
>>were no aio_readv/writev calls in 2.4 either - I'm wondering if there 
>>was any particular reason for excluding readv/writev operations from aio ?
>>
>>The read/readv paths have anyway been merged for raw/O_DIRECT and 
>>regular file read/writes. So why not expose the vector read/write to the 
>>user by adding the IOCB_CMD_PREADV/IOCB_CMD_READV and 
>>IOCB_CMD_PWRITEV/IOCB_CMD_WRITEV commands to the aio set. Without that, 
>>raw/O_DIRECT readv users would need to unnecessarily cycle through their 
>>iovecs at a library level submitting them individually.
>>For larger iovecs, user/library code would needlessly deal with multiple 
>>completions. While I'm not sure of the performance impact of the absence 
>>of aio_readv/writev, it seems easy enough to provide.
>>Most of the functions are already in place. We would only
>>need a way to pass the iovec through the iocb.
>>
>>I was thinking of something like this:
>>
>>struct iocb {
>>
>>+union {
>>        __u64	aio_buf
>>+      __u64	aio_iovp
>>+}
>>+union {
>>        __u64	aio_nbytes
>>+      __u64	aio_nsegs
>>+}
>>
>>allowing the iovec * & nsegs to be passed into sys_io_submit. Some code 
>>would be added (within case handling of IOCB_CMD_READV within 
>>io_submit_one) to copy & verify the iovec pointers and then call 
>>aio_readv/aio_writev (if its defined for the fs).
>>
>>What do you think ? I wanted to get some feedback before trying to code 
>>this up.
>>
>>While we are on the topic of expanding aio operations, what about 
>>providing IOCB_CMD_READ/WRITE, distinct from their pread/pwrite 
>>counterparts ? Do you think thats needed ?
>>
>>- Shailabh
>>


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC] adding aio_readv/writev
  2002-09-23 14:30   ` Shailabh Nagar
@ 2002-09-23 18:53     ` Clement T. Cole
       [not found]     ` <20020923114104.A11680@redhat.com>
  1 sibling, 0 replies; 9+ messages in thread
From: Clement T. Cole @ 2002-09-23 18:53 UTC (permalink / raw)
  To: Shailabh Nagar
  Cc: Stephen Hemminger, Benjamin LaHaise, Andrew Morton,
	Alexander Viro, linux-aio, linux-kernel

>>Comments, reasons for not doing async readv/writev directly welcome.

How about the case for it...  See Pages 404-406 [Section 12.7] of
Richard Steven's ``Advanced Programming in the Unix Environment''
[aka APUE].  Richard measures almost a factor of 2 difference
in system time between using vectored I/O and not using it on
a Sun and on a x86.

>>- there exists at least one major application class (databases)
>>  that uses vectored I/O heavily and benefits from async I/O.
>>  Hence async vectored I/O is also likely to be useful. Can anyone
>>  else with experience on other OS's comment on this ?

			....  a number of other comments/arguments from other
			....  responces removed to get the meat of the discussion.

Becareful out there.....

Large commercial applications such as Oracle DB, IBM's DB2 or
Netscape Enterprise server for that matter - are very modular in
their interface to OS because they have to be and have been, ported
and tuned to run on a number of different OS's and HW architectures.
When I see some one say something about ``Oracle'' doing X or Y -
I get a little worried.

Which version, which port etc...  e.g. Oracle's DB running on VMS
has a different I/O system interface that is different from any of
it's Unix implementations...... oh yes - was the clustered or
not... did it have the X package etc...

The point is that the UNIX implementations of Oracle DB vary
widely.....  This is also true of every >>major<< application package
I have worked/consulted seen some of the insides (SAP, Informix,
Netscape,  etc...).

Solaris and Tru64 [and I would expect AIX, HP-UX
etc... but I only know these two personally] each offer
a highly parallel I/O, asynchronous (but proprietary) interface.
Oracle's Sun group (or the old DEC group) exploit the >>private<<
interfaces -- to make the code work better - they do.

That may or may not be what you have seen on some ``simple Un*x''
port - which is a starting point for them - that's not the code
they ship on the high end revenue systems.

Oracle/IBM/Netscape etc... do this cause the want to grab customers
from their competetors (DB2, Informix, etc.)... they invest in
using the best interfaces available.... if they are available
AND if they can help them sell more copies of their product.


So... let's get back to the basic issue....

We know that vectored/scatter gather I/O can help a number of real
applications ... Richard demonstrated that.  We have some examples
[like DB2] that have use vectored I/O successfully.  We also
know asynchronous I/O has been demonstrated to be useful and
know that some commerical folks have used that.  

I'm gather from some of the comments, adding async/vectored
will make an already complex subsystem, even more so [i.e. not
a resounding endorsement for sure this is easy].

So the question is can async vectored I/O be implemented 
to have a positive gain, such as it did within the traditonal one.
If the complexity is too high and it does not help much...then
maybe this is a Chimera to leave alone.   But.... if it can be
done with some level of elegance... well.... the past history is
that the commerical folks have used those features.

I know this this sounds a little bit like:
	``if you build it - they will come.'' 

But I would say it's more to this point:
	 ``if you build it and this new feature shows some real value
	   AND the application can exploit it ...  in time, they will
	   because if they don't their competetors will.''

Clem Cole

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC] adding aio_readv/writev
       [not found]     ` <20020923114104.A11680@redhat.com>
@ 2002-09-24 13:20       ` John Gardiner Myers
  2002-09-24 13:52         ` Stephen C. Tweedie
  0 siblings, 1 reply; 9+ messages in thread
From: John Gardiner Myers @ 2002-09-24 13:20 UTC (permalink / raw)
  To: linux-aio; +Cc: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 392 bytes --]



Benjamin LaHaise wrote:

>Only db2 uses vectored io heavily.  Oracle does not, and none of the open 
>source databases do.  Vectored io is pretty useless for most people.
>  
>
writev is extremely important for networking as it avoids small packets.

Why do people have such tunnel vision around aio to disk?  Aio to 
network is far more important, as networks are much slower than disks.


[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/x-pkcs7-signature, Size: 3537 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC] adding aio_readv/writev
  2002-09-24 13:20       ` John Gardiner Myers
@ 2002-09-24 13:52         ` Stephen C. Tweedie
  2002-09-24 14:13           ` John Gardiner Myers
  0 siblings, 1 reply; 9+ messages in thread
From: Stephen C. Tweedie @ 2002-09-24 13:52 UTC (permalink / raw)
  To: John Gardiner Myers; +Cc: linux-aio, linux-kernel

Hi,

On Tue, Sep 24, 2002 at 06:20:45AM -0700, John Gardiner Myers wrote:
 
> Benjamin LaHaise wrote:
> 
> >Only db2 uses vectored io heavily.  Oracle does not, and none of the open 
> >source databases do.  Vectored io is pretty useless for most people.
> >  
> >
> writev is extremely important for networking as it avoids small packets.

No, all you can infer from that is that "some method for avoiding
small packets is important for networking."  TCP_CORK already does
that in Linux, for tcp at least, without requiring writev.  (Of
course, normal nonblocking writev is still there if you want it.)

--Stephen

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC] adding aio_readv/writev
  2002-09-24 13:52         ` Stephen C. Tweedie
@ 2002-09-24 14:13           ` John Gardiner Myers
  0 siblings, 0 replies; 9+ messages in thread
From: John Gardiner Myers @ 2002-09-24 14:13 UTC (permalink / raw)
  To: linux-aio; +Cc: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 640 bytes --]



Stephen C. Tweedie wrote:

>No, all you can infer from that is that "some method for avoiding
>small packets is important for networking."  TCP_CORK already does
>that in Linux, for tcp at least, without requiring writev.  (Of
>course, normal nonblocking writev is still there if you want it.)
>
TCP_CORK is indeed effective for avoiding small packets.  Be that as it 
may, the source data for network writes are frequently in discontiguous 
buffers and writev is nonetheless still important for networking.  The 
alternative in the aio model is to waste a lot of resources delivering 
io completions the application doesn't care about.


[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/x-pkcs7-signature, Size: 3537 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC] adding aio_readv/writev
  2002-09-23 19:52 ` Shailabh Nagar
@ 2002-09-23 20:39   ` Clement T. Cole
  0 siblings, 0 replies; 9+ messages in thread
From: Clement T. Cole @ 2002-09-23 20:39 UTC (permalink / raw)
  To: Shailabh Nagar
  Cc: Stephen Hemminger, Benjamin LaHaise, Andrew Morton,
	Alexander Viro, linux-aio, linux-kernel

>>It would have been nice to have corresponding data for the async path.
Agreed... I'll let you know if I learn anything.  When Richard wrote
APUE, aio was not defined by Posix.  Only the select/poll and turning
on the O_*SYNC flags hacks from BSD and SVR4.  I don't think you will
learn much from that.

As I said, many/most of the commerical Un*x folks added their own
propritary (and slightly different) version of aio years ago.   Then
they agreed on the Posix interface and most [if not all] have offered
those.  Most of the majpr ISV's that used their proprietary ones
have switched to or are in the process of switching too the Posi
interface for simpliticy [if they could - there are sometimes reasons
why they can not - not always technical reasons BTW].

I personally started to monitor this mailing list because I was
interested in Ben's work on aio for Linux and what I'm researching
needs to be follow what Linux is doing in this area.

For what ever its worth to this list: I have local implementations
of the Posix async I/O for a Sun and *BSD.  I trying to get my hands
on a Alpha and SVR5 [<-- bits secured for the later but no HW at
the moment to try it].  If you have any aio test cases, let me know.
As I do my research, if I can learn anything useful I'll be willing
to pass it on if you think it will help.

I'm currently thinking up/trying some examples and there are some
worrisome issues with the Posix spec IMHO.  I know that you
folks are not trying to be Posix compliant - which is both
a blessing and curse.

In my case, I need to follow Posix, since that's
what the ISVs really use as their guide.  My assumption is that
there will be mapping layer between your final interface and
the Posix interface.  I can offer any extensions as need/appropriate if
I can show that it helps [which in this case it might].

Clem

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC] adding aio_readv/writev
       [not found] <200209231851.g8NIpea12782@igw2.watson.ibm.com>
@ 2002-09-23 19:52 ` Shailabh Nagar
  2002-09-23 20:39   ` Clement T. Cole
  0 siblings, 1 reply; 9+ messages in thread
From: Shailabh Nagar @ 2002-09-23 19:52 UTC (permalink / raw)
  To: clemc
  Cc: Stephen Hemminger, Benjamin LaHaise, Andrew Morton,
	Alexander Viro, linux-aio, linux-kernel

Clement T. Cole wrote:

>>>Comments, reasons for not doing async readv/writev directly welcome.
>>>
>
>How about the case for it...  See Pages 404-406 [Section 12.7] of
>Richard Steven's ``Advanced Programming in the Unix Environment''
>[aka APUE].  Richard measures almost a factor of 2 difference
>in system time between using vectored I/O and not using it on
>a Sun and on a x86.
>
It would have been nice to have corresponding data for the async path.

><snip>
>
>So... let's get back to the basic issue....
>
>We know that vectored/scatter gather I/O can help a number of real
>applications ... Richard demonstrated that.  We have some examples
>[like DB2] that have use vectored I/O successfully.  We also
>know asynchronous I/O has been demonstrated to be useful and
>know that some commerical folks have used that.  
>
>I'm gather from some of the comments, adding async/vectored
>will make an already complex subsystem, even more so [i.e. not
>a resounding endorsement for sure this is easy].
>

I wouldn't say so. Adding async vectored I/O to the 2.5 code won't make 
it more complex since the underlying functions
do handle iovec's anyway.

>
>
>So the question is can async vectored I/O be implemented 
>to have a positive gain, such as it did within the traditonal one.
>If the complexity is too high and it does not help much...then
>maybe this is a Chimera to leave alone.   But.... if it can be
>done with some level of elegance... well.... the past history is
>that the commerical folks have used those features.
>

It seems to be a case of "complexity is low, benefits are unknown". I 
guess the best thing is to develop a patch and see what people think 
about the complexity part. The benefits part will become clear only when 
the async interfaces are reasonable functional and we can compare the 
following

- call async readv directly
vs
- multiple calls to io_submit using one iocb (each call corresponds to 
one element of user's vector)
vs
- single call to io_submit using multiple iocb's (each iocb corresponds 
to one element of user's vector)

Since the raw/O_DIRECT interfaces offer asynchrony (through Badari 
Pulavarty & Mingming Cao's patches), it should be possible to test this 
out.

More on this shortly,
- Shailabh


^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: [RFC] adding aio_readv/writev
@ 2002-09-23 17:59 Chen, Kenneth W
  0 siblings, 0 replies; 9+ messages in thread
From: Chen, Kenneth W @ 2002-09-23 17:59 UTC (permalink / raw)
  To: Benjamin LaHaise, Shailabh Nagar
  Cc: Stephen Hemminger, Andrew Morton, Alexander Viro, linux-aio,
	linux-kernel

ben> Only db2 uses vectored io heavily.  Oracle does not, and none of the
open 
ben> source databases do.  Vectored io is pretty useless for most people.

That's not necessary true. As far as I know, the reason oracle doesn't use
vectored io is because the real implementation is not there.

- Ken

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2002-09-24 14:08 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2002-09-20 20:39 [RFC] adding aio_readv/writev Shailabh Nagar
     [not found] ` <1032555981.2082.10.camel@dell_ss3.pdx.osdl.net>
2002-09-23 14:30   ` Shailabh Nagar
2002-09-23 18:53     ` Clement T. Cole
     [not found]     ` <20020923114104.A11680@redhat.com>
2002-09-24 13:20       ` John Gardiner Myers
2002-09-24 13:52         ` Stephen C. Tweedie
2002-09-24 14:13           ` John Gardiner Myers
2002-09-23 17:59 Chen, Kenneth W
     [not found] <200209231851.g8NIpea12782@igw2.watson.ibm.com>
2002-09-23 19:52 ` Shailabh Nagar
2002-09-23 20:39   ` Clement T. Cole

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).