* [RFC] adding aio_readv/writev @ 2002-09-20 20:39 Shailabh Nagar [not found] ` <1032555981.2082.10.camel@dell_ss3.pdx.osdl.net> 0 siblings, 1 reply; 9+ messages in thread From: Shailabh Nagar @ 2002-09-20 20:39 UTC (permalink / raw) To: Ben LaHaise Cc: Andrew Morton, Alexander Viro, linux-aio, linux-kernel, lse-tech Ben, Currently there is no way to initiate an aio readv/writev in 2.5. There were no aio_readv/writev calls in 2.4 either - I'm wondering if there was any particular reason for excluding readv/writev operations from aio ? The read/readv paths have anyway been merged for raw/O_DIRECT and regular file read/writes. So why not expose the vector read/write to the user by adding the IOCB_CMD_PREADV/IOCB_CMD_READV and IOCB_CMD_PWRITEV/IOCB_CMD_WRITEV commands to the aio set. Without that, raw/O_DIRECT readv users would need to unnecessarily cycle through their iovecs at a library level submitting them individually. For larger iovecs, user/library code would needlessly deal with multiple completions. While I'm not sure of the performance impact of the absence of aio_readv/writev, it seems easy enough to provide. Most of the functions are already in place. We would only need a way to pass the iovec through the iocb. I was thinking of something like this: struct iocb { +union { __u64 aio_buf + __u64 aio_iovp +} +union { __u64 aio_nbytes + __u64 aio_nsegs +} allowing the iovec * & nsegs to be passed into sys_io_submit. Some code would be added (within case handling of IOCB_CMD_READV within io_submit_one) to copy & verify the iovec pointers and then call aio_readv/aio_writev (if its defined for the fs). What do you think ? I wanted to get some feedback before trying to code this up. While we are on the topic of expanding aio operations, what about providing IOCB_CMD_READ/WRITE, distinct from their pread/pwrite counterparts ? Do you think thats needed ? - Shailabh ^ permalink raw reply [flat|nested] 9+ messages in thread
[parent not found: <1032555981.2082.10.camel@dell_ss3.pdx.osdl.net>]
* Re: [RFC] adding aio_readv/writev [not found] ` <1032555981.2082.10.camel@dell_ss3.pdx.osdl.net> @ 2002-09-23 14:30 ` Shailabh Nagar 2002-09-23 18:53 ` Clement T. Cole [not found] ` <20020923114104.A11680@redhat.com> 0 siblings, 2 replies; 9+ messages in thread From: Shailabh Nagar @ 2002-09-23 14:30 UTC (permalink / raw) To: Stephen Hemminger Cc: Benjamin LaHaise, Andrew Morton, Alexander Viro, linux-aio, linux-kernel Stephen Hemminger wrote: >Why not batch up multiple requests with one io_submit? It has the same >effect, except there would be multiple responses. > Even though the multiple iocb's enter the kernel together, they still get processed individually so a fair amount of unnecessary data transmission and function invocation are still occurring in the submit code path. Depending on how long it takes for io_submit_one to return, there might be a reduced probability for merging of io requests at the i/o scheduler. Finally, the multiple responses need to be handled as you mentioned. I suppose the application could wait for the last request (in the io_submit list) and that would most probably ensure that the preceding ones were complete as well but its not a guarantee offered by the aio API, right ? Besides, the application needs the data (represented by multiple requests) at one go so partial completion isn't likely to be useful and will only be an overhead. While a quantitative assessment of the above tradeoffs is possible, it will be difficult to make a good comparison before "true" aio functionality is in place for 2.5. Such an assessment is unlikely to happen before the feature freeze takes effect. So I'm making a case for putting in async vector I/O interfaces in for the following three reasons: - the synchronous API does provide separate entry points for vector I/O. Extending the same to the async interfaces, especially when it doesn't even involve creating new syscalls, seems natural for completeness. - underlying in-kernel infrastructure already supports it, so no major changes are needed. - there exists atleast one major application class (databases) that uses vectored I/O heavily and benefits from async I/O. Hence async vectored I/O is also likely to be useful. Can anyone else with experience on other OS's comment on this ? Comments, reasons for not doing async readv/writev directly welcome. - Shailabh > > >On Fri, 2002-09-20 at 13:39, Shailabh Nagar wrote: > >>Ben, >> >>Currently there is no way to initiate an aio readv/writev in 2.5. There >>were no aio_readv/writev calls in 2.4 either - I'm wondering if there >>was any particular reason for excluding readv/writev operations from aio ? >> >>The read/readv paths have anyway been merged for raw/O_DIRECT and >>regular file read/writes. So why not expose the vector read/write to the >>user by adding the IOCB_CMD_PREADV/IOCB_CMD_READV and >>IOCB_CMD_PWRITEV/IOCB_CMD_WRITEV commands to the aio set. Without that, >>raw/O_DIRECT readv users would need to unnecessarily cycle through their >>iovecs at a library level submitting them individually. >>For larger iovecs, user/library code would needlessly deal with multiple >>completions. While I'm not sure of the performance impact of the absence >>of aio_readv/writev, it seems easy enough to provide. >>Most of the functions are already in place. We would only >>need a way to pass the iovec through the iocb. >> >>I was thinking of something like this: >> >>struct iocb { >> >>+union { >> __u64 aio_buf >>+ __u64 aio_iovp >>+} >>+union { >> __u64 aio_nbytes >>+ __u64 aio_nsegs >>+} >> >>allowing the iovec * & nsegs to be passed into sys_io_submit. Some code >>would be added (within case handling of IOCB_CMD_READV within >>io_submit_one) to copy & verify the iovec pointers and then call >>aio_readv/aio_writev (if its defined for the fs). >> >>What do you think ? I wanted to get some feedback before trying to code >>this up. >> >>While we are on the topic of expanding aio operations, what about >>providing IOCB_CMD_READ/WRITE, distinct from their pread/pwrite >>counterparts ? Do you think thats needed ? >> >>- Shailabh >> ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [RFC] adding aio_readv/writev 2002-09-23 14:30 ` Shailabh Nagar @ 2002-09-23 18:53 ` Clement T. Cole [not found] ` <20020923114104.A11680@redhat.com> 1 sibling, 0 replies; 9+ messages in thread From: Clement T. Cole @ 2002-09-23 18:53 UTC (permalink / raw) To: Shailabh Nagar Cc: Stephen Hemminger, Benjamin LaHaise, Andrew Morton, Alexander Viro, linux-aio, linux-kernel >>Comments, reasons for not doing async readv/writev directly welcome. How about the case for it... See Pages 404-406 [Section 12.7] of Richard Steven's ``Advanced Programming in the Unix Environment'' [aka APUE]. Richard measures almost a factor of 2 difference in system time between using vectored I/O and not using it on a Sun and on a x86. >>- there exists at least one major application class (databases) >> that uses vectored I/O heavily and benefits from async I/O. >> Hence async vectored I/O is also likely to be useful. Can anyone >> else with experience on other OS's comment on this ? .... a number of other comments/arguments from other .... responces removed to get the meat of the discussion. Becareful out there..... Large commercial applications such as Oracle DB, IBM's DB2 or Netscape Enterprise server for that matter - are very modular in their interface to OS because they have to be and have been, ported and tuned to run on a number of different OS's and HW architectures. When I see some one say something about ``Oracle'' doing X or Y - I get a little worried. Which version, which port etc... e.g. Oracle's DB running on VMS has a different I/O system interface that is different from any of it's Unix implementations...... oh yes - was the clustered or not... did it have the X package etc... The point is that the UNIX implementations of Oracle DB vary widely..... This is also true of every >>major<< application package I have worked/consulted seen some of the insides (SAP, Informix, Netscape, etc...). Solaris and Tru64 [and I would expect AIX, HP-UX etc... but I only know these two personally] each offer a highly parallel I/O, asynchronous (but proprietary) interface. Oracle's Sun group (or the old DEC group) exploit the >>private<< interfaces -- to make the code work better - they do. That may or may not be what you have seen on some ``simple Un*x'' port - which is a starting point for them - that's not the code they ship on the high end revenue systems. Oracle/IBM/Netscape etc... do this cause the want to grab customers from their competetors (DB2, Informix, etc.)... they invest in using the best interfaces available.... if they are available AND if they can help them sell more copies of their product. So... let's get back to the basic issue.... We know that vectored/scatter gather I/O can help a number of real applications ... Richard demonstrated that. We have some examples [like DB2] that have use vectored I/O successfully. We also know asynchronous I/O has been demonstrated to be useful and know that some commerical folks have used that. I'm gather from some of the comments, adding async/vectored will make an already complex subsystem, even more so [i.e. not a resounding endorsement for sure this is easy]. So the question is can async vectored I/O be implemented to have a positive gain, such as it did within the traditonal one. If the complexity is too high and it does not help much...then maybe this is a Chimera to leave alone. But.... if it can be done with some level of elegance... well.... the past history is that the commerical folks have used those features. I know this this sounds a little bit like: ``if you build it - they will come.'' But I would say it's more to this point: ``if you build it and this new feature shows some real value AND the application can exploit it ... in time, they will because if they don't their competetors will.'' Clem Cole ^ permalink raw reply [flat|nested] 9+ messages in thread
[parent not found: <20020923114104.A11680@redhat.com>]
* Re: [RFC] adding aio_readv/writev [not found] ` <20020923114104.A11680@redhat.com> @ 2002-09-24 13:20 ` John Gardiner Myers 2002-09-24 13:52 ` Stephen C. Tweedie 0 siblings, 1 reply; 9+ messages in thread From: John Gardiner Myers @ 2002-09-24 13:20 UTC (permalink / raw) To: linux-aio; +Cc: linux-kernel [-- Attachment #1: Type: text/plain, Size: 392 bytes --] Benjamin LaHaise wrote: >Only db2 uses vectored io heavily. Oracle does not, and none of the open >source databases do. Vectored io is pretty useless for most people. > > writev is extremely important for networking as it avoids small packets. Why do people have such tunnel vision around aio to disk? Aio to network is far more important, as networks are much slower than disks. [-- Attachment #2: S/MIME Cryptographic Signature --] [-- Type: application/x-pkcs7-signature, Size: 3537 bytes --] ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [RFC] adding aio_readv/writev 2002-09-24 13:20 ` John Gardiner Myers @ 2002-09-24 13:52 ` Stephen C. Tweedie 2002-09-24 14:13 ` John Gardiner Myers 0 siblings, 1 reply; 9+ messages in thread From: Stephen C. Tweedie @ 2002-09-24 13:52 UTC (permalink / raw) To: John Gardiner Myers; +Cc: linux-aio, linux-kernel Hi, On Tue, Sep 24, 2002 at 06:20:45AM -0700, John Gardiner Myers wrote: > Benjamin LaHaise wrote: > > >Only db2 uses vectored io heavily. Oracle does not, and none of the open > >source databases do. Vectored io is pretty useless for most people. > > > > > writev is extremely important for networking as it avoids small packets. No, all you can infer from that is that "some method for avoiding small packets is important for networking." TCP_CORK already does that in Linux, for tcp at least, without requiring writev. (Of course, normal nonblocking writev is still there if you want it.) --Stephen ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [RFC] adding aio_readv/writev 2002-09-24 13:52 ` Stephen C. Tweedie @ 2002-09-24 14:13 ` John Gardiner Myers 0 siblings, 0 replies; 9+ messages in thread From: John Gardiner Myers @ 2002-09-24 14:13 UTC (permalink / raw) To: linux-aio; +Cc: linux-kernel [-- Attachment #1: Type: text/plain, Size: 640 bytes --] Stephen C. Tweedie wrote: >No, all you can infer from that is that "some method for avoiding >small packets is important for networking." TCP_CORK already does >that in Linux, for tcp at least, without requiring writev. (Of >course, normal nonblocking writev is still there if you want it.) > TCP_CORK is indeed effective for avoiding small packets. Be that as it may, the source data for network writes are frequently in discontiguous buffers and writev is nonetheless still important for networking. The alternative in the aio model is to waste a lot of resources delivering io completions the application doesn't care about. [-- Attachment #2: S/MIME Cryptographic Signature --] [-- Type: application/x-pkcs7-signature, Size: 3537 bytes --] ^ permalink raw reply [flat|nested] 9+ messages in thread
* RE: [RFC] adding aio_readv/writev @ 2002-09-23 17:59 Chen, Kenneth W 0 siblings, 0 replies; 9+ messages in thread From: Chen, Kenneth W @ 2002-09-23 17:59 UTC (permalink / raw) To: Benjamin LaHaise, Shailabh Nagar Cc: Stephen Hemminger, Andrew Morton, Alexander Viro, linux-aio, linux-kernel ben> Only db2 uses vectored io heavily. Oracle does not, and none of the open ben> source databases do. Vectored io is pretty useless for most people. That's not necessary true. As far as I know, the reason oracle doesn't use vectored io is because the real implementation is not there. - Ken ^ permalink raw reply [flat|nested] 9+ messages in thread
[parent not found: <200209231851.g8NIpea12782@igw2.watson.ibm.com>]
* Re: [RFC] adding aio_readv/writev [not found] <200209231851.g8NIpea12782@igw2.watson.ibm.com> @ 2002-09-23 19:52 ` Shailabh Nagar 2002-09-23 20:39 ` Clement T. Cole 0 siblings, 1 reply; 9+ messages in thread From: Shailabh Nagar @ 2002-09-23 19:52 UTC (permalink / raw) To: clemc Cc: Stephen Hemminger, Benjamin LaHaise, Andrew Morton, Alexander Viro, linux-aio, linux-kernel Clement T. Cole wrote: >>>Comments, reasons for not doing async readv/writev directly welcome. >>> > >How about the case for it... See Pages 404-406 [Section 12.7] of >Richard Steven's ``Advanced Programming in the Unix Environment'' >[aka APUE]. Richard measures almost a factor of 2 difference >in system time between using vectored I/O and not using it on >a Sun and on a x86. > It would have been nice to have corresponding data for the async path. ><snip> > >So... let's get back to the basic issue.... > >We know that vectored/scatter gather I/O can help a number of real >applications ... Richard demonstrated that. We have some examples >[like DB2] that have use vectored I/O successfully. We also >know asynchronous I/O has been demonstrated to be useful and >know that some commerical folks have used that. > >I'm gather from some of the comments, adding async/vectored >will make an already complex subsystem, even more so [i.e. not >a resounding endorsement for sure this is easy]. > I wouldn't say so. Adding async vectored I/O to the 2.5 code won't make it more complex since the underlying functions do handle iovec's anyway. > > >So the question is can async vectored I/O be implemented >to have a positive gain, such as it did within the traditonal one. >If the complexity is too high and it does not help much...then >maybe this is a Chimera to leave alone. But.... if it can be >done with some level of elegance... well.... the past history is >that the commerical folks have used those features. > It seems to be a case of "complexity is low, benefits are unknown". I guess the best thing is to develop a patch and see what people think about the complexity part. The benefits part will become clear only when the async interfaces are reasonable functional and we can compare the following - call async readv directly vs - multiple calls to io_submit using one iocb (each call corresponds to one element of user's vector) vs - single call to io_submit using multiple iocb's (each iocb corresponds to one element of user's vector) Since the raw/O_DIRECT interfaces offer asynchrony (through Badari Pulavarty & Mingming Cao's patches), it should be possible to test this out. More on this shortly, - Shailabh ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [RFC] adding aio_readv/writev 2002-09-23 19:52 ` Shailabh Nagar @ 2002-09-23 20:39 ` Clement T. Cole 0 siblings, 0 replies; 9+ messages in thread From: Clement T. Cole @ 2002-09-23 20:39 UTC (permalink / raw) To: Shailabh Nagar Cc: Stephen Hemminger, Benjamin LaHaise, Andrew Morton, Alexander Viro, linux-aio, linux-kernel >>It would have been nice to have corresponding data for the async path. Agreed... I'll let you know if I learn anything. When Richard wrote APUE, aio was not defined by Posix. Only the select/poll and turning on the O_*SYNC flags hacks from BSD and SVR4. I don't think you will learn much from that. As I said, many/most of the commerical Un*x folks added their own propritary (and slightly different) version of aio years ago. Then they agreed on the Posix interface and most [if not all] have offered those. Most of the majpr ISV's that used their proprietary ones have switched to or are in the process of switching too the Posi interface for simpliticy [if they could - there are sometimes reasons why they can not - not always technical reasons BTW]. I personally started to monitor this mailing list because I was interested in Ben's work on aio for Linux and what I'm researching needs to be follow what Linux is doing in this area. For what ever its worth to this list: I have local implementations of the Posix async I/O for a Sun and *BSD. I trying to get my hands on a Alpha and SVR5 [<-- bits secured for the later but no HW at the moment to try it]. If you have any aio test cases, let me know. As I do my research, if I can learn anything useful I'll be willing to pass it on if you think it will help. I'm currently thinking up/trying some examples and there are some worrisome issues with the Posix spec IMHO. I know that you folks are not trying to be Posix compliant - which is both a blessing and curse. In my case, I need to follow Posix, since that's what the ISVs really use as their guide. My assumption is that there will be mapping layer between your final interface and the Posix interface. I can offer any extensions as need/appropriate if I can show that it helps [which in this case it might]. Clem ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2002-09-24 14:08 UTC | newest] Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2002-09-20 20:39 [RFC] adding aio_readv/writev Shailabh Nagar [not found] ` <1032555981.2082.10.camel@dell_ss3.pdx.osdl.net> 2002-09-23 14:30 ` Shailabh Nagar 2002-09-23 18:53 ` Clement T. Cole [not found] ` <20020923114104.A11680@redhat.com> 2002-09-24 13:20 ` John Gardiner Myers 2002-09-24 13:52 ` Stephen C. Tweedie 2002-09-24 14:13 ` John Gardiner Myers 2002-09-23 17:59 Chen, Kenneth W [not found] <200209231851.g8NIpea12782@igw2.watson.ibm.com> 2002-09-23 19:52 ` Shailabh Nagar 2002-09-23 20:39 ` Clement T. Cole
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).