* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains @ 2001-02-01 14:44 bsuparna 2001-02-01 15:09 ` Christoph Hellwig 0 siblings, 1 reply; 186+ messages in thread From: bsuparna @ 2001-02-01 14:44 UTC (permalink / raw) To: Stephen C. Tweedie; +Cc: linux-kernel, kiobuf-io-devel >Hi, > >On Thu, Feb 01, 2001 at 10:25:22AM +0530, bsuparna@in.ibm.com wrote: >> >> >We _do_ need the ability to stack completion events, but as far as the >> >kiobuf work goes, my current thoughts are to do that by stacking >> >lightweight "clone" kiobufs. >> >> Would that work with stackable filesystems ? > >Only if the filesystems were using VFS interfaces which used kiobufs. >Right now, the only filesystem using kiobufs is XFS, and it only >passes them down to the block device layer, not to other filesystems. That would require the vfs interfaces themselves (address space readpage/writepage ops) to take kiobufs as arguments, instead of struct page * . That's not the case right now, is it ? A filter filesystem would be layered over XFS to take this example. So right now a filter filesystem only sees the struct page * and passes this along. Any completion event stacking has to be applied with reference to this. >> Being able to track the children of a kiobuf would help with I/O >> cancellation (e.g. to pull sub-ios off their request queues if I/O >> cancellation for the parent kiobuf was issued). Not essential, I guess, in >> general, but useful in some situations. > >What exactly is the justification for IO cancellation? It really >upsets the normal flow of control through the IO stack to have >voluntary cancellation semantics. One reason that I saw is that if the results of an i/o are no longer required due to some condition (e.g. aio cancellation situations, or if the process that issued the i/o gets killed), then this avoids the unnecessary disk i/o, if the request hadn't been scheduled as yet. Too remote a requirement ? If the capability/support doesn't exist at the driver level I guess its difficult. --Stephen _______________________________________________ Kiobuf-io-devel mailing list Kiobuf-io-devel@lists.sourceforge.net http://lists.sourceforge.net/lists/listinfo/kiobuf-io-devel - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains 2001-02-01 14:44 [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains bsuparna @ 2001-02-01 15:09 ` Christoph Hellwig 2001-02-01 16:08 ` Steve Lord 2001-02-01 16:16 ` Stephen C. Tweedie 0 siblings, 2 replies; 186+ messages in thread From: Christoph Hellwig @ 2001-02-01 15:09 UTC (permalink / raw) To: bsuparna; +Cc: Stephen C. Tweedie, linux-kernel, kiobuf-io-devel On Thu, Feb 01, 2001 at 08:14:58PM +0530, bsuparna@in.ibm.com wrote: > > >Hi, > > > >On Thu, Feb 01, 2001 at 10:25:22AM +0530, bsuparna@in.ibm.com wrote: > >> > >> >We _do_ need the ability to stack completion events, but as far as the > >> >kiobuf work goes, my current thoughts are to do that by stacking > >> >lightweight "clone" kiobufs. > >> > >> Would that work with stackable filesystems ? > > > >Only if the filesystems were using VFS interfaces which used kiobufs. > >Right now, the only filesystem using kiobufs is XFS, and it only > >passes them down to the block device layer, not to other filesystems. > > That would require the vfs interfaces themselves (address space > readpage/writepage ops) to take kiobufs as arguments, instead of struct > page * . That's not the case right now, is it ? No, and with the current kiobufs it would not make sense, because they are to heavy-weight. With page,length,offsett iobufs this makes sense and is IMHO the way to go. Christoph -- Of course it doesn't work. We've performed a software upgrade. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains 2001-02-01 15:09 ` Christoph Hellwig @ 2001-02-01 16:08 ` Steve Lord 2001-02-01 16:49 ` Stephen C. Tweedie 2001-02-01 16:16 ` Stephen C. Tweedie 1 sibling, 1 reply; 186+ messages in thread From: Steve Lord @ 2001-02-01 16:08 UTC (permalink / raw) To: hch, linux-kernel, kiobuf-io-devel Christoph Hellwig wrote: > On Thu, Feb 01, 2001 at 08:14:58PM +0530, bsuparna@in.ibm.com wrote: > > > > That would require the vfs interfaces themselves (address space > > readpage/writepage ops) to take kiobufs as arguments, instead of struct > > page * . That's not the case right now, is it ? > > No, and with the current kiobufs it would not make sense, because they > are to heavy-weight. With page,length,offsett iobufs this makes sense > and is IMHO the way to go. > > Christoph > Enquiring minds would like to know if you are working towards this revamp of the kiobuf structure at the moment, you have been very quiet recently. Steve - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains 2001-02-01 16:08 ` Steve Lord @ 2001-02-01 16:49 ` Stephen C. Tweedie 2001-02-01 17:02 ` Christoph Hellwig 0 siblings, 1 reply; 186+ messages in thread From: Stephen C. Tweedie @ 2001-02-01 16:49 UTC (permalink / raw) To: Steve Lord; +Cc: hch, linux-kernel, kiobuf-io-devel Hi, On Thu, Feb 01, 2001 at 10:08:45AM -0600, Steve Lord wrote: > Christoph Hellwig wrote: > > On Thu, Feb 01, 2001 at 08:14:58PM +0530, bsuparna@in.ibm.com wrote: > > > > > > That would require the vfs interfaces themselves (address space > > > readpage/writepage ops) to take kiobufs as arguments, instead of struct > > > page * . That's not the case right now, is it ? > > > > No, and with the current kiobufs it would not make sense, because they > > are to heavy-weight. With page,length,offsett iobufs this makes sense > > and is IMHO the way to go. > > Enquiring minds would like to know if you are working towards this > revamp of the kiobuf structure at the moment, you have been very quiet > recently. I'm in the middle of some parts of it, and am actively soliciting feedback on what cleanups are required. I've been merging all of the 2.2 fixes into a 2.4 kiobuf tree, and have started doing some of the cleanups needed --- removing the embedded page vector, and adding support for lightweight stacking of kiobufs for completion callback chains. However, filesystem IO is almost *always* page aligned: O_DIRECT IO comes from VM pages, and internal filesystem IO comes from page cache pages. Buffer cache IOs are the only exception, and kiobufs only fail for such IOs once you have multiple buffer_heads being merged into single requests. So, what are the benefits in the disk IO stack of adding length/offset pairs to each page of the kiobuf? Basically, the only advantage right now is that it would allow us to merge requests together without having to chain separate kiobufs. However, chaining kiobufs in this case is actually much better than merging them if the original IOs came in as kiobufs: merging kiobufs requires us to reallocate a new, longer (page/offset/len) vector, whereas chaining kiobufs is just a list operation. Having true scatter-gather lists in the kiobuf would let us represent arbitrary lists of buffer_heads as a single kiobuf, though, and that _is_ a big win if we can avoid using buffer_heads below the ll_rw_block layer at all. (It's not clear that this is really possible, though, since we still need to propagate completion information back up into each individual buffer head's status and wait queue.) Cheers, Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains 2001-02-01 16:49 ` Stephen C. Tweedie @ 2001-02-01 17:02 ` Christoph Hellwig 2001-02-01 17:34 ` Alan Cox 0 siblings, 1 reply; 186+ messages in thread From: Christoph Hellwig @ 2001-02-01 17:02 UTC (permalink / raw) To: Stephen C. Tweedie; +Cc: Steve Lord, linux-kernel, kiobuf-io-devel On Thu, Feb 01, 2001 at 04:49:58PM +0000, Stephen C. Tweedie wrote: > > Enquiring minds would like to know if you are working towards this > > revamp of the kiobuf structure at the moment, you have been very quiet > > recently. > > I'm in the middle of some parts of it, and am actively soliciting > feedback on what cleanups are required. The real issue is that Linus dislikes the current kiobuf scheme. I do not like everything he proposes, but lots of things makes sense. > I've been merging all of the 2.2 fixes into a 2.4 kiobuf tree, and > have started doing some of the cleanups needed --- removing the > embedded page vector, and adding support for lightweight stacking of > kiobufs for completion callback chains. Ok, great. > However, filesystem IO is almost *always* page aligned: O_DIRECT IO > comes from VM pages, and internal filesystem IO comes from page cache > pages. Buffer cache IOs are the only exception, and kiobufs only fail > for such IOs once you have multiple buffer_heads being merged into > single requests. > > So, what are the benefits in the disk IO stack of adding length/offset > pairs to each page of the kiobuf? I don't see any real advantage for disk IO. The real advantage is that we can have a generic structure that is also usefull in e.g. networking and can lead to a unified IO buffering scheme (a little like IO-Lite). Christoph -- Of course it doesn't work. We've performed a software upgrade. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains 2001-02-01 17:02 ` Christoph Hellwig @ 2001-02-01 17:34 ` Alan Cox 2001-02-01 17:49 ` Stephen C. Tweedie ` (2 more replies) 0 siblings, 3 replies; 186+ messages in thread From: Alan Cox @ 2001-02-01 17:34 UTC (permalink / raw) To: Christoph Hellwig Cc: Stephen C. Tweedie, Steve Lord, linux-kernel, kiobuf-io-devel > > I'm in the middle of some parts of it, and am actively soliciting > > feedback on what cleanups are required. > > The real issue is that Linus dislikes the current kiobuf scheme. > I do not like everything he proposes, but lots of things makes sense. Linus basically designed the original kiobuf scheme of course so I guess he's allowed to dislike it. Linus disliking something however doesn't mean its wrong. Its not a technically valid basis for argument. Linus list of reasons like the amount of state are more interesting > > So, what are the benefits in the disk IO stack of adding length/offset > > pairs to each page of the kiobuf? > > I don't see any real advantage for disk IO. The real advantage is that > we can have a generic structure that is also usefull in e.g. networking > and can lead to a unified IO buffering scheme (a little like IO-Lite). Networking wants something lighter rather than heavier. Adding tons of base/limit pairs to kiobufs makes it worse not better - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains 2001-02-01 17:34 ` Alan Cox @ 2001-02-01 17:49 ` Stephen C. Tweedie 2001-02-01 17:09 ` Chaitanya Tumuluri 2001-02-01 20:33 ` Christoph Hellwig 2001-02-01 17:49 ` Christoph Hellwig 2001-02-01 18:51 ` bcrl 2 siblings, 2 replies; 186+ messages in thread From: Stephen C. Tweedie @ 2001-02-01 17:49 UTC (permalink / raw) To: Alan Cox Cc: Christoph Hellwig, Stephen C. Tweedie, Steve Lord, linux-kernel, kiobuf-io-devel Hi, On Thu, Feb 01, 2001 at 05:34:49PM +0000, Alan Cox wrote: > > > > I don't see any real advantage for disk IO. The real advantage is that > > we can have a generic structure that is also usefull in e.g. networking > > and can lead to a unified IO buffering scheme (a little like IO-Lite). > > Networking wants something lighter rather than heavier. Adding tons of > base/limit pairs to kiobufs makes it worse not better Networking has fundamentally different requirements. In a network stack, you want the ability to add fragments to unaligned chunks of data to represent headers at any point in the stack. In the disk IO case, you basically don't get that (the only thing which comes close is raid5 parity blocks). The data which the user started with is the data sent out on the wire. You do get some interesting cases such as soft raid and LVM, or even in the scsi stack if you run out of mailbox space, where you need to send only a sub-chunk of the input buffer. In that case, having offset/len as the kiobuf limit markers is ideal: you can clone a kiobuf header using the same page vector as the parent, narrow down the start/end points, and continue down the stack without having to copy any part of the page list. If you had the offset/len data encoded implicitly into each entry in the sglist, you would not be able to do that. --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains 2001-02-01 17:49 ` Stephen C. Tweedie @ 2001-02-01 17:09 ` Chaitanya Tumuluri 2001-02-01 20:33 ` Christoph Hellwig 1 sibling, 0 replies; 186+ messages in thread From: Chaitanya Tumuluri @ 2001-02-01 17:09 UTC (permalink / raw) To: Stephen C. Tweedie Cc: Alan Cox, Christoph Hellwig, Steve Lord, linux-kernel, kiobuf-io-devel On Thu, 1 Feb 2001, Stephen C. Tweedie wrote: > Hi, > > On Thu, Feb 01, 2001 at 05:34:49PM +0000, Alan Cox wrote: > > > > > > I don't see any real advantage for disk IO. The real advantage is that > > > we can have a generic structure that is also usefull in e.g. networking > > > and can lead to a unified IO buffering scheme (a little like IO-Lite). > > > > Networking wants something lighter rather than heavier. Adding tons of > > base/limit pairs to kiobufs makes it worse not better > > Networking has fundamentally different requirements. In a network > stack, you want the ability to add fragments to unaligned chunks of > data to represent headers at any point in the stack. > > In the disk IO case, you basically don't get that (the only thing > which comes close is raid5 parity blocks). The data which the user > started with is the data sent out on the wire. You do get some > interesting cases such as soft raid and LVM, or even in the scsi stack > if you run out of mailbox space, where you need to send only a > sub-chunk of the input buffer. Or the case of BSD-style UIO implementing the readv() and writev() calls. This may/may-not align perfectly so, address-length lists per page could be helpful. I did try an implementation of this for rawio and found that I had to restrict the a-len lists coming in via the user iovecs to be aligned. > In that case, having offset/len as the kiobuf limit markers is ideal: > you can clone a kiobuf header using the same page vector as the > parent, narrow down the start/end points, and continue down the stack > without having to copy any part of the page list. If you had the > offset/len data encoded implicitly into each entry in the sglist, you > would not be able to do that. This would solve the issue with UIO, yes. Also, I think Martin Peterson (mkp) had taken a stab at doing "clone-kiobufs" for LVM at some point. Martin? Cheers, -Chait. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains 2001-02-01 17:49 ` Stephen C. Tweedie 2001-02-01 17:09 ` Chaitanya Tumuluri @ 2001-02-01 20:33 ` Christoph Hellwig 2001-02-01 20:56 ` Steve Lord ` (2 more replies) 1 sibling, 3 replies; 186+ messages in thread From: Christoph Hellwig @ 2001-02-01 20:33 UTC (permalink / raw) To: "Stephen C. Tweedie" Cc: Steve Lord, linux-kernel, kiobuf-io-devel@lists.sourceforge.net Alan Cox In article <20010201174946.B11607@redhat.com> you wrote: > Hi, > On Thu, Feb 01, 2001 at 05:34:49PM +0000, Alan Cox wrote: > In the disk IO case, you basically don't get that (the only thing > which comes close is raid5 parity blocks). The data which the user > started with is the data sent out on the wire. You do get some > interesting cases such as soft raid and LVM, or even in the scsi stack > if you run out of mailbox space, where you need to send only a > sub-chunk of the input buffer. Though your describption is right, I don't think the case is very common: Sometimes in LVM on a pv boundary and maybe sometimes in the scsi code. In raid1 you need some kind of clone iobuf, which should work with both cases. In raid0 you need a complete new pagelist anyway, same for raid5. > In that case, having offset/len as the kiobuf limit markers is ideal: > you can clone a kiobuf header using the same page vector as the > parent, narrow down the start/end points, and continue down the stack > without having to copy any part of the page list. If you had the > offset/len data encoded implicitly into each entry in the sglist, you > would not be able to do that. Sure you could: you embedd that information in a higher-level structure. I think you want the whole kio concept only for disk-like IO. Then many of the things you do are completly right and I don't see much problems (besides thinking that some thing may go away - but that's no major point). With a generic object that is used over subsytem boundaries things are different. Christoph -- Of course it doesn't work. We've performed a software upgrade. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains 2001-02-01 20:33 ` Christoph Hellwig @ 2001-02-01 20:56 ` Steve Lord 2001-02-01 20:59 ` Christoph Hellwig 2001-02-01 21:44 ` Stephen C. Tweedie 2001-02-01 22:07 ` Stephen C. Tweedie 2 siblings, 1 reply; 186+ messages in thread From: Steve Lord @ 2001-02-01 20:56 UTC (permalink / raw) To: Christoph Hellwig Cc: "Stephen C. Tweedie", Steve Lord, linux-kernel, kiobuf-io-devel@lists.sourceforge.net Alan Cox > In article <20010201174946.B11607@redhat.com> you wrote: > > Hi, > > > On Thu, Feb 01, 2001 at 05:34:49PM +0000, Alan Cox wrote: > > In the disk IO case, you basically don't get that (the only thing > > which comes close is raid5 parity blocks). The data which the user > > started with is the data sent out on the wire. You do get some > > interesting cases such as soft raid and LVM, or even in the scsi stack > > if you run out of mailbox space, where you need to send only a > > sub-chunk of the input buffer. > > Though your describption is right, I don't think the case is very common: > Sometimes in LVM on a pv boundary and maybe sometimes in the scsi code. And if you are writing to a striped volume via a filesystem which can do it's own I/O clustering, e.g. I throw 500 pages at LVM in one go and LVM is striped on 64K boundaries. Steve - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains 2001-02-01 20:56 ` Steve Lord @ 2001-02-01 20:59 ` Christoph Hellwig 2001-02-01 21:17 ` Steve Lord 0 siblings, 1 reply; 186+ messages in thread From: Christoph Hellwig @ 2001-02-01 20:59 UTC (permalink / raw) To: Steve Lord; +Cc: Stephen C . Tweedie, linux-kernel, kiobuf-io-devel, Alan Cox On Thu, Feb 01, 2001 at 02:56:47PM -0600, Steve Lord wrote: > And if you are writing to a striped volume via a filesystem which can do > it's own I/O clustering, e.g. I throw 500 pages at LVM in one go and LVM > is striped on 64K boundaries. But usually I want to have pages 0-63, 128-191, etc together, because they are contingous on disk, or? Christoph -- Of course it doesn't work. We've performed a software upgrade. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains 2001-02-01 20:59 ` Christoph Hellwig @ 2001-02-01 21:17 ` Steve Lord 0 siblings, 0 replies; 186+ messages in thread From: Steve Lord @ 2001-02-01 21:17 UTC (permalink / raw) To: Steve Lord, Stephen C . Tweedie, linux-kernel, kiobuf-io-devel, Alan Cox > On Thu, Feb 01, 2001 at 02:56:47PM -0600, Steve Lord wrote: > > And if you are writing to a striped volume via a filesystem which can do > > it's own I/O clustering, e.g. I throw 500 pages at LVM in one go and LVM > > is striped on 64K boundaries. > > But usually I want to have pages 0-63, 128-191, etc together, because they ar > e > contingous on disk, or? I was just giving an example of how kiobufs might need splitting up more often than you think, crossing a stripe boundary is one obvious case. Yes you do want to keep the pages which are contiguous on disk together, but you will often get requests which cover multiple stripes, otherwise you don't really get much out of stripes and may as well just concatenate drives. Ideally the file is striped across the various disks in the volume, and one large write (direct or from the cache) gets scattered across the disks. All the I/O's run in parallel (and on different controllers if you have the budget). Steve > > Christoph > > -- > Of course it doesn't work. We've performed a software upgrade. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains 2001-02-01 20:33 ` Christoph Hellwig 2001-02-01 20:56 ` Steve Lord @ 2001-02-01 21:44 ` Stephen C. Tweedie 2001-02-01 22:07 ` Stephen C. Tweedie 2 siblings, 0 replies; 186+ messages in thread From: Stephen C. Tweedie @ 2001-02-01 21:44 UTC (permalink / raw) To: Christoph Hellwig Cc: Stephen C. Tweedie, Steve Lord, linux-kernel, kiobuf-io-devel, Alan Cox Hi, On Thu, Feb 01, 2001 at 09:33:27PM +0100, Christoph Hellwig wrote: > > > On Thu, Feb 01, 2001 at 05:34:49PM +0000, Alan Cox wrote: > > In the disk IO case, you basically don't get that (the only thing > > which comes close is raid5 parity blocks). The data which the user > > started with is the data sent out on the wire. You do get some > > interesting cases such as soft raid and LVM, or even in the scsi stack > > if you run out of mailbox space, where you need to send only a > > sub-chunk of the input buffer. > > Though your describption is right, I don't think the case is very common: > Sometimes in LVM on a pv boundary and maybe sometimes in the scsi code. On raid0 stripes, it's common to have stripes of between 16k and 64k, so it's rather more common there than you'd like. In any case, you need the code to handle it, and I don't want to make the code paths any more complex than necessary. > In raid1 you need some kind of clone iobuf, which should work with both > cases. In raid0 you need a complete new pagelist anyway No you don't. You take the existing one, specify which region of it is going to the current stripe, and send it off. Nothing more. > > In that case, having offset/len as the kiobuf limit markers is ideal: > > you can clone a kiobuf header using the same page vector as the > > parent, narrow down the start/end points, and continue down the stack > > without having to copy any part of the page list. If you had the > > offset/len data encoded implicitly into each entry in the sglist, you > > would not be able to do that. > > Sure you could: you embedd that information in a higher-level structure. What's the point in a common data container structure if you need higher-level information to make any sense out of it? --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains 2001-02-01 20:33 ` Christoph Hellwig 2001-02-01 20:56 ` Steve Lord 2001-02-01 21:44 ` Stephen C. Tweedie @ 2001-02-01 22:07 ` Stephen C. Tweedie 2001-02-02 12:02 ` Christoph Hellwig 2001-02-03 20:28 ` Linus Torvalds 2 siblings, 2 replies; 186+ messages in thread From: Stephen C. Tweedie @ 2001-02-01 22:07 UTC (permalink / raw) To: Christoph Hellwig Cc: Stephen C. Tweedie, Steve Lord, linux-kernel, kiobuf-io-devel, Alan Cox, Linus Torvalds Hi, On Thu, Feb 01, 2001 at 09:33:27PM +0100, Christoph Hellwig wrote: > I think you want the whole kio concept only for disk-like IO. No. I want something good for zero-copy IO in general, but a lot of that concerns the problem of interacting with the user, and the basic center of that interaction in 99% of the interesting cases is either a user VM buffer or the page cache --- all of which are page-aligned. If you look at the sorts of models being proposed (even by Linus) for splice, you get len = prepare_read(); prepare_write(); pull_fd(); commit_write(); in which the read is being pulled into a known location in the page cache -- it's page-aligned, again. I'm perfectly willing to accept that there may be a need for scatter-gather boundaries including non-page-aligned fragments in this model, but I can't see one if you're using the page cache as a mediator, nor if you're doing it through a user mmapped buffer. The only reason you need finer scatter-gather boundaries --- and it may be a compelling reason --- is if you are merging multiple IOs together into a single device-level IO. That makes perfect sense for the zerocopy tcp case where you're doing MSG_MORE-type coalescing. It doesn't help the existing SGI kiobuf block device code, because that performs its merging in the filesystem layers and the block device code just squirts the IOs to the wire as-is, but if we want to start merging those kiobuf-based IOs within make_request() then the block device layer may want it too. And Linus is right, the old way of using a *kiobuf[] for that was painful, but the solution of adding start/length to every entry in the page vector just doesn't sit right with many components of the block device environment either. I may still be persuaded that we need the full scatter-gather list fields throughout, but for now I tend to think that, at least in the disk layers, we may get cleaner results by allow linked lists of page-aligned kiobufs instead. That allows for merging of kiobufs without having to copy all of the vector information each time. The killer, however, is what happens if you want to split such a merged kiobuf. Right now, that's something that I can only imagine happening in the block layers if we start encoding buffer_head chains as kiobufs, but if we do that in the future, or if we start merging genuine kiobuf requests requests, then doing that split later on (for raid0 etc) may require duplicating whole chains of kiobufs. At that point, just doing scatter-gather lists is cleaner. But for now, the way to picture what I'm trying to achieve is that kiobufs are a bit like buffer_heads --- they represent the physical pages of some VM object that a higher layer has constructed, such as the page cache or a user VM buffer. You can chain these objects together for IO, but that doesn't stop the individual objects from being separate entities with independent IO completion callbacks to be honoured. Cheers, Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains 2001-02-01 22:07 ` Stephen C. Tweedie @ 2001-02-02 12:02 ` Christoph Hellwig 2001-02-05 12:19 ` Stephen C. Tweedie 2001-02-03 20:28 ` Linus Torvalds 1 sibling, 1 reply; 186+ messages in thread From: Christoph Hellwig @ 2001-02-02 12:02 UTC (permalink / raw) To: Stephen C. Tweedie Cc: Christoph Hellwig, Steve Lord, linux-kernel, kiobuf-io-devel, Alan Cox, Linus Torvalds On Thu, Feb 01, 2001 at 10:07:44PM +0000, Stephen C. Tweedie wrote: > No. I want something good for zero-copy IO in general, but a lot of > that concerns the problem of interacting with the user, and the basic > center of that interaction in 99% of the interesting cases is either a > user VM buffer or the page cache --- all of which are page-aligned. Yes. > If you look at the sorts of models being proposed (even by Linus) for > splice, you get > > len = prepare_read(); > prepare_write(); > pull_fd(); > commit_write(); Yepp. > in which the read is being pulled into a known location in the page > cache -- it's page-aligned, again. I'm perfectly willing to accept > that there may be a need for scatter-gather boundaries including > non-page-aligned fragments in this model, but I can't see one if > you're using the page cache as a mediator, nor if you're doing it > through a user mmapped buffer. True. > The only reason you need finer scatter-gather boundaries --- and it > may be a compelling reason --- is if you are merging multiple IOs > together into a single device-level IO. That makes perfect sense for > the zerocopy tcp case where you're doing MSG_MORE-type coalescing. It > doesn't help the existing SGI kiobuf block device code, because that > performs its merging in the filesystem layers and the block device > code just squirts the IOs to the wire as-is, Yes - but that is no soloution for a generic model. AFAICS even XFS falls back to buffer_head's for small requests. > but if we want to start > merging those kiobuf-based IOs within make_request() then the block > device layer may want it too. Yes. > And Linus is right, the old way of using a *kiobuf[] for that was > painful, but the solution of adding start/length to every entry in > the page vector just doesn't sit right with many components of the > block device environment either. What do you thing is the alternative? > I may still be persuaded that we need the full scatter-gather list > fields throughout, but for now I tend to think that, at least in the > disk layers, we may get cleaner results by allow linked lists of > page-aligned kiobufs instead. That allows for merging of kiobufs > without having to copy all of the vector information each time. But it will have the same problems as the array soloution: there will be one complete kio structure for each kiobuf, with it's own end_io callback, etc. Christoph -- Of course it doesn't work. We've performed a software upgrade. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains 2001-02-02 12:02 ` Christoph Hellwig @ 2001-02-05 12:19 ` Stephen C. Tweedie 2001-02-05 21:28 ` Ingo Molnar 0 siblings, 1 reply; 186+ messages in thread From: Stephen C. Tweedie @ 2001-02-05 12:19 UTC (permalink / raw) To: Stephen C. Tweedie, Steve Lord, linux-kernel, kiobuf-io-devel, Alan Cox, Linus Torvalds Hi, On Fri, Feb 02, 2001 at 01:02:28PM +0100, Christoph Hellwig wrote: > > > I may still be persuaded that we need the full scatter-gather list > > fields throughout, but for now I tend to think that, at least in the > > disk layers, we may get cleaner results by allow linked lists of > > page-aligned kiobufs instead. That allows for merging of kiobufs > > without having to copy all of the vector information each time. > > But it will have the same problems as the array soloution: there will > be one complete kio structure for each kiobuf, with it's own end_io > callback, etc. And what's the problem with that? You *need* this. You have to have that multiple-completion concept in the disk layers. Think about chains of buffer_heads being sent to disk as a single IO --- you need to know which buffers make it to disk successfully and which had IO errors. And no, the IO success is *not* necessarily sequential from the start of the IO: if you are doing IO to raid0, for example, and the IO gets striped across two disks, you might find that the first disk gets an error so the start of the IO fails but the rest completes. It's the completion code which notifies the caller of what worked and what did not. And for readahead, you want to notify the caller as early as posssible about completion for the first part of the IO, even if the device driver is still processing the rest. Multiple completions are a necessary feature of the current block device interface. Removing that would be a step backwards. Cheers, Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains 2001-02-05 12:19 ` Stephen C. Tweedie @ 2001-02-05 21:28 ` Ingo Molnar 2001-02-05 22:58 ` Stephen C. Tweedie 0 siblings, 1 reply; 186+ messages in thread From: Ingo Molnar @ 2001-02-05 21:28 UTC (permalink / raw) To: Stephen C. Tweedie Cc: Steve Lord, linux-kernel, kiobuf-io-devel, Alan Cox, Linus Torvalds On Mon, 5 Feb 2001, Stephen C. Tweedie wrote: > And no, the IO success is *not* necessarily sequential from the start > of the IO: if you are doing IO to raid0, for example, and the IO gets > striped across two disks, you might find that the first disk gets an > error so the start of the IO fails but the rest completes. It's the > completion code which notifies the caller of what worked and what did > not. it's exactly these 'compound' structures i'm vehemently against. I do think it's a design nightmare. I can picture these monster kiobufs complicating the whole code for no good reason - we couldnt even get the bh-list code in block_device.c right - why do you think kiobufs *all across the kernel* will be any better? RAID0 is not an issue. Split it up, use separate kiobufs for every different disk. We need simple constructs - i do not believe why nobody sees that these big fat monster-trucks of IO workload are *trouble*. They keep things localized, instead of putting workload components into the system immediately. We'll have performance bugs nobody has seen before. bhs have one very nice property: they are simple, modularized. I think this is like CISC vs. RISC: CISC designs ended up splitting 'fat instructions' up into RISC-like instructions. fragmented skbs are a different matter: they are simply a bit more generic abstractions of 'memory buffer'. Clear goal, clear solution. I do not think kiobufs have clear goals. and i do not buy the performance arguments. In 2.4.1 we improved block-IO performance dramatically by fixing high-load IO scheduling. Write performance suddenly improved dramatically, there is a 30-40% improvement in dbench performance. To put in another way: *we needed 5 years to fix a serious IO-subsystem performance bug*. Block IO was already too complex - and Alex & Andrea have done a nice job streamlining and cleaning it up for 2.4. We should simplify it further - and optimize the components, instead of bringing in yet another *big* complication into the API. and what is the goal of having multi-page kiobufs. To avoid having to do multiple function calls via a simpler interface? Shouldnt we optimize that codepath instead? Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains 2001-02-05 21:28 ` Ingo Molnar @ 2001-02-05 22:58 ` Stephen C. Tweedie 2001-02-05 23:06 ` Alan Cox 2001-02-06 0:19 ` Manfred Spraul 0 siblings, 2 replies; 186+ messages in thread From: Stephen C. Tweedie @ 2001-02-05 22:58 UTC (permalink / raw) To: Ingo Molnar Cc: Stephen C. Tweedie, Steve Lord, linux-kernel, kiobuf-io-devel, Alan Cox, Linus Torvalds Hi, On Mon, Feb 05, 2001 at 10:28:37PM +0100, Ingo Molnar wrote: > > On Mon, 5 Feb 2001, Stephen C. Tweedie wrote: > > it's exactly these 'compound' structures i'm vehemently against. I do > think it's a design nightmare. I can picture these monster kiobufs > complicating the whole code for no good reason - we couldnt even get the > bh-list code in block_device.c right - why do you think kiobufs *all > across the kernel* will be any better? > > RAID0 is not an issue. Split it up, use separate kiobufs for every > different disk. Umm, that's not the point --- of course you can use separate kiobufs for the communication between raid0 and the underlying disks, but what do you then tell the application _above_ raid0 if one of the underlying IOs succeeds and the other fails halfway through? And what about raid1? Are you really saying that raid1 doesn't need to know which blocks succeeded and which failed? That's the level of completion information I'm worrying about at the moment. > fragmented skbs are a different matter: they are simply a bit more generic > abstractions of 'memory buffer'. Clear goal, clear solution. I do not > think kiobufs have clear goals. The goal: allow arbitrary IOs to be pushed down through the stack in such a way that the callers can get meaningful information back about what worked and what did not. If the write was a 128kB raw IO, then you obviously get coarse granularity of completion callback. If the write was a series of independent pages which happened to be contiguous on disk, you actually get told which pages hit disk and which did not. > and what is the goal of having multi-page kiobufs. To avoid having to do > multiple function calls via a simpler interface? Shouldnt we optimize that > codepath instead? The original multi-page buffers came from the map_user_kiobuf interface: they represented a user data buffer. I'm not wedded to that format --- we can happily replace it with a fine-grained sg list --- but the reason they have been pushed so far down the IO stack is the need for accurate completion information on the originally requested IOs. In other words, even if we expand the kiobuf into a sg vector list, when it comes to merging requests in ll_rw_blk.c we still need to track the callbacks on each independent source kiobufs. --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains 2001-02-05 22:58 ` Stephen C. Tweedie @ 2001-02-05 23:06 ` Alan Cox 2001-02-05 23:16 ` Stephen C. Tweedie 2001-02-06 0:19 ` Manfred Spraul 1 sibling, 1 reply; 186+ messages in thread From: Alan Cox @ 2001-02-05 23:06 UTC (permalink / raw) To: Stephen C. Tweedie Cc: Ingo Molnar, Stephen C. Tweedie, Steve Lord, linux-kernel, kiobuf-io-devel, Alan Cox, Linus Torvalds > do you then tell the application _above_ raid0 if one of the > underlying IOs succeeds and the other fails halfway through? struct { u32 flags; /* because everything needs flags */ struct io_completion *completions; kiovec_t sglist[0]; } thingy; now kmalloc one object of the header the sglist of the right size and the completion list. Shove the completion list on the end of it as another array of objects and what is the problem. > In other words, even if we expand the kiobuf into a sg vector list, > when it comes to merging requests in ll_rw_blk.c we still need to > track the callbacks on each independent source kiobufs. But that can be two arrays - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains 2001-02-05 23:06 ` Alan Cox @ 2001-02-05 23:16 ` Stephen C. Tweedie 0 siblings, 0 replies; 186+ messages in thread From: Stephen C. Tweedie @ 2001-02-05 23:16 UTC (permalink / raw) To: Alan Cox Cc: Stephen C. Tweedie, Ingo Molnar, Steve Lord, linux-kernel, kiobuf-io-devel, Linus Torvalds Hi, On Mon, Feb 05, 2001 at 11:06:48PM +0000, Alan Cox wrote: > > do you then tell the application _above_ raid0 if one of the > > underlying IOs succeeds and the other fails halfway through? > > struct > { > u32 flags; /* because everything needs flags */ > struct io_completion *completions; > kiovec_t sglist[0]; > } thingy; > > now kmalloc one object of the header the sglist of the right size and the > completion list. Shove the completion list on the end of it as another > array of objects and what is the problem. XFS uses both small metadata items in the buffer cache and large pagebufs. You may have merged a 512-byte read with a large pagebuf read: one completion callback is associated with a single sg fragment, the next callback belongs to a dozen different fragments. Associating the two lists becomes non-trivial, although it could be done. --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains 2001-02-05 22:58 ` Stephen C. Tweedie 2001-02-05 23:06 ` Alan Cox @ 2001-02-06 0:19 ` Manfred Spraul 1 sibling, 0 replies; 186+ messages in thread From: Manfred Spraul @ 2001-02-06 0:19 UTC (permalink / raw) To: Stephen C. Tweedie Cc: Ingo Molnar, Steve Lord, linux-kernel, kiobuf-io-devel, Alan Cox, Linus Torvalds "Stephen C. Tweedie" wrote: > > The original multi-page buffers came from the map_user_kiobuf > interface: they represented a user data buffer. I'm not wedded to > that format --- we can happily replace it with a fine-grained sg list > Could you change that interface? <<< from Linus mail: struct buffer { struct page *page; u16 offset, length; }; >>>>>> /* returns the number of used buffers, or <0 on error */ int map_user_buffer(struct buffer *ba, int max_bcount, void* addr, int len); void unmap_buffer(struct buffer *ba, int bcount); That's enough for the zero copy pipe code ;-) Real hw drivers probably need a replacement for pci_map_single() (pci_map_and_align_and_bounce_buffer_array()) The kiobuf structure could contain these 'struct buffer' instead of the current 'struct page' pointers. > > In other words, even if we expand the kiobuf into a sg vector list, > when it comes to merging requests in ll_rw_blk.c we still need to > track the callbacks on each independent source kiobufs. > Probably. -- Manfred - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains 2001-02-01 22:07 ` Stephen C. Tweedie 2001-02-02 12:02 ` Christoph Hellwig @ 2001-02-03 20:28 ` Linus Torvalds 2001-02-05 11:03 ` Stephen C. Tweedie 1 sibling, 1 reply; 186+ messages in thread From: Linus Torvalds @ 2001-02-03 20:28 UTC (permalink / raw) To: Stephen C. Tweedie Cc: Christoph Hellwig, Steve Lord, linux-kernel, kiobuf-io-devel, Alan Cox On Thu, 1 Feb 2001, Stephen C. Tweedie wrote: > > On Thu, Feb 01, 2001 at 09:33:27PM +0100, Christoph Hellwig wrote: > > > I think you want the whole kio concept only for disk-like IO. > > No. I want something good for zero-copy IO in general, but a lot of > that concerns the problem of interacting with the user, and the basic > center of that interaction in 99% of the interesting cases is either a > user VM buffer or the page cache --- all of which are page-aligned. > > If you look at the sorts of models being proposed (even by Linus) for > splice, you get > > len = prepare_read(); > prepare_write(); > pull_fd(); > commit_write(); > > in which the read is being pulled into a known location in the page > cache -- it's page-aligned, again. Wrong. Neither the read nor the write are page-aligned. I don't know where you got that idea. It's obviously not true even in the common case: it depends _entirely_ on what the file offsets are, and expecting the offset to be zero is just being stupid. It's often _not_ zero. With networking it is in fact seldom zero, because the network packets are seldom aligned either in size or in location. Also, there are many reasons why "page" may have different meaning. We will eventually have a page-cache where the pagecace granularity is not the same as the user-level visible one. User-level may do mmap at 4kB boundaries, even if the page cache itself uses 8kB or 16kB pages. THERE IS NO PAGE-ALIGNMENT. And anything that even _mentions_ the word page-aligned is going into my trash-can faster than you can say "bug". Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains 2001-02-03 20:28 ` Linus Torvalds @ 2001-02-05 11:03 ` Stephen C. Tweedie 2001-02-05 12:00 ` Manfred Spraul 2001-02-05 16:36 ` [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains Linus Torvalds 0 siblings, 2 replies; 186+ messages in thread From: Stephen C. Tweedie @ 2001-02-05 11:03 UTC (permalink / raw) To: Linus Torvalds Cc: Stephen C. Tweedie, Christoph Hellwig, Steve Lord, linux-kernel, kiobuf-io-devel, Alan Cox Hi, On Sat, Feb 03, 2001 at 12:28:47PM -0800, Linus Torvalds wrote: > > On Thu, 1 Feb 2001, Stephen C. Tweedie wrote: > > > Neither the read nor the write are page-aligned. I don't know where you > got that idea. It's obviously not true even in the common case: it depends > _entirely_ on what the file offsets are, and expecting the offset to be > zero is just being stupid. It's often _not_ zero. With networking it is in > fact seldom zero, because the network packets are seldom aligned either in > size or in location. The underlying buffer is. The VFS (and the current kiobuf code) is already happy about IO happening at odd offsets within a page. However, the more general case --- doing zero-copy IO on arbitrary unaligned buffers --- simply won't work if you expect to be able to push those buffers to disk without a copy. The splice case you talked about is fine because it's doing the normal prepare/commit logic where the underlying buffer is page aligned, even if the splice IO is not to a page aligned location. That's _exactly_ what kiobufs were intended to support. The prepare_read/prepare_write/ pull/push cycle lets the caller tell the pull() function where to store its data, becausse there are alignment constraints which just can't be ignored: you simply cannot do physical disk IO on non-sector-aligned memory or in chunks which aren't a multiple of sector size. (The buffer address alignment can sometimes be relaxed --- obviously if you're doing PIO then it doesn't matter --- but the length granularity is rigidly enforced.) Cheers, Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains 2001-02-05 11:03 ` Stephen C. Tweedie @ 2001-02-05 12:00 ` Manfred Spraul 2001-02-05 15:03 ` Stephen C. Tweedie 2001-02-05 16:56 ` Linus Torvalds 2001-02-05 16:36 ` [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains Linus Torvalds 1 sibling, 2 replies; 186+ messages in thread From: Manfred Spraul @ 2001-02-05 12:00 UTC (permalink / raw) To: Stephen C. Tweedie Cc: Linus Torvalds, Christoph Hellwig, Steve Lord, linux-kernel, kiobuf-io-devel, Alan Cox "Stephen C. Tweedie" wrote: > > You simply cannot do physical disk IO on > non-sector-aligned memory or in chunks which aren't a multiple of > sector size. Why not? Obviously the disk access itself must be sector aligned and the total length must be a multiple of the sector length, but there shouldn't be any restrictions on the data buffers. I remember that even Windoze 95 has scatter-gather support for physical disk IO with arbitraty buffer chunks. (If the hardware supports it, otherwise the io subsystem will copy the data into a contiguous temporary buffer) -- Manfred - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains 2001-02-05 12:00 ` Manfred Spraul @ 2001-02-05 15:03 ` Stephen C. Tweedie 2001-02-05 15:19 ` Alan Cox 2001-02-05 22:09 ` [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains Ingo Molnar 2001-02-05 16:56 ` Linus Torvalds 1 sibling, 2 replies; 186+ messages in thread From: Stephen C. Tweedie @ 2001-02-05 15:03 UTC (permalink / raw) To: Manfred Spraul Cc: Stephen C. Tweedie, Linus Torvalds, Christoph Hellwig, Steve Lord, linux-kernel, kiobuf-io-devel, Alan Cox Hi, On Mon, Feb 05, 2001 at 01:00:51PM +0100, Manfred Spraul wrote: > "Stephen C. Tweedie" wrote: > > > > You simply cannot do physical disk IO on > > non-sector-aligned memory or in chunks which aren't a multiple of > > sector size. > > Why not? > > Obviously the disk access itself must be sector aligned and the total > length must be a multiple of the sector length, but there shouldn't be > any restrictions on the data buffers. But there are. Many controllers just break down and corrupt things silently if you don't align the data buffers (Jeff Merkey found this by accident when he started generating unaligned IOs within page boundaries in his NWFS code). And a lot of controllers simply cannot break a sector dma over a page boundary (at least not without some form of IOMMU remapping). Yes, it's the sort of thing that you would hope should work, but in practice it's not reliable. Cheers, Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains 2001-02-05 15:03 ` Stephen C. Tweedie @ 2001-02-05 15:19 ` Alan Cox 2001-02-05 17:20 ` Stephen C. Tweedie 2001-02-05 22:09 ` [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains Ingo Molnar 1 sibling, 1 reply; 186+ messages in thread From: Alan Cox @ 2001-02-05 15:19 UTC (permalink / raw) To: Stephen C. Tweedie Cc: Manfred Spraul, Stephen C. Tweedie, Linus Torvalds, Christoph Hellwig, Steve Lord, linux-kernel, kiobuf-io-devel, Alan Cox > Yes, it's the sort of thing that you would hope should work, but in > practice it's not reliable. So the less smart devices need to call something like kiovec_align(kiovec, 512); and have it do the bounce buffers ? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains 2001-02-05 15:19 ` Alan Cox @ 2001-02-05 17:20 ` Stephen C. Tweedie 2001-02-05 17:29 ` Alan Cox 0 siblings, 1 reply; 186+ messages in thread From: Stephen C. Tweedie @ 2001-02-05 17:20 UTC (permalink / raw) To: Alan Cox Cc: Stephen C. Tweedie, Manfred Spraul, Linus Torvalds, Christoph Hellwig, Steve Lord, linux-kernel, kiobuf-io-devel Hi, On Mon, Feb 05, 2001 at 03:19:09PM +0000, Alan Cox wrote: > > Yes, it's the sort of thing that you would hope should work, but in > > practice it's not reliable. > > So the less smart devices need to call something like > > kiovec_align(kiovec, 512); > > and have it do the bounce buffers ? _All_ drivers would have to do that in the degenerate case, because none of our drivers can deal with a dma boundary in the middle of a sector, and even in those places where the hardware supports it in theory, you are still often limited to word-alignment. --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains 2001-02-05 17:20 ` Stephen C. Tweedie @ 2001-02-05 17:29 ` Alan Cox 2001-02-05 18:49 ` Stephen C. Tweedie 0 siblings, 1 reply; 186+ messages in thread From: Alan Cox @ 2001-02-05 17:29 UTC (permalink / raw) To: Stephen C. Tweedie Cc: Alan Cox, Stephen C. Tweedie, Manfred Spraul, Linus Torvalds, Christoph Hellwig, Steve Lord, linux-kernel, kiobuf-io-devel > > kiovec_align(kiovec, 512); > > and have it do the bounce buffers ? > > _All_ drivers would have to do that in the degenerate case, because > none of our drivers can deal with a dma boundary in the middle of a > sector, and even in those places where the hardware supports it in > theory, you are still often limited to word-alignment. Thats true for _block_ disk devices but if we want a generic kiovec then if I am going from video capture to network I dont need to force anything more than 4 byte align - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains 2001-02-05 17:29 ` Alan Cox @ 2001-02-05 18:49 ` Stephen C. Tweedie 2001-02-05 19:04 ` Alan Cox 2001-02-05 19:09 ` Linus Torvalds 0 siblings, 2 replies; 186+ messages in thread From: Stephen C. Tweedie @ 2001-02-05 18:49 UTC (permalink / raw) To: Alan Cox Cc: Stephen C. Tweedie, Manfred Spraul, Linus Torvalds, Christoph Hellwig, Steve Lord, linux-kernel, kiobuf-io-devel Hi, On Mon, Feb 05, 2001 at 05:29:47PM +0000, Alan Cox wrote: > > > > _All_ drivers would have to do that in the degenerate case, because > > none of our drivers can deal with a dma boundary in the middle of a > > sector, and even in those places where the hardware supports it in > > theory, you are still often limited to word-alignment. > > Thats true for _block_ disk devices but if we want a generic kiovec then > if I am going from video capture to network I dont need to force anything more > than 4 byte align Kiobufs have never, ever required the IO to be aligned on any particular boundary. They simply make the assumption that the underlying buffered object can be described in terms of pages with some arbitrary (non-aligned) start/offset. Every video framebuffer I've ever seen satisfies that, so you can easily map an arbitrary contiguous region of the framebuffer with a kiobuf already. --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains 2001-02-05 18:49 ` Stephen C. Tweedie @ 2001-02-05 19:04 ` Alan Cox 2001-02-05 19:09 ` Linus Torvalds 1 sibling, 0 replies; 186+ messages in thread From: Alan Cox @ 2001-02-05 19:04 UTC (permalink / raw) To: Stephen C. Tweedie Cc: Alan Cox, Stephen C. Tweedie, Manfred Spraul, Linus Torvalds, Christoph Hellwig, Steve Lord, linux-kernel, kiobuf-io-devel > Kiobufs have never, ever required the IO to be aligned on any > particular boundary. They simply make the assumption that the > underlying buffered object can be described in terms of pages with > some arbitrary (non-aligned) start/offset. Every video framebuffer start/length per page ? > I've ever seen satisfies that, so you can easily map an arbitrary > contiguous region of the framebuffer with a kiobuf already. Video is non contiguous ranges. In fact if you are blitting to a card with tiled memory it gets very interesting in its video lists - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains 2001-02-05 18:49 ` Stephen C. Tweedie 2001-02-05 19:04 ` Alan Cox @ 2001-02-05 19:09 ` Linus Torvalds 2001-02-05 19:16 ` [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait Alan Cox 1 sibling, 1 reply; 186+ messages in thread From: Linus Torvalds @ 2001-02-05 19:09 UTC (permalink / raw) To: Stephen C. Tweedie Cc: Alan Cox, Manfred Spraul, Christoph Hellwig, Steve Lord, linux-kernel, kiobuf-io-devel On Mon, 5 Feb 2001, Stephen C. Tweedie wrote: > > Thats true for _block_ disk devices but if we want a generic kiovec then > > if I am going from video capture to network I dont need to force anything more > > than 4 byte align > > Kiobufs have never, ever required the IO to be aligned on any > particular boundary. They simply make the assumption that the > underlying buffered object can be described in terms of pages with > some arbitrary (non-aligned) start/offset. Every video framebuffer > I've ever seen satisfies that, so you can easily map an arbitrary > contiguous region of the framebuffer with a kiobuf already. Stop this idiocy, Stephen. You're _this_ close to be the first person I ever blacklist from my mailbox. Network. Packets. Fragmentation. Or just non-page-sized MTU's. It is _not_ a "series of contiguous pages". Never has been. Never will be. So stop making excuses. Also, think of protocols that may want to gather stuff from multiple places, where the boundaries have little to do with pages but are specified some other way. Imagine doing "writev()" style operations to disk, gathering stuff from multiple sources into one operation. Think of GART remappings - you can have multiple pages that show up as one "linear" chunk to the graphics device behind the AGP bridge, but that are _not_ contiguous in real memory. There just is NO excuse for the "linear series of pages" view. And if you cannot realize that, then I don't know what's wrong with you. Your arguments are obviously crap, and the stuff you seem unable to argue against (like networking) you decide to just ignore. Get your act together. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait 2001-02-05 19:09 ` Linus Torvalds @ 2001-02-05 19:16 ` Alan Cox 2001-02-05 19:28 ` Linus Torvalds 0 siblings, 1 reply; 186+ messages in thread From: Alan Cox @ 2001-02-05 19:16 UTC (permalink / raw) To: Linus Torvalds Cc: Stephen C. Tweedie, Alan Cox, Manfred Spraul, Christoph Hellwig, Steve Lord, linux-kernel, kiobuf-io-devel > Stop this idiocy, Stephen. You're _this_ close to be the first person I > ever blacklist from my mailbox. I think I've just figured out what the miscommunication is around here kiovecs can describe arbitary scatter gather its just that they can also cleanly describe the common case of contiguous pages in one entry. After all a subpage block is simply a contiguous set of 1 page. Alan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait 2001-02-05 19:16 ` [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait Alan Cox @ 2001-02-05 19:28 ` Linus Torvalds 2001-02-05 20:54 ` Stephen C. Tweedie 2001-02-06 0:31 ` Roman Zippel 0 siblings, 2 replies; 186+ messages in thread From: Linus Torvalds @ 2001-02-05 19:28 UTC (permalink / raw) To: Alan Cox Cc: Stephen C. Tweedie, Manfred Spraul, Christoph Hellwig, Steve Lord, linux-kernel, kiobuf-io-devel On Mon, 5 Feb 2001, Alan Cox wrote: > > Stop this idiocy, Stephen. You're _this_ close to be the first person I > > ever blacklist from my mailbox. > > I think I've just figured out what the miscommunication is around here > > kiovecs can describe arbitary scatter gather I know. But they are entirely useless for anything that requires low latency handling. They are big, bloated, and slow. It is also an example of layering gone horribly horribly wrong. The _vectors_ are needed at the very lowest levels: the levels that do not necessarily have to worry at all about completion notification etc. You want the arbitrary scatter-gather vectors passed down to the stuff that sets up the SG arrays etc, the stuff that doesn't care AT ALL about the high-level semantics. This all proves that the lowest level of layering should be pretty much noting but the vectors. No callbacks, no crap like that. That's already a level of abstraction away, and should not get tacked on. Your lowest level of abstraction should be just the "area". Something like struct buffer { struct page *page; u16 offset, length; }; int nr_buffers: struct buffer *array; should be the low-level abstraction. And on top of _that_ you build a more complex entity (so a "kiobuf" would be defined not just by the memory area, but by the operation you want to do on it, adn the callback on completion etc). Currently kiobufs do it the other way around: you can build up an array, but only by having the overhead of passing kiovec's around - ie you have to pass the _highest_ level of abstraction around just to get the lowest level of details. That's wrong. And that wrongness comes _exactly_ from Stephens opinion that the fundamental IO entity is an array of contiguous pages. And, btw, this is why the networking layer will never be able to use kiobufs. Which makes kiobufs as they stand now basically useless for anything but some direct disk stuff. And I'd rather work on making the low-level disk drivers use something saner. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait 2001-02-05 19:28 ` Linus Torvalds @ 2001-02-05 20:54 ` Stephen C. Tweedie 2001-02-05 21:08 ` David Lang ` (2 more replies) 2001-02-06 0:31 ` Roman Zippel 1 sibling, 3 replies; 186+ messages in thread From: Stephen C. Tweedie @ 2001-02-05 20:54 UTC (permalink / raw) To: Linus Torvalds Cc: Alan Cox, Stephen C. Tweedie, Manfred Spraul, Christoph Hellwig, Steve Lord, linux-kernel, kiobuf-io-devel Hi, On Mon, Feb 05, 2001 at 11:28:17AM -0800, Linus Torvalds wrote: > The _vectors_ are needed at the very lowest levels: the levels that do not > necessarily have to worry at all about completion notification etc. You > want the arbitrary scatter-gather vectors passed down to the stuff that > sets up the SG arrays etc, the stuff that doesn't care AT ALL about the > high-level semantics. OK, this is exactly where we have a problem: I can see too many cases where we *do* need to know about completion stuff at a fine granularity when it comes to disk IO (unlike network IO, where we can usually rely on a caller doing retransmit at some point in the stack). If we are doing readahead, we want completion callbacks raised as soon as possible on IO completions, no matter how many other IOs have been merged with the current one. More importantly though, when we are merging multiple page or buffer_head IOs in a request, we want to know exactly which buffer/page contents are valid and which are not once the IO completes. The current request struct's buffer_head list provides that quite naturally, but is a hugely heavyweight way of performing large IOs. What I'm really after is a way of sending IOs to make_request in such a way that if the caller provides an array of buffer_heads, it gets back completion information on each one, but if the IO is requested in large chunks (eg. XFS's pagebufs or large kiobufs from raw IO), then the request code can deal with it in those large chunks. What worries me is things like the soft raid1/5 code: pretending that we can skimp on the return information about which blocks were transferred successfully and which were not sounds like a really bad idea when you've got a driver which relies on that completion information in order to do intelligent error recovery. Cheers, Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait 2001-02-05 20:54 ` Stephen C. Tweedie @ 2001-02-05 21:08 ` David Lang 2001-02-05 21:51 ` Alan Cox 2001-02-06 0:07 ` Stephen C. Tweedie 2 siblings, 0 replies; 186+ messages in thread From: David Lang @ 2001-02-05 21:08 UTC (permalink / raw) To: Stephen C. Tweedie Cc: Linus Torvalds, Alan Cox, Manfred Spraul, Christoph Hellwig, Steve Lord, linux-kernel, kiobuf-io-devel so you have two concepts in one here 1. SG items that can be more then a single page 2. a container for #1 that includes details for completion callbacks, etc it looks like Linus is objecting to having both in the same structure and then using that structure as your generic low-level bucket. define these as two seperate structures, the #1 structure may now be lightweight enough to be used for networking and other functions, and when you go to use it with disk IO you then wrap it in the #2 structure. this still lets you have the completion callbacks at as low a level as you want, you just have to explicitly add this layer when it makes sense. David Lang On Mon, 5 Feb 2001, Stephen C. Tweedie wrote: > Date: Mon, 5 Feb 2001 20:54:29 +0000 > From: Stephen C. Tweedie <sct@redhat.com> > To: Linus Torvalds <torvalds@transmeta.com> > Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>, Stephen C. Tweedie <sct@redhat.com>, > Manfred Spraul <manfred@colorfullife.com>, > Christoph Hellwig <hch@caldera.de>, Steve Lord <lord@sgi.com>, > linux-kernel@vger.kernel.org, kiobuf-io-devel@lists.sourceforge.net > Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait > > Hi, > > On Mon, Feb 05, 2001 at 11:28:17AM -0800, Linus Torvalds wrote: > > > The _vectors_ are needed at the very lowest levels: the levels that do not > > necessarily have to worry at all about completion notification etc. You > > want the arbitrary scatter-gather vectors passed down to the stuff that > > sets up the SG arrays etc, the stuff that doesn't care AT ALL about the > > high-level semantics. > > OK, this is exactly where we have a problem: I can see too many cases > where we *do* need to know about completion stuff at a fine > granularity when it comes to disk IO (unlike network IO, where we can > usually rely on a caller doing retransmit at some point in the stack). > > If we are doing readahead, we want completion callbacks raised as soon > as possible on IO completions, no matter how many other IOs have been > merged with the current one. More importantly though, when we are > merging multiple page or buffer_head IOs in a request, we want to know > exactly which buffer/page contents are valid and which are not once > the IO completes. > > The current request struct's buffer_head list provides that quite > naturally, but is a hugely heavyweight way of performing large IOs. > What I'm really after is a way of sending IOs to make_request in such > a way that if the caller provides an array of buffer_heads, it gets > back completion information on each one, but if the IO is requested in > large chunks (eg. XFS's pagebufs or large kiobufs from raw IO), then > the request code can deal with it in those large chunks. > > What worries me is things like the soft raid1/5 code: pretending that > we can skimp on the return information about which blocks were > transferred successfully and which were not sounds like a really bad > idea when you've got a driver which relies on that completion > information in order to do intelligent error recovery. > > Cheers, > Stephen > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > Please read the FAQ at http://www.tux.org/lkml/ > - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait 2001-02-05 20:54 ` Stephen C. Tweedie 2001-02-05 21:08 ` David Lang @ 2001-02-05 21:51 ` Alan Cox 2001-02-06 0:07 ` Stephen C. Tweedie 2 siblings, 0 replies; 186+ messages in thread From: Alan Cox @ 2001-02-05 21:51 UTC (permalink / raw) To: Stephen C. Tweedie Cc: Linus Torvalds, Alan Cox, Stephen C. Tweedie, Manfred Spraul, Christoph Hellwig, Steve Lord, linux-kernel, kiobuf-io-devel > OK, this is exactly where we have a problem: I can see too many cases > where we *do* need to know about completion stuff at a fine > granularity when it comes to disk IO (unlike network IO, where we can > usually rely on a caller doing retransmit at some point in the stack). Ok so whats wrong with embedded kiovec points into somethign bigger, one kmalloc can allocate two arrays, one of buffers (shared with networking etc) followed by a second of block io completion data. Now you can also kind of cast from the bigger to the smaller object and get the right result if the kiovec array is the start of the combined allocation - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait 2001-02-05 20:54 ` Stephen C. Tweedie 2001-02-05 21:08 ` David Lang 2001-02-05 21:51 ` Alan Cox @ 2001-02-06 0:07 ` Stephen C. Tweedie 2001-02-06 17:00 ` Christoph Hellwig 2 siblings, 1 reply; 186+ messages in thread From: Stephen C. Tweedie @ 2001-02-06 0:07 UTC (permalink / raw) To: Linus Torvalds Cc: Alan Cox, Stephen C. Tweedie, Manfred Spraul, Christoph Hellwig, Steve Lord, linux-kernel, kiobuf-io-devel, Ben LaHaise, Ingo Molnar Hi, OK, if we take a step back what does this look like: On Mon, Feb 05, 2001 at 08:54:29PM +0000, Stephen C. Tweedie wrote: > > If we are doing readahead, we want completion callbacks raised as soon > as possible on IO completions, no matter how many other IOs have been > merged with the current one. More importantly though, when we are > merging multiple page or buffer_head IOs in a request, we want to know > exactly which buffer/page contents are valid and which are not once > the IO completes. This is the current situation. If the page cache submits a 64K IO to the block layer, it does so in pieces, and then expects to be told on return exactly which pages succeeded and which failed. That's where the mess of having multiple completion objects in a single IO request comes from. Can we just forbid this case? That's the short cut that SGI's kiobuf block dev patches do when they get kiobufs: they currently deal with either buffer_heads or kiobufs in struct requests, but they don't merge kiobuf requests. (XFS already clusters the IOs for them in that case.) Is that a realistic basis for a cleaned-up ll_rw_blk.c? It implies that the caller has to do IO merging. For read, that's not much pain, as the most important case --- readahead --- is already done in a generic way which could submit larger IOs relatively easily. It would be harder for writes, but high-level write clustering code has already been started. It implies that for any IO, on IO failure you don't get told which part of the IO failed. That adds code to the caller: the page cache would have to retry per-page to work out which pages are readable and which are not. It means that for soft raid, you don't get told which blocks are bad if a stripe has an error anywhere. Ingo, is that a potential problem? But it gives very, very simple semantics to the request layer: single IOs go in (with a completion callback and a single scatter-gather list), and results go back with success or failure. With that change, it becomes _much_ more natural to push a simple sg list down through the disk layers. --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait 2001-02-06 0:07 ` Stephen C. Tweedie @ 2001-02-06 17:00 ` Christoph Hellwig 2001-02-06 17:05 ` Stephen C. Tweedie 0 siblings, 1 reply; 186+ messages in thread From: Christoph Hellwig @ 2001-02-06 17:00 UTC (permalink / raw) To: Stephen C. Tweedie Cc: Linus Torvalds, Alan Cox, Manfred Spraul, Steve Lord, linux-kernel, kiobuf-io-devel, Ben LaHaise, Ingo Molnar On Tue, Feb 06, 2001 at 12:07:04AM +0000, Stephen C. Tweedie wrote: > This is the current situation. If the page cache submits a 64K IO to > the block layer, it does so in pieces, and then expects to be told on > return exactly which pages succeeded and which failed. > > That's where the mess of having multiple completion objects in a > single IO request comes from. Can we just forbid this case? > > That's the short cut that SGI's kiobuf block dev patches do when they > get kiobufs: they currently deal with either buffer_heads or kiobufs > in struct requests, but they don't merge kiobuf requests. IIRC Jens Axboe has done some work on merging kiobuf-based requests. > (XFS already clusters the IOs for them in that case.) > > Is that a realistic basis for a cleaned-up ll_rw_blk.c? I don't think os. If we minimize the state in the IO container object, the lower levels could split them at their guess and the IO completion function just has to handle the case that it might be called for a smaller object. Christoph -- Of course it doesn't work. We've performed a software upgrade. Whip me. Beat me. Make me maintain AIX. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait 2001-02-06 17:00 ` Christoph Hellwig @ 2001-02-06 17:05 ` Stephen C. Tweedie 2001-02-06 17:14 ` Jens Axboe ` (2 more replies) 0 siblings, 3 replies; 186+ messages in thread From: Stephen C. Tweedie @ 2001-02-06 17:05 UTC (permalink / raw) To: Stephen C. Tweedie, Linus Torvalds, Alan Cox, Manfred Spraul, Steve Lord, linux-kernel, kiobuf-io-devel, Ben LaHaise, Ingo Molnar Hi, On Tue, Feb 06, 2001 at 06:00:58PM +0100, Christoph Hellwig wrote: > On Tue, Feb 06, 2001 at 12:07:04AM +0000, Stephen C. Tweedie wrote: > > > > Is that a realistic basis for a cleaned-up ll_rw_blk.c? > > I don't think os. If we minimize the state in the IO container object, > the lower levels could split them at their guess and the IO completion > function just has to handle the case that it might be called for a smaller > object. The whole point of the post was that it is merging, not splitting, which is troublesome. How are you going to merge requests without having chains of scatter-gather entities each with their own completion callbacks? --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait 2001-02-06 17:05 ` Stephen C. Tweedie @ 2001-02-06 17:14 ` Jens Axboe 2001-02-06 17:22 ` Christoph Hellwig 2001-02-06 17:37 ` Ben LaHaise 2 siblings, 0 replies; 186+ messages in thread From: Jens Axboe @ 2001-02-06 17:14 UTC (permalink / raw) To: Stephen C. Tweedie Cc: Linus Torvalds, Alan Cox, Manfred Spraul, Steve Lord, linux-kernel, kiobuf-io-devel, Ben LaHaise, Ingo Molnar On Tue, Feb 06 2001, Stephen C. Tweedie wrote: > > I don't think os. If we minimize the state in the IO container object, > > the lower levels could split them at their guess and the IO completion > > function just has to handle the case that it might be called for a smaller > > object. > > The whole point of the post was that it is merging, not splitting, > which is troublesome. How are you going to merge requests without > having chains of scatter-gather entities each with their own > completion callbacks? You can't, the stuff I played with turned out to be horrible. At least with the current kiobuf I/O stuff, merging will have to be done before its submitted. And IMO we don't want to loose the ability to cluster buffers and requests in ll_rw_blk. -- Jens Axboe - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait 2001-02-06 17:05 ` Stephen C. Tweedie 2001-02-06 17:14 ` Jens Axboe @ 2001-02-06 17:22 ` Christoph Hellwig 2001-02-06 18:26 ` Stephen C. Tweedie 2001-02-06 17:37 ` Ben LaHaise 2 siblings, 1 reply; 186+ messages in thread From: Christoph Hellwig @ 2001-02-06 17:22 UTC (permalink / raw) To: Stephen C. Tweedie Cc: Linus Torvalds, Alan Cox, Manfred Spraul, Steve Lord, linux-kernel, kiobuf-io-devel, Ben LaHaise, Ingo Molnar On Tue, Feb 06, 2001 at 05:05:06PM +0000, Stephen C. Tweedie wrote: > The whole point of the post was that it is merging, not splitting, > which is troublesome. How are you going to merge requests without > having chains of scatter-gather entities each with their own > completion callbacks? The object passed down to the low-level driver just needs to ne able to contain multiple end-io callbacks. The decision what to call when some of the scatter-gather entities fail is of course not so easy to handle and needs further discussion. Christoph -- Whip me. Beat me. Make me maintain AIX. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait 2001-02-06 17:22 ` Christoph Hellwig @ 2001-02-06 18:26 ` Stephen C. Tweedie 0 siblings, 0 replies; 186+ messages in thread From: Stephen C. Tweedie @ 2001-02-06 18:26 UTC (permalink / raw) To: Stephen C. Tweedie, Linus Torvalds, Alan Cox, Manfred Spraul, Steve Lord, linux-kernel, kiobuf-io-devel, Ben LaHaise, Ingo Molnar Hi, On Tue, Feb 06, 2001 at 06:22:58PM +0100, Christoph Hellwig wrote: > On Tue, Feb 06, 2001 at 05:05:06PM +0000, Stephen C. Tweedie wrote: > > The whole point of the post was that it is merging, not splitting, > > which is troublesome. How are you going to merge requests without > > having chains of scatter-gather entities each with their own > > completion callbacks? > > The object passed down to the low-level driver just needs to ne able > to contain multiple end-io callbacks. The decision what to call when > some of the scatter-gather entities fail is of course not so easy to > handle and needs further discussion. Umm, and if you want the separate higher-level IOs to be told which IOs succeeded and which ones failed on error, you need to associate each of the multiple completion callbacks with its particular scatter-gather fragment or fragments. So you end up with the same sort of kiobuf/kiovec concept where you have chains of sg chunks, each chunk with its own completion information. This is *precisely* what I've been trying to get people to address. Forget whether the individual sg fragments are based on pages or not: if you want to have IO merging and accurate completion callbacks, you need not just one sg list but multiple lists each with a separate callback header. Abandon the merging of sg-list requests (by moving that functionality into the higher-level layers) and that problem disappears: flat sg-lists will then work quite happily at the request layer. Cheers, Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait 2001-02-06 17:05 ` Stephen C. Tweedie 2001-02-06 17:14 ` Jens Axboe 2001-02-06 17:22 ` Christoph Hellwig @ 2001-02-06 17:37 ` Ben LaHaise 2001-02-06 18:00 ` Jens Axboe ` (2 more replies) 2 siblings, 3 replies; 186+ messages in thread From: Ben LaHaise @ 2001-02-06 17:37 UTC (permalink / raw) To: Stephen C. Tweedie Cc: Linus Torvalds, Alan Cox, Manfred Spraul, Steve Lord, linux-kernel, kiobuf-io-devel, Ingo Molnar Hey folks, On Tue, 6 Feb 2001, Stephen C. Tweedie wrote: > The whole point of the post was that it is merging, not splitting, > which is troublesome. How are you going to merge requests without > having chains of scatter-gather entities each with their own > completion callbacks? Let me just emphasize what Stephen is pointing out: if requests are properly merged at higher layers, then merging is neither required nor desired. Traditionally, ext2 has not done merging because the underlying system doesn't support it. This leads to rather convoluted code for readahead which doesn't result in appropriately merged requests on indirect block boundries, and in fact leads to suboptimal performance. The only case I see where merging of requests can improve things is when dealing with lots of small files. But we already know that small files need to be treated differently (fe tail merging). Besides, most of the benefit of merging can be had by doing readaround for these small files. As for io completion, can't we just issue seperate requests for the critical data and the readahead? That way for SCSI disks, the important io should be finished while the readahead can continue. Thoughts? -ben - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait 2001-02-06 17:37 ` Ben LaHaise @ 2001-02-06 18:00 ` Jens Axboe 2001-02-06 18:09 ` Ben LaHaise 2001-02-06 18:14 ` Linus Torvalds 2001-02-06 18:18 ` Ingo Molnar 2 siblings, 1 reply; 186+ messages in thread From: Jens Axboe @ 2001-02-06 18:00 UTC (permalink / raw) To: Ben LaHaise Cc: Stephen C. Tweedie, Linus Torvalds, Alan Cox, Manfred Spraul, Steve Lord, linux-kernel, kiobuf-io-devel, Ingo Molnar On Tue, Feb 06 2001, Ben LaHaise wrote: > > The whole point of the post was that it is merging, not splitting, > > which is troublesome. How are you going to merge requests without > > having chains of scatter-gather entities each with their own > > completion callbacks? > > Let me just emphasize what Stephen is pointing out: if requests are > properly merged at higher layers, then merging is neither required nor > desired. Traditionally, ext2 has not done merging because the underlying > system doesn't support it. This leads to rather convoluted code for > readahead which doesn't result in appropriately merged requests on > indirect block boundries, and in fact leads to suboptimal performance. > The only case I see where merging of requests can improve things is when > dealing with lots of small files. But we already know that small files > need to be treated differently (fe tail merging). Besides, most of the > benefit of merging can be had by doing readaround for these small files. Stephen already covered this point, the merging is not a problem to deal with for read-ahead. The underlying system can easily queue that in nice big chunks. Delayed allocation makes it easier to to flush big chunks as well. I seem to recall the xfs people having problems with the lack of merging causing a performance hit on smaller I/O. Of course merging doesn't have to happen in ll_rw_blk. > As for io completion, can't we just issue seperate requests for the > critical data and the readahead? That way for SCSI disks, the important > io should be finished while the readahead can continue. Thoughts? Priorities? -- Jens Axboe - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait 2001-02-06 18:00 ` Jens Axboe @ 2001-02-06 18:09 ` Ben LaHaise 2001-02-06 19:35 ` Jens Axboe 0 siblings, 1 reply; 186+ messages in thread From: Ben LaHaise @ 2001-02-06 18:09 UTC (permalink / raw) To: Jens Axboe Cc: Stephen C. Tweedie, Linus Torvalds, Alan Cox, Manfred Spraul, Steve Lord, linux-kernel, kiobuf-io-devel, Ingo Molnar On Tue, 6 Feb 2001, Jens Axboe wrote: > Stephen already covered this point, the merging is not a problem > to deal with for read-ahead. The underlying system can easily I just wanted to make sure that was clear =) > queue that in nice big chunks. Delayed allocation makes it > easier to to flush big chunks as well. I seem to recall the xfs people > having problems with the lack of merging causing a performance hit > on smaller I/O. That's where readaround buffers come into play. If we have a fixed number of readaround buffers that are used when small ios are issued, they should provide a low overhead means of substantially improving things like find (which reads many nearby inodes out of order but sequentially). I need to implement this can get cache hit rates for various workloads. ;-) > Of course merging doesn't have to happen in ll_rw_blk. > > > As for io completion, can't we just issue seperate requests for the > > critical data and the readahead? That way for SCSI disks, the important > > io should be finished while the readahead can continue. Thoughts? > > Priorities? Definately. I'd like to be able to issue readaheads with a "don't bother executing if this request unless the cost is low" bit set. It might also be helpful for heavy multiuser loads (or even a single user with multiple processes) to ensure progress is made for others. -ben - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait 2001-02-06 18:09 ` Ben LaHaise @ 2001-02-06 19:35 ` Jens Axboe 0 siblings, 0 replies; 186+ messages in thread From: Jens Axboe @ 2001-02-06 19:35 UTC (permalink / raw) To: Ben LaHaise Cc: Stephen C. Tweedie, Linus Torvalds, Alan Cox, Manfred Spraul, Steve Lord, linux-kernel, kiobuf-io-devel, Ingo Molnar On Tue, Feb 06 2001, Ben LaHaise wrote: > > > As for io completion, can't we just issue seperate requests for the > > > critical data and the readahead? That way for SCSI disks, the important > > > io should be finished while the readahead can continue. Thoughts? > > > > Priorities? > > Definately. I'd like to be able to issue readaheads with a "don't bother > executing if this request unless the cost is low" bit set. It might also > be helpful for heavy multiuser loads (or even a single user with multiple > processes) to ensure progress is made for others. And in other contexts too it might be handy to assign priorities to requests as well. I don't know how sgi plan on handling grio (or already handle it in irix), maybe Steve can fill us in on that :) -- Jens Axboe - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait 2001-02-06 17:37 ` Ben LaHaise 2001-02-06 18:00 ` Jens Axboe @ 2001-02-06 18:14 ` Linus Torvalds 2001-02-08 11:21 ` Andi Kleen 2001-02-08 14:11 ` Martin Dalecki 2001-02-06 18:18 ` Ingo Molnar 2 siblings, 2 replies; 186+ messages in thread From: Linus Torvalds @ 2001-02-06 18:14 UTC (permalink / raw) To: Ben LaHaise Cc: Stephen C. Tweedie, Alan Cox, Manfred Spraul, Steve Lord, linux-kernel, kiobuf-io-devel, Ingo Molnar On Tue, 6 Feb 2001, Ben LaHaise wrote: > > On Tue, 6 Feb 2001, Stephen C. Tweedie wrote: > > > The whole point of the post was that it is merging, not splitting, > > which is troublesome. How are you going to merge requests without > > having chains of scatter-gather entities each with their own > > completion callbacks? > > Let me just emphasize what Stephen is pointing out: if requests are > properly merged at higher layers, then merging is neither required nor > desired. I will claim that you CANNOT merge at higher levels and get good performance. Sure, you can do read-ahead, and try to get big merges that way at a high level. Good for you. But you'll have a bitch of a time trying to merge multiple threads/processes reading from the same area on disk at roughly the same time. Your higher levels won't even _know_ that there is merging to be done until the IO requests hit the wall in waiting for the disk. Qutie frankly, this whole discussion sounds worthless. We have solved this problem already: it's called a "buffer head". Deceptively simple at higher levels, and lower levels can easily merge them together into chains and do fancy scatter-gather structures of them that can be dynamically extended at any time. The buffer heads together with "struct request" do a hell of a lot more than just a simple scatter-gather: it's able to create ordered lists of independent sg-events, together with full call-backs etc. They are low-cost, fairly efficient, and they have worked beautifully for years. The fact that kiobufs can't be made to do the same thing is somebody elses problem. I _know_ that merging has to happen late, and if others are hitting their heads against this issue until they turn silly, then that's their problem. You'll eventually learn, or you'll hit your heads into a pulp. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait 2001-02-06 18:14 ` Linus Torvalds @ 2001-02-08 11:21 ` Andi Kleen 2001-02-08 14:11 ` Martin Dalecki 1 sibling, 0 replies; 186+ messages in thread From: Andi Kleen @ 2001-02-08 11:21 UTC (permalink / raw) To: Linus Torvalds Cc: Ben LaHaise, Stephen C. Tweedie, Alan Cox, Manfred Spraul, Steve Lord, linux-kernel, kiobuf-io-devel, Ingo Molnar On Tue, Feb 06, 2001 at 10:14:21AM -0800, Linus Torvalds wrote: > I will claim that you CANNOT merge at higher levels and get good > performance. > > Sure, you can do read-ahead, and try to get big merges that way at a high > level. Good for you. > > But you'll have a bitch of a time trying to merge multiple > threads/processes reading from the same area on disk at roughly the same > time. Your higher levels won't even _know_ that there is merging to be > done until the IO requests hit the wall in waiting for the disk. Hi, I've tried to experimentally check this statement. I instrumented a kernel with the following patch. It keeps a counter for every merge between unrelated requests. An unrelated requests is defined as the requests getting allocated from different currents. I did various tests and suprisingly I was not able to trigger a single unrelated merge on my IDE system with various IO loads (dbench, news expire, news sort, kernel compile, swapping ...) So either my patch is wrong (if yes, what is wrong?), or they do simply not happen in usual IO loads. I know that it has a few holes (like it doesn't count unrelated merges that happen from the same process, or if a process quits and another one gets its kernel stack and IO of both is merged it'll be counted as related merge), but if unrelated merges were relevant there should still show up more, no? My pet theory is that page and buffer cache filters most unrelated merges out. I haven't tried to use raw IO to avoid this problem, but I expect that anything that does raw IO will do some intelligent IO scheduling on its own anyways. If anyone is interested: it would be interesting if other people are able to trigger unrelated merges in real loads. Here is a patch. Display statistics using: (echo print unrelated_merge ; print related_merge ) | gdb vmlinux /proc/kcore --- linux/drivers/block/ll_rw_blk.c-REQSTAT Tue Jan 30 13:33:25 2001 +++ linux/drivers/block/ll_rw_blk.c Thu Feb 8 01:13:57 2001 @@ -31,6 +31,9 @@ #include <linux/module.h> +int unrelated_merge; +int related_merge; + /* * MAC Floppy IWM hooks */ @@ -478,6 +481,7 @@ rq->rq_status = RQ_ACTIVE; rq->special = NULL; rq->q = q; + rq->originator = current; } return rq; @@ -668,6 +672,11 @@ if (!q->merge_requests_fn(q, req, next, max_segments)) return; + if (next->originator != req->originator) + unrelated_merge++; + else + related_merge++; + q->elevator.elevator_merge_req_fn(req, next); req->bhtail->b_reqnext = next->bh; req->bhtail = next->bhtail; --- linux/include/linux/blkdev.h-REQSTAT Tue Jan 30 17:17:01 2001 +++ linux/include/linux/blkdev.h Wed Feb 7 23:33:35 2001 @@ -45,6 +45,8 @@ struct buffer_head * bh; struct buffer_head * bhtail; request_queue_t *q; + + struct task_struct *originator; }; #include <linux/elevator.h> -Andi - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait 2001-02-06 18:14 ` Linus Torvalds 2001-02-08 11:21 ` Andi Kleen @ 2001-02-08 14:11 ` Martin Dalecki 2001-02-08 17:59 ` Linus Torvalds 1 sibling, 1 reply; 186+ messages in thread From: Martin Dalecki @ 2001-02-08 14:11 UTC (permalink / raw) To: Linus Torvalds Cc: Ben LaHaise, Stephen C. Tweedie, Alan Cox, Manfred Spraul, Steve Lord, linux-kernel, kiobuf-io-devel, Ingo Molnar Linus Torvalds wrote: > > On Tue, 6 Feb 2001, Ben LaHaise wrote: > > > > On Tue, 6 Feb 2001, Stephen C. Tweedie wrote: > > > > > The whole point of the post was that it is merging, not splitting, > > > which is troublesome. How are you going to merge requests without > > > having chains of scatter-gather entities each with their own > > > completion callbacks? > > > > Let me just emphasize what Stephen is pointing out: if requests are > > properly merged at higher layers, then merging is neither required nor > > desired. > > I will claim that you CANNOT merge at higher levels and get good > performance. > > Sure, you can do read-ahead, and try to get big merges that way at a high > level. Good for you. > > But you'll have a bitch of a time trying to merge multiple > threads/processes reading from the same area on disk at roughly the same > time. Your higher levels won't even _know_ that there is merging to be > done until the IO requests hit the wall in waiting for the disk. Merging is a hardware tighted optimization, so it should happen, there we you really have full "knowlendge" and controll of the hardware -> namely the device driver. > Qutie frankly, this whole discussion sounds worthless. We have solved this > problem already: it's called a "buffer head". Deceptively simple at higher > levels, and lower levels can easily merge them together into chains and do > fancy scatter-gather structures of them that can be dynamically extended > at any time. > > The buffer heads together with "struct request" do a hell of a lot more > than just a simple scatter-gather: it's able to create ordered lists of > independent sg-events, together with full call-backs etc. They are > low-cost, fairly efficient, and they have worked beautifully for years. > > The fact that kiobufs can't be made to do the same thing is somebody elses > problem. I _know_ that merging has to happen late, and if others are > hitting their heads against this issue until they turn silly, then that's > their problem. You'll eventually learn, or you'll hit your heads into a > pulp. Amen. -- - phone: +49 214 8656 283 - job: STOCK-WORLD Media AG, LEV .de (MY OPPINNIONS ARE MY OWN!) - langs: de_DE.ISO8859-1, en_US, pl_PL.ISO8859-2, last ressort: ru_RU.KOI8-R - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait 2001-02-08 14:11 ` Martin Dalecki @ 2001-02-08 17:59 ` Linus Torvalds 0 siblings, 0 replies; 186+ messages in thread From: Linus Torvalds @ 2001-02-08 17:59 UTC (permalink / raw) To: Martin Dalecki Cc: Ben LaHaise, Stephen C. Tweedie, Alan Cox, Manfred Spraul, Steve Lord, linux-kernel, kiobuf-io-devel, Ingo Molnar On Thu, 8 Feb 2001, Martin Dalecki wrote: > > > > But you'll have a bitch of a time trying to merge multiple > > threads/processes reading from the same area on disk at roughly the same > > time. Your higher levels won't even _know_ that there is merging to be > > done until the IO requests hit the wall in waiting for the disk. > > Merging is a hardware tighted optimization, so it should happen, there we you > really have full "knowlendge" and controll of the hardware -> namely the > device driver. Or, in many cases, the device itself. There are valid reasons for not doing merging in the driver, but they all tend to boil down to "even lower layers can do a better job of it". They basically _never_ boil down to "upper layers already did it for us". That said, there tend to be advantages to doing "appropriate" clustering at each level. Upper layers can (and do) use read-ahead to help the lower levels. The write-out can (and currently does not) try to sort the requests for better elevator behaviour. The driver level can (and does) further cluster the requests - even if the low-level device does a perfect job of orderign and merging on its own it's usually advantageous to have fewer (and bigger) commands in-flight in order to have fewer completion interrupts and less command traffic on the bus. So it's obviously not entirely black-and-white. Upper layers can help, but it's a mistake to think that they should "do the work". (Note: a lot of people seem to think that "layering" means that the complexity is in upper layers, and that lower layers should be simple and "stupid". This is not true. A well-balanced layering would have all layers doing potentially equally complex things - but the complexity should be _independent_. Complex interactions are bad. But it's also bad to thin kthat lower levels shouldn't be allowed to optimize because they should be "simple".). Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait 2001-02-06 17:37 ` Ben LaHaise 2001-02-06 18:00 ` Jens Axboe 2001-02-06 18:14 ` Linus Torvalds @ 2001-02-06 18:18 ` Ingo Molnar 2001-02-06 18:25 ` Ben LaHaise 2 siblings, 1 reply; 186+ messages in thread From: Ingo Molnar @ 2001-02-06 18:18 UTC (permalink / raw) To: Ben LaHaise Cc: Stephen C. Tweedie, Linus Torvalds, Alan Cox, Manfred Spraul, Steve Lord, Linux Kernel List, kiobuf-io-devel, Ingo Molnar On Tue, 6 Feb 2001, Ben LaHaise wrote: > Let me just emphasize what Stephen is pointing out: if requests are > properly merged at higher layers, then merging is neither required nor > desired. [...] this is just so incorrect that it's not funny anymore. - higher levels just do not have the kind of knowledge lower levels have. - merging decisions are often not even *deterministic*. - higher levels do not have the kind of state to eg. merge requests done by different users. The only chance for merging is often the lowest level, where we already know what disk, which sector. - merging is not even *required* for some devices - and chances are high that we'll get away from this inefficient and unreliable 'rotating array of disks' business of storing bulk data in this century. (solid state disks, holographic storage, whatever.) i'm truly shocked that you and Stephen are both saying this. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait 2001-02-06 18:18 ` Ingo Molnar @ 2001-02-06 18:25 ` Ben LaHaise 2001-02-06 18:35 ` Ingo Molnar 0 siblings, 1 reply; 186+ messages in thread From: Ben LaHaise @ 2001-02-06 18:25 UTC (permalink / raw) To: Ingo Molnar Cc: Stephen C. Tweedie, Linus Torvalds, Alan Cox, Manfred Spraul, Steve Lord, Linux Kernel List, kiobuf-io-devel, Ingo Molnar On Tue, 6 Feb 2001, Ingo Molnar wrote: > - higher levels do not have the kind of state to eg. merge requests done > by different users. The only chance for merging is often the lowest > level, where we already know what disk, which sector. That's what a readaround buffer is for, and I suspect that readaround will give use a big performance boost. > - merging is not even *required* for some devices - and chances are high > that we'll get away from this inefficient and unreliable 'rotating array > of disks' business of storing bulk data in this century. (solid state > disks, holographic storage, whatever.) Interesting that you've brought up this point, as its an example > i'm truly shocked that you and Stephen are both saying this. Merging != sorting. Sorting of requests has to be carried out at the lower layers, and the specific block device should be able to choose the Right Thing To Do for the next item in a chain of sequential requests. -ben - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait 2001-02-06 18:25 ` Ben LaHaise @ 2001-02-06 18:35 ` Ingo Molnar 2001-02-06 18:54 ` Ben LaHaise 0 siblings, 1 reply; 186+ messages in thread From: Ingo Molnar @ 2001-02-06 18:35 UTC (permalink / raw) To: Ben LaHaise Cc: Stephen C. Tweedie, Linus Torvalds, Alan Cox, Manfred Spraul, Steve Lord, Linux Kernel List, kiobuf-io-devel, Ingo Molnar On Tue, 6 Feb 2001, Ben LaHaise wrote: > > - higher levels do not have the kind of state to eg. merge requests done > > by different users. The only chance for merging is often the lowest > > level, where we already know what disk, which sector. > > That's what a readaround buffer is for, [...] If you are merging based on (device, offset) values, then that's lowlevel - and this is what we have been doing for years. If you are merging based on (inode, offset), then it has flaws like not being able to merge through a loopback or stacked filesystem. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait 2001-02-06 18:35 ` Ingo Molnar @ 2001-02-06 18:54 ` Ben LaHaise 2001-02-06 18:58 ` Ingo Molnar 2001-02-06 19:20 ` Linus Torvalds 0 siblings, 2 replies; 186+ messages in thread From: Ben LaHaise @ 2001-02-06 18:54 UTC (permalink / raw) To: Ingo Molnar Cc: Stephen C. Tweedie, Linus Torvalds, Alan Cox, Manfred Spraul, Steve Lord, Linux Kernel List, kiobuf-io-devel, Ingo Molnar On Tue, 6 Feb 2001, Ingo Molnar wrote: > If you are merging based on (device, offset) values, then that's lowlevel > - and this is what we have been doing for years. > > If you are merging based on (inode, offset), then it has flaws like not > being able to merge through a loopback or stacked filesystem. I disagree. Loopback filesystems typically have their data contiguously on disk and won't split up incoming requests any further. Here are the points I'm trying to address: - reduce the overhead in submitting block ios, especially for large ios. Look at the %CPU usages differences between 512 byte blocks and 4KB blocks, this can be better. - make asynchronous io possible in the block layer. This is impossible with the current ll_rw_block scheme and io request plugging. - provide a generic mechanism for reordering io requests for devices which will benefit from this. Make it a library for drivers to call into. IDE for example will probably make use of it, but some high end devices do this on the controller. This is the important point: Make it OPTIONAL. You mentioned non-spindle base io devices in your last message. Take something like a big RAM disk. Now compare kiobuf base io to buffer head based io. Tell me which one is going to perform better. -ben - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait 2001-02-06 18:54 ` Ben LaHaise @ 2001-02-06 18:58 ` Ingo Molnar 2001-02-06 19:11 ` Ben LaHaise 2001-02-06 19:20 ` Linus Torvalds 1 sibling, 1 reply; 186+ messages in thread From: Ingo Molnar @ 2001-02-06 18:58 UTC (permalink / raw) To: Ben LaHaise Cc: Stephen C. Tweedie, Linus Torvalds, Alan Cox, Manfred Spraul, Steve Lord, Linux Kernel List, kiobuf-io-devel, Ingo Molnar On Tue, 6 Feb 2001, Ben LaHaise wrote: > - reduce the overhead in submitting block ios, especially for > large ios. Look at the %CPU usages differences between 512 byte > blocks and 4KB blocks, this can be better. my system is already submitting 4KB bhs. If anyone's raw-IO setup submits 512 byte bhs thats a problem of the raw IO code ... > - make asynchronous io possible in the block layer. This is > impossible with the current ll_rw_block scheme and io request > plugging. why is it impossible? > You mentioned non-spindle base io devices in your last message. Take > something like a big RAM disk. Now compare kiobuf base io to buffer > head based io. Tell me which one is going to perform better. roughly equal performance when using 4K bhs. And a hell of a lot more complex and volatile code in the kiobuf case. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait 2001-02-06 18:58 ` Ingo Molnar @ 2001-02-06 19:11 ` Ben LaHaise 2001-02-06 19:32 ` Jens Axboe ` (3 more replies) 0 siblings, 4 replies; 186+ messages in thread From: Ben LaHaise @ 2001-02-06 19:11 UTC (permalink / raw) To: Ingo Molnar Cc: Stephen C. Tweedie, Linus Torvalds, Alan Cox, Manfred Spraul, Steve Lord, Linux Kernel List, kiobuf-io-devel, Ingo Molnar On Tue, 6 Feb 2001, Ingo Molnar wrote: > > On Tue, 6 Feb 2001, Ben LaHaise wrote: > > > - reduce the overhead in submitting block ios, especially for > > large ios. Look at the %CPU usages differences between 512 byte > > blocks and 4KB blocks, this can be better. > > my system is already submitting 4KB bhs. If anyone's raw-IO setup submits > 512 byte bhs thats a problem of the raw IO code ... > > > - make asynchronous io possible in the block layer. This is > > impossible with the current ll_rw_block scheme and io request > > plugging. > > why is it impossible? s/impossible/unpleasant/. ll_rw_blk blocks; it should be possible to have a non blocking variant that does all of the setup in the caller's context. Yes, I know that we can do it with a kernel thread, but that isn't as clean and it significantly penalises small ios (hint: databases issue *lots* of small random ios and a good chunk of large ios). > > You mentioned non-spindle base io devices in your last message. Take > > something like a big RAM disk. Now compare kiobuf base io to buffer > > head based io. Tell me which one is going to perform better. > > roughly equal performance when using 4K bhs. And a hell of a lot more > complex and volatile code in the kiobuf case. I'm willing to benchmark you on this. -ben - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait 2001-02-06 19:11 ` Ben LaHaise @ 2001-02-06 19:32 ` Jens Axboe 2001-02-06 19:32 ` Ingo Molnar ` (2 subsequent siblings) 3 siblings, 0 replies; 186+ messages in thread From: Jens Axboe @ 2001-02-06 19:32 UTC (permalink / raw) To: Ben LaHaise Cc: Ingo Molnar, Stephen C. Tweedie, Linus Torvalds, Alan Cox, Manfred Spraul, Steve Lord, Linux Kernel List, kiobuf-io-devel, Ingo Molnar On Tue, Feb 06 2001, Ben LaHaise wrote: > > > - make asynchronous io possible in the block layer. This is > > > impossible with the current ll_rw_block scheme and io request > > > plugging. > > > > why is it impossible? > > s/impossible/unpleasant/. ll_rw_blk blocks; it should be possible to have > a non blocking variant that does all of the setup in the caller's context. > Yes, I know that we can do it with a kernel thread, but that isn't as > clean and it significantly penalises small ios (hint: databases issue > *lots* of small random ios and a good chunk of large ios). So make a non-blocking variant, not a big deal. Users of async I/O know how to deal with resource limits anyway. -- Jens Axboe - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait 2001-02-06 19:11 ` Ben LaHaise 2001-02-06 19:32 ` Jens Axboe @ 2001-02-06 19:32 ` Ingo Molnar 2001-02-06 19:32 ` Linus Torvalds 2001-02-06 19:46 ` Ingo Molnar 3 siblings, 0 replies; 186+ messages in thread From: Ingo Molnar @ 2001-02-06 19:32 UTC (permalink / raw) To: Ben LaHaise Cc: Stephen C. Tweedie, Linus Torvalds, Alan Cox, Manfred Spraul, Steve Lord, Linux Kernel List, kiobuf-io-devel, Ingo Molnar On Tue, 6 Feb 2001, Ben LaHaise wrote: > > > - make asynchronous io possible in the block layer. This is > > > impossible with the current ll_rw_block scheme and io request > > > plugging. > > > > why is it impossible? > > s/impossible/unpleasant/. ll_rw_blk blocks; it should be possible to > have a non blocking variant that does all of the setup in the caller's > context. [...] sorry, but exactly what code are you comparing this to? The aio code you sent a few days ago does not do this either. (And you did not answer my questions regarding this issue.) What i saw is some scheme that at a point relies on keventd (a kernel thread) to do the blocking stuff. [or, unless i have misread the code, does the ->bmap() synchronously.] indeed an asynchron ll_rw_block() is possible and desirable (and not hard at all - all structures are interrupt-safe already, opposed to the kiovec code), but this is only half of the story. What is the big issue for me is an async ->bmap(). And we wont access ext2fs data structures from IRQ handlers anytime soon - so true async IO right now is damn near impossible. No matter what the IO-submission interface is: kiobufs/kiovecs or bhs/requests. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait 2001-02-06 19:11 ` Ben LaHaise 2001-02-06 19:32 ` Jens Axboe 2001-02-06 19:32 ` Ingo Molnar @ 2001-02-06 19:32 ` Linus Torvalds 2001-02-06 19:44 ` Ingo Molnar ` (2 more replies) 2001-02-06 19:46 ` Ingo Molnar 3 siblings, 3 replies; 186+ messages in thread From: Linus Torvalds @ 2001-02-06 19:32 UTC (permalink / raw) To: Ben LaHaise Cc: Ingo Molnar, Stephen C. Tweedie, Alan Cox, Manfred Spraul, Steve Lord, Linux Kernel List, kiobuf-io-devel, Ingo Molnar On Tue, 6 Feb 2001, Ben LaHaise wrote: > > s/impossible/unpleasant/. ll_rw_blk blocks; it should be possible to have > a non blocking variant that does all of the setup in the caller's context. > Yes, I know that we can do it with a kernel thread, but that isn't as > clean and it significantly penalises small ios (hint: databases issue > *lots* of small random ios and a good chunk of large ios). Ehh.. submit_bh() does everything you want. And, btw, ll_rw_block() does NOT block. Never has. Never will. (Small correction: it doesn't block on anything else than allocating a request structure if needed, and quite frankly, you have to block SOMETIME. You can't just try to throw stuff at the device faster than it can take it. Think of it as a "there can only be this many IO's in flight") If you want to use kiobuf's because you think they are asycnrhonous and bh's aren't, then somebody has been feeding you a lot of crap. The kiobuf PR department seems to have been working overtime on some FUD strategy. The fact is that bh's can do MORE than kiobuf's. They have all the callbacks in place etc. They merge and sort correctly. Oh, they have limitations: one "bh" always describes just one memory area with a "start,len" kind of thing. That's fine - scatter-gather is pushed downwards, and the upper layers do not even need to know about it. Which is what layering is all about, after all. Traditionally, a "bh" is only _used_ for small areas, but that's not a "bh" issue, that's a memory management issue. The code should pretty much handle the issue of a single 64kB bh pretty much as-is, but nothing creates them: the VM layer only creates bh's in sizes ranging from 512 bytes to a single page. The IO layer could do more, but there has yet to be anybody who needed more (becase once you hit a page-size, you tend to get into scatter-gather, so you want to have one bh per area - and let the low-level IO level handle the actual merging etc). Right now, on many normal setups, the thing that limits our ability to do big IO requests is actually the fact that IDE cannot do more than 128kB per request, for example (256 sectors). It's not the bh's or the VM layer. If you want to make a "raw disk device", you can do so TODAY with bh's. How? Don't use "bread()" (which allocates the backing store and creates the cache). Allocate a separate anonymous bh (or multiple), and set them up to point to whatever data source/sink you have, and let it rip. All asynchronous. All with nice completion callbacks. All with existing code, no kiobuf's in sight. What more do you think your kiobuf's should be able to do? Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait 2001-02-06 19:32 ` Linus Torvalds @ 2001-02-06 19:44 ` Ingo Molnar 2001-02-06 19:49 ` Ben LaHaise 2001-02-06 20:25 ` Christoph Hellwig 2 siblings, 0 replies; 186+ messages in thread From: Ingo Molnar @ 2001-02-06 19:44 UTC (permalink / raw) To: Linus Torvalds Cc: Ben LaHaise, Stephen C. Tweedie, Alan Cox, Manfred Spraul, Steve Lord, Linux Kernel List, kiobuf-io-devel, Ingo Molnar On Tue, 6 Feb 2001, Linus Torvalds wrote: > (Small correction: it doesn't block on anything else than allocating a > request structure if needed, and quite frankly, you have to block > SOMETIME. You can't just try to throw stuff at the device faster than > it can take it. Think of it as a "there can only be this many IO's in > flight") yep. The/my goal would be to get some sort of async IO capability that is able to read the pagecache without holding up the process. And just because i've already implemented the helper-kernel-thread async IO variant [in fact what TUX does is that there are per-CPU async IO helper threads, and we always pick the 'localized' thread, to avoid unnecessery cross-CPU traffic], i'd like to explore the possibility of getting this done via a pure, IRQ-driven state-machine - which arguably has the lowest overhead. but i just cannot find any robust way to do this with ext2fs (or any other disk-based FS for that matter). The horror scenario: the inode block is not cached yet, and the block resides in a triple-indirected block which triggers 3 other block reads, before the actual data block can be read. And i definitely do not see why kiobufs would help make this any easier. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait 2001-02-06 19:32 ` Linus Torvalds 2001-02-06 19:44 ` Ingo Molnar @ 2001-02-06 19:49 ` Ben LaHaise 2001-02-06 19:57 ` Ingo Molnar 2001-02-06 20:26 ` Linus Torvalds 2001-02-06 20:25 ` Christoph Hellwig 2 siblings, 2 replies; 186+ messages in thread From: Ben LaHaise @ 2001-02-06 19:49 UTC (permalink / raw) To: Linus Torvalds Cc: Ingo Molnar, Stephen C. Tweedie, Alan Cox, Manfred Spraul, Steve Lord, Linux Kernel List, kiobuf-io-devel, Ingo Molnar On Tue, 6 Feb 2001, Linus Torvalds wrote: > > > On Tue, 6 Feb 2001, Ben LaHaise wrote: > > > > s/impossible/unpleasant/. ll_rw_blk blocks; it should be possible to have > > a non blocking variant that does all of the setup in the caller's context. > > Yes, I know that we can do it with a kernel thread, but that isn't as > > clean and it significantly penalises small ios (hint: databases issue > > *lots* of small random ios and a good chunk of large ios). > > Ehh.. submit_bh() does everything you want. And, btw, ll_rw_block() does > NOT block. Never has. Never will. > > (Small correction: it doesn't block on anything else than allocating a > request structure if needed, and quite frankly, you have to block > SOMETIME. You can't just try to throw stuff at the device faster than it > can take it. Think of it as a "there can only be this many IO's in > flight") This small correction is the crux of the problem: if it blocks, it takes away from the ability of the process to continue doing useful work. If it returns -EAGAIN, then that's okay, the io will be resubmitted later when other disk io has completed. But, it should be possible to continue servicing network requests or user io while disk io is underway. > If you want to use kiobuf's because you think they are asycnrhonous and > bh's aren't, then somebody has been feeding you a lot of crap. The kiobuf > PR department seems to have been working overtime on some FUD strategy. I'm using bh's to refer to what is currently being done, and kiobuf when talking about what could be done. It's probably the wrong thing to do, and if bh's are extended to operate on arbitrary sized blocks then there is no difference between the two. > If you want to make a "raw disk device", you can do so TODAY with bh's. > How? Don't use "bread()" (which allocates the backing store and creates > the cache). Allocate a separate anonymous bh (or multiple), and set them > up to point to whatever data source/sink you have, and let it rip. All > asynchronous. All with nice completion callbacks. All with existing code, > no kiobuf's in sight. > What more do you think your kiobuf's should be able to do? That's what my code is doing today. There are a ton of bh's setup for a single kiobuf request that is issued. For something like a single 256kb io, this is the difference between the batched io requests being passed into submit_bh fitting in L1 cache and overflowing it. Resizable bh's would certainly improve this. -ben - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait 2001-02-06 19:49 ` Ben LaHaise @ 2001-02-06 19:57 ` Ingo Molnar 2001-02-06 20:07 ` Jens Axboe ` (2 more replies) 2001-02-06 20:26 ` Linus Torvalds 1 sibling, 3 replies; 186+ messages in thread From: Ingo Molnar @ 2001-02-06 19:57 UTC (permalink / raw) To: Ben LaHaise Cc: Linus Torvalds, Stephen C. Tweedie, Alan Cox, Manfred Spraul, Steve Lord, Linux Kernel List, kiobuf-io-devel, Ingo Molnar On Tue, 6 Feb 2001, Ben LaHaise wrote: > This small correction is the crux of the problem: if it blocks, it > takes away from the ability of the process to continue doing useful > work. If it returns -EAGAIN, then that's okay, the io will be > resubmitted later when other disk io has completed. But, it should be > possible to continue servicing network requests or user io while disk > io is underway. typical blocking point is waiting for page completion, not __wait_request(). But, this is really not an issue, NR_REQUESTS can be increased anytime. If NR_REQUESTS is large enough then think of it as the 'absolute upper limit of doing IO', and think of the blocking as 'the kernel pulling the brakes'. [overhead of 512-byte bhs in the raw IO code is an artificial problem of the raw IO code.] Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait 2001-02-06 19:57 ` Ingo Molnar @ 2001-02-06 20:07 ` Jens Axboe 2001-02-06 20:25 ` Ben LaHaise 2001-02-07 0:21 ` Stephen C. Tweedie 2 siblings, 0 replies; 186+ messages in thread From: Jens Axboe @ 2001-02-06 20:07 UTC (permalink / raw) To: Ingo Molnar Cc: Ben LaHaise, Linus Torvalds, Stephen C. Tweedie, Alan Cox, Manfred Spraul, Steve Lord, Linux Kernel List, kiobuf-io-devel, Ingo Molnar On Tue, Feb 06 2001, Ingo Molnar wrote: > > This small correction is the crux of the problem: if it blocks, it > > takes away from the ability of the process to continue doing useful > > work. If it returns -EAGAIN, then that's okay, the io will be > > resubmitted later when other disk io has completed. But, it should be > > possible to continue servicing network requests or user io while disk > > io is underway. > > typical blocking point is waiting for page completion, not > __wait_request(). But, this is really not an issue, NR_REQUESTS can be > increased anytime. If NR_REQUESTS is large enough then think of it as the > 'absolute upper limit of doing IO', and think of the blocking as 'the > kernel pulling the brakes'. Not just __get_request_wait, but also the limit on max locked buffers in ll_rw_block. Serves the same purpose though, brake effect. -- Jens Axboe - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait 2001-02-06 19:57 ` Ingo Molnar 2001-02-06 20:07 ` Jens Axboe @ 2001-02-06 20:25 ` Ben LaHaise 2001-02-06 20:41 ` Manfred Spraul 2001-02-06 20:49 ` Jens Axboe 2001-02-07 0:21 ` Stephen C. Tweedie 2 siblings, 2 replies; 186+ messages in thread From: Ben LaHaise @ 2001-02-06 20:25 UTC (permalink / raw) To: Ingo Molnar Cc: Linus Torvalds, Stephen C. Tweedie, Alan Cox, Manfred Spraul, Steve Lord, Linux Kernel List, kiobuf-io-devel, Ingo Molnar On Tue, 6 Feb 2001, Ingo Molnar wrote: > > On Tue, 6 Feb 2001, Ben LaHaise wrote: > > > This small correction is the crux of the problem: if it blocks, it > > takes away from the ability of the process to continue doing useful > > work. If it returns -EAGAIN, then that's okay, the io will be > > resubmitted later when other disk io has completed. But, it should be > > possible to continue servicing network requests or user io while disk > > io is underway. > > typical blocking point is waiting for page completion, not > __wait_request(). But, this is really not an issue, NR_REQUESTS can be > increased anytime. If NR_REQUESTS is large enough then think of it as the > 'absolute upper limit of doing IO', and think of the blocking as 'the > kernel pulling the brakes'. =) This is what I'm seeing: lots of processes waiting with wchan == __get_request_wait. With async io and a database flushing lots of io asynchronously spread out across the disk, the NR_REQUESTS limit is hit very quickly. > [overhead of 512-byte bhs in the raw IO code is an artificial problem of > the raw IO code.] True, and in the tests I've run, raw io is using 2KB blocks (same as the database). -ben - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait 2001-02-06 20:25 ` Ben LaHaise @ 2001-02-06 20:41 ` Manfred Spraul 2001-02-06 20:50 ` Jens Axboe 2001-02-06 20:49 ` Jens Axboe 1 sibling, 1 reply; 186+ messages in thread From: Manfred Spraul @ 2001-02-06 20:41 UTC (permalink / raw) To: Ben LaHaise Cc: Ingo Molnar, Linus Torvalds, Stephen C. Tweedie, Alan Cox, Steve Lord, Linux Kernel List, kiobuf-io-devel, Ingo Molnar Ben LaHaise wrote: > > On Tue, 6 Feb 2001, Ingo Molnar wrote: > > > > > On Tue, 6 Feb 2001, Ben LaHaise wrote: > > > > > This small correction is the crux of the problem: if it blocks, it > > > takes away from the ability of the process to continue doing useful > > > work. If it returns -EAGAIN, then that's okay, the io will be > > > resubmitted later when other disk io has completed. But, it should be > > > possible to continue servicing network requests or user io while disk > > > io is underway. > > > > typical blocking point is waiting for page completion, not > > __wait_request(). But, this is really not an issue, NR_REQUESTS can be > > increased anytime. If NR_REQUESTS is large enough then think of it as the > > 'absolute upper limit of doing IO', and think of the blocking as 'the > > kernel pulling the brakes'. > > =) This is what I'm seeing: lots of processes waiting with wchan == > __get_request_wait. With async io and a database flushing lots of io > asynchronously spread out across the disk, the NR_REQUESTS limit is hit > very quickly. > Has that anything to do with kiobuf or buffer head? Several kernel functions need a "dontblock" parameter (or a callback, or a waitqueue address, or a tq_struct pointer). -- Manfred - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait 2001-02-06 20:41 ` Manfred Spraul @ 2001-02-06 20:50 ` Jens Axboe 2001-02-06 21:26 ` Manfred Spraul 0 siblings, 1 reply; 186+ messages in thread From: Jens Axboe @ 2001-02-06 20:50 UTC (permalink / raw) To: Manfred Spraul Cc: Ben LaHaise, Ingo Molnar, Linus Torvalds, Stephen C. Tweedie, Alan Cox, Steve Lord, Linux Kernel List, kiobuf-io-devel, Ingo Molnar On Tue, Feb 06 2001, Manfred Spraul wrote: > > =) This is what I'm seeing: lots of processes waiting with wchan == > > __get_request_wait. With async io and a database flushing lots of io > > asynchronously spread out across the disk, the NR_REQUESTS limit is hit > > very quickly. > > > Has that anything to do with kiobuf or buffer head? Nothing > Several kernel functions need a "dontblock" parameter (or a callback, or > a waitqueue address, or a tq_struct pointer). We don't even need that, non-blocking is implicitly applied with READA. -- Jens Axboe - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait 2001-02-06 20:50 ` Jens Axboe @ 2001-02-06 21:26 ` Manfred Spraul 2001-02-06 21:42 ` Linus Torvalds 0 siblings, 1 reply; 186+ messages in thread From: Manfred Spraul @ 2001-02-06 21:26 UTC (permalink / raw) To: Jens Axboe Cc: Ben LaHaise, Ingo Molnar, Linus Torvalds, Stephen C. Tweedie, Alan Cox, Steve Lord, Linux Kernel List, kiobuf-io-devel, Ingo Molnar Jens Axboe wrote: > > > Several kernel functions need a "dontblock" parameter (or a callback, or > > a waitqueue address, or a tq_struct pointer). > > We don't even need that, non-blocking is implicitly applied with READA. > READA just returns - I doubt that the aio functions should poll until there are free entries in the request queue. The pending aio requests should be "included" into the wait_for_requests waitqueue (ok, they don't have a process context, thus a wait queue entry doesn't help, but these requests belong into that wait queue) -- Manfred - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait 2001-02-06 21:26 ` Manfred Spraul @ 2001-02-06 21:42 ` Linus Torvalds 2001-02-06 20:16 ` Marcelo Tosatti 2001-02-06 21:57 ` Manfred Spraul 0 siblings, 2 replies; 186+ messages in thread From: Linus Torvalds @ 2001-02-06 21:42 UTC (permalink / raw) To: Manfred Spraul Cc: Jens Axboe, Ben LaHaise, Ingo Molnar, Stephen C. Tweedie, Alan Cox, Steve Lord, Linux Kernel List, kiobuf-io-devel, Ingo Molnar On Tue, 6 Feb 2001, Manfred Spraul wrote: > Jens Axboe wrote: > > > > > Several kernel functions need a "dontblock" parameter (or a callback, or > > > a waitqueue address, or a tq_struct pointer). > > > > We don't even need that, non-blocking is implicitly applied with READA. > > > READA just returns - I doubt that the aio functions should poll until > there are free entries in the request queue. The aio functions should NOT use READA/WRITEA. They should just use the normal operations, waiting for requests. The things that makes them asycnhronous is not waiting for the requests to _complete_. Which you can already do, trivially enough. The case for using READA/WRITEA is not that you want to do asynchronous IO (all Linux IO is asynchronous unless you do extra work), but because you have a case where you _might_ want to start IO, but if you don't have a free request slot (ie there's already tons of pending IO happening), you want the option of doing something else. This is not about aio - with aio you _need_ to start the IO, you're just not willing to wait for it. An example of READA/WRITEA is if you want to do opportunistic dirty page cleaning - you might not _have_ to clean it up, but you say "Hmm.. if you can do this simply without having to wait for other requests, start doing the writeout in the background. If notm I'll come back to you later after I've done more real work.." And the Linux block device layer supports both of these kinds of "delayed IO" already. It's all there. Today. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait 2001-02-06 21:42 ` Linus Torvalds @ 2001-02-06 20:16 ` Marcelo Tosatti 2001-02-06 22:09 ` Jens Axboe 2001-02-06 21:57 ` Manfred Spraul 1 sibling, 1 reply; 186+ messages in thread From: Marcelo Tosatti @ 2001-02-06 20:16 UTC (permalink / raw) To: Linus Torvalds Cc: Manfred Spraul, Jens Axboe, Ben LaHaise, Ingo Molnar, Stephen C. Tweedie, Alan Cox, Steve Lord, Linux Kernel List, kiobuf-io-devel, Ingo Molnar On Tue, 6 Feb 2001, Linus Torvalds wrote: > > > On Tue, 6 Feb 2001, Manfred Spraul wrote: > > Jens Axboe wrote: > > > > > > > Several kernel functions need a "dontblock" parameter (or a callback, or > > > > a waitqueue address, or a tq_struct pointer). > > > > > > We don't even need that, non-blocking is implicitly applied with READA. > > > > > READA just returns - I doubt that the aio functions should poll until > > there are free entries in the request queue. > > The aio functions should NOT use READA/WRITEA. They should just use the > normal operations, waiting for requests. The things that makes them > asycnhronous is not waiting for the requests to _complete_. Which you can > already do, trivially enough. Reading write(2): EAGAIN Non-blocking I/O has been selected using O_NONBLOCK and there was no room in the pipe or socket connected to fd to write the data immediately. I see no reason why "aio function have to block waiting for requests". _Why_ they do ? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait 2001-02-06 20:16 ` Marcelo Tosatti @ 2001-02-06 22:09 ` Jens Axboe 2001-02-06 22:26 ` Linus Torvalds 0 siblings, 1 reply; 186+ messages in thread From: Jens Axboe @ 2001-02-06 22:09 UTC (permalink / raw) To: Marcelo Tosatti Cc: Linus Torvalds, Manfred Spraul, Ben LaHaise, Ingo Molnar, Stephen C. Tweedie, Alan Cox, Steve Lord, Linux Kernel List, kiobuf-io-devel, Ingo Molnar On Tue, Feb 06 2001, Marcelo Tosatti wrote: > > > > We don't even need that, non-blocking is implicitly applied with READA. > > > > > > > READA just returns - I doubt that the aio functions should poll until > > > there are free entries in the request queue. > > > > The aio functions should NOT use READA/WRITEA. They should just use the > > normal operations, waiting for requests. The things that makes them > > asycnhronous is not waiting for the requests to _complete_. Which you can > > already do, trivially enough. > > Reading write(2): > > EAGAIN Non-blocking I/O has been selected using O_NONBLOCK and there was > no room in the pipe or socket connected to fd to write the data > immediately. > > I see no reason why "aio function have to block waiting for requests". That was my reasoning too with READA etc, but Linus seems to want that we can block while submitting the I/O (as throttling, Linus?) just not until completion. -- Jens Axboe - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait 2001-02-06 22:09 ` Jens Axboe @ 2001-02-06 22:26 ` Linus Torvalds 2001-02-06 21:13 ` Marcelo Tosatti 2001-02-07 23:15 ` Pavel Machek 0 siblings, 2 replies; 186+ messages in thread From: Linus Torvalds @ 2001-02-06 22:26 UTC (permalink / raw) To: Jens Axboe Cc: Marcelo Tosatti, Manfred Spraul, Ben LaHaise, Ingo Molnar, Stephen C. Tweedie, Alan Cox, Steve Lord, Linux Kernel List, kiobuf-io-devel, Ingo Molnar On Tue, 6 Feb 2001, Jens Axboe wrote: > On Tue, Feb 06 2001, Marcelo Tosatti wrote: > > > > Reading write(2): > > > > EAGAIN Non-blocking I/O has been selected using O_NONBLOCK and there was > > no room in the pipe or socket connected to fd to write the data > > immediately. > > > > I see no reason why "aio function have to block waiting for requests". > > That was my reasoning too with READA etc, but Linus seems to want that we > can block while submitting the I/O (as throttling, Linus?) just not > until completion. Note the "in the pipe or socket" part. ^^^^ ^^^^^^ EAGAIN is _not_ a valid return value for block devices or for regular files. And in fact it _cannot_ be, because select() is defined to always return 1 on them - so if a write() were to return EAGAIN, user space would have nothing to wait on. Busy waiting is evil. So READA/WRITEA are only useful inside the kernel, and when the caller has some data structures of its own that it can use to gracefully handle the case of a failure - it will try to do the IO later for some reasons, maybe deciding to do it with blocking because it has nothing better to do at the later date, or because it decides that it can have only so many outstanding requests. Remember: in the end you HAVE to wait somewhere. You're always going to be able to generate data faster than the disk can take it. SOMETHING has to throttle - if you don't allow generic_make_request() to throttle, you have to do it on your own at some point. It is stupid and counter-productive to argue against throttling. The only argument can be _where_ that throttling is done, and READA/WRITEA leaves the possibility open of doing it somewhere else (or just delaying it and letting a future call with READ/WRITE do the throttling). Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait 2001-02-06 22:26 ` Linus Torvalds @ 2001-02-06 21:13 ` Marcelo Tosatti 2001-02-06 23:26 ` Linus Torvalds 2001-02-07 23:15 ` Pavel Machek 1 sibling, 1 reply; 186+ messages in thread From: Marcelo Tosatti @ 2001-02-06 21:13 UTC (permalink / raw) To: Linus Torvalds Cc: Jens Axboe, Manfred Spraul, Ben LaHaise, Ingo Molnar, Stephen C. Tweedie, Alan Cox, Steve Lord, Linux Kernel List, kiobuf-io-devel, Ingo Molnar On Tue, 6 Feb 2001, Linus Torvalds wrote: > Remember: in the end you HAVE to wait somewhere. You're always going to be > able to generate data faster than the disk can take it. SOMETHING has to > throttle - if you don't allow generic_make_request() to throttle, you have > to do it on your own at some point. It is stupid and counter-productive to > argue against throttling. The only argument can be _where_ that throttling > is done, and READA/WRITEA leaves the possibility open of doing it > somewhere else (or just delaying it and letting a future call with > READ/WRITE do the throttling). Its not "arguing against throttling". Its arguing against making a smart application block on the disk while its able to use the CPU for other work. An application which sets non blocking behavior and busy waits for a request (which seems to be your argument) is just stupid, of course. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait 2001-02-06 21:13 ` Marcelo Tosatti @ 2001-02-06 23:26 ` Linus Torvalds 2001-02-07 23:17 ` select() returning busy for regular files [was Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait] Pavel Machek 2001-02-08 15:06 ` [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait Ben LaHaise 0 siblings, 2 replies; 186+ messages in thread From: Linus Torvalds @ 2001-02-06 23:26 UTC (permalink / raw) To: Marcelo Tosatti Cc: Jens Axboe, Manfred Spraul, Ben LaHaise, Ingo Molnar, Stephen C. Tweedie, Alan Cox, Steve Lord, Linux Kernel List, kiobuf-io-devel, Ingo Molnar On Tue, 6 Feb 2001, Marcelo Tosatti wrote: > > Its arguing against making a smart application block on the disk while its > able to use the CPU for other work. There are currently no other alternatives in user space. You'd have to create whole new interfaces for aio_read/write, and ways for the kernel to inform user space that "now you can re-try submitting your IO". Could be done. But that's a big thing. > An application which sets non blocking behavior and busy waits for a > request (which seems to be your argument) is just stupid, of course. Tell me what else it could do at some point? You need something like select() to wait on it. There are no such interfaces right now... (besides, latency would suck. I bet you're better off waiting for the requests if they are all used up. It takes too long to get deep into the kernel from user space, and you cannot use the exclusive waiters with its anti-herd behaviour etc). Simple rule: if you want to optimize concurrency and avoid waiting - use several processes or threads instead. At which point you can get real work done on multiple CPU's, instead of worrying about what happens when you have to wait on the disk. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* select() returning busy for regular files [was Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait] 2001-02-06 23:26 ` Linus Torvalds @ 2001-02-07 23:17 ` Pavel Machek 2001-02-08 13:57 ` Ben LaHaise 2001-02-08 17:52 ` Linus Torvalds 2001-02-08 15:06 ` [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait Ben LaHaise 1 sibling, 2 replies; 186+ messages in thread From: Pavel Machek @ 2001-02-07 23:17 UTC (permalink / raw) To: Linus Torvalds, Marcelo Tosatti Cc: Jens Axboe, Manfred Spraul, Ben LaHaise, Ingo Molnar, Stephen C. Tweedie, Alan Cox, Steve Lord, Linux Kernel List, kiobuf-io-devel, Ingo Molnar Hi! > > Its arguing against making a smart application block on the disk while its > > able to use the CPU for other work. > > There are currently no other alternatives in user space. You'd have to > create whole new interfaces for aio_read/write, and ways for the kernel to > inform user space that "now you can re-try submitting your IO". Why is current select() interface not good enough? Defining that select may say regular file is not ready should be enough. Okay, maybe you'd want new fcntl() flag saying "I _really_ want this regular file to be non-blocking". No need for new interfaces. Pavel -- I'm pavel@ucw.cz. "In my country we have almost anarchy and I don't care." Panos Katsaloulis describing me w.r.t. patents at discuss@linmodems.org - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: select() returning busy for regular files [was Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait] 2001-02-07 23:17 ` select() returning busy for regular files [was Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait] Pavel Machek @ 2001-02-08 13:57 ` Ben LaHaise 2001-02-08 17:52 ` Linus Torvalds 1 sibling, 0 replies; 186+ messages in thread From: Ben LaHaise @ 2001-02-08 13:57 UTC (permalink / raw) To: Pavel Machek Cc: Linus Torvalds, Marcelo Tosatti, Jens Axboe, Manfred Spraul, Ingo Molnar, Stephen C. Tweedie, Alan Cox, Steve Lord, Linux Kernel List, kiobuf-io-devel, Ingo Molnar On Thu, 8 Feb 2001, Pavel Machek wrote: > Hi! > > > > Its arguing against making a smart application block on the disk while its > > > able to use the CPU for other work. > > > > There are currently no other alternatives in user space. You'd have to > > create whole new interfaces for aio_read/write, and ways for the kernel to > > inform user space that "now you can re-try submitting your IO". > > Why is current select() interface not good enough? Think of random disk io scattered across the disk. Think about aio_write providing a means to perform zero copy io without needing to resort to playing mm tricks write protecting pages in the user's page tables. It's also a means for dealing efficiently with thousands of outstanding requests for network io. Using a select based interface is going to be an ugly kludge that still has all the overhead of select/poll. -ben - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: select() returning busy for regular files [was Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait] 2001-02-07 23:17 ` select() returning busy for regular files [was Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait] Pavel Machek 2001-02-08 13:57 ` Ben LaHaise @ 2001-02-08 17:52 ` Linus Torvalds 1 sibling, 0 replies; 186+ messages in thread From: Linus Torvalds @ 2001-02-08 17:52 UTC (permalink / raw) To: Pavel Machek Cc: Marcelo Tosatti, Jens Axboe, Manfred Spraul, Ben LaHaise, Ingo Molnar, Stephen C. Tweedie, Alan Cox, Steve Lord, Linux Kernel List, kiobuf-io-devel, Ingo Molnar On Thu, 8 Feb 2001, Pavel Machek wrote: > > > > There are currently no other alternatives in user space. You'd have to > > create whole new interfaces for aio_read/write, and ways for the kernel to > > inform user space that "now you can re-try submitting your IO". > > Why is current select() interface not good enough? Ehh.. One major reason is rather simple: disk request wait times tend to be on the order of sub-millisecond (remember: if we run out of requests, that means that we have 256 of them already queued, which means that it's very likely that several of them will be freed up in the very near future due to completion). The fact is, that if you start doing write/select loops, you're going to waste a _large_ portion of your CPU speed on it. Especially considering that the select() call would have to go all the way down to the ll_rw_blk layer to figure out whether there are more requests etc. So there is (a) historical reasons that say that regular files can never wait and EAGAIN is not an acceptable return value and (b) practical reasons for why such an interface would be a bad one. There are better ways to do it. Either using threads, or just having a better aio-like interface. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait 2001-02-06 23:26 ` Linus Torvalds 2001-02-07 23:17 ` select() returning busy for regular files [was Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait] Pavel Machek @ 2001-02-08 15:06 ` Ben LaHaise 2001-02-08 13:44 ` Marcelo Tosatti 1 sibling, 1 reply; 186+ messages in thread From: Ben LaHaise @ 2001-02-08 15:06 UTC (permalink / raw) To: Linus Torvalds Cc: Marcelo Tosatti, Jens Axboe, Manfred Spraul, Ingo Molnar, Stephen C. Tweedie, Alan Cox, Steve Lord, Linux Kernel List, kiobuf-io-devel, Ingo Molnar On Tue, 6 Feb 2001, Linus Torvalds wrote: > There are currently no other alternatives in user space. You'd have to > create whole new interfaces for aio_read/write, and ways for the kernel to > inform user space that "now you can re-try submitting your IO". > > Could be done. But that's a big thing. Has been done. Still needs some work, but it works pretty well. As for throttling io, having ios submitted does not have to correspond to them being queued in the lower layers. The main issue with async io is limiting the amount of pinned memory for ios; if that's taken care of, I don't think it matters how many ios are in flight. > > An application which sets non blocking behavior and busy waits for a > > request (which seems to be your argument) is just stupid, of course. > > Tell me what else it could do at some point? You need something like > select() to wait on it. There are no such interfaces right now... > > (besides, latency would suck. I bet you're better off waiting for the > requests if they are all used up. It takes too long to get deep into the > kernel from user space, and you cannot use the exclusive waiters with its > anti-herd behaviour etc). Ah, but no. In fact for some things, the wait queue extensions I'm using will be more efficient as things like test_and_set_bit for obtaining a lock gets executed without waking up a task. > Simple rule: if you want to optimize concurrency and avoid waiting - use > several processes or threads instead. At which point you can get real work > done on multiple CPU's, instead of worrying about what happens when you > have to wait on the disk. There do exist plenty of cases where threads are not efficient enough. Just the stack overhead alone with 8000 threads makes things really suck. Event based io completion means that server processes don't need to have the overhead of select/poll. Add in NT style completion ports for waking up the right number of worker threads off of the completion queue, and That said, I don't expect all devices to support async io. But given support for files, raw and sockets all the important cases are covered. The remainder can be supported via userspace helpers. -ben - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait 2001-02-08 15:06 ` [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait Ben LaHaise @ 2001-02-08 13:44 ` Marcelo Tosatti 2001-02-08 13:45 ` Marcelo Tosatti 0 siblings, 1 reply; 186+ messages in thread From: Marcelo Tosatti @ 2001-02-08 13:44 UTC (permalink / raw) To: Ben LaHaise Cc: Linus Torvalds, Jens Axboe, Manfred Spraul, Ingo Molnar, Stephen C. Tweedie, Alan Cox, Steve Lord, Linux Kernel List, kiobuf-io-devel, Ingo Molnar On Thu, 8 Feb 2001, Ben LaHaise wrote: <snip> > > (besides, latency would suck. I bet you're better off waiting for the > > requests if they are all used up. It takes too long to get deep into the > > kernel from user space, and you cannot use the exclusive waiters with its > > anti-herd behaviour etc). > > Ah, but no. In fact for some things, the wait queue extensions I'm using > will be more efficient as things like test_and_set_bit for obtaining a > lock gets executed without waking up a task. The latency argument is somewhat bogus because there is no problem to check the request queue, in the aio syscalls, and simply fail if its full. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait 2001-02-08 13:44 ` Marcelo Tosatti @ 2001-02-08 13:45 ` Marcelo Tosatti 0 siblings, 0 replies; 186+ messages in thread From: Marcelo Tosatti @ 2001-02-08 13:45 UTC (permalink / raw) To: Ben LaHaise Cc: Linus Torvalds, Jens Axboe, Manfred Spraul, Ingo Molnar, Stephen C. Tweedie, Alan Cox, Steve Lord, Linux Kernel List, kiobuf-io-devel, Ingo Molnar On Thu, 8 Feb 2001, Marcelo Tosatti wrote: > > On Thu, 8 Feb 2001, Ben LaHaise wrote: > > <snip> > > > > (besides, latency would suck. I bet you're better off waiting for the > > > requests if they are all used up. It takes too long to get deep into the > > > kernel from user space, and you cannot use the exclusive waiters with its > > > anti-herd behaviour etc). > > > > Ah, but no. In fact for some things, the wait queue extensions I'm using > > will be more efficient as things like test_and_set_bit for obtaining a > > lock gets executed without waking up a task. > > The latency argument is somewhat bogus because there is no problem to > check the request queue, in the aio syscalls, and simply fail if its full. Ugh, I forgot to say check the request queue before doing any filesystem work. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait 2001-02-06 22:26 ` Linus Torvalds 2001-02-06 21:13 ` Marcelo Tosatti @ 2001-02-07 23:15 ` Pavel Machek 2001-02-08 13:22 ` Stephen C. Tweedie 2001-02-08 14:52 ` Mikulas Patocka 1 sibling, 2 replies; 186+ messages in thread From: Pavel Machek @ 2001-02-07 23:15 UTC (permalink / raw) To: Linus Torvalds, Jens Axboe Cc: Marcelo Tosatti, Manfred Spraul, Ben LaHaise, Ingo Molnar, Stephen C. Tweedie, Alan Cox, Steve Lord, Linux Kernel List, kiobuf-io-devel, Ingo Molnar Hi! > > > Reading write(2): > > > > > > EAGAIN Non-blocking I/O has been selected using O_NONBLOCK and there was > > > no room in the pipe or socket connected to fd to write the data > > > immediately. > > > > > > I see no reason why "aio function have to block waiting for requests". > > > > That was my reasoning too with READA etc, but Linus seems to want that we > > can block while submitting the I/O (as throttling, Linus?) just not > > until completion. > > Note the "in the pipe or socket" part. > ^^^^ ^^^^^^ > > EAGAIN is _not_ a valid return value for block devices or for regular > files. And in fact it _cannot_ be, because select() is defined to always > return 1 on them - so if a write() were to return EAGAIN, user space would > have nothing to wait on. Busy waiting is evil. So you consider inability to select() on regular files _feature_? It can be a pretty serious problem with slow block devices (floppy). It also hurts when you are trying to do high-performance reads/writes. [I know it hurt in userspace sherlock search engine -- kind of small altavista.] How do you write high-performance ftp server without threads if select on regular file always returns "ready"? > Remember: in the end you HAVE to wait somewhere. You're always going to be > able to generate data faster than the disk can take it. SOMETHING Userspace wants to _know_ when to stop. It asks politely using "select()". Pavel -- I'm pavel@ucw.cz. "In my country we have almost anarchy and I don't care." Panos Katsaloulis describing me w.r.t. patents at discuss@linmodems.org - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait 2001-02-07 23:15 ` Pavel Machek @ 2001-02-08 13:22 ` Stephen C. Tweedie 2001-02-08 12:03 ` Marcelo Tosatti 2001-02-08 14:52 ` Mikulas Patocka 1 sibling, 1 reply; 186+ messages in thread From: Stephen C. Tweedie @ 2001-02-08 13:22 UTC (permalink / raw) To: Pavel Machek Cc: Linus Torvalds, Jens Axboe, Marcelo Tosatti, Manfred Spraul, Ben LaHaise, Ingo Molnar, Stephen C. Tweedie, Alan Cox, Steve Lord, Linux Kernel List, kiobuf-io-devel, Ingo Molnar Hi, On Thu, Feb 08, 2001 at 12:15:13AM +0100, Pavel Machek wrote: > > > EAGAIN is _not_ a valid return value for block devices or for regular > > files. And in fact it _cannot_ be, because select() is defined to always > > return 1 on them - so if a write() were to return EAGAIN, user space would > > have nothing to wait on. Busy waiting is evil. > > So you consider inability to select() on regular files _feature_? Select might make some sort of sense for sequential access to files, and for random access via lseek/read but it makes no sense at all for pread and pwrite where select() has no idea _which_ part of the file the user is going to want to access next. > How do you write high-performance ftp server without threads if select > on regular file always returns "ready"? Select can work if the access is sequential, but async IO is a more general solution. Cheers, Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait 2001-02-08 13:22 ` Stephen C. Tweedie @ 2001-02-08 12:03 ` Marcelo Tosatti 2001-02-08 15:46 ` Mikulas Patocka 2001-02-08 18:09 ` Linus Torvalds 0 siblings, 2 replies; 186+ messages in thread From: Marcelo Tosatti @ 2001-02-08 12:03 UTC (permalink / raw) To: Stephen C. Tweedie Cc: Pavel Machek, Linus Torvalds, Jens Axboe, Manfred Spraul, Ben LaHaise, Ingo Molnar, Alan Cox, Steve Lord, Linux Kernel List, kiobuf-io-devel, Ingo Molnar On Thu, 8 Feb 2001, Stephen C. Tweedie wrote: <snip> > > How do you write high-performance ftp server without threads if select > > on regular file always returns "ready"? > > Select can work if the access is sequential, but async IO is a more > general solution. Even async IO (ie aio_read/aio_write) should block on the request queue if its full in Linus mind. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait 2001-02-08 12:03 ` Marcelo Tosatti @ 2001-02-08 15:46 ` Mikulas Patocka 2001-02-08 14:05 ` Marcelo Tosatti 2001-02-08 15:55 ` Jens Axboe 2001-02-08 18:09 ` Linus Torvalds 1 sibling, 2 replies; 186+ messages in thread From: Mikulas Patocka @ 2001-02-08 15:46 UTC (permalink / raw) To: Marcelo Tosatti Cc: Stephen C. Tweedie, Pavel Machek, Linus Torvalds, Jens Axboe, Manfred Spraul, Ben LaHaise, Ingo Molnar, Alan Cox, Steve Lord, Linux Kernel List, kiobuf-io-devel, Ingo Molnar > > > How do you write high-performance ftp server without threads if select > > > on regular file always returns "ready"? > > > > Select can work if the access is sequential, but async IO is a more > > general solution. > > Even async IO (ie aio_read/aio_write) should block on the request queue if > its full in Linus mind. This is not problem (you can create queue big enough to handle the load). The problem is that aio_read and aio_write are pretty useless for ftp or http server. You need aio_open. Mikulas - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait 2001-02-08 15:46 ` Mikulas Patocka @ 2001-02-08 14:05 ` Marcelo Tosatti 2001-02-08 16:11 ` Mikulas Patocka 2001-02-08 15:55 ` Jens Axboe 1 sibling, 1 reply; 186+ messages in thread From: Marcelo Tosatti @ 2001-02-08 14:05 UTC (permalink / raw) To: Mikulas Patocka Cc: Stephen C. Tweedie, Pavel Machek, Linus Torvalds, Jens Axboe, Manfred Spraul, Ben LaHaise, Ingo Molnar, Alan Cox, Steve Lord, Linux Kernel List, kiobuf-io-devel, Ingo Molnar On Thu, 8 Feb 2001, Mikulas Patocka wrote: > > > > How do you write high-performance ftp server without threads if select > > > > on regular file always returns "ready"? > > > > > > Select can work if the access is sequential, but async IO is a more > > > general solution. > > > > Even async IO (ie aio_read/aio_write) should block on the request queue if > > its full in Linus mind. > > This is not problem (you can create queue big enough to handle the load). The point is that you want to be able to not block if the queue full (and the queue size has nothing to do with that). > The problem is that aio_read and aio_write are pretty useless for ftp or > http server. You need aio_open. Could you explain this? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait 2001-02-08 14:05 ` Marcelo Tosatti @ 2001-02-08 16:11 ` Mikulas Patocka 2001-02-08 14:44 ` Marcelo Tosatti 2001-02-08 16:57 ` Rik van Riel 0 siblings, 2 replies; 186+ messages in thread From: Mikulas Patocka @ 2001-02-08 16:11 UTC (permalink / raw) To: Marcelo Tosatti Cc: Stephen C. Tweedie, Pavel Machek, Linus Torvalds, Jens Axboe, Manfred Spraul, Ben LaHaise, Ingo Molnar, Alan Cox, Steve Lord, Linux Kernel List, kiobuf-io-devel, Ingo Molnar > > The problem is that aio_read and aio_write are pretty useless for ftp or > > http server. You need aio_open. > > Could you explain this? If the server is sending many small files, disk spends huge amount time walking directory tree and seeking to inodes. Maybe opening the file is even slower than reading it - read is usually sequential but open needs to seek at few areas of disk. And if you have one-threaded server using open, close, aio_read and aio_write, you actually block the whole server while it is opening a single file. This is not how async io is supposed to work. Mikulas - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait 2001-02-08 16:11 ` Mikulas Patocka @ 2001-02-08 14:44 ` Marcelo Tosatti 2001-02-08 16:57 ` Rik van Riel 1 sibling, 0 replies; 186+ messages in thread From: Marcelo Tosatti @ 2001-02-08 14:44 UTC (permalink / raw) To: Mikulas Patocka Cc: Stephen C. Tweedie, Pavel Machek, Linus Torvalds, Jens Axboe, Manfred Spraul, Ben LaHaise, Ingo Molnar, Alan Cox, Steve Lord, Linux Kernel List, kiobuf-io-devel, Ingo Molnar On Thu, 8 Feb 2001, Mikulas Patocka wrote: > > > The problem is that aio_read and aio_write are pretty useless for ftp or > > > http server. You need aio_open. > > > > Could you explain this? > > If the server is sending many small files, disk spends huge amount time > walking directory tree and seeking to inodes. Maybe opening the file is > even slower than reading it - read is usually sequential but open needs to > seek at few areas of disk. > > And if you have one-threaded server using open, close, aio_read and > aio_write, you actually block the whole server while it is opening a > single file. This is not how async io is supposed to work. Ok but this is not the point of the discussion. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait 2001-02-08 16:11 ` Mikulas Patocka 2001-02-08 14:44 ` Marcelo Tosatti @ 2001-02-08 16:57 ` Rik van Riel 2001-02-08 17:13 ` James Sutherland 2001-02-08 18:38 ` Linus Torvalds 1 sibling, 2 replies; 186+ messages in thread From: Rik van Riel @ 2001-02-08 16:57 UTC (permalink / raw) To: Mikulas Patocka Cc: Marcelo Tosatti, Stephen C. Tweedie, Pavel Machek, Linus Torvalds, Jens Axboe, Manfred Spraul, Ben LaHaise, Ingo Molnar, Alan Cox, Steve Lord, Linux Kernel List, kiobuf-io-devel, Ingo Molnar On Thu, 8 Feb 2001, Mikulas Patocka wrote: > > > You need aio_open. > > Could you explain this? > > If the server is sending many small files, disk spends huge > amount time walking directory tree and seeking to inodes. Maybe > opening the file is even slower than reading it Not if you have a big enough inode_cache and dentry_cache. OTOH ... if you have enough memory the whole async IO argument is moot anyway because all your files will be in memory too. regards, Rik -- Linux MM bugzilla: http://linux-mm.org/bugzilla.shtml Virtual memory is like a game you can't win; However, without VM there's truly nothing to lose... http://www.surriel.com/ http://www.conectiva.com/ http://distro.conectiva.com/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait 2001-02-08 16:57 ` Rik van Riel @ 2001-02-08 17:13 ` James Sutherland 2001-02-08 18:38 ` Linus Torvalds 1 sibling, 0 replies; 186+ messages in thread From: James Sutherland @ 2001-02-08 17:13 UTC (permalink / raw) To: Rik van Riel Cc: Mikulas Patocka, Marcelo Tosatti, Stephen C. Tweedie, Pavel Machek, Linus Torvalds, Jens Axboe, Manfred Spraul, Ben LaHaise, Ingo Molnar, Alan Cox, Steve Lord, Linux Kernel List, kiobuf-io-devel, Ingo Molnar On Thu, 8 Feb 2001, Rik van Riel wrote: > On Thu, 8 Feb 2001, Mikulas Patocka wrote: > > > > > You need aio_open. > > > Could you explain this? > > > > If the server is sending many small files, disk spends huge > > amount time walking directory tree and seeking to inodes. Maybe > > opening the file is even slower than reading it > > Not if you have a big enough inode_cache and dentry_cache. Eh? However big the caches are, you can still get misses which will require multiple (blocking) disk accesses to handle... > OTOH ... if you have enough memory the whole async IO argument > is moot anyway because all your files will be in memory too. Only for cache hits. If you're doing a Mindcraft benchmark or something with everything in RAM, you're fine - for real world servers, that's not really an option ;-) Really, you want/need cache MISSES to be handled without blocking. However big the caches, short of running EVERYTHING from a ramdisk, these will still happen! James. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait 2001-02-08 16:57 ` Rik van Riel 2001-02-08 17:13 ` James Sutherland @ 2001-02-08 18:38 ` Linus Torvalds 2001-02-09 12:17 ` Martin Dalecki 1 sibling, 1 reply; 186+ messages in thread From: Linus Torvalds @ 2001-02-08 18:38 UTC (permalink / raw) To: Rik van Riel Cc: Mikulas Patocka, Marcelo Tosatti, Stephen C. Tweedie, Pavel Machek, Jens Axboe, Manfred Spraul, Ben LaHaise, Ingo Molnar, Alan Cox, Steve Lord, Linux Kernel List, kiobuf-io-devel, Ingo Molnar On Thu, 8 Feb 2001, Rik van Riel wrote: > On Thu, 8 Feb 2001, Mikulas Patocka wrote: > > > > > You need aio_open. > > > Could you explain this? > > > > If the server is sending many small files, disk spends huge > > amount time walking directory tree and seeking to inodes. Maybe > > opening the file is even slower than reading it > > Not if you have a big enough inode_cache and dentry_cache. > > OTOH ... if you have enough memory the whole async IO argument > is moot anyway because all your files will be in memory too. Note that this _is_ an important point. You should never _ever_ think about pure IO speed as the most important thing. Even if you get absolutely perfect IO streaming off the fastest disk you can find, I will beat you every single time with a cached setup that doesn't need to do IO at all. 90% of the VFS layer is all about caching, and trying to avoid IO. Of the rest, about 9% is about trying to avoid even calling down to the low-level filesystem, because it's faster if we can handle it at a high level without any need to even worry about issues like physical disk addresses. Even if those addresses are cached. The remaining 1% is about actually getting the IO done. At that point we end up throwing our hands in the air and saying "ok, this will be slow". So if you design your system for disk load, you are missing a big portion of the picture. There are cases where IO really matter. The most notable one being databases, certainly _not_ web or ftp servers. For web- or ftp-servers you buy more memory if you want high performance, and you tend to be limited by the network speed anyway (if you have multiple gigabit networks and network speed isn't an issue, then I can also tell you that buying a few gigabyte of RAM isn't an issue, because you are obviously working for something like the DoD and have very little regard for the cost of the thing ;) For databases (and for file servers that you want to be robust over a crash), IO throughput is an issue mainly because you need to put the damn requests in stable memory somewhere. Which tends to mean that _write_ speed is what really matters, because the reads you can still try to cache as efficiently as humanly possible (and the issue of database design then turns into trying to find every single piece of locality you can, so that the read caching works as well as possible). Short and sweet: "aio_open()" is basically never supposed to be an issue. If it is, you've misdesigned something, or you're trying too damn hard to single-thread everything (and "hiding" the threading that _does_ happen by just calling it "AIO" instead - lying to yourself, in short). Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait 2001-02-08 18:38 ` Linus Torvalds @ 2001-02-09 12:17 ` Martin Dalecki 0 siblings, 0 replies; 186+ messages in thread From: Martin Dalecki @ 2001-02-09 12:17 UTC (permalink / raw) To: Linus Torvalds Cc: Rik van Riel, Mikulas Patocka, Marcelo Tosatti, Stephen C. Tweedie, Pavel Machek, Jens Axboe, Manfred Spraul, Ben LaHaise, Ingo Molnar, Alan Cox, Steve Lord, Linux Kernel List, kiobuf-io-devel, Ingo Molnar Linus Torvalds wrote: > > On Thu, 8 Feb 2001, Rik van Riel wrote: > > > On Thu, 8 Feb 2001, Mikulas Patocka wrote: > > > > > > > You need aio_open. > > > > Could you explain this? > > > > > > If the server is sending many small files, disk spends huge > > > amount time walking directory tree and seeking to inodes. Maybe > > > opening the file is even slower than reading it > > > > Not if you have a big enough inode_cache and dentry_cache. > > > > OTOH ... if you have enough memory the whole async IO argument > > is moot anyway because all your files will be in memory too. > > Note that this _is_ an important point. > > You should never _ever_ think about pure IO speed as the most important > thing. Even if you get absolutely perfect IO streaming off the fastest > disk you can find, I will beat you every single time with a cached setup > that doesn't need to do IO at all. > > 90% of the VFS layer is all about caching, and trying to avoid IO. Of the > rest, about 9% is about trying to avoid even calling down to the low-level > filesystem, because it's faster if we can handle it at a high level > without any need to even worry about issues like physical disk addresses. > Even if those addresses are cached. > > The remaining 1% is about actually getting the IO done. At that point we > end up throwing our hands in the air and saying "ok, this will be slow". > > So if you design your system for disk load, you are missing a big portion > of the picture. > > There are cases where IO really matter. The most notable one being > databases, certainly _not_ web or ftp servers. For web- or ftp-servers you > buy more memory if you want high performance, and you tend to be limited > by the network speed anyway (if you have multiple gigabit networks and > network speed isn't an issue, then I can also tell you that buying a few > gigabyte of RAM isn't an issue, because you are obviously working for > something like the DoD and have very little regard for the cost of the > thing ;) > > For databases (and for file servers that you want to be robust over a > crash), IO throughput is an issue mainly because you need to put the damn > requests in stable memory somewhere. Which tends to mean that _write_ > speed is what really matters, because the reads you can still try to cache > as efficiently as humanly possible (and the issue of database design then > turns into trying to find every single piece of locality you can, so that > the read caching works as well as possible). > > Short and sweet: "aio_open()" is basically never supposed to be an issue. > If it is, you've misdesigned something, or you're trying too damn hard to > single-thread everything (and "hiding" the threading that _does_ happen by > just calling it "AIO" instead - lying to yourself, in short). Right - I agree with you that an AIO design is basically hiding an inherently multi threaded program flow. This argument is indeed very catchy. And looking from some other point one will see that most of the AIO designs are from times where multi threading in applications wasn't that common as it is now. Most prominently coprocesses in a shell come to my mind as a very good example about how to handle AIO (sort of)... - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait 2001-02-08 15:46 ` Mikulas Patocka 2001-02-08 14:05 ` Marcelo Tosatti @ 2001-02-08 15:55 ` Jens Axboe 1 sibling, 0 replies; 186+ messages in thread From: Jens Axboe @ 2001-02-08 15:55 UTC (permalink / raw) To: Mikulas Patocka Cc: Marcelo Tosatti, Stephen C. Tweedie, Pavel Machek, Linus Torvalds, Manfred Spraul, Ben LaHaise, Ingo Molnar, Alan Cox, Steve Lord, Linux Kernel List, kiobuf-io-devel, Ingo Molnar On Thu, Feb 08 2001, Mikulas Patocka wrote: > > Even async IO (ie aio_read/aio_write) should block on the request queue if > > its full in Linus mind. > > This is not problem (you can create queue big enough to handle the load). Well in theory, but in practice this isn't a very good idea. At some point throwing yet more requests in there doesn't make a whole lot of sense. You are basically _always_ going to be able to empty the request list by dirtying lots of data. -- Jens Axboe - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait 2001-02-08 12:03 ` Marcelo Tosatti 2001-02-08 15:46 ` Mikulas Patocka @ 2001-02-08 18:09 ` Linus Torvalds 1 sibling, 0 replies; 186+ messages in thread From: Linus Torvalds @ 2001-02-08 18:09 UTC (permalink / raw) To: Marcelo Tosatti Cc: Stephen C. Tweedie, Pavel Machek, Jens Axboe, Manfred Spraul, Ben LaHaise, Ingo Molnar, Alan Cox, Steve Lord, Linux Kernel List, kiobuf-io-devel, Ingo Molnar On Thu, 8 Feb 2001, Marcelo Tosatti wrote: > > On Thu, 8 Feb 2001, Stephen C. Tweedie wrote: > > <snip> > > > > How do you write high-performance ftp server without threads if select > > > on regular file always returns "ready"? > > > > Select can work if the access is sequential, but async IO is a more > > general solution. > > Even async IO (ie aio_read/aio_write) should block on the request queue if > its full in Linus mind. Not necessarily. I said that "READA/WRITEA" are only worth exporting inside the kernel - because the latencies and complexities are low-level enough that it should not be exported to user space as such. But I could imagine a kernel aio package that does the equivalent of bh->b_end_io = completion_handler; generic_make_request(WRITE, bh); /* this may block */ bh= bh->b_next; /* Now, fill it up as much as we can.. */ current->state = TASK_INTERRUPTIBLE; while (more data to be written) { if (generic_make_request(WRITEA, bh) < 0) break; bh = bh->b_next; } return; and then you make the _completion handler_ thing continue to feed more requests. Yes, you may block at some points (because you need to always have at least _one_ request in-flight in order to have the state machine active, but you can basically try to avoid blocking more than necessary. But do you see why the above can't be done from user space? It requires that the completion handler (which runs in an interrupt context) be able to continue to feed requests and keep the queue filled. If you don't do that, you'll never have good throughput, because it takes too long to send signals, re-schedule or whatever to user mode. And do you see how it has to block _sometimes_? If people do hundreds of AIO requests, we can't let memory just fill up with pending writes.. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait 2001-02-07 23:15 ` Pavel Machek 2001-02-08 13:22 ` Stephen C. Tweedie @ 2001-02-08 14:52 ` Mikulas Patocka 2001-02-08 19:50 ` Stephen C. Tweedie 2001-02-11 21:30 ` Pavel Machek 1 sibling, 2 replies; 186+ messages in thread From: Mikulas Patocka @ 2001-02-08 14:52 UTC (permalink / raw) To: Pavel Machek Cc: Linus Torvalds, Jens Axboe, Marcelo Tosatti, Manfred Spraul, Ben LaHaise, Ingo Molnar, Stephen C. Tweedie, Alan Cox, Steve Lord, Linux Kernel List, kiobuf-io-devel, Ingo Molnar Hi! > So you consider inability to select() on regular files _feature_? select on files is unimplementable. You can't do background file IO the same way you do background receiving of packets on socket. Filesystem is synchronous. It can block. > It can be a pretty serious problem with slow block devices > (floppy). It also hurts when you are trying to do high-performance > reads/writes. [I know it hurt in userspace sherlock search engine -- > kind of small altavista.] > > How do you write high-performance ftp server without threads if select > on regular file always returns "ready"? No, it's not really possible on Linux. Use SYS$QIO call on VMS :-) You can emulate asynchronous IO with kernel threads like FreeBSD and some commercial Unices do, but you still need as many (possibly kernel) threads as many requests you are servicing. > > Remember: in the end you HAVE to wait somewhere. You're always going to be > > able to generate data faster than the disk can take it. SOMETHING > > Userspace wants to _know_ when to stop. It asks politely using > "select()". And how do you want to wait for other select()ed events if you are blocked in wait_for_buffer in get_block (former bmap)? Making real async IO would require to rewrite all filesystems and whole VFS _from_scratch_. It won't happen. Mikulas - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait 2001-02-08 14:52 ` Mikulas Patocka @ 2001-02-08 19:50 ` Stephen C. Tweedie 2001-02-11 21:30 ` Pavel Machek 1 sibling, 0 replies; 186+ messages in thread From: Stephen C. Tweedie @ 2001-02-08 19:50 UTC (permalink / raw) To: Mikulas Patocka Cc: Pavel Machek, Linus Torvalds, Jens Axboe, Marcelo Tosatti, Manfred Spraul, Ben LaHaise, Ingo Molnar, Stephen C. Tweedie, Alan Cox, Steve Lord, Linux Kernel List, kiobuf-io-devel, Ingo Molnar Hi, On Thu, Feb 08, 2001 at 03:52:35PM +0100, Mikulas Patocka wrote: > > > How do you write high-performance ftp server without threads if select > > on regular file always returns "ready"? > > No, it's not really possible on Linux. Use SYS$QIO call on VMS :-) Ahh, but even VMS SYS$QIO is synchronous at doing opens, allocation of the IO request packets, and mapping file location to disk blocks. Only the data IO is ever async (and Ben's async IO stuff for Linux provides that too). --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait 2001-02-08 14:52 ` Mikulas Patocka 2001-02-08 19:50 ` Stephen C. Tweedie @ 2001-02-11 21:30 ` Pavel Machek 1 sibling, 0 replies; 186+ messages in thread From: Pavel Machek @ 2001-02-11 21:30 UTC (permalink / raw) To: Mikulas Patocka, Pavel Machek Cc: Linus Torvalds, Jens Axboe, Marcelo Tosatti, Manfred Spraul, Ben LaHaise, Ingo Molnar, Stephen C. Tweedie, Alan Cox, Steve Lord, Linux Kernel List, kiobuf-io-devel, Ingo Molnar Hi! > > So you consider inability to select() on regular files _feature_? > > select on files is unimplementable. You can't do background file IO the > same way you do background receiving of packets on socket. Filesystem is > synchronous. It can block. You can use helper friends if VFS layer is not able to handle background IO. Then we can do it right in linux-4.4. Pavel -- I'm pavel@ucw.cz. "In my country we have almost anarchy and I don't care." Panos Katsaloulis describing me w.r.t. patents at discuss@linmodems.org - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait 2001-02-06 21:42 ` Linus Torvalds 2001-02-06 20:16 ` Marcelo Tosatti @ 2001-02-06 21:57 ` Manfred Spraul 2001-02-06 22:13 ` Linus Torvalds 1 sibling, 1 reply; 186+ messages in thread From: Manfred Spraul @ 2001-02-06 21:57 UTC (permalink / raw) To: Linus Torvalds Cc: Jens Axboe, Ben LaHaise, Ingo Molnar, Stephen C. Tweedie, Alan Cox, Steve Lord, Linux Kernel List, kiobuf-io-devel, Ingo Molnar Linus Torvalds wrote: > > On Tue, 6 Feb 2001, Manfred Spraul wrote: > > Jens Axboe wrote: > > > > > > > Several kernel functions need a "dontblock" parameter (or a callback, or > > > > a waitqueue address, or a tq_struct pointer). > > > > > > We don't even need that, non-blocking is implicitly applied with READA. > > > > > READA just returns - I doubt that the aio functions should poll until > > there are free entries in the request queue. > > The aio functions should NOT use READA/WRITEA. They should just use the > normal operations, waiting for requests. But then you end with lots of threads blocking in get_request() Quoting Ben's mail: <<<<<<<<< > > =) This is what I'm seeing: lots of processes waiting with wchan == > __get_request_wait. With async io and a database flushing lots of io > asynchronously spread out across the disk, the NR_REQUESTS limit is hit > very quickly. > >>>>>>>>> On an io bound server the request queue is always full - waiting for the next request might take longer than the actual io. -- Manfred - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait 2001-02-06 21:57 ` Manfred Spraul @ 2001-02-06 22:13 ` Linus Torvalds 2001-02-06 22:26 ` Andre Hedrick 0 siblings, 1 reply; 186+ messages in thread From: Linus Torvalds @ 2001-02-06 22:13 UTC (permalink / raw) To: Manfred Spraul Cc: Jens Axboe, Ben LaHaise, Ingo Molnar, Stephen C. Tweedie, Alan Cox, Steve Lord, Linux Kernel List, kiobuf-io-devel, Ingo Molnar On Tue, 6 Feb 2001, Manfred Spraul wrote: > > > > The aio functions should NOT use READA/WRITEA. They should just use the > > normal operations, waiting for requests. > > But then you end with lots of threads blocking in get_request() So? What the HELL do you expect to happen if somebody writes faster than the disk can take? You don't lik ebusy-waiting. Fair enough. So maybe blocking on a wait-queue is the right thing? Just MAYBE? Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait 2001-02-06 22:13 ` Linus Torvalds @ 2001-02-06 22:26 ` Andre Hedrick 0 siblings, 0 replies; 186+ messages in thread From: Andre Hedrick @ 2001-02-06 22:26 UTC (permalink / raw) To: Linus Torvalds Cc: Manfred Spraul, Jens Axboe, Ben LaHaise, Ingo Molnar, Stephen C. Tweedie, Alan Cox, Steve Lord, Linux Kernel List, kiobuf-io-devel, Ingo Molnar On Tue, 6 Feb 2001, Linus Torvalds wrote: > > > On Tue, 6 Feb 2001, Manfred Spraul wrote: > > > > > > The aio functions should NOT use READA/WRITEA. They should just use the > > > normal operations, waiting for requests. > > > > But then you end with lots of threads blocking in get_request() > > So? > > What the HELL do you expect to happen if somebody writes faster than the > disk can take? > > You don't lik ebusy-waiting. Fair enough. > > So maybe blocking on a wait-queue is the right thing? Just MAYBE? Did I miss a portion of the thread? Is the block layer ignoring the status of a device? --Andre - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait 2001-02-06 20:25 ` Ben LaHaise 2001-02-06 20:41 ` Manfred Spraul @ 2001-02-06 20:49 ` Jens Axboe 1 sibling, 0 replies; 186+ messages in thread From: Jens Axboe @ 2001-02-06 20:49 UTC (permalink / raw) To: Ben LaHaise Cc: Ingo Molnar, Linus Torvalds, Stephen C. Tweedie, Alan Cox, Manfred Spraul, Steve Lord, Linux Kernel List, kiobuf-io-devel, Ingo Molnar On Tue, Feb 06 2001, Ben LaHaise wrote: > =) This is what I'm seeing: lots of processes waiting with wchan == > __get_request_wait. With async io and a database flushing lots of io > asynchronously spread out across the disk, the NR_REQUESTS limit is hit > very quickly. You can't do async I/O this way! In going what Linus said, make submit_bh return an int telling you if it failed to queue the buffer and use READA/WRITEA to submit it. -- Jens Axboe - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait 2001-02-06 19:57 ` Ingo Molnar 2001-02-06 20:07 ` Jens Axboe 2001-02-06 20:25 ` Ben LaHaise @ 2001-02-07 0:21 ` Stephen C. Tweedie 2001-02-07 0:25 ` Ingo Molnar ` (2 more replies) 2 siblings, 3 replies; 186+ messages in thread From: Stephen C. Tweedie @ 2001-02-07 0:21 UTC (permalink / raw) To: Ingo Molnar Cc: Ben LaHaise, Linus Torvalds, Stephen C. Tweedie, Alan Cox, Manfred Spraul, Steve Lord, Linux Kernel List, kiobuf-io-devel, Ingo Molnar Hi, On Tue, Feb 06, 2001 at 08:57:13PM +0100, Ingo Molnar wrote: > > [overhead of 512-byte bhs in the raw IO code is an artificial problem of > the raw IO code.] No, it is a problem of the ll_rw_block interface: buffer_heads need to be aligned on disk at a multiple of their buffer size. Under the Unix raw IO interface it is perfectly legal to begin a 128kB IO at offset 512 bytes into a device. --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait 2001-02-07 0:21 ` Stephen C. Tweedie @ 2001-02-07 0:25 ` Ingo Molnar 2001-02-07 0:36 ` Stephen C. Tweedie 2001-02-07 0:42 ` Linus Torvalds 2001-02-07 0:35 ` Jens Axboe 2001-02-07 0:41 ` Linus Torvalds 2 siblings, 2 replies; 186+ messages in thread From: Ingo Molnar @ 2001-02-07 0:25 UTC (permalink / raw) To: Stephen C. Tweedie Cc: Ingo Molnar, Ben LaHaise, Linus Torvalds, Alan Cox, Manfred Spraul, Steve Lord, Linux Kernel List, kiobuf-io-devel On Wed, 7 Feb 2001, Stephen C. Tweedie wrote: > No, it is a problem of the ll_rw_block interface: buffer_heads need to > be aligned on disk at a multiple of their buffer size. Under the Unix > raw IO interface it is perfectly legal to begin a 128kB IO at offset > 512 bytes into a device. then we should either fix this limitation, or the raw IO code should split the request up into several, variable-size bhs, so that the range is filled out optimally with aligned bhs. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait 2001-02-07 0:25 ` Ingo Molnar @ 2001-02-07 0:36 ` Stephen C. Tweedie 2001-02-07 0:50 ` Linus Torvalds 2001-02-07 1:42 ` Jeff V. Merkey 2001-02-07 0:42 ` Linus Torvalds 1 sibling, 2 replies; 186+ messages in thread From: Stephen C. Tweedie @ 2001-02-07 0:36 UTC (permalink / raw) To: Ingo Molnar Cc: Stephen C. Tweedie, Ingo Molnar, Ben LaHaise, Linus Torvalds, Alan Cox, Manfred Spraul, Steve Lord, Linux Kernel List, kiobuf-io-devel Hi, On Tue, Feb 06, 2001 at 07:25:19PM -0500, Ingo Molnar wrote: > > On Wed, 7 Feb 2001, Stephen C. Tweedie wrote: > > > No, it is a problem of the ll_rw_block interface: buffer_heads need to > > be aligned on disk at a multiple of their buffer size. Under the Unix > > raw IO interface it is perfectly legal to begin a 128kB IO at offset > > 512 bytes into a device. > > then we should either fix this limitation, or the raw IO code should split > the request up into several, variable-size bhs, so that the range is > filled out optimally with aligned bhs. That gets us from 512-byte blocks to 4k, but no more (ll_rw_block enforces a single blocksize on all requests but that relaxing that requirement is no big deal). Buffer_heads can't deal with data which spans more than a page right now. --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait 2001-02-07 0:36 ` Stephen C. Tweedie @ 2001-02-07 0:50 ` Linus Torvalds 2001-02-07 1:49 ` Stephen C. Tweedie 2001-02-07 1:51 ` Jeff V. Merkey 2001-02-07 1:42 ` Jeff V. Merkey 1 sibling, 2 replies; 186+ messages in thread From: Linus Torvalds @ 2001-02-07 0:50 UTC (permalink / raw) To: Stephen C. Tweedie Cc: Ingo Molnar, Ben LaHaise, Alan Cox, Manfred Spraul, Steve Lord, Linux Kernel List, kiobuf-io-devel On Wed, 7 Feb 2001, Stephen C. Tweedie wrote: > > That gets us from 512-byte blocks to 4k, but no more (ll_rw_block > enforces a single blocksize on all requests but that relaxing that > requirement is no big deal). Buffer_heads can't deal with data which > spans more than a page right now. Stephen, you're so full of shit lately that it's unbelievable. You're batting a clear 0.000 so far. "struct buffer_head" can deal with pretty much any size: the only thing it cares about is bh->b_size. It so happens that if you have highmem support, then "create_bounce()" will work on a per-page thing, but that just means that you'd better have done your bouncing into low memory before you call generic_make_request(). Have you ever spent even just 5 minutes actually _looking_ at the block device layer, before you decided that you think it needs to be completely re-done some other way? It appears that you never bothered to. Sure, I would not be surprised if some device driver ends up being surpised if you start passing it different request sizes than it is used to. But that's a driver and testing issue, nothing more. (Which is not to say that "driver and testing" issues aren't important as hell: it's one of the more scary things in fact, and it can take a long time to get right if you start doing somehting that historically has never been done and thus has historically never gotten any testing. So I'm not saying that it should work out-of-the-box. But I _am_ saying that there's no point in trying to re-design upper layers that already do ALL of this with no problems at all). Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait 2001-02-07 0:50 ` Linus Torvalds @ 2001-02-07 1:49 ` Stephen C. Tweedie 2001-02-07 2:37 ` Linus Torvalds 2001-02-07 1:51 ` Jeff V. Merkey 1 sibling, 1 reply; 186+ messages in thread From: Stephen C. Tweedie @ 2001-02-07 1:49 UTC (permalink / raw) To: Linus Torvalds Cc: Stephen C. Tweedie, Ingo Molnar, Ben LaHaise, Alan Cox, Manfred Spraul, Steve Lord, Linux Kernel List, kiobuf-io-devel Hi, On Tue, Feb 06, 2001 at 04:50:19PM -0800, Linus Torvalds wrote: > > > On Wed, 7 Feb 2001, Stephen C. Tweedie wrote: > > > > That gets us from 512-byte blocks to 4k, but no more (ll_rw_block > > enforces a single blocksize on all requests but that relaxing that > > requirement is no big deal). Buffer_heads can't deal with data which > > spans more than a page right now. > > "struct buffer_head" can deal with pretty much any size: the only thing it > cares about is bh->b_size. Right now, anything larger than a page is physically non-contiguous, and sorry if I didn't make that explicit, but I thought that was obvious enough that I didn't need to. We were talking about raw IO, and as long as we're doing IO out of user anonymous data allocated from individual pages, buffer_heads are limited to that page size in this context. > Have you ever spent even just 5 minutes actually _looking_ at the block > device layer, before you decided that you think it needs to be completely > re-done some other way? It appears that you never bothered to. Yes. We still have this fundamental property: if a user sends in a 128kB IO, we end up having to split it up into buffer_heads and doing a separate submit_bh() on each single one. Given our VM, PAGE_SIZE (*not* PAGE_CACHE_SIZE) is the best granularity we can hope for in this case. THAT is the overhead that I'm talking about: having to split a large IO into small chunks, each of which just ends up having to be merged back again into a single struct request by the *make_request code. A constructed IO request basically doesn't care about anything in the buffer_head except for the data pointer and size, and the completion status info and callback. All of the physical IO description is in the struct request by this point. The chain of buffer_heads is carrying around a huge amount of information which isn't used by the IO, and if the caller is something like the raw IO driver which isn't using the buffer cache, that extra buffer_head data is just overhead. --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait 2001-02-07 1:49 ` Stephen C. Tweedie @ 2001-02-07 2:37 ` Linus Torvalds 2001-02-07 14:52 ` Stephen C. Tweedie 2001-02-07 19:12 ` Richard Gooch 0 siblings, 2 replies; 186+ messages in thread From: Linus Torvalds @ 2001-02-07 2:37 UTC (permalink / raw) To: Stephen C. Tweedie Cc: Ingo Molnar, Ben LaHaise, Alan Cox, Manfred Spraul, Steve Lord, Linux Kernel List, kiobuf-io-devel On Wed, 7 Feb 2001, Stephen C. Tweedie wrote: > > > "struct buffer_head" can deal with pretty much any size: the only thing it > > cares about is bh->b_size. > > Right now, anything larger than a page is physically non-contiguous, > and sorry if I didn't make that explicit, but I thought that was > obvious enough that I didn't need to. We were talking about raw IO, > and as long as we're doing IO out of user anonymous data allocated > from individual pages, buffer_heads are limited to that page size in > this context. Sure. That's obviously also one of the reasons why the IO layer has never seen bigger requests anyway - the data _does_ tend to be fundamentally broken up into page-size entities, if for no other reason that that is how user-space sees memory. However, I really _do_ want to have the page cache have a bigger granularity than the smallest memory mapping size, and there are always special cases that might be able to generate IO in bigger chunks (ie in-kernel services etc) > Yes. We still have this fundamental property: if a user sends in a > 128kB IO, we end up having to split it up into buffer_heads and doing > a separate submit_bh() on each single one. Given our VM, PAGE_SIZE > (*not* PAGE_CACHE_SIZE) is the best granularity we can hope for in > this case. Absolutely. And this is independent of what kind of interface we end up using, whether it be kiobuf of just plain "struct buffer_head". In that respect they are equivalent. > THAT is the overhead that I'm talking about: having to split a large > IO into small chunks, each of which just ends up having to be merged > back again into a single struct request by the *make_request code. You could easily just generate the bh then and there, if you wanted to. Your overhead comes from the fact that you want to gather the IO together. And I'm saying that you _shouldn't_ gather the IO. There's no point. The gathering is sufficiently done by the low-level code anyway, and I've tried to explain why the low-level code _has_ to do that work regardless of what upper layers do. You need to generate a separate sg entry for each page anyway. So why not just use the existing one? The "struct buffer_head". Which already _handles_ all the issues that you have complained are hard to handle. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait 2001-02-07 2:37 ` Linus Torvalds @ 2001-02-07 14:52 ` Stephen C. Tweedie 2001-02-07 19:12 ` Richard Gooch 1 sibling, 0 replies; 186+ messages in thread From: Stephen C. Tweedie @ 2001-02-07 14:52 UTC (permalink / raw) To: Linus Torvalds Cc: Stephen C. Tweedie, Ingo Molnar, Ben LaHaise, Alan Cox, Manfred Spraul, Steve Lord, Linux Kernel List, kiobuf-io-devel Hi, On Tue, Feb 06, 2001 at 06:37:41PM -0800, Linus Torvalds wrote: > > > However, I really _do_ want to have the page cache have a bigger > granularity than the smallest memory mapping size, and there are always > special cases that might be able to generate IO in bigger chunks (ie > in-kernel services etc) No argument there. > > Yes. We still have this fundamental property: if a user sends in a > > 128kB IO, we end up having to split it up into buffer_heads and doing > > a separate submit_bh() on each single one. Given our VM, PAGE_SIZE > > (*not* PAGE_CACHE_SIZE) is the best granularity we can hope for in > > this case. > > Absolutely. And this is independent of what kind of interface we end up > using, whether it be kiobuf of just plain "struct buffer_head". In that > respect they are equivalent. Sorry? I'm not sure where communication is breaking down here, but we really don't seem to be talking about the same things. SGI's kiobuf request patches already let us pass a large IO through the request layer in a single unit without having to split it up to squeeze it through the API. > > THAT is the overhead that I'm talking about: having to split a large > > IO into small chunks, each of which just ends up having to be merged > > back again into a single struct request by the *make_request code. > > You could easily just generate the bh then and there, if you wanted to. In the current 2.4 tree, we already do: brw_kiovec creates the temporary buffer_heads on demand to feed them to the IO layers. > Your overhead comes from the fact that you want to gather the IO together. > And I'm saying that you _shouldn't_ gather the IO. There's no point. I don't --- the underlying layer does. And that is where the overhead is: for every single large IO being created by the higher layers, make_request is doing a dozen or more merges because I can only feed the IO through make_request in tiny pieces. > The > gathering is sufficiently done by the low-level code anyway, and I've > tried to explain why the low-level code _has_ to do that work regardless > of what upper layers do. I know. The problem is the low-level code doing it a hundred times for a single injected IO. > You need to generate a separate sg entry for each page anyway. So why not > just use the existing one? The "struct buffer_head". Which already > _handles_ all the issues that you have complained are hard to handle. Two issues here. First is that the buffer_head is an enormously heavyweight object for a sg-list fragment. It contains a ton of fields of interest only to the buffer cache. We could mitigate this to some extent by ensuring that the relevant fields for IO (rsector, size, req_next, state, data, page etc) were in a single cache line. Secondly, the cost of adding each single buffer_head to the request list is O(n) in the number of requests already on the list. We end up walking potentially the entire request queue before finding the request to merge against, and we do that again and again, once for every single buffer_head in the list. We do this even if the caller went in via a multi-bh ll_rw_block() call in which case we know in advance that all of the buffer_heads are contiguous on disk. There is a side problem: right now, things like raid remapping occur during generic_make_request, before we have a request built. That means that all of the raid0 remapping or raid1/5 request expanding is being done on a per-buffer_head, not per-request, basis, so again we're doing a whole lot of unnecessary duplicate work when an IO larger than a buffer_head is submitted. If you really don't mind the size of the buffer_head as a sg fragment header, then at least I'd like us to be able to submit a pre-built chain of bh's all at once without having to go through the remap/merge cost for each single bh. Cheers, Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait 2001-02-07 2:37 ` Linus Torvalds 2001-02-07 14:52 ` Stephen C. Tweedie @ 2001-02-07 19:12 ` Richard Gooch 2001-02-07 20:03 ` Stephen C. Tweedie 1 sibling, 1 reply; 186+ messages in thread From: Richard Gooch @ 2001-02-07 19:12 UTC (permalink / raw) To: Stephen C. Tweedie Cc: Linus Torvalds, Ingo Molnar, Ben LaHaise, Alan Cox, Manfred Spraul, Steve Lord, Linux Kernel List, kiobuf-io-devel Stephen C. Tweedie writes: > Hi, > > On Tue, Feb 06, 2001 at 06:37:41PM -0800, Linus Torvalds wrote: > > Absolutely. And this is independent of what kind of interface we end up > > using, whether it be kiobuf of just plain "struct buffer_head". In that > > respect they are equivalent. > > Sorry? I'm not sure where communication is breaking down here, but > we really don't seem to be talking about the same things. SGI's > kiobuf request patches already let us pass a large IO through the > request layer in a single unit without having to split it up to > squeeze it through the API. Isn't Linus saying that you can use (say) 4 kiB buffer_heads, so you don't need kiobufs? IIRC, kiobufs are page containers, so a 4 kiB buffer_head is effectively the same thing. > If you really don't mind the size of the buffer_head as a sg fragment > header, then at least I'd like us to be able to submit a pre-built > chain of bh's all at once without having to go through the remap/merge > cost for each single bh. Even if you are limited to feeding one buffer_head at a time, the merge costs should be somewhat mitigated, since you'll decrease your calls into the API by a factor of 8 or 16. But an API extension to allow passing a pre-built chain would be even better. Hopefully I haven't missed the point. I've got the flu so I'm not running on all 4 cylinders :-( Regards, Richard.... Permanent: rgooch@atnf.csiro.au Current: rgooch@ras.ucalgary.ca - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait 2001-02-07 19:12 ` Richard Gooch @ 2001-02-07 20:03 ` Stephen C. Tweedie 0 siblings, 0 replies; 186+ messages in thread From: Stephen C. Tweedie @ 2001-02-07 20:03 UTC (permalink / raw) To: Richard Gooch Cc: Stephen C. Tweedie, Linus Torvalds, Ingo Molnar, Ben LaHaise, Alan Cox, Manfred Spraul, Steve Lord, Linux Kernel List, kiobuf-io-devel Hi, On Wed, Feb 07, 2001 at 12:12:44PM -0700, Richard Gooch wrote: > Stephen C. Tweedie writes: > > > > Sorry? I'm not sure where communication is breaking down here, but > > we really don't seem to be talking about the same things. SGI's > > kiobuf request patches already let us pass a large IO through the > > request layer in a single unit without having to split it up to > > squeeze it through the API. > > Isn't Linus saying that you can use (say) 4 kiB buffer_heads, so you > don't need kiobufs? IIRC, kiobufs are page containers, so a 4 kiB > buffer_head is effectively the same thing. kiobufs let you encode _any_ contiguous region of user VA or of an inode's page cache contents in one kiobuf, no matter how many pages there are in it. A write of a megabyte to a raw device can be encoded as a single kiobuf if we want to pass the entire 1MB IO down to the block layers untouched. That's what the page vector in the kiobuf is for. Doing the same thing with buffer_heads would still require a couple of hundred of them, and you'd have to submit each such buffer_head to the IO subsystem independently. And then the IO layer will just have to reassemble them on the other side (and it may have to scan the device's entire request queue once for every single buffer_head to do so). > But an API extension to allow passing a pre-built chain would be even > better. Yep. --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait 2001-02-07 0:50 ` Linus Torvalds 2001-02-07 1:49 ` Stephen C. Tweedie @ 2001-02-07 1:51 ` Jeff V. Merkey 2001-02-07 1:01 ` Ingo Molnar 2001-02-07 1:02 ` Jens Axboe 1 sibling, 2 replies; 186+ messages in thread From: Jeff V. Merkey @ 2001-02-07 1:51 UTC (permalink / raw) To: Linus Torvalds Cc: Stephen C. Tweedie, Ingo Molnar, Ben LaHaise, Alan Cox, Manfred Spraul, Steve Lord, Linux Kernel List, kiobuf-io-devel On Tue, Feb 06, 2001 at 04:50:19PM -0800, Linus Torvalds wrote: > > > On Wed, 7 Feb 2001, Stephen C. Tweedie wrote: > > > > That gets us from 512-byte blocks to 4k, but no more (ll_rw_block > > enforces a single blocksize on all requests but that relaxing that > > requirement is no big deal). Buffer_heads can't deal with data which > > spans more than a page right now. > > Stephen, you're so full of shit lately that it's unbelievable. You're > batting a clear 0.000 so far. > > "struct buffer_head" can deal with pretty much any size: the only thing it > cares about is bh->b_size. > > It so happens that if you have highmem support, then "create_bounce()" > will work on a per-page thing, but that just means that you'd better have > done your bouncing into low memory before you call generic_make_request(). > > Have you ever spent even just 5 minutes actually _looking_ at the block > device layer, before you decided that you think it needs to be completely > re-done some other way? It appears that you never bothered to. > > Sure, I would not be surprised if some device driver ends up being > surpised if you start passing it different request sizes than it is used > to. But that's a driver and testing issue, nothing more. > > (Which is not to say that "driver and testing" issues aren't important as > hell: it's one of the more scary things in fact, and it can take a long > time to get right if you start doing somehting that historically has never > been done and thus has historically never gotten any testing. So I'm not > saying that it should work out-of-the-box. But I _am_ saying that there's > no point in trying to re-design upper layers that already do ALL of this > with no problems at all). > > Linus > I remember Linus asking to try this variable buffer head chaining thing 512-1024-512 kind of stuff several months back, and mixing them to see what would happen -- result. About half the drivers break with it. The interface allows you to do it, I've tried it, (works on Andre's drivers, but a lot of SCSI drivers break) but a lot of drivers seem to have assumptions about these things all being the same size in a buffer head chain. :-) Jeff > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > Please read the FAQ at http://www.tux.org/lkml/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait 2001-02-07 1:51 ` Jeff V. Merkey @ 2001-02-07 1:01 ` Ingo Molnar 2001-02-07 1:59 ` Jeff V. Merkey 2001-02-07 1:02 ` Jens Axboe 1 sibling, 1 reply; 186+ messages in thread From: Ingo Molnar @ 2001-02-07 1:01 UTC (permalink / raw) To: Jeff V. Merkey Cc: Linus Torvalds, Stephen C. Tweedie, Ben LaHaise, Alan Cox, Manfred Spraul, Steve Lord, Linux Kernel List, kiobuf-io-devel On Tue, 6 Feb 2001, Jeff V. Merkey wrote: > I remember Linus asking to try this variable buffer head chaining > thing 512-1024-512 kind of stuff several months back, and mixing them > to see what would happen -- result. About half the drivers break with > it. [...] time to fix them then - instead of rewriting the rest of the kernel ;-) Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait 2001-02-07 1:01 ` Ingo Molnar @ 2001-02-07 1:59 ` Jeff V. Merkey 0 siblings, 0 replies; 186+ messages in thread From: Jeff V. Merkey @ 2001-02-07 1:59 UTC (permalink / raw) To: Ingo Molnar Cc: Linus Torvalds, Stephen C. Tweedie, Ben LaHaise, Alan Cox, Manfred Spraul, Steve Lord, Linux Kernel List, kiobuf-io-devel On Wed, Feb 07, 2001 at 02:01:54AM +0100, Ingo Molnar wrote: > > On Tue, 6 Feb 2001, Jeff V. Merkey wrote: > > > I remember Linus asking to try this variable buffer head chaining > > thing 512-1024-512 kind of stuff several months back, and mixing them > > to see what would happen -- result. About half the drivers break with > > it. [...] > > time to fix them then - instead of rewriting the rest of the kernel ;-) > > Ingo I agree. Jeff - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait 2001-02-07 1:51 ` Jeff V. Merkey 2001-02-07 1:01 ` Ingo Molnar @ 2001-02-07 1:02 ` Jens Axboe 2001-02-07 1:19 ` Linus Torvalds 2001-02-07 2:00 ` Jeff V. Merkey 1 sibling, 2 replies; 186+ messages in thread From: Jens Axboe @ 2001-02-07 1:02 UTC (permalink / raw) To: Jeff V. Merkey Cc: Linus Torvalds, Stephen C. Tweedie, Ingo Molnar, Ben LaHaise, Alan Cox, Manfred Spraul, Steve Lord, Linux Kernel List, kiobuf-io-devel On Tue, Feb 06 2001, Jeff V. Merkey wrote: > I remember Linus asking to try this variable buffer head chaining > thing 512-1024-512 kind of stuff several months back, and mixing them to > see what would happen -- result. About half the drivers break with it. > The interface allows you to do it, I've tried it, (works on Andre's > drivers, but a lot of SCSI drivers break) but a lot of drivers seem to > have assumptions about these things all being the same size in a > buffer head chain. I don't see anything that would break doing this, in fact you can do this as long as the buffers are all at least a multiple of the block size. All the drivers I've inspected handle this fine, noone assumes that rq->bh->b_size is the same in all the buffers attached to the request. This includes SCSI (scsi_lib.c builds sg tables), IDE, and the Compaq array + Mylex driver. This mostly leaves the "old-style" drivers using CURRENT etc, the kernel helpers for these handle it as well. So I would appreciate pointers to these devices that break so we can inspect them. -- Jens Axboe - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait 2001-02-07 1:02 ` Jens Axboe @ 2001-02-07 1:19 ` Linus Torvalds 2001-02-07 1:39 ` Jens Axboe 2001-02-07 2:00 ` Jeff V. Merkey 1 sibling, 1 reply; 186+ messages in thread From: Linus Torvalds @ 2001-02-07 1:19 UTC (permalink / raw) To: Jens Axboe Cc: Jeff V. Merkey, Stephen C. Tweedie, Ingo Molnar, Ben LaHaise, Alan Cox, Manfred Spraul, Steve Lord, Linux Kernel List, kiobuf-io-devel On Wed, 7 Feb 2001, Jens Axboe wrote: > > I don't see anything that would break doing this, in fact you can > do this as long as the buffers are all at least a multiple of the > block size. All the drivers I've inspected handle this fine, noone > assumes that rq->bh->b_size is the same in all the buffers attached > to the request. It's really easy to get this wrong when going forward in the request list: you need to make sure that you update "request->current_nr_sectors" each time you move on to the next bh. I would not be surprised if some of them have been seriously buggered. On the other hand, I would _also_ not be surprised if we've actually fixed a lot of them: one of the things that the RAID code and loopback test is exactly getting these kinds of issues right (not this exact one, but similar ones). And let's remember things like the old ultrastor driver that was totally unable to handle anything but 1kB devices etc. I would not be _totally_ surprised if it turns out that there are still drivers out there that remember the time when Linux only ever had 1kB buffers. Even if it is 7 years ago or so ;) (Also, there might be drivers that are "optimized" - they set the IO length once per request, and just never set it again as they do partial end_io() calls. None of those kinds of issues would ever be found under normal load, so I would be _really_ nervous about just turning it on silently. This is all very much a 2.5.x-kind of thing ;) Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait 2001-02-07 1:19 ` Linus Torvalds @ 2001-02-07 1:39 ` Jens Axboe 2001-02-07 1:45 ` Linus Torvalds 0 siblings, 1 reply; 186+ messages in thread From: Jens Axboe @ 2001-02-07 1:39 UTC (permalink / raw) To: Linus Torvalds Cc: Jeff V. Merkey, Stephen C. Tweedie, Ingo Molnar, Ben LaHaise, Alan Cox, Manfred Spraul, Steve Lord, Linux Kernel List, kiobuf-io-devel On Tue, Feb 06 2001, Linus Torvalds wrote: > > I don't see anything that would break doing this, in fact you can > > do this as long as the buffers are all at least a multiple of the > > block size. All the drivers I've inspected handle this fine, noone > > assumes that rq->bh->b_size is the same in all the buffers attached > > to the request. > > It's really easy to get this wrong when going forward in the request list: > you need to make sure that you update "request->current_nr_sectors" each > time you move on to the next bh. > > I would not be surprised if some of them have been seriously buggered. Maybe have been, but it looks good at least with the general drivers that I mentioned. > [...] so I would be _really_ nervous about just turning it on > silently. This is all very much a 2.5.x-kind of thing ;) Then you might want to apply this :-) --- drivers/block/ll_rw_blk.c~ Wed Feb 7 02:38:31 2001 +++ drivers/block/ll_rw_blk.c Wed Feb 7 02:38:42 2001 @@ -1048,7 +1048,7 @@ /* Verify requested block sizes. */ for (i = 0; i < nr; i++) { struct buffer_head *bh = bhs[i]; - if (bh->b_size % correct_size) { + if (bh->b_size != correct_size) { printk(KERN_NOTICE "ll_rw_block: device %s: " "only %d-char blocks implemented (%u)\n", kdevname(bhs[0]->b_dev), -- Jens Axboe - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait 2001-02-07 1:39 ` Jens Axboe @ 2001-02-07 1:45 ` Linus Torvalds 2001-02-07 1:55 ` Jens Axboe 2001-02-07 9:10 ` David Howells 0 siblings, 2 replies; 186+ messages in thread From: Linus Torvalds @ 2001-02-07 1:45 UTC (permalink / raw) To: Jens Axboe Cc: Jeff V. Merkey, Stephen C. Tweedie, Ingo Molnar, Ben LaHaise, Alan Cox, Manfred Spraul, Steve Lord, Linux Kernel List, kiobuf-io-devel On Wed, 7 Feb 2001, Jens Axboe wrote: > > > [...] so I would be _really_ nervous about just turning it on > > silently. This is all very much a 2.5.x-kind of thing ;) > > Then you might want to apply this :-) > > --- drivers/block/ll_rw_blk.c~ Wed Feb 7 02:38:31 2001 > +++ drivers/block/ll_rw_blk.c Wed Feb 7 02:38:42 2001 > @@ -1048,7 +1048,7 @@ > /* Verify requested block sizes. */ > for (i = 0; i < nr; i++) { > struct buffer_head *bh = bhs[i]; > - if (bh->b_size % correct_size) { > + if (bh->b_size != correct_size) { > printk(KERN_NOTICE "ll_rw_block: device %s: " > "only %d-char blocks implemented (%u)\n", > kdevname(bhs[0]->b_dev), Actually, I'd rather leave it in, but speed it up with the saner and faster if (bh->b_size & (correct_size-1)) { ... That way people who _want_ to test the odd-size thing can do so. And normal code (that never generates requests on any other size than the "native" size) won't ever notice either way. (Oh, we'll eventually need to move to "correct_size == hardware blocksize", not the "virtual blocksize" that it is now. As it it a tester needs to set the soft-blk size by hand now). Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait 2001-02-07 1:45 ` Linus Torvalds @ 2001-02-07 1:55 ` Jens Axboe 2001-02-07 9:10 ` David Howells 1 sibling, 0 replies; 186+ messages in thread From: Jens Axboe @ 2001-02-07 1:55 UTC (permalink / raw) To: Linus Torvalds Cc: Jeff V. Merkey, Stephen C. Tweedie, Ingo Molnar, Ben LaHaise, Alan Cox, Manfred Spraul, Steve Lord, Linux Kernel List, kiobuf-io-devel On Tue, Feb 06 2001, Linus Torvalds wrote: > > > [...] so I would be _really_ nervous about just turning it on > > > silently. This is all very much a 2.5.x-kind of thing ;) > > > > Then you might want to apply this :-) > > > > --- drivers/block/ll_rw_blk.c~ Wed Feb 7 02:38:31 2001 > > +++ drivers/block/ll_rw_blk.c Wed Feb 7 02:38:42 2001 > > @@ -1048,7 +1048,7 @@ > > /* Verify requested block sizes. */ > > for (i = 0; i < nr; i++) { > > struct buffer_head *bh = bhs[i]; > > - if (bh->b_size % correct_size) { > > + if (bh->b_size != correct_size) { > > printk(KERN_NOTICE "ll_rw_block: device %s: " > > "only %d-char blocks implemented (%u)\n", > > kdevname(bhs[0]->b_dev), > > Actually, I'd rather leave it in, but speed it up with the saner and > faster > > if (bh->b_size & (correct_size-1)) { > ... > > That way people who _want_ to test the odd-size thing can do so. And > normal code (that never generates requests on any other size than the > "native" size) won't ever notice either way. Fine, as I said I didn't spot anything bad so that's why it was changed. > (Oh, we'll eventually need to move to "correct_size == hardware > blocksize", not the "virtual blocksize" that it is now. As it it a tester > needs to set the soft-blk size by hand now). Exactly, wrt earlier mail about submitting < hw block size requests to the lower levels. -- Jens Axboe - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait 2001-02-07 1:45 ` Linus Torvalds 2001-02-07 1:55 ` Jens Axboe @ 2001-02-07 9:10 ` David Howells 2001-02-07 12:16 ` Stephen C. Tweedie 1 sibling, 1 reply; 186+ messages in thread From: David Howells @ 2001-02-07 9:10 UTC (permalink / raw) To: Linus Torvalds; +Cc: Jens Axboe, linux-kernel, kiobuf-io-devel Linus Torvalds <torvalds@transmeta.com> wrote: > Actually, I'd rather leave it in, but speed it up with the saner and > faster > > if (bh->b_size & (correct_size-1)) { I presume that correct_size will always be a power of 2... David - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait 2001-02-07 9:10 ` David Howells @ 2001-02-07 12:16 ` Stephen C. Tweedie 0 siblings, 0 replies; 186+ messages in thread From: Stephen C. Tweedie @ 2001-02-07 12:16 UTC (permalink / raw) To: David Howells; +Cc: Linus Torvalds, Jens Axboe, linux-kernel, kiobuf-io-devel Hi, On Wed, Feb 07, 2001 at 09:10:32AM +0000, David Howells wrote: > > I presume that correct_size will always be a power of 2... Yes. --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait 2001-02-07 1:02 ` Jens Axboe 2001-02-07 1:19 ` Linus Torvalds @ 2001-02-07 2:00 ` Jeff V. Merkey 2001-02-07 1:06 ` Ingo Molnar 2001-02-07 1:08 ` Jens Axboe 1 sibling, 2 replies; 186+ messages in thread From: Jeff V. Merkey @ 2001-02-07 2:00 UTC (permalink / raw) To: Jens Axboe Cc: Linus Torvalds, Stephen C. Tweedie, Ingo Molnar, Ben LaHaise, Alan Cox, Manfred Spraul, Steve Lord, Linux Kernel List, kiobuf-io-devel On Wed, Feb 07, 2001 at 02:02:21AM +0100, Jens Axboe wrote: > On Tue, Feb 06 2001, Jeff V. Merkey wrote: > > I remember Linus asking to try this variable buffer head chaining > > thing 512-1024-512 kind of stuff several months back, and mixing them to > > see what would happen -- result. About half the drivers break with it. > > The interface allows you to do it, I've tried it, (works on Andre's > > drivers, but a lot of SCSI drivers break) but a lot of drivers seem to > > have assumptions about these things all being the same size in a > > buffer head chain. > > I don't see anything that would break doing this, in fact you can > do this as long as the buffers are all at least a multiple of the > block size. All the drivers I've inspected handle this fine, noone > assumes that rq->bh->b_size is the same in all the buffers attached > to the request. This includes SCSI (scsi_lib.c builds sg tables), > IDE, and the Compaq array + Mylex driver. This mostly leaves the > "old-style" drivers using CURRENT etc, the kernel helpers for these > handle it as well. > > So I would appreciate pointers to these devices that break so we > can inspect them. > > -- > Jens Axboe Adaptec drivers had an oops. Also, AIC7XXX also had some oops with it. Jeff - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait 2001-02-07 2:00 ` Jeff V. Merkey @ 2001-02-07 1:06 ` Ingo Molnar 2001-02-07 1:09 ` Jens Axboe ` (2 more replies) 2001-02-07 1:08 ` Jens Axboe 1 sibling, 3 replies; 186+ messages in thread From: Ingo Molnar @ 2001-02-07 1:06 UTC (permalink / raw) To: Jeff V. Merkey Cc: Jens Axboe, Linus Torvalds, Stephen C. Tweedie, Ben LaHaise, Alan Cox, Manfred Spraul, Steve Lord, Linux Kernel List, kiobuf-io-devel On Tue, 6 Feb 2001, Jeff V. Merkey wrote: > > I don't see anything that would break doing this, in fact you can > > do this as long as the buffers are all at least a multiple of the > > block size. All the drivers I've inspected handle this fine, noone > > assumes that rq->bh->b_size is the same in all the buffers attached > > to the request. This includes SCSI (scsi_lib.c builds sg tables), > > IDE, and the Compaq array + Mylex driver. This mostly leaves the > > "old-style" drivers using CURRENT etc, the kernel helpers for these > > handle it as well. > > > > So I would appreciate pointers to these devices that break so we > > can inspect them. > > > > -- > > Jens Axboe > > Adaptec drivers had an oops. Also, AIC7XXX also had some oops with it. most likely some coding error on your side. buffer-size mismatches should show up as filesystem corruption or random DMA scribble, not in-driver oopses. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait 2001-02-07 1:06 ` Ingo Molnar @ 2001-02-07 1:09 ` Jens Axboe 2001-02-07 1:11 ` Ingo Molnar 2001-02-07 1:26 ` Linus Torvalds 2001-02-07 2:07 ` Jeff V. Merkey 2 siblings, 1 reply; 186+ messages in thread From: Jens Axboe @ 2001-02-07 1:09 UTC (permalink / raw) To: Ingo Molnar Cc: Jeff V. Merkey, Linus Torvalds, Stephen C. Tweedie, Ben LaHaise, Alan Cox, Manfred Spraul, Steve Lord, Linux Kernel List, kiobuf-io-devel On Wed, Feb 07 2001, Ingo Molnar wrote: > > > So I would appreciate pointers to these devices that break so we > > > can inspect them. > > > > > > -- > > > Jens Axboe > > > > Adaptec drivers had an oops. Also, AIC7XXX also had some oops with it. > > most likely some coding error on your side. buffer-size mismatches should > show up as filesystem corruption or random DMA scribble, not in-driver > oopses. I would suspect so, aic7xxx shouldn't care about anything except the sg entries and I would seriously doubt that it makes any such assumptions on them :-) -- Jens Axboe - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait 2001-02-07 1:09 ` Jens Axboe @ 2001-02-07 1:11 ` Ingo Molnar 0 siblings, 0 replies; 186+ messages in thread From: Ingo Molnar @ 2001-02-07 1:11 UTC (permalink / raw) To: Jens Axboe Cc: Jeff V. Merkey, Linus Torvalds, Stephen C. Tweedie, Ben LaHaise, Alan Cox, Manfred Spraul, Steve Lord, Linux Kernel List, kiobuf-io-devel On Wed, 7 Feb 2001, Jens Axboe wrote: > > > Adaptec drivers had an oops. Also, AIC7XXX also had some oops with it. > > > > most likely some coding error on your side. buffer-size mismatches should > > show up as filesystem corruption or random DMA scribble, not in-driver > > oopses. > > I would suspect so, aic7xxx shouldn't care about anything except the > sg entries and I would seriously doubt that it makes any such > assumptions on them :-) yep - and not a single reference to b_size in aic7xxx.c. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait 2001-02-07 1:06 ` Ingo Molnar 2001-02-07 1:09 ` Jens Axboe @ 2001-02-07 1:26 ` Linus Torvalds 2001-02-07 2:07 ` Jeff V. Merkey 2 siblings, 0 replies; 186+ messages in thread From: Linus Torvalds @ 2001-02-07 1:26 UTC (permalink / raw) To: Ingo Molnar Cc: Jeff V. Merkey, Jens Axboe, Stephen C. Tweedie, Ben LaHaise, Alan Cox, Manfred Spraul, Steve Lord, Linux Kernel List, kiobuf-io-devel On Wed, 7 Feb 2001, Ingo Molnar wrote: > > most likely some coding error on your side. buffer-size mismatches should > show up as filesystem corruption or random DMA scribble, not in-driver > oopses. I'm not sure. If I was a driver writer (and I'm happy those days are mostly behind me ;), I would not be totally dis-inclined to check for various limits and things. There can be hardware out there that simply has trouble with non-native alignment, ie be unhappy about getting a 1kB request that is aligned in memory at a 512-byte boundary. So there are real reasons why drivers might need updating. Don't dismiss the concerns out-of-hand. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait 2001-02-07 1:06 ` Ingo Molnar 2001-02-07 1:09 ` Jens Axboe 2001-02-07 1:26 ` Linus Torvalds @ 2001-02-07 2:07 ` Jeff V. Merkey 2 siblings, 0 replies; 186+ messages in thread From: Jeff V. Merkey @ 2001-02-07 2:07 UTC (permalink / raw) To: Ingo Molnar Cc: Jens Axboe, Linus Torvalds, Stephen C. Tweedie, Ben LaHaise, Alan Cox, Manfred Spraul, Steve Lord, Linux Kernel List, kiobuf-io-devel On Wed, Feb 07, 2001 at 02:06:27AM +0100, Ingo Molnar wrote: > > On Tue, 6 Feb 2001, Jeff V. Merkey wrote: > > > > I don't see anything that would break doing this, in fact you can > > > do this as long as the buffers are all at least a multiple of the > > > block size. All the drivers I've inspected handle this fine, noone > > > assumes that rq->bh->b_size is the same in all the buffers attached > > > to the request. This includes SCSI (scsi_lib.c builds sg tables), > > > IDE, and the Compaq array + Mylex driver. This mostly leaves the > > > "old-style" drivers using CURRENT etc, the kernel helpers for these > > > handle it as well. > > > > > > So I would appreciate pointers to these devices that break so we > > > can inspect them. > > > > > > -- > > > Jens Axboe > > > > Adaptec drivers had an oops. Also, AIC7XXX also had some oops with it. > > most likely some coding error on your side. buffer-size mismatches should > show up as filesystem corruption or random DMA scribble, not in-driver > oopses. > > Ingo Oops was in my code, but was caused by these drivers. The Adaptec driver did have an oops that was it's own code address, AIC7XXX crashed in my code. Jeff - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait 2001-02-07 2:00 ` Jeff V. Merkey 2001-02-07 1:06 ` Ingo Molnar @ 2001-02-07 1:08 ` Jens Axboe 2001-02-07 2:08 ` Jeff V. Merkey 1 sibling, 1 reply; 186+ messages in thread From: Jens Axboe @ 2001-02-07 1:08 UTC (permalink / raw) To: Jeff V. Merkey Cc: Linus Torvalds, Stephen C. Tweedie, Ingo Molnar, Ben LaHaise, Alan Cox, Manfred Spraul, Steve Lord, Linux Kernel List, kiobuf-io-devel On Tue, Feb 06 2001, Jeff V. Merkey wrote: > Adaptec drivers had an oops. Also, AIC7XXX also had some oops with it. Do you still have this oops? -- Jens Axboe - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait 2001-02-07 1:08 ` Jens Axboe @ 2001-02-07 2:08 ` Jeff V. Merkey 0 siblings, 0 replies; 186+ messages in thread From: Jeff V. Merkey @ 2001-02-07 2:08 UTC (permalink / raw) To: Jens Axboe Cc: Linus Torvalds, Stephen C. Tweedie, Ingo Molnar, Ben LaHaise, Alan Cox, Manfred Spraul, Steve Lord, Linux Kernel List, kiobuf-io-devel On Wed, Feb 07, 2001 at 02:08:53AM +0100, Jens Axboe wrote: > On Tue, Feb 06 2001, Jeff V. Merkey wrote: > > Adaptec drivers had an oops. Also, AIC7XXX also had some oops with it. > > Do you still have this oops? > I can recreate. Will work on it tommorrow. SCI testing today. Jeff > -- > Jens Axboe - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait 2001-02-07 0:36 ` Stephen C. Tweedie 2001-02-07 0:50 ` Linus Torvalds @ 2001-02-07 1:42 ` Jeff V. Merkey 1 sibling, 0 replies; 186+ messages in thread From: Jeff V. Merkey @ 2001-02-07 1:42 UTC (permalink / raw) To: Stephen C. Tweedie Cc: Ingo Molnar, Ingo Molnar, Ben LaHaise, Linus Torvalds, Alan Cox, Manfred Spraul, Steve Lord, Linux Kernel List, kiobuf-io-devel On Wed, Feb 07, 2001 at 12:36:29AM +0000, Stephen C. Tweedie wrote: > Hi, > > On Tue, Feb 06, 2001 at 07:25:19PM -0500, Ingo Molnar wrote: > > > > On Wed, 7 Feb 2001, Stephen C. Tweedie wrote: > > > > > No, it is a problem of the ll_rw_block interface: buffer_heads need to > > > be aligned on disk at a multiple of their buffer size. Under the Unix > > > raw IO interface it is perfectly legal to begin a 128kB IO at offset > > > 512 bytes into a device. > > > > then we should either fix this limitation, or the raw IO code should split > > the request up into several, variable-size bhs, so that the range is > > filled out optimally with aligned bhs. > > That gets us from 512-byte blocks to 4k, but no more (ll_rw_block > enforces a single blocksize on all requests but that relaxing that > requirement is no big deal). Buffer_heads can't deal with data which > spans more than a page right now. I can handle requests larger than a page (64K) but I am not using the buffer cache in Linux. We really need an NT/NetWare like model to support the non-Unix FS's properly. i.e. a disk request should be <disk> <lba> <length> <buffer> and get rid of this fixed block stuff with buffer heads. :-) I understand that the way the elevator is implemented in Linux makes this very hard at this point to support, since it's very troublesome to handling requests that overlap sector boundries. Jeff > > --Stephen > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > Please read the FAQ at http://www.tux.org/lkml/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait 2001-02-07 0:25 ` Ingo Molnar 2001-02-07 0:36 ` Stephen C. Tweedie @ 2001-02-07 0:42 ` Linus Torvalds 1 sibling, 0 replies; 186+ messages in thread From: Linus Torvalds @ 2001-02-07 0:42 UTC (permalink / raw) To: Ingo Molnar Cc: Stephen C. Tweedie, Ingo Molnar, Ben LaHaise, Alan Cox, Manfred Spraul, Steve Lord, Linux Kernel List, kiobuf-io-devel On Tue, 6 Feb 2001, Ingo Molnar wrote: > > On Wed, 7 Feb 2001, Stephen C. Tweedie wrote: > > > No, it is a problem of the ll_rw_block interface: buffer_heads need to > > be aligned on disk at a multiple of their buffer size. Under the Unix > > raw IO interface it is perfectly legal to begin a 128kB IO at offset > > 512 bytes into a device. > > then we should either fix this limitation, or the raw IO code should split > the request up into several, variable-size bhs, so that the range is > filled out optimally with aligned bhs. As mentioned, no such limitation exists if you just use the right interfaces. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait 2001-02-07 0:21 ` Stephen C. Tweedie 2001-02-07 0:25 ` Ingo Molnar @ 2001-02-07 0:35 ` Jens Axboe 2001-02-07 0:41 ` Linus Torvalds 2 siblings, 0 replies; 186+ messages in thread From: Jens Axboe @ 2001-02-07 0:35 UTC (permalink / raw) To: Stephen C. Tweedie Cc: Ingo Molnar, Ben LaHaise, Linus Torvalds, Alan Cox, Manfred Spraul, Steve Lord, Linux Kernel List, kiobuf-io-devel, Ingo Molnar On Wed, Feb 07 2001, Stephen C. Tweedie wrote: > > [overhead of 512-byte bhs in the raw IO code is an artificial problem of > > the raw IO code.] > > No, it is a problem of the ll_rw_block interface: buffer_heads need to > be aligned on disk at a multiple of their buffer size. Under the Unix > raw IO interface it is perfectly legal to begin a 128kB IO at offset > 512 bytes into a device. Submitting buffers to lower layers that are not hw sector aligned can't be supported below ll_rw_blk anyway (they can, but look at the problems this has always created), and I would much rather see stuff like this handled outside of there. -- Jens Axboe - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait 2001-02-07 0:21 ` Stephen C. Tweedie 2001-02-07 0:25 ` Ingo Molnar 2001-02-07 0:35 ` Jens Axboe @ 2001-02-07 0:41 ` Linus Torvalds 2001-02-07 1:27 ` Stephen C. Tweedie 2 siblings, 1 reply; 186+ messages in thread From: Linus Torvalds @ 2001-02-07 0:41 UTC (permalink / raw) To: Stephen C. Tweedie Cc: Ingo Molnar, Ben LaHaise, Alan Cox, Manfred Spraul, Steve Lord, Linux Kernel List, kiobuf-io-devel, Ingo Molnar On Wed, 7 Feb 2001, Stephen C. Tweedie wrote: > > On Tue, Feb 06, 2001 at 08:57:13PM +0100, Ingo Molnar wrote: > > > > [overhead of 512-byte bhs in the raw IO code is an artificial problem of > > the raw IO code.] > > No, it is a problem of the ll_rw_block interface: buffer_heads need to > be aligned on disk at a multiple of their buffer size. Ehh.. True of ll_rw_block() and submit_bh(), which are meant for the traditional block device setup, where "b_blocknr" is the "virtual blocknumber" and that indeed is tied in to the block size. That's the whole _point_ of ll_rw_block() and friends - they show the device at a different "virtual blocking" level than the low-level physical accesses necessarily are. Which very much means that if you have a 4kB "view", of the device, you get a stream of 4kB blocks. Not 4kB sized blocks at 512-byte offsets (or whatebver the hardware blocking size is). This way the interfaces are independent of the hardware blocksize. Which is logical and what you'd expect. You need to go to a lower level to see those kinds of blocking issues. But it is _not_ true of "generic_make_request()" and the block IO layer in general. It obviously _cannot_ be true, because the block I/O layer has always had the notion of merging consecutive blocks together - regardless of whether the end result is even a power of two or antyhing like that in size. You can make an IO request for pretty much any size, as long as it's a multiple of the hardare blocksize (normally 512 bytes, but there are certainly devices out there with other blocksizes). The fact is, if you have problems like the above, then you don't understand the interfaces. And it sounds like you designed kiobuf support around the wrong set of interfaces. If you want to get at the _sector_ level, then you do lock_bh(); bh->b_rdev = device; bh->b_rsector = sector-number (where linux defines "sector" to be 512 bytes) bh->b_size = size in bytes (must be a multiple of 512); bh->b_data = pointer; bh->b_end_io = callback; generic_make_request(rw, bh); which doesn't look all that complicated to me. What's the problem? Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait 2001-02-07 0:41 ` Linus Torvalds @ 2001-02-07 1:27 ` Stephen C. Tweedie 2001-02-07 1:40 ` Linus Torvalds 0 siblings, 1 reply; 186+ messages in thread From: Stephen C. Tweedie @ 2001-02-07 1:27 UTC (permalink / raw) To: Linus Torvalds Cc: Stephen C. Tweedie, Ingo Molnar, Ben LaHaise, Alan Cox, Manfred Spraul, Steve Lord, Linux Kernel List, kiobuf-io-devel, Ingo Molnar Hi, On Tue, Feb 06, 2001 at 04:41:21PM -0800, Linus Torvalds wrote: > > On Wed, 7 Feb 2001, Stephen C. Tweedie wrote: > > No, it is a problem of the ll_rw_block interface: buffer_heads need to > > be aligned on disk at a multiple of their buffer size. > > Ehh.. True of ll_rw_block() and submit_bh(), which are meant for the > traditional block device setup, where "b_blocknr" is the "virtual > blocknumber" and that indeed is tied in to the block size. > > The fact is, if you have problems like the above, then you don't > understand the interfaces. And it sounds like you designed kiobuf support > around the wrong set of interfaces. They used the only interfaces available at the time... > If you want to get at the _sector_ level, then you do ... > which doesn't look all that complicated to me. What's the problem? Doesn't this break nastily as soon as the IO hits an LVM or soft raid device? I don't think we are safe if we create a larger-sized buffer_head which spans a raid stripe: the raid mapping is only applied once per buffer_head. --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait 2001-02-07 1:27 ` Stephen C. Tweedie @ 2001-02-07 1:40 ` Linus Torvalds 2001-02-12 10:07 ` Jamie Lokier 0 siblings, 1 reply; 186+ messages in thread From: Linus Torvalds @ 2001-02-07 1:40 UTC (permalink / raw) To: Stephen C. Tweedie Cc: Ingo Molnar, Ben LaHaise, Alan Cox, Manfred Spraul, Steve Lord, Linux Kernel List, kiobuf-io-devel, Ingo Molnar On Wed, 7 Feb 2001, Stephen C. Tweedie wrote: > > > > The fact is, if you have problems like the above, then you don't > > understand the interfaces. And it sounds like you designed kiobuf support > > around the wrong set of interfaces. > > They used the only interfaces available at the time... Ehh.. "generic_make_request()" goes back a _loong_ time. It used to be called just "make_request()", but all my points still stand. It's even exported to modules. As far as I know, the raid code has always used this interface exactly because raid needed to feed back the remapped stuff and get around the blocksizing in ll_rw_block(). This really isn't anything new. I _know_ it's there in 2.2.x, and I would not be surprised if it was there even in 2.0.x. > > If you want to get at the _sector_ level, then you do > ... > > which doesn't look all that complicated to me. What's the problem? > > Doesn't this break nastily as soon as the IO hits an LVM or soft raid > device? I don't think we are safe if we create a larger-sized > buffer_head which spans a raid stripe: the raid mapping is only > applied once per buffer_head. Absolutely. This is exactly what I mean by saying that low-level drivers may not actually be able to handle new cases that they've never been asked to do before - they just never saw anything like a 64kB request before or something that crossed its own alignment. But the _higher_ levels are there. And there's absolutely nothing in the design that is a real problem. But there's no question that you might need to fix up more than one or two low-level drivers. (The only drivers I know better are the IDE ones, and as far as I can tell they'd have no trouble at all with any of this. Most other normal drivers are likely to be in this same situation. But because I've not had a reason to test, I certainly won't guarantee even that). Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait 2001-02-07 1:40 ` Linus Torvalds @ 2001-02-12 10:07 ` Jamie Lokier 0 siblings, 0 replies; 186+ messages in thread From: Jamie Lokier @ 2001-02-12 10:07 UTC (permalink / raw) To: Linus Torvalds Cc: Stephen C. Tweedie, Ingo Molnar, Ben LaHaise, Alan Cox, Manfred Spraul, Steve Lord, Linux Kernel List, kiobuf-io-devel, Ingo Molnar Linus Torvalds wrote: > Absolutely. This is exactly what I mean by saying that low-level drivers > may not actually be able to handle new cases that they've never been asked > to do before - they just never saw anything like a 64kB request before or > something that crossed its own alignment. > > But the _higher_ levels are there. And there's absolutely nothing in the > design that is a real problem. But there's no question that you might need > to fix up more than one or two low-level drivers. > > (The only drivers I know better are the IDE ones, and as far as I can tell > they'd have no trouble at all with any of this. Most other normal drivers > are likely to be in this same situation. But because I've not had a reason > to test, I certainly won't guarantee even that). PCI has dma_mask, which distinguishes different device capabilities. This nice interface handles 64-bit capable devices, 32-bit ones, ISA limitations (the old 16MB limit) and some other strange devices. This mask appears in block devices one way or another so that bounce buffers are used for high addresses. How about a mask for block devices which indicates the kinds of alignment and lengths that the driver can handle? For old drivers that can't be thoroughly tested, we assume the worst. Some devices have hardware limitations. Newer, tested drivers can relax the limits. It's probably not difficult to say, "this 64k request can't be handled so split it into 1k requests". It integrates naturally with the decision to use bounce buffers -- alignment restrictions cause copying just as high addresses causes copying. -- Jamie - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait 2001-02-06 19:49 ` Ben LaHaise 2001-02-06 19:57 ` Ingo Molnar @ 2001-02-06 20:26 ` Linus Torvalds 1 sibling, 0 replies; 186+ messages in thread From: Linus Torvalds @ 2001-02-06 20:26 UTC (permalink / raw) To: Ben LaHaise Cc: Ingo Molnar, Stephen C. Tweedie, Alan Cox, Manfred Spraul, Steve Lord, Linux Kernel List, kiobuf-io-devel, Ingo Molnar On Tue, 6 Feb 2001, Ben LaHaise wrote: > > This small correction is the crux of the problem: if it blocks, it takes > away from the ability of the process to continue doing useful work. If it > returns -EAGAIN, then that's okay, the io will be resubmitted later when > other disk io has completed. But, it should be possible to continue > servicing network requests or user io while disk io is underway. Ehh.. The supprot for this is actually all there already. It's just not used, because nobody asked for it. Check the "rw_ahead" variable in __make_request(). Notice how it does everything you ask for. So remind me again why we should need a whole new interface for something that already exists but isn't exported because nobody needed it? It got created for READA, but that isn't used any more. You could absolutely _trivially_ re-introduce it (along with WRITEA), but you should probably change the semantics of what happens when it doesn't get a request. Something like making "submit_bh()" return an error value for the case, instead of doing "bh->b_end_io(0..)" which is what I think it does right now. That would make it easier for the submitter to say "oh, the queue is full". This is probably all of 5 lines of code. I really think that people don't give the block device layer enough credit. Some of it is quite ugly due to 10 years of history, and there is certainly a lack of some interesting capabilities (there is no "barrier" operation right now to enforce ordering, for example, and it really would be sensible to support a wider operation of ops than just read/write and let the ioctl's use it to pass commands too). These issues are things that I've been discussing with Jens for the last few months, and are things that he already to some degree has been toying with, and we already decided to try to do this during 2.5.x. It's already been a _lot_ of clean-up with the per-queue request lists etc, and there's more to be done in the cleanup section too. But the fact is that too many people seem to have ignored the support that IS there, and that actually works very well indeed - and is very generic. > > What more do you think your kiobuf's should be able to do? > > That's what my code is doing today. There are a ton of bh's setup for a > single kiobuf request that is issued. For something like a single 256kb > io, this is the difference between the batched io requests being passed > into submit_bh fitting in L1 cache and overflowing it. Resizable bh's > would certainly improve this. bh's _are_ resizeable. You just change bh->b_size, and you're done. Of course, you'll need to do your own memory management for the backing store. The generic bread() etc layer makes memory management simpler by having just one size per page and making "struct page" their native mm etity, but that's not really a bh issue - it's a MM issue and stems from the fact that this is how all traditional block filesystems tend to want to work. NOTE! If you do start to resize the buffer heads, please give me a ping. The code has never actually been _tested_ with anything but 512. 1024, 2048, 4096 and 8192-byte blocks. I would not be surprised at all if some low-level drivers actually have asserts that the sizes are ones they "recognize". The generic layer should be happy with anything that is a multiple of 512, but as with all things, you'll probably find some gotchas when you actually try something new. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait 2001-02-06 19:32 ` Linus Torvalds 2001-02-06 19:44 ` Ingo Molnar 2001-02-06 19:49 ` Ben LaHaise @ 2001-02-06 20:25 ` Christoph Hellwig 2001-02-06 20:35 ` Ingo Molnar 2001-02-06 20:59 ` Linus Torvalds 2 siblings, 2 replies; 186+ messages in thread From: Christoph Hellwig @ 2001-02-06 20:25 UTC (permalink / raw) To: Linus Torvalds Cc: Ben LaHaise, Ingo Molnar, Stephen C. Tweedie, Alan Cox, Manfred Spraul, Steve Lord, Linux Kernel List, kiobuf-io-devel, Ingo Molnar On Tue, Feb 06, 2001 at 11:32:43AM -0800, Linus Torvalds wrote: > Traditionally, a "bh" is only _used_ for small areas, but that's not a > "bh" issue, that's a memory management issue. The code should pretty much > handle the issue of a single 64kB bh pretty much as-is, but nothing > creates them: the VM layer only creates bh's in sizes ranging from 512 > bytes to a single page. > > The IO layer could do more, but there has yet to be anybody who needed > more (becase once you hit a page-size, you tend to get into > scatter-gather, so you want to have one bh per area - and let the > low-level IO level handle the actual merging etc). Yes. That's one disadvantage blown away. The second is that bh's are two things: - a cacheing object - an io buffer This is not really an clean appropeach, and I would really like to get away from it. Christoph -- Whip me. Beat me. Make me maintain AIX. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait 2001-02-06 20:25 ` Christoph Hellwig @ 2001-02-06 20:35 ` Ingo Molnar 2001-02-06 19:05 ` Marcelo Tosatti 2001-02-07 18:27 ` Christoph Hellwig 2001-02-06 20:59 ` Linus Torvalds 1 sibling, 2 replies; 186+ messages in thread From: Ingo Molnar @ 2001-02-06 20:35 UTC (permalink / raw) To: Christoph Hellwig Cc: Linus Torvalds, Ben LaHaise, Stephen C. Tweedie, Alan Cox, Manfred Spraul, Steve Lord, Linux Kernel List, kiobuf-io-devel, Ingo Molnar On Tue, 6 Feb 2001, Christoph Hellwig wrote: > The second is that bh's are two things: > > - a cacheing object > - an io buffer > > This is not really an clean appropeach, and I would really like to get > away from it. caching bmap() blocks was a recent addition around 2.3.20, and i suggested some time ago to cache pagecache blocks via explicit entries in struct page. That would be one solution - but it creates overhead. but there isnt anything wrong with having the bhs around to cache blocks - think of it as a 'cached and recycled IO buffer entry, with the block information cached'. frankly, my quick (and limited) hack to abuse bhs to cache blocks just cannot be a reason to replace bhs ... Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait 2001-02-06 20:35 ` Ingo Molnar @ 2001-02-06 19:05 ` Marcelo Tosatti 2001-02-06 20:59 ` Ingo Molnar 2001-02-07 18:27 ` Christoph Hellwig 1 sibling, 1 reply; 186+ messages in thread From: Marcelo Tosatti @ 2001-02-06 19:05 UTC (permalink / raw) To: Ingo Molnar Cc: Christoph Hellwig, Linus Torvalds, Ben LaHaise, Stephen C. Tweedie, Alan Cox, Manfred Spraul, Steve Lord, Linux Kernel List, kiobuf-io-devel, Ingo Molnar On Tue, 6 Feb 2001, Ingo Molnar wrote: > > On Tue, 6 Feb 2001, Christoph Hellwig wrote: > > > The second is that bh's are two things: > > > > - a cacheing object > > - an io buffer > > > > This is not really an clean appropeach, and I would really like to get > > away from it. > > caching bmap() blocks was a recent addition around 2.3.20, and i suggested > some time ago to cache pagecache blocks via explicit entries in struct > page. That would be one solution - but it creates overhead. Think about a given number of pages which are physically contiguous on disk -- you dont need to cache the block number for each page, you just need to cache the physical block number of the first page of the "cluster". SGI's pagebuf do that, and it would be great if we had something similar in 2.5. It allows us to have fast IO clustering. > but there isnt anything wrong with having the bhs around to cache blocks - > think of it as a 'cached and recycled IO buffer entry, with the block > information cached'. Usually we need to cache only block information (for clustering), and not all the other stuff which buffer_head holds. > frankly, my quick (and limited) hack to abuse bhs to cache blocks just > cannot be a reason to replace bhs ... - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait 2001-02-06 19:05 ` Marcelo Tosatti @ 2001-02-06 20:59 ` Ingo Molnar 2001-02-06 21:20 ` Steve Lord 0 siblings, 1 reply; 186+ messages in thread From: Ingo Molnar @ 2001-02-06 20:59 UTC (permalink / raw) To: Marcelo Tosatti Cc: Christoph Hellwig, Linus Torvalds, Ben LaHaise, Stephen C. Tweedie, Alan Cox, Manfred Spraul, Steve Lord, Linux Kernel List, kiobuf-io-devel, Ingo Molnar On Tue, 6 Feb 2001, Marcelo Tosatti wrote: > Think about a given number of pages which are physically contiguous on > disk -- you dont need to cache the block number for each page, you > just need to cache the physical block number of the first page of the > "cluster". ranges are a hell of a lot more trouble to get right than page or block-sized objects - and typical access patterns are rarely 'ranged'. As long as the basic unit is not 'too small' (ie. not 512 byte, but something more sane, like 4096 bytes), i dont think ranging done in higher levels buys us anything valuable. And we do ranging at the request layer already ... Guess why most CPUs ended up having pages, and not "memory ranges"? It's simpler, thus faster in the common case and easier to debug. > Usually we need to cache only block information (for clustering), and > not all the other stuff which buffer_head holds. well, the other issue is that buffer_heads hold buffer-cache details as well. But i think it's too small right now to justify any splitup - and those issues are related enough to have significant allocation-merging effects. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait 2001-02-06 20:59 ` Ingo Molnar @ 2001-02-06 21:20 ` Steve Lord 0 siblings, 0 replies; 186+ messages in thread From: Steve Lord @ 2001-02-06 21:20 UTC (permalink / raw) To: mingo Cc: Marcelo Tosatti, Christoph Hellwig, Linus Torvalds, Ben LaHaise, Stephen C. Tweedie, Alan Cox, Manfred Spraul, Steve Lord, Linux Kernel List, kiobuf-io-devel, Ingo Molnar > > On Tue, 6 Feb 2001, Marcelo Tosatti wrote: > > > Think about a given number of pages which are physically contiguous on > > disk -- you dont need to cache the block number for each page, you > > just need to cache the physical block number of the first page of the > > "cluster". > > ranges are a hell of a lot more trouble to get right than page or > block-sized objects - and typical access patterns are rarely 'ranged'. As > long as the basic unit is not 'too small' (ie. not 512 byte, but something > more sane, like 4096 bytes), i dont think ranging done in higher levels > buys us anything valuable. And we do ranging at the request layer already > ... Guess why most CPUs ended up having pages, and not "memory ranges"? > It's simpler, thus faster in the common case and easier to debug. > > > Usually we need to cache only block information (for clustering), and > > not all the other stuff which buffer_head holds. > > well, the other issue is that buffer_heads hold buffer-cache details as > well. But i think it's too small right now to justify any splitup - and > those issues are related enough to have significant allocation-merging > effects. > > Ingo Think about it from the point of view of being able to reduce the number of times you need to talk to the allocator in a filesystem. You can talk to the allocator about all of your readahead pages in one go, or you can do things like allocate on flush rather than allocating page at a time (that is a bit more complex, but not too much). Having to talk to the allocator on a page by page basis is my pet peeve about the current mechanisms. Steve - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait 2001-02-06 20:35 ` Ingo Molnar 2001-02-06 19:05 ` Marcelo Tosatti @ 2001-02-07 18:27 ` Christoph Hellwig 1 sibling, 0 replies; 186+ messages in thread From: Christoph Hellwig @ 2001-02-07 18:27 UTC (permalink / raw) To: Ingo Molnar Cc: Linus Torvalds, Ben LaHaise, Stephen C. Tweedie, Alan Cox, Manfred Spraul, Steve Lord, Linux Kernel List, kiobuf-io-devel, Ingo Molnar On Tue, Feb 06, 2001 at 09:35:58PM +0100, Ingo Molnar wrote: > caching bmap() blocks was a recent addition around 2.3.20, and i suggested > some time ago to cache pagecache blocks via explicit entries in struct > page. That would be one solution - but it creates overhead. > > but there isnt anything wrong with having the bhs around to cache blocks - > think of it as a 'cached and recycled IO buffer entry, with the block > information cached'. I was not talking about caching physical blocks but the remaining buffer-cache support stuff. Christoph -- Of course it doesn't work. We've performed a software upgrade. Whip me. Beat me. Make me maintain AIX. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait 2001-02-06 20:25 ` Christoph Hellwig 2001-02-06 20:35 ` Ingo Molnar @ 2001-02-06 20:59 ` Linus Torvalds 2001-02-07 18:26 ` Christoph Hellwig 1 sibling, 1 reply; 186+ messages in thread From: Linus Torvalds @ 2001-02-06 20:59 UTC (permalink / raw) To: Christoph Hellwig Cc: Ben LaHaise, Ingo Molnar, Stephen C. Tweedie, Alan Cox, Manfred Spraul, Steve Lord, Linux Kernel List, kiobuf-io-devel, Ingo Molnar On Tue, 6 Feb 2001, Christoph Hellwig wrote: > > The second is that bh's are two things: > > - a cacheing object > - an io buffer Actually, they really aren't. They kind of _used_ to be, but more and more they've moved away from that historical use. Check in particular the page cache, and as a really extreme case the swap cache version of the page cache. It certainly _used_ to be true that "bh"s were actually first-class memory management citizens, and actually had a data buffer and a cache associated with them. And because of that historical baggage, that's how many people still think of them. These days, it's really not true any more. A "bh" doesn't really have an IO buffer intrisically associated with it any more - all memory management is done on a _page_ level, and it really works the other way around, ie a page can have one or more bh's associated with it as the IO entity. This _does_ show up in the bh itself: you find that bh's end up having the bh->b_page pointer in it, which is really a layering violation these days, but you'll notice that it's actually not used very much, and it could probably be largely removed. The most fundamental use of it (from an IO standpoint) is actually to handle high memory issues, because high-memory handling is very fundamentally based on "struct page", and in order to be able to have high-memory IO buffers you absolutely have to have the "struct page" the way things are done now. (all the other uses tend to not be IO-related at all: they are stuff like the callbacks that want to find the page that should be free'd up) The other part of "struct bh" is that it _does_ have support for fast lookups, and the bh hashing. Again, from a pure IO standpoint you can easily choose to just ignore this. It's often not used at all (in fact, _most_ bh's aren't hashed, because the only way to find them are through the page cache). > This is not really an clean appropeach, and I would really like to > get away from it. Trust me, you really _can_ get away from it. It's not designed into the bh's at all. You can already just allocate a single (or multiple) "struct buffer_head" and just use them as IO objects, and give them your _own_ pointers to the IO buffer etc. In fact, if you look at how the page cache is organized, this is what the page cache already does. The page cache has it's own IO buffer (the page itself), and it just uses "struct buffer_head" to allocate temporary IO entities. It _also_ uses the "struct buffer_head" to cache the meta-data in the sense of having the buffer head also contain the physical address on disk so that the page cache doesn't have to ask the low-level filesystem all the time, so in that sense it actually has a double use for it. But you can (and _should_) think of that as a "we got the meta-data address caching for free, and it fit with our historical use, so why not use it?". So you can easily do the equivalent of - maintain your own buffers (possibly by looking up pages directly from user space, if you want to do zero-copy kind of things) - allocate a private buffer head ("get_unused_buffer_head()") - make that buffer head point into your buffer - submit the IO by just calling "submit_bh()", using the b_end_io() callback as your way to maintain _your_ IO buffer ownership. In particular, think of the things that you do NOT have to do: - you do NOT have to allocate a bh-private buffer. Just point the bh at your own buffer. - you do NOT have to "give" your buffer to the bh. You do, of course, want to know when the bh is done with _your_ buffer, but that's what the b_end_io callback is all about. - you do NOT have to hash the bh you allocated and thus expose it to anybody else. It is YOUR private bh, and it does not show up on ANY other lists. There are various helper functions to insert the bh on various global lists ("mark_bh_dirty()" to put it on the dirty list, "buffer_insert_inode_queue()" to put it on the inode lists etc, but there is nothing in the thing that _forces_ you to expose your bh. So don't think of "bh->b_data" as being something that the bh owns. It's just a pointer. Think of "bh->b_data" and "bh->b_size" as _nothing_ more than a data range in memory. In short, you can, and often should, think of "struct buffer_head" as nothing but an IO entity. It has some support for being more than that, but that's secondary. That can validly be seen as another layer, that is just so common that there is little point in splitting it up (and a lot of purely historical reasons for not splitting it). Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait 2001-02-06 20:59 ` Linus Torvalds @ 2001-02-07 18:26 ` Christoph Hellwig 2001-02-07 18:36 ` Linus Torvalds 0 siblings, 1 reply; 186+ messages in thread From: Christoph Hellwig @ 2001-02-07 18:26 UTC (permalink / raw) To: Linus Torvalds Cc: Ben LaHaise, Ingo Molnar, Stephen C. Tweedie, Alan Cox, Manfred Spraul, Steve Lord, Linux Kernel List, kiobuf-io-devel, Ingo Molnar On Tue, Feb 06, 2001 at 12:59:02PM -0800, Linus Torvalds wrote: > > > On Tue, 6 Feb 2001, Christoph Hellwig wrote: > > > > The second is that bh's are two things: > > > > - a cacheing object > > - an io buffer > > Actually, they really aren't. > > They kind of _used_ to be, but more and more they've moved away from that > historical use. Check in particular the page cache, and as a really > extreme case the swap cache version of the page cache. Yes. And that exactly why I think it's ugly to have the left-over caching stuff in the same data sctruture as the IO buffer. > It certainly _used_ to be true that "bh"s were actually first-class memory > management citizens, and actually had a data buffer and a cache associated > with them. And because of that historical baggage, that's how many people > still think of them. I do even know that the pagecache is our primary cache now :) Anyway having that caching cruft still in is ugly. > > This is not really an clean appropeach, and I would really like to > > get away from it. > > Trust me, you really _can_ get away from it. It's not designed into the > bh's at all. You can already just allocate a single (or multiple) "struct > buffer_head" and just use them as IO objects, and give them your _own_ > pointers to the IO buffer etc. So true. Exactly because of that the data structures should become seperated also. Christoph -- Of course it doesn't work. We've performed a software upgrade. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait 2001-02-07 18:26 ` Christoph Hellwig @ 2001-02-07 18:36 ` Linus Torvalds 2001-02-07 18:44 ` Christoph Hellwig 2001-02-08 0:34 ` Neil Brown 0 siblings, 2 replies; 186+ messages in thread From: Linus Torvalds @ 2001-02-07 18:36 UTC (permalink / raw) To: Christoph Hellwig Cc: Ben LaHaise, Ingo Molnar, Stephen C. Tweedie, Alan Cox, Manfred Spraul, Steve Lord, Linux Kernel List, kiobuf-io-devel, Ingo Molnar On Wed, 7 Feb 2001, Christoph Hellwig wrote: > On Tue, Feb 06, 2001 at 12:59:02PM -0800, Linus Torvalds wrote: > > > > Actually, they really aren't. > > > > They kind of _used_ to be, but more and more they've moved away from that > > historical use. Check in particular the page cache, and as a really > > extreme case the swap cache version of the page cache. > > Yes. And that exactly why I think it's ugly to have the left-over > caching stuff in the same data sctruture as the IO buffer. I do agree. I would not be opposed to factoring out the "pure block IO" part from the bh struct. It should not even be very hard. You'd do something like struct block_io { .. here is the stuff needed for block IO .. }; struct buffer_head { struct block_io io; .. here is the stuff needed for hashing etc .. } and then you make "generic_make_request()" and everything lower down take just the "struct block_io". You'd still leave "ll_rw_block()" and "submit_bh()" operating on bh's, because they knoa about bh semantics (ie things like scaling the sector number to the bh size etc). Which means that pretty much all the code outside the block layer wouldn't even _notice_. Which is a sign of good layering. If you want to do this, please do go ahead. But do realize that this is not exactly a 2.4.x thing ;) Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait 2001-02-07 18:36 ` Linus Torvalds @ 2001-02-07 18:44 ` Christoph Hellwig 2001-02-08 0:34 ` Neil Brown 1 sibling, 0 replies; 186+ messages in thread From: Christoph Hellwig @ 2001-02-07 18:44 UTC (permalink / raw) To: Linus Torvalds Cc: Ben LaHaise, Ingo Molnar, Stephen C. Tweedie, Alan Cox, Manfred Spraul, Steve Lord, Linux Kernel List, kiobuf-io-devel, Ingo Molnar On Wed, Feb 07, 2001 at 10:36:47AM -0800, Linus Torvalds wrote: > > > On Wed, 7 Feb 2001, Christoph Hellwig wrote: > > > On Tue, Feb 06, 2001 at 12:59:02PM -0800, Linus Torvalds wrote: > > > > > > Actually, they really aren't. > > > > > > They kind of _used_ to be, but more and more they've moved away from that > > > historical use. Check in particular the page cache, and as a really > > > extreme case the swap cache version of the page cache. > > > > Yes. And that exactly why I think it's ugly to have the left-over > > caching stuff in the same data sctruture as the IO buffer. > > I do agree. > > I would not be opposed to factoring out the "pure block IO" part from the > bh struct. It should not even be very hard. You'd do something like > > struct block_io { > .. here is the stuff needed for block IO .. > }; > > struct buffer_head { > struct block_io io; > .. here is the stuff needed for hashing etc .. > } > > and then you make "generic_make_request()" and everything lower down take > just the "struct block_io". Yep. (besides the name block_io sucks :)) > You'd still leave "ll_rw_block()" and "submit_bh()" operating on bh's, > because they knoa about bh semantics (ie things like scaling the sector > number to the bh size etc). Which means that pretty much all the code > outside the block layer wouldn't even _notice_. Which is a sign of good > layering. Yep. > If you want to do this, please do go ahead. I'll take a look at it. > But do realize that this is not exactly a 2.4.x thing ;) Sure. Christoph -- Whip me. Beat me. Make me maintain AIX. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait 2001-02-07 18:36 ` Linus Torvalds 2001-02-07 18:44 ` Christoph Hellwig @ 2001-02-08 0:34 ` Neil Brown 1 sibling, 0 replies; 186+ messages in thread From: Neil Brown @ 2001-02-08 0:34 UTC (permalink / raw) To: Linus Torvalds Cc: Christoph Hellwig, Ben LaHaise, Ingo Molnar, Stephen C. Tweedie, Alan Cox, Manfred Spraul, Steve Lord, Linux Kernel List, kiobuf-io-devel, Ingo Molnar On Wednesday February 7, torvalds@transmeta.com wrote: > > > On Wed, 7 Feb 2001, Christoph Hellwig wrote: > > > On Tue, Feb 06, 2001 at 12:59:02PM -0800, Linus Torvalds wrote: > > > > > > Actually, they really aren't. > > > > > > They kind of _used_ to be, but more and more they've moved away from that > > > historical use. Check in particular the page cache, and as a really > > > extreme case the swap cache version of the page cache. > > > > Yes. And that exactly why I think it's ugly to have the left-over > > caching stuff in the same data sctruture as the IO buffer. > > I do agree. > > I would not be opposed to factoring out the "pure block IO" part from the > bh struct. It should not even be very hard. You'd do something like > > struct block_io { > .. here is the stuff needed for block IO .. > }; > > struct buffer_head { > struct block_io io; > .. here is the stuff needed for hashing etc .. > } > > and then you make "generic_make_request()" and everything lower down take > just the "struct block_io". > I was just thinking the same, or a similar thing. I wanted to do struct io_head { stuff }; struct buffer_head { struct io_head; more stuff; } so that, as an unnamed substructure, the content of the struct io_head would automagically be promoted to appear to be content of buffer_head. However I then remembered (when it didn't work) that unnamed substructures are a feature of the Plan-9 C compiler, not the GNU Compiler Collection. (Any gcc coders out there think this would be a good thing to add? http://plan9.bell-labs.com/sys/doc/compiler.html ) Anyway, I produced the same result in a rather ugly way with #defines and modified raid5 to use 32byte block_io structures instead of the 80+ byte buffer_heads, and it ... doesn't quite work :-( it boots fine, but raid5 dies and the Oops message is a few kilometers away. Anyway, I think the concept it fine. Patch is below for your inspection. It occurs to me that Stephen's desire to pass lots of requests through make_request all at once isn't a bad idea and could be done by simply linking the io_heads together with b_reqnext. This would require: 1/ all callers of generic_make_request (there are 3) to initialise b_reqnext 2/ all registered make_request_fn functions (there are again 3 I think) to cope with following b_reqnext It shouldn't be too hard to make the elevator code take advantage of any ordering that it fines in the list. I don't have a patch which does this. NeilBrown --- ./include/linux/fs.h 2001/02/07 22:45:37 1.1 +++ ./include/linux/fs.h 2001/02/07 23:09:05 @@ -207,6 +207,7 @@ #define BH_Protected 6 /* 1 if the buffer is protected */ /* + * THIS COMMENT NO-LONGER CORRECT. * Try to keep the most commonly used fields in single cache lines (16 * bytes) to improve performance. This ordering should be * particularly beneficial on 32-bit processors. @@ -217,31 +218,43 @@ * The second 16 bytes we use for lru buffer scans, as used by * sync_buffers() and refill_freelist(). -- sct */ + +/* + * io_head is all that is needed by device drivers. + */ +#define io_head_fields \ + unsigned long b_state; /* buffer state bitmap (see above) */ \ + struct buffer_head *b_reqnext; /* request queue */ \ + unsigned short b_size; /* block size */ \ + kdev_t b_rdev; /* Real device */ \ + unsigned long b_rsector; /* Real buffer location on disk */ \ + char * b_data; /* pointer to data block (512 byte) */ \ + void (*b_end_io)(struct buffer_head *bh, int uptodate); /* I/O completion */ \ + void *b_private; /* reserved for b_end_io */ \ + struct page *b_page; /* the page this bh is mapped to */ \ + /* this line intensionally left blank */ +struct io_head { + io_head_fields +}; + +/* buffer_head adds all the stuff needed by the buffer cache */ struct buffer_head { - /* First cache line: */ + io_head_fields + struct buffer_head *b_next; /* Hash queue list */ unsigned long b_blocknr; /* block number */ - unsigned short b_size; /* block size */ unsigned short b_list; /* List that this buffer appears */ kdev_t b_dev; /* device (B_FREE = free) */ atomic_t b_count; /* users using this block */ - kdev_t b_rdev; /* Real device */ - unsigned long b_state; /* buffer state bitmap (see above) */ unsigned long b_flushtime; /* Time when (dirty) buffer should be written */ struct buffer_head *b_next_free;/* lru/free list linkage */ struct buffer_head *b_prev_free;/* doubly linked list of buffers */ struct buffer_head *b_this_page;/* circular list of buffers in one page */ - struct buffer_head *b_reqnext; /* request queue */ struct buffer_head **b_pprev; /* doubly linked list of hash-queue */ - char * b_data; /* pointer to data block (512 byte) */ - struct page *b_page; /* the page this bh is mapped to */ - void (*b_end_io)(struct buffer_head *bh, int uptodate); /* I/O completion */ - void *b_private; /* reserved for b_end_io */ - unsigned long b_rsector; /* Real buffer location on disk */ wait_queue_head_t b_wait; struct inode * b_inode; --- ./drivers/md/raid5.c 2001/02/06 05:43:31 1.2 +++ ./drivers/md/raid5.c 2001/02/07 23:15:36 @@ -151,18 +151,16 @@ for (i=0; i<num; i++) { struct page *page; - bh = kmalloc(sizeof(struct buffer_head), priority); + bh = kmalloc(sizeof(struct io_head), priority); if (!bh) return 1; - memset(bh, 0, sizeof (struct buffer_head)); - init_waitqueue_head(&bh->b_wait); + memset(bh, 0, sizeof (struct io_head)); page = alloc_page(priority); bh->b_data = page_address(page); if (!bh->b_data) { kfree(bh); return 1; } - atomic_set(&bh->b_count, 0); bh->b_page = page; sh->bh_cache[i] = bh; @@ -412,7 +410,7 @@ spin_lock_irqsave(&conf->device_lock, flags); } } else { - md_error(mddev_to_kdev(conf->mddev), bh->b_dev); + md_error(mddev_to_kdev(conf->mddev), conf->disks[i].dev); clear_bit(BH_Uptodate, &bh->b_state); } clear_bit(BH_Lock, &bh->b_state); @@ -440,7 +438,7 @@ md_spin_lock_irqsave(&conf->device_lock, flags); if (!uptodate) - md_error(mddev_to_kdev(conf->mddev), bh->b_dev); + md_error(mddev_to_kdev(conf->mddev), conf->disks[i].dev); clear_bit(BH_Lock, &bh->b_state); set_bit(STRIPE_HANDLE, &sh->state); __release_stripe(conf, sh); @@ -456,12 +454,10 @@ unsigned long block = sh->sector / (sh->size >> 9); init_buffer(bh, raid5_end_read_request, sh); - bh->b_dev = conf->disks[i].dev; bh->b_blocknr = block; bh->b_state = (1 << BH_Req) | (1 << BH_Mapped); bh->b_size = sh->size; - bh->b_list = BUF_LOCKED; return bh; } @@ -1085,15 +1081,14 @@ else bh->b_end_io = raid5_end_write_request; if (conf->disks[i].operational) - bh->b_dev = conf->disks[i].dev; + bh->b_rdev = conf->disks[i].dev; else if (conf->spare && action[i] == WRITE+1) - bh->b_dev = conf->spare->dev; + bh->b_rdev = conf->spare->dev; else skip=1; if (!skip) { PRINTK("for %ld schedule op %d on disc %d\n", sh->sector, action[i]-1, i); atomic_inc(&sh->count); - bh->b_rdev = bh->b_dev; - bh->b_rsector = bh->b_blocknr * (bh->b_size>>9); + bh->b_rsector = sh->sector; generic_make_request(action[i]-1, bh); } else { PRINTK("skip op %d on disc %d for sector %ld\n", action[i]-1, i, sh->sector); @@ -1502,7 +1497,7 @@ } memory = conf->max_nr_stripes * (sizeof(struct stripe_head) + - conf->raid_disks * ((sizeof(struct buffer_head) + PAGE_SIZE))) / 1024; + conf->raid_disks * ((sizeof(struct io_head) + PAGE_SIZE))) / 1024; if (grow_stripes(conf, conf->max_nr_stripes, GFP_KERNEL)) { printk(KERN_ERR "raid5: couldn't allocate %dkB for buffers\n", memory); shrink_stripes(conf, conf->max_nr_stripes); - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait 2001-02-06 19:11 ` Ben LaHaise ` (2 preceding siblings ...) 2001-02-06 19:32 ` Linus Torvalds @ 2001-02-06 19:46 ` Ingo Molnar 2001-02-06 20:16 ` Ben LaHaise 3 siblings, 1 reply; 186+ messages in thread From: Ingo Molnar @ 2001-02-06 19:46 UTC (permalink / raw) To: Ben LaHaise Cc: Stephen C. Tweedie, Linus Torvalds, Alan Cox, Manfred Spraul, Steve Lord, Linux Kernel List, kiobuf-io-devel, Ingo Molnar On Tue, 6 Feb 2001, Ben LaHaise wrote: > > > You mentioned non-spindle base io devices in your last message. Take > > > something like a big RAM disk. Now compare kiobuf base io to buffer > > > head based io. Tell me which one is going to perform better. > > > > roughly equal performance when using 4K bhs. And a hell of a lot more > > complex and volatile code in the kiobuf case. > > I'm willing to benchmark you on this. sure. Could you specify the actual workload, and desired test-setups? Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait 2001-02-06 19:46 ` Ingo Molnar @ 2001-02-06 20:16 ` Ben LaHaise 2001-02-06 20:22 ` Ingo Molnar 0 siblings, 1 reply; 186+ messages in thread From: Ben LaHaise @ 2001-02-06 20:16 UTC (permalink / raw) To: Ingo Molnar Cc: Stephen C. Tweedie, Linus Torvalds, Alan Cox, Manfred Spraul, Steve Lord, Linux Kernel List, kiobuf-io-devel, Ingo Molnar On Tue, 6 Feb 2001, Ingo Molnar wrote: > > On Tue, 6 Feb 2001, Ben LaHaise wrote: > > > > > You mentioned non-spindle base io devices in your last message. Take > > > > something like a big RAM disk. Now compare kiobuf base io to buffer > > > > head based io. Tell me which one is going to perform better. > > > > > > roughly equal performance when using 4K bhs. And a hell of a lot more > > > complex and volatile code in the kiobuf case. > > > > I'm willing to benchmark you on this. > > sure. Could you specify the actual workload, and desired test-setups? Sure. General parameters will be as follows (since I think we both have access to these machines): - 4xXeon, 4GB memory, 3GB to be used for the ramdisk (enough for a base install plus data files. - data to/from the ram block device must be copied within the ram block driver. - the filesystem used must be ext2. optimisations to ext2 for tweaks to the interface are permitted & encouraged. The main item I'm interested in is read (page cache cold)/synchronous write performance for blocks from 256 bytes to 16MB in powers of two, much like what I've done in testing the aio patches that shows where improvement in latency is needed. Including a few other items on disk like the timings of find/make -s dep/bonnie/dbench is probably to show changes in throughput. Sound fair? -ben - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait 2001-02-06 20:16 ` Ben LaHaise @ 2001-02-06 20:22 ` Ingo Molnar 0 siblings, 0 replies; 186+ messages in thread From: Ingo Molnar @ 2001-02-06 20:22 UTC (permalink / raw) To: Ben LaHaise Cc: Stephen C. Tweedie, Linus Torvalds, Alan Cox, Manfred Spraul, Steve Lord, Linux Kernel List, kiobuf-io-devel, Ingo Molnar On Tue, 6 Feb 2001, Ben LaHaise wrote: > Sure. General parameters will be as follows (since I think we both have > access to these machines): > > - 4xXeon, 4GB memory, 3GB to be used for the ramdisk (enough for a > base install plus data files. > - data to/from the ram block device must be copied within the ram > block driver. > - the filesystem used must be ext2. optimisations to ext2 for > tweaks to the interface are permitted & encouraged. > > The main item I'm interested in is read (page cache cold)/synchronous > write performance for blocks from 256 bytes to 16MB in powers of two, > much like what I've done in testing the aio patches that shows where > improvement in latency is needed. Including a few other items on disk > like the timings of find/make -s dep/bonnie/dbench is probably to show > changes in throughput. Sound fair? yep, sounds fair. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait 2001-02-06 18:54 ` Ben LaHaise 2001-02-06 18:58 ` Ingo Molnar @ 2001-02-06 19:20 ` Linus Torvalds 1 sibling, 0 replies; 186+ messages in thread From: Linus Torvalds @ 2001-02-06 19:20 UTC (permalink / raw) To: Ben LaHaise Cc: Ingo Molnar, Stephen C. Tweedie, Alan Cox, Manfred Spraul, Steve Lord, Linux Kernel List, kiobuf-io-devel, Ingo Molnar On Tue, 6 Feb 2001, Ben LaHaise wrote: > On Tue, 6 Feb 2001, Ingo Molnar wrote: > > > If you are merging based on (device, offset) values, then that's lowlevel > > - and this is what we have been doing for years. > > > > If you are merging based on (inode, offset), then it has flaws like not > > being able to merge through a loopback or stacked filesystem. > > I disagree. Loopback filesystems typically have their data contiguously > on disk and won't split up incoming requests any further. Face it. You NEED to merge and sort late. You _cannot_ do a good job early. Early on, you don't have any concept of what the final IO pattern will be: you will only have that once you've seen which requests are still pending etc, something that the higher level layers CANNOT do. Do you really want the higher levels to know about per-controller request locking etc? I don't think so. Trust me. You HAVE to do the final decisions late in the game. You absolutely _cannot_ get the best performance except for trivial and uninteresting cases (ie one process that wants to read gigabytes of data in one single stream) otherwise. (It should be pointed out, btw, that SGI etc were often interested exactly in the trivial and uninteresting cases. When you have the DoD asking you to stream satellite pictures over the net as fast as you can, money being no object, you get a rather twisted picture of what is important and what is not) And I will turn your own argument against you: if you do merging at a low level anyway, there's little point in trying to do it at a higher level. Higher levels should do high-level sequencing. They can (and should) do some amount of sorting - the lower levels will still do their own sort as part of the merging anyway, and the lower level sorting may actually end up being _different_ from a high-level sort because the lower levels know about the topology of the device, but higher levels giving data with "patterns" to it only make it easier for the lower levels to do a good job. So high-level sorting is not _necessary_, but it's probably a good idea. High-level merging is almost certainly not even a good idea - higher levels should try to _batch_ the requests, but that's a different issue, and is again all about giving lower levels "patterns". It's can also about simple issues like cache locality - batching things tends to make for better icache (and possibly dcache) behaviour. So you should separate out the issue of batching and merging. An dyou absolutely should realize that you should NOT ignore Ingo's arguments about loopback etc just because they don't fit the model you WANT them to fit. The fact is that higher levels should NOT know about things like RAID striping etc, yet that has a HUGE impact on the issue of merging (you do _not_ want to merge requests to separate disks - you'll just have to split them up again). > Here are the points I'm trying to address: > > - reduce the overhead in submitting block ios, especially for > large ios. Look at the %CPU usages differences between 512 byte > blocks and 4KB blocks, this can be better. This is often a filesystem layer issue. Design your filesystem well, and you get a lot of batching for free. You can also batch the requests - this is basically what "readahead" is. That helps a lot. But that is NOT the same thing as merging. Not at all. The "batched" read-ahead requests may actually be split up among many different disks - and they will each then get separately merged with _other_ requests to those disks. See? And trust me, THAT is how you get good performance. Not by merging early. By merging late, and letting the disk layers do their own thing. > - make asynchronous io possible in the block layer. This is > impossible with the current ll_rw_block scheme and io request > plugging. I'm surprised you say that. It's not only possible, but we do it all the time. What do you think the swapout and writing is? How do you think that read-ahead is actually _implemented_? Right. Read-ahead is NOT done as a "merge" operation. It's done as several asynchronous IO operations that the low-level stuff can choose (or not) to merge. What do you think happens if you do a "submit_bh()"? It's a _purely_ asynchronous operation. It turns synchronous when you wait for the bh, not before. Your argument is nonsense. > - provide a generic mechanism for reordering io requests for > devices which will benefit from this. Make it a library for > drivers to call into. IDE for example will probably make use of > it, but some high end devices do this on the controller. This > is the important point: Make it OPTIONAL. Ehh. You've just described exatcly what we have. This is what the whole elevator thing _is_. It's a library of routines. You don't have to use them, and in fact many things DO NOT use them. The loopback driver, for example, doesn't bother with sorting or merging at all, because it knows that it's only supposed to pass the request on to somebody else - who will do a hell of a lot better job of it. Some high-end drivers have their own merging stuff, exactly because they don't need the overhead - you're better off just feeding the request to the controller as soon as you can, as the controller itself will do all the merging and sorting anyway. > You mentioned non-spindle base io devices in your last message. Take > something like a big RAM disk. Now compare kiobuf base io to buffer head > based io. Tell me which one is going to perform better. Buffer heads? Go and read the code. Sure, it has some historical baggage still, but the fact is that it works a hell of a lot better than kiobufs and it _does_ know about merging multiple requests and handling errors in the middle of one request etc. You can get the full advantage of streaming megabytes of data in one request, AND still get proper error handling if it turns out that one sector in the middle was bad. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait 2001-02-05 19:28 ` Linus Torvalds 2001-02-05 20:54 ` Stephen C. Tweedie @ 2001-02-06 0:31 ` Roman Zippel 2001-02-06 1:01 ` Linus Torvalds 2001-02-06 1:08 ` David S. Miller 1 sibling, 2 replies; 186+ messages in thread From: Roman Zippel @ 2001-02-06 0:31 UTC (permalink / raw) To: Linus Torvalds Cc: Alan Cox, Stephen C. Tweedie, Manfred Spraul, Christoph Hellwig, Steve Lord, linux-kernel, kiobuf-io-devel Hi, On Mon, 5 Feb 2001, Linus Torvalds wrote: > This all proves that the lowest level of layering should be pretty much > noting but the vectors. No callbacks, no crap like that. That's already a > level of abstraction away, and should not get tacked on. Your lowest level > of abstraction should be just the "area". Something like > > struct buffer { > struct page *page; > u16 offset, length; > }; > > int nr_buffers: > struct buffer *array; > > should be the low-level abstraction. Does it has to be vectors? What about lists? I'm thinking about this for some time now and I think lists are more flexible. At higher level we can easily generate a list of pages and in a lower level you can still split them up as needed. It would be basically the same structure, but you could use it everywhere with the same kind of operations. bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait 2001-02-06 0:31 ` Roman Zippel @ 2001-02-06 1:01 ` Linus Torvalds 2001-02-06 9:22 ` Roman Zippel 2001-02-06 9:30 ` Ingo Molnar 2001-02-06 1:08 ` David S. Miller 1 sibling, 2 replies; 186+ messages in thread From: Linus Torvalds @ 2001-02-06 1:01 UTC (permalink / raw) To: Roman Zippel Cc: Alan Cox, Stephen C. Tweedie, Manfred Spraul, Christoph Hellwig, Steve Lord, linux-kernel, kiobuf-io-devel On Tue, 6 Feb 2001, Roman Zippel wrote: > > > > int nr_buffers: > > struct buffer *array; > > > > should be the low-level abstraction. > > Does it has to be vectors? What about lists? I'd prefer to avoid lists unless there is some overriding concern, like a real implementation issue. But I don't care much one way or the other - what I care about is that the setup and usage time is as low as possible. I suspect arrays are better for that. I have this strong suspicion that networking is going to be the most latency-critical and complex part of this, and the fact that the networking code wanted arrays is what makes me think that arrays are the right way to go. But talk to Davem and ank about why they wanted vectors. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait 2001-02-06 1:01 ` Linus Torvalds @ 2001-02-06 9:22 ` Roman Zippel 2001-02-06 9:30 ` Ingo Molnar 1 sibling, 0 replies; 186+ messages in thread From: Roman Zippel @ 2001-02-06 9:22 UTC (permalink / raw) To: Linus Torvalds Cc: Alan Cox, Stephen C. Tweedie, Manfred Spraul, Christoph Hellwig, Steve Lord, linux-kernel, kiobuf-io-devel Hi, On Mon, 5 Feb 2001, Linus Torvalds wrote: > > Does it has to be vectors? What about lists? > > I'd prefer to avoid lists unless there is some overriding concern, like a > real implementation issue. But I don't care much one way or the other - > what I care about is that the setup and usage time is as low as possible. > I suspect arrays are better for that. I was more thinking about the higher layers. Here it's simpler to setup a list of pages which can be send to a lower layer. In the page cache we already have per address space lists, so it would be very easy to use that. A lower layer can generate of course anything it wants out of this, e.g. it can generate sublists or vectors. bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait 2001-02-06 1:01 ` Linus Torvalds 2001-02-06 9:22 ` Roman Zippel @ 2001-02-06 9:30 ` Ingo Molnar 1 sibling, 0 replies; 186+ messages in thread From: Ingo Molnar @ 2001-02-06 9:30 UTC (permalink / raw) To: Linus Torvalds Cc: Roman Zippel, Alan Cox, Stephen C. Tweedie, Manfred Spraul, Christoph Hellwig, Steve Lord, linux-kernel, kiobuf-io-devel On Mon, 5 Feb 2001, Linus Torvalds wrote: > [...] But talk to Davem and ank about why they wanted vectors. one issue is allocation overhead. The fragment array is a natural and constant-size part of an skb, thus we get all the control structures in place while allocating a structure that we have to allocate anyway. another issue is that certain cards have (or can have) SG-limits, so we have to be prepared to have a 'limited' array of fragments anyway, and have to be prepared to split/refragment packets. Whether there is a global MAX_SKB_FRAGS limit or not makes no difference. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait 2001-02-06 0:31 ` Roman Zippel 2001-02-06 1:01 ` Linus Torvalds @ 2001-02-06 1:08 ` David S. Miller 1 sibling, 0 replies; 186+ messages in thread From: David S. Miller @ 2001-02-06 1:08 UTC (permalink / raw) To: Linus Torvalds Cc: Roman Zippel, Alan Cox, Stephen C. Tweedie, Manfred Spraul, Christoph Hellwig, Steve Lord, linux-kernel, kiobuf-io-devel Linus Torvalds writes: > But talk to Davem and ank about why they wanted vectors. SKB setup and free needs to be as light as possible. Using vectors leads to code like: skb_data_free(...) { ... for (i = 0; i < MAX_SKB_FRAGS; i++) put_page(skb_shinfo(skb)->frags[i].page); } Currently, the ZC patches have a fixed frag vector size (MAX_SKB_FRAGS). But a part of me wants this to be made dynamic (to handle HIPPI etc. properly) whereas another part of me doesn't want to do it that way because it would increase the complexity of paged SKB handling and add yet another member to the SKB structure. Later, David S. Miller davem@redhat.com - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains 2001-02-05 15:03 ` Stephen C. Tweedie 2001-02-05 15:19 ` Alan Cox @ 2001-02-05 22:09 ` Ingo Molnar 1 sibling, 0 replies; 186+ messages in thread From: Ingo Molnar @ 2001-02-05 22:09 UTC (permalink / raw) To: Stephen C. Tweedie Cc: Manfred Spraul, Linus Torvalds, Christoph Hellwig, Steve Lord, Linux Kernel List, kiobuf-io-devel, Alan Cox On Mon, 5 Feb 2001, Stephen C. Tweedie wrote: > > Obviously the disk access itself must be sector aligned and the total > > length must be a multiple of the sector length, but there shouldn't be > > any restrictions on the data buffers. > > But there are. Many controllers just break down and corrupt things > silently if you don't align the data buffers (Jeff Merkey found this > by accident when he started generating unaligned IOs within page > boundaries in his NWFS code). And a lot of controllers simply cannot > break a sector dma over a page boundary (at least not without some > form of IOMMU remapping). so we are putting workarounds for hardware bugs into the design? Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains 2001-02-05 12:00 ` Manfred Spraul 2001-02-05 15:03 ` Stephen C. Tweedie @ 2001-02-05 16:56 ` Linus Torvalds 2001-02-05 17:27 ` [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait Alan Cox 1 sibling, 1 reply; 186+ messages in thread From: Linus Torvalds @ 2001-02-05 16:56 UTC (permalink / raw) To: Manfred Spraul Cc: Stephen C. Tweedie, Christoph Hellwig, Steve Lord, linux-kernel, kiobuf-io-devel, Alan Cox On Mon, 5 Feb 2001, Manfred Spraul wrote: > "Stephen C. Tweedie" wrote: > > > > You simply cannot do physical disk IO on > > non-sector-aligned memory or in chunks which aren't a multiple of > > sector size. > > Why not? > > Obviously the disk access itself must be sector aligned and the total > length must be a multiple of the sector length, but there shouldn't be > any restrictions on the data buffers. In fact, regular IDE DMA allows arbitrary scatter-gather at least in theory. Linux has never used it, so I don't know how well it works in practice - I would not be surprised if it ends up causing no end of nasty corner-cases that have bugs. It's not as if IDE controllers always follow the documentation ;) The _total_ length of the buffers have to be a multiple of the sector size, and there are some alignment issues (each scatter-gather area has to be at least 16-bit aligned both in physical memory and in length, and apparently many controllers need 32-bit alignment). And I'd almost be surprised if there wouldn't be hardware that wanted cache alignment because they always expect to burst. But despite a lot of likely practical reasons why it won't work for arbitrary sg lists on plain IDE DMA, there is no _theoretical_ reason it wouldn't. And there are bound to be better controllers that could handle it. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait 2001-02-05 16:56 ` Linus Torvalds @ 2001-02-05 17:27 ` Alan Cox 0 siblings, 0 replies; 186+ messages in thread From: Alan Cox @ 2001-02-05 17:27 UTC (permalink / raw) To: Linus Torvalds Cc: Manfred Spraul, Stephen C. Tweedie, Christoph Hellwig, Steve Lord, linux-kernel, kiobuf-io-devel, Alan Cox > In fact, regular IDE DMA allows arbitrary scatter-gather at least in > theory. Linux has never used it, so I don't know how well it works in Purely in theory, as Jeff found out. > But despite a lot of likely practical reasons why it won't work for > arbitrary sg lists on plain IDE DMA, there is no _theoretical_ reason it > wouldn't. And there are bound to be better controllers that could handle > it. I2O controllers are required too handle it (most dont) and some of the high end scsi/fc controllers even get it right - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains 2001-02-05 11:03 ` Stephen C. Tweedie 2001-02-05 12:00 ` Manfred Spraul @ 2001-02-05 16:36 ` Linus Torvalds 2001-02-05 19:08 ` Stephen C. Tweedie 1 sibling, 1 reply; 186+ messages in thread From: Linus Torvalds @ 2001-02-05 16:36 UTC (permalink / raw) To: Stephen C. Tweedie Cc: Christoph Hellwig, Steve Lord, linux-kernel, kiobuf-io-devel, Alan Cox On Mon, 5 Feb 2001, Stephen C. Tweedie wrote: > > On Sat, Feb 03, 2001 at 12:28:47PM -0800, Linus Torvalds wrote: > > > > Neither the read nor the write are page-aligned. I don't know where you > > got that idea. It's obviously not true even in the common case: it depends > > _entirely_ on what the file offsets are, and expecting the offset to be > > zero is just being stupid. It's often _not_ zero. With networking it is in > > fact seldom zero, because the network packets are seldom aligned either in > > size or in location. > > The underlying buffer is. The VFS (and the current kiobuf code) is > already happy about IO happening at odd offsets within a page. Stephen. Don't bother even talking about this. You're so damn hung up about the page cache that it's not funny. Have you ever thought about other things, like networking, special devices, stuff like that? They can (and do) have packet boundaries that have nothing to do with pages what-so-ever. They can have such notions as packets that contain multiple streams in one packet, where it ends up being split up into several pieces. Where neither the original packet _nor_ the final pieces have _anything_ to do with "pages". THERE IS NO PAGE ALIGNMENT. So stop blathering about it. Of _course_ the current kiobuf code has page-alignment assumptions. You _designed_ it that way. So bringing it up as an example is a circular argument. And a really stupid one at that, as that's the thing I've been quoting as the single biggest design bug in all of kiobufs. It's the thing that makes them entirely useless for things like describing "struct msghdr" etc. We should get _away_ from this page-alignment fallacy. It's not true. It's not necessarily even true for the page cache - which has no real fundamental reasons any more for not being able to be a "variable-size" cache some time in the future (ie it might be a per-address-space decision on whether the granularity is 1, 2, 4 or more pages). Anything that designs for "everything is a page" will automatically be limited for cases where you might sometimes have 64kB chunks of data. Instead, just face the realization that "everything is a bunch or ranges", and leave it at that. It's true _already_ - thing about fragmented IP packets. We may not handle it that way completely yet, but the zero-copy networking is going in this direction. And as long as you keep on harping about page alignment, you're not going to play in this game. End of story. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains 2001-02-05 16:36 ` [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains Linus Torvalds @ 2001-02-05 19:08 ` Stephen C. Tweedie 0 siblings, 0 replies; 186+ messages in thread From: Stephen C. Tweedie @ 2001-02-05 19:08 UTC (permalink / raw) To: Linus Torvalds Cc: Stephen C. Tweedie, Christoph Hellwig, Steve Lord, linux-kernel, kiobuf-io-devel, Alan Cox Hi, On Mon, Feb 05, 2001 at 08:36:31AM -0800, Linus Torvalds wrote: > Have you ever thought about other things, like networking, special > devices, stuff like that? They can (and do) have packet boundaries that > have nothing to do with pages what-so-ever. They can have such notions as > packets that contain multiple streams in one packet, where it ends up > being split up into several pieces. Where neither the original packet > _nor_ the final pieces have _anything_ to do with "pages". > > THERE IS NO PAGE ALIGNMENT. And kiobufs don't require IO to be page aligned, and they have never done. The only page alignment they assume is that if a *single* scatter-gather element spans multiple pages, then the joins between those pages occur on page boundaries. Remember, a kiobuf is only designed to represent one scatter-gather fragment, not a full sg list. That was the whole reason for having a kiovec as a separate concept: if you have more than one independent fragment in the sg-list, you need more than one kiobuf. And the reason why we created sg fragments which can span pages was so that we can encode IOs which interact with the VM: any arbitrary virtually-contiguous user data buffer can be mapped into a *single* kiobuf for a write() call, so it's a generic way of supporting things like O_DIRECT without the IO layers having to know anything about VM (and Ben's async IO patches also use kiobufs in this way to allow read()s to write to the user's data buffer once the IO completes, without having to have a context switch back into that user's context.) Similarly, any extent of a file in the page cache can be encoded in a single kiobuf. And no, the simpler networking-style sg-list does not cut it for block device IO, because for block devices, we want to have separate completion status made available for each individual sg fragment in the IO. *That* is why the kiobuf is more heavyweight than the networking variant: each fragment [kiobuf] in the scatter-gather list [kiovec] has its own completion information. If we have a bunch of separate data buffers queued for sequential disk IO as a single request, then we still want things like readahead and error handling to work. That means that we want the first kiobuf in the chain to get its completion wakeup as soon as that segment of the IO is complete, without having to wait for the remaining sectors of the IO to be transferred. It also means that if we've done something like split the IO over a raid stripe, then when an error occurs, we still want to know which of the callers' buffers succeeded and which failed. Yes, I agree that the original kiovec mechanism of using a *kiobuf[] array to assemble the scatter-gather fragments sucked. But I don't believe that just throwing away the concept of kiobuf as a sc-fragment will work either when it comes to disk IOs: the need for per-fragment completion is too compelling. I'd rather shift to allowing kiobufs to be assembled into linked lists for IO to avoid *kiobuf[] vectors, in just the same way that we currently chain buffer_heads for IO. Cheers, Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains 2001-02-01 17:34 ` Alan Cox 2001-02-01 17:49 ` Stephen C. Tweedie @ 2001-02-01 17:49 ` Christoph Hellwig 2001-02-01 17:58 ` Alan Cox 2001-02-01 19:33 ` Stephen C. Tweedie 2001-02-01 18:51 ` bcrl 2 siblings, 2 replies; 186+ messages in thread From: Christoph Hellwig @ 2001-02-01 17:49 UTC (permalink / raw) To: Alan Cox Cc: Christoph Hellwig, Stephen C. Tweedie, Steve Lord, linux-kernel, kiobuf-io-devel On Thu, Feb 01, 2001 at 05:34:49PM +0000, Alan Cox wrote: > > > I'm in the middle of some parts of it, and am actively soliciting > > > feedback on what cleanups are required. > > > > The real issue is that Linus dislikes the current kiobuf scheme. > > I do not like everything he proposes, but lots of things makes sense. > > Linus basically designed the original kiobuf scheme of course so I guess > he's allowed to dislike it. Linus disliking something however doesn't mean > its wrong. Its not a technically valid basis for argument. Sure. But Linus saing that he doesn't want more of that (shit, crap, I don't rember what he said exactly) in the kernel is a very good reason for thinking a little more aboyt it. Espescially if most arguments look right to one after thinking more about it... > Linus list of reasons like the amount of state are more interesting True. The arument that they are to heaviweight also. That they should allow scatter gather without an array of structs also. > > > So, what are the benefits in the disk IO stack of adding length/offset > > > pairs to each page of the kiobuf? > > > > I don't see any real advantage for disk IO. The real advantage is that > > we can have a generic structure that is also usefull in e.g. networking > > and can lead to a unified IO buffering scheme (a little like IO-Lite). > > Networking wants something lighter rather than heavier. Right. That's what the new design was about, besides adding a offset and length to every page instead of the page array, something also wanted by the networking in the first place. Look at the skb_frag struct in the zero-copy patch for what networking thinks it needs for physical page based buffers. > Adding tons of base/limit pairs to kiobufs makes it worse not better >From looking at the networking code and listening to Dave and Ingo it looks like it makes the thing better for networking, although I can not verify this due to the lack of familarity with the networking code. For disk I/O it makes the handling a little easier for the cost of the additional offset/length fields. Christoph P.S. the tuple things is also what Larry had in his inital slice paper. -- Of course it doesn't work. We've performed a software upgrade. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains 2001-02-01 17:49 ` Christoph Hellwig @ 2001-02-01 17:58 ` Alan Cox 2001-02-01 18:32 ` Rik van Riel 2001-02-01 19:33 ` Stephen C. Tweedie 1 sibling, 1 reply; 186+ messages in thread From: Alan Cox @ 2001-02-01 17:58 UTC (permalink / raw) To: Christoph Hellwig Cc: Alan Cox, Christoph Hellwig, Stephen C. Tweedie, Steve Lord, linux-kernel, kiobuf-io-devel > > Linus basically designed the original kiobuf scheme of course so I guess > > he's allowed to dislike it. Linus disliking something however doesn't mean > > its wrong. Its not a technically valid basis for argument. > > Sure. But Linus saing that he doesn't want more of that (shit, crap, > I don't rember what he said exactly) in the kernel is a very good reason > for thinking a little more aboyt it. No. Linus is not a God, Linus is fallible, regularly makes mistakes and frequently opens his mouth and says stupid things when he is far too busy. > Espescially if most arguments look right to one after thinking more about > it... I agree with the issues about networking wanting lightweight objects, Im unconvinced however the existing setup for networking is sanely applicable for real world applications in other spaces. Take video capture. I want to stream 60Mbytes/second in multi-megabyte chunks between my capture cards and a high end raid array. The array wants 1Mbyte or large blocks per I/O to reach 60Mbytes/second performance. This btw isnt benchmark crap like most of the zero copy networking, this is a real world application.. The current buffer head stuff is already heavier than the kio stuff. The networking stuff isnt oriented to that kind of I/O and would end up needing to do tons of extra processing. > For disk I/O it makes the handling a little easier for the cost of the > additional offset/length fields. I remain to be convinced by that. However you do get 64bytes/cacheline on a real processor nowdays so if you touch any of that 64byte block you are practically zero cost to fill the rest. Alan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains 2001-02-01 17:58 ` Alan Cox @ 2001-02-01 18:32 ` Rik van Riel 2001-02-01 18:59 ` yodaiken 0 siblings, 1 reply; 186+ messages in thread From: Rik van Riel @ 2001-02-01 18:32 UTC (permalink / raw) To: Alan Cox Cc: Christoph Hellwig, Stephen C. Tweedie, Steve Lord, linux-kernel, kiobuf-io-devel On Thu, 1 Feb 2001, Alan Cox wrote: > > Sure. But Linus saing that he doesn't want more of that (shit, crap, > > I don't rember what he said exactly) in the kernel is a very good reason > > for thinking a little more aboyt it. > > No. Linus is not a God, Linus is fallible, regularly makes mistakes and > frequently opens his mouth and says stupid things when he is far too busy. People may remember Linus saying a resolute no to SMP support in Linux ;) In my experience, when Linus says "NO" to a certain idea, he's usually objecting to bad design decisions in the proposed implementation of the idea and the lack of a nice alternative solution ... ... but as soon as a clean, efficient and maintainable alternative to the original bad idea surfaces, it seems to be quite easy to convince Linus to include it. cheers, Rik -- Virtual memory is like a game you can't win; However, without VM there's truly nothing to lose... http://www.surriel.com/ http://www.conectiva.com/ http://distro.conectiva.com.br/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains 2001-02-01 18:32 ` Rik van Riel @ 2001-02-01 18:59 ` yodaiken 0 siblings, 0 replies; 186+ messages in thread From: yodaiken @ 2001-02-01 18:59 UTC (permalink / raw) To: Rik van Riel Cc: Alan Cox, Christoph Hellwig, Stephen C. Tweedie, Steve Lord, linux-kernel, kiobuf-io-devel On Thu, Feb 01, 2001 at 04:32:48PM -0200, Rik van Riel wrote: > On Thu, 1 Feb 2001, Alan Cox wrote: > > > > Sure. But Linus saing that he doesn't want more of that (shit, crap, > > > I don't rember what he said exactly) in the kernel is a very good reason > > > for thinking a little more aboyt it. > > > > No. Linus is not a God, Linus is fallible, regularly makes mistakes and > > frequently opens his mouth and says stupid things when he is far too busy. > > People may remember Linus saying a resolute no to SMP > support in Linux ;) And perhaps he was right! -- --------------------------------------------------------- Victor Yodaiken Finite State Machine Labs: The RTLinux Company. www.fsmlabs.com www.rtlinux.com - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains 2001-02-01 17:49 ` Christoph Hellwig 2001-02-01 17:58 ` Alan Cox @ 2001-02-01 19:33 ` Stephen C. Tweedie 1 sibling, 0 replies; 186+ messages in thread From: Stephen C. Tweedie @ 2001-02-01 19:33 UTC (permalink / raw) To: Alan Cox, Stephen C. Tweedie, Steve Lord, linux-kernel, kiobuf-io-devel Hi, On Thu, Feb 01, 2001 at 06:49:50PM +0100, Christoph Hellwig wrote: > > > Adding tons of base/limit pairs to kiobufs makes it worse not better > > For disk I/O it makes the handling a little easier for the cost of the > additional offset/length fields. Umm, actually, no, it makes it much worse for many of the cases. --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains 2001-02-01 17:34 ` Alan Cox 2001-02-01 17:49 ` Stephen C. Tweedie 2001-02-01 17:49 ` Christoph Hellwig @ 2001-02-01 18:51 ` bcrl 2 siblings, 0 replies; 186+ messages in thread From: bcrl @ 2001-02-01 18:51 UTC (permalink / raw) To: Alan Cox Cc: Christoph Hellwig, Stephen C. Tweedie, Steve Lord, linux-kernel, kiobuf-io-devel On Thu, 1 Feb 2001, Alan Cox wrote: > Linus list of reasons like the amount of state are more interesting The state is required, not optional, if we are to have a decent basis for building asyncronous io into the kernel. > Networking wants something lighter rather than heavier. Adding tons of > base/limit pairs to kiobufs makes it worse not better I'm still not seeing what I consider valid arguments from the networking people regarding the use of kiobufs as the interface they present to the VFS for asynchronous/bulk io. I agree with their needs for a light weight mechanism for getting small io requests from userland, and even the need for using lightweight scatter gather lists within the network layer itself. If the statement is that map_user_kiobuf is too heavy for use on every single io, sure. But that is a seperate issue. -ben - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains 2001-02-01 15:09 ` Christoph Hellwig 2001-02-01 16:08 ` Steve Lord @ 2001-02-01 16:16 ` Stephen C. Tweedie 2001-02-01 17:05 ` Christoph Hellwig 1 sibling, 1 reply; 186+ messages in thread From: Stephen C. Tweedie @ 2001-02-01 16:16 UTC (permalink / raw) To: bsuparna, Stephen C. Tweedie, linux-kernel, kiobuf-io-devel Hi, On Thu, Feb 01, 2001 at 04:09:53PM +0100, Christoph Hellwig wrote: > On Thu, Feb 01, 2001 at 08:14:58PM +0530, bsuparna@in.ibm.com wrote: > > > > That would require the vfs interfaces themselves (address space > > readpage/writepage ops) to take kiobufs as arguments, instead of struct > > page * . That's not the case right now, is it ? > > No, and with the current kiobufs it would not make sense, because they > are to heavy-weight. Really? In what way? > With page,length,offsett iobufs this makes sense > and is IMHO the way to go. What, you mean adding *extra* stuff to the heavyweight kiobuf makes it lean enough to do the job?? Cheers, Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains 2001-02-01 16:16 ` Stephen C. Tweedie @ 2001-02-01 17:05 ` Christoph Hellwig 2001-02-01 17:09 ` Christoph Hellwig 2001-02-01 17:41 ` Stephen C. Tweedie 0 siblings, 2 replies; 186+ messages in thread From: Christoph Hellwig @ 2001-02-01 17:05 UTC (permalink / raw) To: Stephen C. Tweedie; +Cc: bsuparna, linux-kernel, kiobuf-io-devel On Thu, Feb 01, 2001 at 04:16:15PM +0000, Stephen C. Tweedie wrote: > Hi, > > On Thu, Feb 01, 2001 at 04:09:53PM +0100, Christoph Hellwig wrote: > > On Thu, Feb 01, 2001 at 08:14:58PM +0530, bsuparna@in.ibm.com wrote: > > > > > > That would require the vfs interfaces themselves (address space > > > readpage/writepage ops) to take kiobufs as arguments, instead of struct > > > page * . That's not the case right now, is it ? > > > > No, and with the current kiobufs it would not make sense, because they > > are to heavy-weight. > > Really? In what way? We can't allocate a huge kiobuf structure just for requesting one page of IO. It might get better with VM-level IO clustering though. > > > With page,length,offsett iobufs this makes sense > > and is IMHO the way to go. > > What, you mean adding *extra* stuff to the heavyweight kiobuf makes it > lean enough to do the job?? No. I was speaking abou the light-weight kiobuf Linux & Me discussed on lkml some time ago (though I'd much more like to call it kiovec analogous to BSD iovecs). And a page,offset,length tuple is pretty cheap compared to a current kiobuf. Christoph -- Of course it doesn't work. We've performed a software upgrade. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains 2001-02-01 17:05 ` Christoph Hellwig @ 2001-02-01 17:09 ` Christoph Hellwig 2001-02-01 17:41 ` Stephen C. Tweedie 1 sibling, 0 replies; 186+ messages in thread From: Christoph Hellwig @ 2001-02-01 17:09 UTC (permalink / raw) To: Stephen C. Tweedie, bsuparna, linux-kernel, kiobuf-io-devel On Thu, Feb 01, 2001 at 06:05:15PM +0100, Christoph Hellwig wrote: > > What, you mean adding *extra* stuff to the heavyweight kiobuf makes it > > lean enough to do the job?? > > No. I was speaking abou the light-weight kiobuf Linux & Me discussed on ^^^^^ Linus ... > lkml some time ago (though I'd much more like to call it kiovec analogous > to BSD iovecs). > > And a page,offset,length tuple is pretty cheap compared to a current kiobuf. Christoph (slapping himself for the stupid typo and selfreply ...) -- Of course it doesn't work. We've performed a software upgrade. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains 2001-02-01 17:05 ` Christoph Hellwig 2001-02-01 17:09 ` Christoph Hellwig @ 2001-02-01 17:41 ` Stephen C. Tweedie 2001-02-01 18:14 ` Christoph Hellwig 2001-02-01 20:04 ` Chaitanya Tumuluri 1 sibling, 2 replies; 186+ messages in thread From: Stephen C. Tweedie @ 2001-02-01 17:41 UTC (permalink / raw) To: Stephen C. Tweedie, bsuparna, linux-kernel, kiobuf-io-devel Hi, On Thu, Feb 01, 2001 at 06:05:15PM +0100, Christoph Hellwig wrote: > On Thu, Feb 01, 2001 at 04:16:15PM +0000, Stephen C. Tweedie wrote: > > > > > > No, and with the current kiobufs it would not make sense, because they > > > are to heavy-weight. > > > > Really? In what way? > > We can't allocate a huge kiobuf structure just for requesting one page of > IO. It might get better with VM-level IO clustering though. A kiobuf is *much* smaller than, say, a buffer_head, and we currently allocate a buffer_head per block for all IO. A kiobuf contains enough embedded page vector space for 16 pages by default, but I'm happy enough to remove that if people want. However, note that that memory is not initialised, so there is no memory access cost at all for that empty space. Remove that space and instead of one memory allocation per kiobuf, you get two, so the cost goes *UP* for small IOs. > > > With page,length,offsett iobufs this makes sense > > > and is IMHO the way to go. > > > > What, you mean adding *extra* stuff to the heavyweight kiobuf makes it > > lean enough to do the job?? > > No. I was speaking abou the light-weight kiobuf Linux & Me discussed on > lkml some time ago (though I'd much more like to call it kiovec analogous > to BSD iovecs). What is so heavyweight in the current kiobuf (other than the embedded vector, which I've already noted I'm willing to cut)? --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains 2001-02-01 17:41 ` Stephen C. Tweedie @ 2001-02-01 18:14 ` Christoph Hellwig 2001-02-01 18:25 ` Alan Cox ` (2 more replies) 2001-02-01 20:04 ` Chaitanya Tumuluri 1 sibling, 3 replies; 186+ messages in thread From: Christoph Hellwig @ 2001-02-01 18:14 UTC (permalink / raw) To: Stephen C. Tweedie; +Cc: bsuparna, linux-kernel, kiobuf-io-devel On Thu, Feb 01, 2001 at 05:41:20PM +0000, Stephen C. Tweedie wrote: > Hi, > > On Thu, Feb 01, 2001 at 06:05:15PM +0100, Christoph Hellwig wrote: > > On Thu, Feb 01, 2001 at 04:16:15PM +0000, Stephen C. Tweedie wrote: > > > > > > > > No, and with the current kiobufs it would not make sense, because they > > > > are to heavy-weight. > > > > > > Really? In what way? > > > > We can't allocate a huge kiobuf structure just for requesting one page of > > IO. It might get better with VM-level IO clustering though. > > A kiobuf is *much* smaller than, say, a buffer_head, and we currently > allocate a buffer_head per block for all IO. A kiobuf is 124 bytes, a buffer_head 96. And a buffer_head is additionally used for caching data, a kiobuf not. > > A kiobuf contains enough embedded page vector space for 16 pages by > default, but I'm happy enough to remove that if people want. However, > note that that memory is not initialised, so there is no memory access > cost at all for that empty space. Remove that space and instead of > one memory allocation per kiobuf, you get two, so the cost goes *UP* > for small IOs. You could still embed it into a surrounding structure, even if there are cases where an additional memory allocation is needed, yes. > > > > > With page,length,offsett iobufs this makes sense > > > > and is IMHO the way to go. > > > > > > What, you mean adding *extra* stuff to the heavyweight kiobuf makes it > > > lean enough to do the job?? > > > > No. I was speaking abou the light-weight kiobuf Linux & Me discussed on > > lkml some time ago (though I'd much more like to call it kiovec analogous > > to BSD iovecs). > > What is so heavyweight in the current kiobuf (other than the embedded > vector, which I've already noted I'm willing to cut)? array_len, io_count, the presence of wait_queue AND end_io, and the lack of scatter gather in one kiobuf struct (you always need an array), and AFAICS that is what the networking guys dislike. They often just want multiple buffers in one physical page, and and array of those. Now one could say: just let the networkers use their own kind of buffers (and that's exactly what is done in the zerocopy patches), but that again leds to inefficient buffer passing and ungeneric IO handling. S.th. like: struct kiovec { struct page * kv_page; /* physical page */ u_short kv_offset; /* offset into page */ u_short kv_length; /* data length */ }; enum kio_flags { KIO_LOANED, /* the calling subsystem wants this buf back */ KIO_GIFTED, /* thanks for the buffer, man! */ KIO_COW /* copy on write (XXX: not yet) */ }; struct kio { struct kiovec * kio_data; /* our kiovecs */ int kio_ndata; /* # of kiovecs */ int kio_flags; /* loaned or giftet? */ void * kio_priv; /* caller private data */ wait_queue_head_t kio_wait; /* wait queue */ }; makes it a lot simpler for the subsytems to integrate. Christoph -- Of course it doesn't work. We've performed a software upgrade. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains 2001-02-01 18:14 ` Christoph Hellwig @ 2001-02-01 18:25 ` Alan Cox 2001-02-01 18:39 ` Rik van Riel 2001-02-01 18:48 ` [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains Christoph Hellwig 2001-02-01 19:32 ` Stephen C. Tweedie 2001-02-02 4:18 ` bcrl 2 siblings, 2 replies; 186+ messages in thread From: Alan Cox @ 2001-02-01 18:25 UTC (permalink / raw) To: Christoph Hellwig Cc: Stephen C. Tweedie, bsuparna, linux-kernel, kiobuf-io-devel > array_len, io_count, the presence of wait_queue AND end_io, and the lack of > scatter gather in one kiobuf struct (you always need an array), and AFAICS > that is what the networking guys dislike. You need a completion pointer. Its arguable whether you want the wait_queue in the default structure or as part of whatever its contained in and handled by the completion pointer. And I've actually bothered to talk to the networking people and they dont have a problem with the completion pointer. > Now one could say: just let the networkers use their own kind of buffers > (and that's exactly what is done in the zerocopy patches), but that again leds > to inefficient buffer passing and ungeneric IO handling. Careful. This is the line of reasoning which also says Aeroplanes are good for travelling long distances Cars are better for getting to my front door Therefore everyone should drive a 747 home It is quite possible that the right thing to do is to do conversions in the cases it happens. That might seem a good reason for having offset/length pairs on each block, because streaming from the network to disk you may well get a collection of partial pages of data you need to write to disk. Unfortunately the reality of DMA support on almost (but not quite) all disk controllers is that you don't get that degree of scatter gather. My I2O controllers and I think the fusion controllers could indeed benefit and cope with being given a pile of randomly located 1480 byte chunks of data and being asked to put them on disk. I do seriously doubt there are any real world situations this is useful. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains 2001-02-01 18:25 ` Alan Cox @ 2001-02-01 18:39 ` Rik van Riel 2001-02-01 18:46 ` [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait Alan Cox 2001-02-01 18:48 ` [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains Christoph Hellwig 1 sibling, 1 reply; 186+ messages in thread From: Rik van Riel @ 2001-02-01 18:39 UTC (permalink / raw) To: Alan Cox Cc: Christoph Hellwig, Stephen C. Tweedie, bsuparna, linux-kernel, kiobuf-io-devel On Thu, 1 Feb 2001, Alan Cox wrote: > > Now one could say: just let the networkers use their own kind of buffers > > (and that's exactly what is done in the zerocopy patches), but that again leds > > to inefficient buffer passing and ungeneric IO handling. [snip] > It is quite possible that the right thing to do is to do > conversions in the cases it happens. OTOH, somehow a zero-copy system which converts the zero-copy metadata every time the buffer is handed to another subsystem just doesn't sound right ... (well, maybe it _is_, but it looks quite inefficient at first glance) regards, Rik -- Virtual memory is like a game you can't win; However, without VM there's truly nothing to lose... http://www.surriel.com/ http://www.conectiva.com/ http://distro.conectiva.com.br/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait 2001-02-01 18:39 ` Rik van Riel @ 2001-02-01 18:46 ` Alan Cox 0 siblings, 0 replies; 186+ messages in thread From: Alan Cox @ 2001-02-01 18:46 UTC (permalink / raw) To: Rik van Riel Cc: Alan Cox, Christoph Hellwig, Stephen C. Tweedie, bsuparna, linux-kernel, kiobuf-io-devel > OTOH, somehow a zero-copy system which converts the zero-copy > metadata every time the buffer is handed to another subsystem > just doesn't sound right ... > > (well, maybe it _is_, but it looks quite inefficient at first > glance) I would certainly be a lot happier if there is a single sensible zero copy format doing the lot, but only if it doesnt turn into a cross between a 747 and bicycle - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains 2001-02-01 18:25 ` Alan Cox 2001-02-01 18:39 ` Rik van Riel @ 2001-02-01 18:48 ` Christoph Hellwig 2001-02-01 18:57 ` Alan Cox 1 sibling, 1 reply; 186+ messages in thread From: Christoph Hellwig @ 2001-02-01 18:48 UTC (permalink / raw) To: Alan Cox Cc: Christoph Hellwig, Stephen C. Tweedie, bsuparna, linux-kernel, kiobuf-io-devel On Thu, Feb 01, 2001 at 06:25:16PM +0000, Alan Cox wrote: > > array_len, io_count, the presence of wait_queue AND end_io, and the lack of > > scatter gather in one kiobuf struct (you always need an array), and AFAICS > > that is what the networking guys dislike. > > You need a completion pointer. Its arguable whether you want the wait_queue > in the default structure or as part of whatever its contained in and handled > by the completion pointer. I personaly think that Ben's function pointer on wakeup work is the alternative in this area. > And I've actually bothered to talk to the networking people and they dont have > a problem with the completion pointer. I have never said that they don't like it - but having both the waitqueue and the completion handler in the kiobuf makes it bigger. > > Now one could say: just let the networkers use their own kind of buffers > > (and that's exactly what is done in the zerocopy patches), but that again leds > > to inefficient buffer passing and ungeneric IO handling. > > Careful. This is the line of reasoning which also says > > Aeroplanes are good for travelling long distances > Cars are better for getting to my front door > Therefore everyone should drive a 747 home Hehe ;) > It is quite possible that the right thing to do is to do conversions in the > cases it happens. Yes, this would be THE alternative to my suggestion. > That might seem a good reason for having offset/length > pairs on each block, because streaming from the network to disk you may well > get a collection of partial pages of data you need to write to disk. > Unfortunately the reality of DMA support on almost (but not quite) all > disk controllers is that you don't get that degree of scatter gather. > > My I2O controllers and I think the fusion controllers could indeed benefit > and cope with being given a pile of randomly located 1480 byte chunks of > data and being asked to put them on disk. It doesn't really matter that much, because we write to the pagecache first anyway. The real thing is that we want to have some common data structure for describing physical memory used for IO. We could either use special structures in every subsystem and then copy between them or pass struct page * and lose meta information. Or we could try to find a structure that holds enough information to make passing it from one subsystem to another usefull. The cut-down kio design (heavily inspired by Larry McVoy's splice paper) should allow just that, nothing more an nothing less. For use in disk-io and networking or v4l there are probably other primary data structures needed, and that's ok. Christoph -- Of course it doesn't work. We've performed a software upgrade. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains 2001-02-01 18:48 ` [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains Christoph Hellwig @ 2001-02-01 18:57 ` Alan Cox 2001-02-01 19:00 ` Christoph Hellwig 0 siblings, 1 reply; 186+ messages in thread From: Alan Cox @ 2001-02-01 18:57 UTC (permalink / raw) To: Christoph Hellwig Cc: Alan Cox, Christoph Hellwig, Stephen C. Tweedie, bsuparna, linux-kernel, kiobuf-io-devel > It doesn't really matter that much, because we write to the pagecache > first anyway. Not for raw I/O. Although for the drivers that can't cope then going via the page cache is certainly the next best alternative > The real thing is that we want to have some common data structure for > describing physical memory used for IO. We could either use special Yes. You also need a way to describe it in terms of page * in order to do mm locking for raw I/O (like the video capture stuff wants) > by Larry McVoy's splice paper) should allow just that, nothing more an > nothing less. For use in disk-io and networking or v4l there are probably > other primary data structures needed, and that's ok. Certainly having the lightweight one a subset of the heavyweight one is a good target. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains 2001-02-01 18:57 ` Alan Cox @ 2001-02-01 19:00 ` Christoph Hellwig 0 siblings, 0 replies; 186+ messages in thread From: Christoph Hellwig @ 2001-02-01 19:00 UTC (permalink / raw) To: Alan Cox; +Cc: Stephen C. Tweedie, bsuparna, linux-kernel, kiobuf-io-devel On Thu, Feb 01, 2001 at 06:57:41PM +0000, Alan Cox wrote: > Not for raw I/O. Although for the drivers that can't cope then going via > the page cache is certainly the next best alternative True - but raw-io has it's own alignment issues anyway. > Yes. You also need a way to describe it in terms of page * in order to do > mm locking for raw I/O (like the video capture stuff wants) Right. (That's why we have the struct page * always as part of the structure) > Certainly having the lightweight one a subset of the heavyweight one is a good > target. Yes, I'm trying to address that... Christoph -- Of course it doesn't work. We've performed a software upgrade. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains 2001-02-01 18:14 ` Christoph Hellwig 2001-02-01 18:25 ` Alan Cox @ 2001-02-01 19:32 ` Stephen C. Tweedie 2001-02-01 20:46 ` Christoph Hellwig 2001-02-02 4:18 ` bcrl 2 siblings, 1 reply; 186+ messages in thread From: Stephen C. Tweedie @ 2001-02-01 19:32 UTC (permalink / raw) To: Stephen C. Tweedie, bsuparna, linux-kernel, kiobuf-io-devel Hi, On Thu, Feb 01, 2001 at 07:14:03PM +0100, Christoph Hellwig wrote: > On Thu, Feb 01, 2001 at 05:41:20PM +0000, Stephen C. Tweedie wrote: > > > > > > We can't allocate a huge kiobuf structure just for requesting one page of > > > IO. It might get better with VM-level IO clustering though. > > > > A kiobuf is *much* smaller than, say, a buffer_head, and we currently > > allocate a buffer_head per block for all IO. > > A kiobuf is 124 bytes, ... the vast majority of which is room for the page vector to expand without having to be copied. You don't touch that in the normal case. > a buffer_head 96. And a buffer_head is additionally > used for caching data, a kiobuf not. Buffer_heads are _sometimes_ used for caching data. That's one of the big problems with them, they are too overloaded, being both IO descriptors _and_ cache descriptors. If you've got 128k of data to write out from user space, do you want to set up one kiobuf or 256 buffer_heads? Buffer_heads become really very heavy indeed once you start doing non-trivial IO. > > What is so heavyweight in the current kiobuf (other than the embedded > > vector, which I've already noted I'm willing to cut)? > > array_len kiobufs can be reused after IO. You can depopulate a kiobuf, repopulate it with new pages and submit new IO without having to deallocate the kiobuf. You can't do this without knowing how big the data vector is. Removing that functionality will prevent reuse, making them _more_ heavyweight. > io_count, Right now we can take a kiobuf and turn it into a bunch of buffer_heads for IO. The io_count lets us track all of those sub-IOs so that we know when all submitted IO has completed, so that we can pass the completion callback back up the chain without having to allocate yet more descriptor structs for the IO. Again, remove this and the IO becomes more heavyweight because we need to create a separate struct for the info. > the presence of wait_queue AND end_io, That's fine, I'm happy scrapping the wait queue: people can always use the kiobuf private data field to refer to a wait queue if they want to. > and the lack of > scatter gather in one kiobuf struct (you always need an array) Again, _all_ data being sent down through the block device layer is either in buffer heads or is page aligned. You want us to triple the size of the "heavyweight" kiobuf's data vector for what gain, exactly? Obviously, extra code will be needed to scan kiobufs if we do that, and unless we have both per-page _and_ per-kiobuf start/offset pairs (adding even further to the complexity), those scatter-gather lists would prevent us from carving up a kiobuf into smaller sub-ios without copying the whole (expanded) vector. That's a _lot_ of extra complexity in the disk IO layers. I'm all for a fast kiobuf_to_sglist converter. But I haven't seen any evidence that such scatter-gather lists will do anything in the block device case except complicate the code and decrease performance. > S.th. like: ... > makes it a lot simpler for the subsytems to integrate. Possibly, but I remain to be convinced, because you may end up with a mechanism which is generic but is not well-tuned for any specific case, so everything goes slower. --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains 2001-02-01 19:32 ` Stephen C. Tweedie @ 2001-02-01 20:46 ` Christoph Hellwig 2001-02-01 21:25 ` Stephen C. Tweedie 0 siblings, 1 reply; 186+ messages in thread From: Christoph Hellwig @ 2001-02-01 20:46 UTC (permalink / raw) To: "Stephen C. Tweedie"; +Cc: bsuparna, linux-kernel, kiobuf-io-devel In article <20010201193221.D11607@redhat.com> you wrote: > Buffer_heads are _sometimes_ used for caching data. Actually they are mostly used, but that should have any value for the discussion... > That's one of the > big problems with them, they are too overloaded, being both IO > descriptors _and_ cache descriptors. Agreed. > If you've got 128k of data to > write out from user space, do you want to set up one kiobuf or 256 > buffer_heads? Buffer_heads become really very heavy indeed once you > start doing non-trivial IO. Sure - I was never arguing in favor of buffer_head's ... >> > What is so heavyweight in the current kiobuf (other than the embedded >> > vector, which I've already noted I'm willing to cut)? >> >> array_len > kiobufs can be reused after IO. You can depopulate a kiobuf, > repopulate it with new pages and submit new IO without having to > deallocate the kiobuf. You can't do this without knowing how big the > data vector is. Removing that functionality will prevent reuse, > making them _more_ heavyweight. >> io_count, > Right now we can take a kiobuf and turn it into a bunch of > buffer_heads for IO. The io_count lets us track all of those sub-IOs > so that we know when all submitted IO has completed, so that we can > pass the completion callback back up the chain without having to > allocate yet more descriptor structs for the IO. > Again, remove this and the IO becomes more heavyweight because we need > to create a separate struct for the info. No. Just allow passing the multiple of the devices blocksize over ll_rw_block. XFS is doing that and it just needs an audit of the lesser used block drivers. >> and the lack of >> scatter gather in one kiobuf struct (you always need an array) > Again, _all_ data being sent down through the block device layer is > either in buffer heads or is page aligned. That's the point. You are always talking about the block-layer only. And I think it should be generic instead. Looks like that is the major point. > You want us to triple the > size of the "heavyweight" kiobuf's data vector for what gain, exactly? double. > Obviously, extra code will be needed to scan kiobufs if we do that, > and unless we have both per-page _and_ per-kiobuf start/offset pairs > (adding even further to the complexity), those scatter-gather lists > would prevent us from carving up a kiobuf into smaller sub-ios without > copying the whole (expanded) vector. No. I think I explained that in my last mail. > That's a _lot_ of extra complexity in the disk IO layers. > Possibly, but I remain to be convinced, because you may end up with a > mechanism which is generic but is not well-tuned for any specific > case, so everything goes slower. As kiobufs are widely used for real IO, just as containers, this is better then nothing. And IMHO a nice generic concepts that lets different subsystems work toegther is a _lot_ better then a bunch of over-optimized, rather isolated subsytems. The IO-Lite people have done a nice research of the effect of an unified IO-Caching system vs. the typical isolated systems. Christoph -- Of course it doesn't work. We've performed a software upgrade. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains 2001-02-01 20:46 ` Christoph Hellwig @ 2001-02-01 21:25 ` Stephen C. Tweedie 2001-02-02 11:51 ` Christoph Hellwig 0 siblings, 1 reply; 186+ messages in thread From: Stephen C. Tweedie @ 2001-02-01 21:25 UTC (permalink / raw) To: Christoph Hellwig Cc: Stephen C. Tweedie, bsuparna, linux-kernel, kiobuf-io-devel Hi, On Thu, Feb 01, 2001 at 09:46:27PM +0100, Christoph Hellwig wrote: > > Right now we can take a kiobuf and turn it into a bunch of > > buffer_heads for IO. The io_count lets us track all of those sub-IOs > > so that we know when all submitted IO has completed, so that we can > > pass the completion callback back up the chain without having to > > allocate yet more descriptor structs for the IO. > > > Again, remove this and the IO becomes more heavyweight because we need > > to create a separate struct for the info. > > No. Just allow passing the multiple of the devices blocksize over > ll_rw_block. That was just one example: you need the sub-ios just as much when you split up an IO over stripe boundaries in LVM or raid0, for example. Secondly, ll_rw_block needs to die anyway: you can expand the blocksize up to PAGE_SIZE but not beyond, whereas something like ll_rw_kiobuf can submit a much larger IO atomically (and we have devices which don't start to deliver good throughput until you use IO sizes of 1MB or more). > >> and the lack of > >> scatter gather in one kiobuf struct (you always need an array) > > > Again, _all_ data being sent down through the block device layer is > > either in buffer heads or is page aligned. > > That's the point. You are always talking about the block-layer only. I'm talking about why the minimal, generic solution doesn't provide what the block layer needs. > > Obviously, extra code will be needed to scan kiobufs if we do that, > > and unless we have both per-page _and_ per-kiobuf start/offset pairs > > (adding even further to the complexity), those scatter-gather lists > > would prevent us from carving up a kiobuf into smaller sub-ios without > > copying the whole (expanded) vector. > > No. I think I explained that in my last mail. How? If I've got a vector (page X, offset 0, length PAGE_SIZE) and I want to split it in two, I have to make two new vectors (page X, offset 0, length n) and (page X, offset n, length PAGE_SIZE-n). That implies copying both vectors. If I have a page vector with a single offset/length pair, I can build a new header with the same vector and modified offset/length to split the vector in two without copying it. > > Possibly, but I remain to be convinced, because you may end up with a > > mechanism which is generic but is not well-tuned for any specific > > case, so everything goes slower. > > As kiobufs are widely used for real IO, just as containers, this is > better then nothing. Surely having all of the subsystems working fast is better still? > And IMHO a nice generic concepts that lets different subsystems work > toegther is a _lot_ better then a bunch of over-optimized, rather isolated > subsytems. The IO-Lite people have done a nice research of the effect of > an unified IO-Caching system vs. the typical isolated systems. I know, and IO-Lite has some major problems (the close integration of that code into the cache, for example, makes it harder to expose the zero-copy to user-land). --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains 2001-02-01 21:25 ` Stephen C. Tweedie @ 2001-02-02 11:51 ` Christoph Hellwig 2001-02-02 14:04 ` Stephen C. Tweedie 0 siblings, 1 reply; 186+ messages in thread From: Christoph Hellwig @ 2001-02-02 11:51 UTC (permalink / raw) To: Stephen C. Tweedie; +Cc: bsuparna, linux-kernel, kiobuf-io-devel On Thu, Feb 01, 2001 at 09:25:08PM +0000, Stephen C. Tweedie wrote: > > No. Just allow passing the multiple of the devices blocksize over > > ll_rw_block. > > That was just one example: you need the sub-ios just as much when > you split up an IO over stripe boundaries in LVM or raid0, for > example. IIRC that's why you designed (and I thought of independandly) clone-kiobufs. > Secondly, ll_rw_block needs to die anyway: you can expand > the blocksize up to PAGE_SIZE but not beyond, whereas something like > ll_rw_kiobuf can submit a much larger IO atomically (and we have > devices which don't start to deliver good throughput until you use > IO sizes of 1MB or more). Completly agreed. > If I've got a vector (page X, offset 0, length PAGE_SIZE) and I want > to split it in two, I have to make two new vectors (page X, offset 0, > length n) and (page X, offset n, length PAGE_SIZE-n). That implies > copying both vectors. > > If I have a page vector with a single offset/length pair, I can build > a new header with the same vector and modified offset/length to split > the vector in two without copying it. You just say in the higher-level structure ignore from x to y even if they have an offset in their own vector. Christoph -- Of course it doesn't work. We've performed a software upgrade. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains 2001-02-02 11:51 ` Christoph Hellwig @ 2001-02-02 14:04 ` Stephen C. Tweedie 0 siblings, 0 replies; 186+ messages in thread From: Stephen C. Tweedie @ 2001-02-02 14:04 UTC (permalink / raw) To: Stephen C. Tweedie, bsuparna, linux-kernel, kiobuf-io-devel Hi, On Fri, Feb 02, 2001 at 12:51:35PM +0100, Christoph Hellwig wrote: > > > > If I have a page vector with a single offset/length pair, I can build > > a new header with the same vector and modified offset/length to split > > the vector in two without copying it. > > You just say in the higher-level structure ignore from x to y even if > they have an offset in their own vector. Exactly --- and so you end up with something _much_ uglier, because you end up with all sorts of combinations of length/offset fields all over the place. This is _precisely_ the mess I want to avoid. Cheers, Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains 2001-02-01 18:14 ` Christoph Hellwig 2001-02-01 18:25 ` Alan Cox 2001-02-01 19:32 ` Stephen C. Tweedie @ 2001-02-02 4:18 ` bcrl 2001-02-02 12:12 ` Christoph Hellwig 2 siblings, 1 reply; 186+ messages in thread From: bcrl @ 2001-02-02 4:18 UTC (permalink / raw) To: Christoph Hellwig Cc: Stephen C. Tweedie, bsuparna, linux-kernel, kiobuf-io-devel On Thu, 1 Feb 2001, Christoph Hellwig wrote: > A kiobuf is 124 bytes, a buffer_head 96. And a buffer_head is additionally > used for caching data, a kiobuf not. Go measure the cost of a distant cache miss, then complain about having everything in one structure. Also, 1 kiobuf maps 16-128 times as much data as a single buffer head. > enum kio_flags { > KIO_LOANED, /* the calling subsystem wants this buf back */ > KIO_GIFTED, /* thanks for the buffer, man! */ > KIO_COW /* copy on write (XXX: not yet) */ > }; This is a Really Bad Idea. Having semantics depend on a subtle flag determined by a caller is a sure way to > > > struct kio { > struct kiovec * kio_data; /* our kiovecs */ > int kio_ndata; /* # of kiovecs */ > int kio_flags; /* loaned or giftet? */ > void * kio_priv; /* caller private data */ > wait_queue_head_t kio_wait; /* wait queue */ > }; > > makes it a lot simpler for the subsytems to integrate. Keep in mind that using distant memory allocations for kio_data will incur additional cache misses. The atomic count is probably going to be widely used; I see it being applicable to the network stack, block io layers and others. Also, how is information about io completion status passed back to the caller? That information is required across layers so that io can be properly aborted or proceed with the partial amount of io. Add those back in and we're right back to the original kiobuf structure. -ben - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains 2001-02-02 4:18 ` bcrl @ 2001-02-02 12:12 ` Christoph Hellwig 0 siblings, 0 replies; 186+ messages in thread From: Christoph Hellwig @ 2001-02-02 12:12 UTC (permalink / raw) To: bcrl Cc: Christoph Hellwig, Stephen C. Tweedie, bsuparna, linux-kernel, kiobuf-io-devel On Thu, Feb 01, 2001 at 11:18:56PM -0500, bcrl@redhat.com wrote: > On Thu, 1 Feb 2001, Christoph Hellwig wrote: > > > A kiobuf is 124 bytes, a buffer_head 96. And a buffer_head is additionally > > used for caching data, a kiobuf not. > > Go measure the cost of a distant cache miss, then complain about having > everything in one structure. Also, 1 kiobuf maps 16-128 times as much > data as a single buffer head. I'd never dipute that. It was just an answers to Stephen's "a kiobuf is already smaller". > > enum kio_flags { > > KIO_LOANED, /* the calling subsystem wants this buf back */ > > KIO_GIFTED, /* thanks for the buffer, man! */ > > KIO_COW /* copy on write (XXX: not yet) */ > > }; > > This is a Really Bad Idea. Having semantics depend on a subtle flag > determined by a caller is a sure way to The semantics aren't different for the using subsystem. LOANED vs GIFTED is an issue for the free function, COW will probably be a page-level mm thing - though I haven't thought a lot about it yet an am not sure wether it actually makes sense. > > > > > > > struct kio { > > struct kiovec * kio_data; /* our kiovecs */ > > int kio_ndata; /* # of kiovecs */ > > int kio_flags; /* loaned or giftet? */ > > void * kio_priv; /* caller private data */ > > wait_queue_head_t kio_wait; /* wait queue */ > > }; > > > > makes it a lot simpler for the subsytems to integrate. > > Keep in mind that using distant memory allocations for kio_data will incur > additional cache misses. It could also be a [0] array at the end, allowing for a single allocation, but that looks more like a implementation detail then a design problem to me. > The atomic count is probably going to be widely > used; I see it being applicable to the network stack, block io layers and > others. Hmm. Currently it is used only for the multiple buffer_head's per iobuf cruft, and I don't see why multiple outstanding IOs should be noted in a kiobuf. > Also, how is information about io completion status passed back > to the caller? Yes, there needs to be an kio_errno field - though I wanted to get rid of it I had to readd in in later versions of my design. Christoph -- Of course it doesn't work. We've performed a software upgrade. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains 2001-02-01 17:41 ` Stephen C. Tweedie 2001-02-01 18:14 ` Christoph Hellwig @ 2001-02-01 20:04 ` Chaitanya Tumuluri 1 sibling, 0 replies; 186+ messages in thread From: Chaitanya Tumuluri @ 2001-02-01 20:04 UTC (permalink / raw) To: Stephen C. Tweedie; +Cc: bsuparna, linux-kernel, kiobuf-io-devel On Thu, 1 Feb 2001, Stephen C. Tweedie wrote: > Hi, > > On Thu, Feb 01, 2001 at 06:05:15PM +0100, Christoph Hellwig wrote: > > On Thu, Feb 01, 2001 at 04:16:15PM +0000, Stephen C. Tweedie wrote: > > > > > > > > No, and with the current kiobufs it would not make sense, because they > > > > are to heavy-weight. > > > > > > Really? In what way? > > > > We can't allocate a huge kiobuf structure just for requesting one page of > > IO. It might get better with VM-level IO clustering though. > > A kiobuf is *much* smaller than, say, a buffer_head, and we currently > allocate a buffer_head per block for all IO. > > A kiobuf contains enough embedded page vector space for 16 pages by > default, but I'm happy enough to remove that if people want. However, > note that that memory is not initialised, so there is no memory access > cost at all for that empty space. Remove that space and instead of > one memory allocation per kiobuf, you get two, so the cost goes *UP* > for small IOs. > > > > > With page,length,offsett iobufs this makes sense > > > > and is IMHO the way to go. > > > > > > What, you mean adding *extra* stuff to the heavyweight kiobuf makes it > > > lean enough to do the job?? > > > > No. I was speaking abou the light-weight kiobuf Linux & Me discussed on > > lkml some time ago (though I'd much more like to call it kiovec analogous > > to BSD iovecs). > > What is so heavyweight in the current kiobuf (other than the embedded > vector, which I've already noted I'm willing to cut)? Hi, It'd seem that "array_len", "locked", "bounced", "io_count" and "errno" are the fields that need to go away (apart from the "maplist"). The field "nr_pages" would reincarnate in the kiovec struct (which is is not a plain array anymore) as the field "nbufs". See below. Based on what I've seen fly by on the lists here's my understanding of the proposed new kiobuf/kiovec structures: =========================================================================== /* * a simple page,offset,length tuple like Linus wants it */ struct kiobuf { struct page * page; /* The page itself */ u_16 offset; /* Offset to start of valid data */ u_16 length; /* Number of valid bytes of data */ }; struct kiovec { int nbufs; /* Kiobufs actually referenced */ struct kiobuf * bufs; } /* * the name is just plain stupid, but that shouldn't matter */ struct vfs_kiovec { struct kiovec * iov; /* private data, mostly for the callback */ void * private; /* completion callback */ void (*end_io) (struct vfs_kiovec *); wait_queue_head_t wait_queue; }; =========================================================================== Is this correct? If so, I have a few questions/clarifications: - The [ll_rw_blk, scsi/ide request-functions, scsi/ide I/O completion handling] functions would be handed the "X_kiovec" struct, correct? - So, the soft-RAID / LVM layers need to construct their own "lvm_kiovec" structs to handle request splits and the partial completions, correct? - Then, what are the semantics of request-merges containing the "X_kiovec" structs in the block I/O queueing layers? Do we add "X_kiovec->next", "X_kiovec->prev" etc. fields? It will also require a re-allocation of a new and longer kiovec->bufs array, correct? - How are I/O error codes to be propagated back to the higher (calling) layers? I think that needs to be added into the "X_kiovec" struct, no? - How is bouncing to be handled with this setup? (some state is needed to (a) determine that bouncing occurred, (b) find out which pages have been bounced and, (c) find out the bounce-page for each of these bounced pages). Cheers, -Chait. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
[parent not found: <CA2569E9.004A4E23.00@d73mta05.au.ibm.com>]
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait [not found] <CA2569E9.004A4E23.00@d73mta05.au.ibm.com> @ 2001-02-04 16:46 ` Alan Cox 0 siblings, 0 replies; 186+ messages in thread From: Alan Cox @ 2001-02-04 16:46 UTC (permalink / raw) To: bsuparna Cc: Stephen C. Tweedie, linux-kernel, kiobuf-io-devel, Alan Cox, Christoph Hellwig, Andi Kleen > It appears that we are coming across 2 kinds of requirements for kiobuf > vectors - and quite a bit of debate centering around that. > > 1. In the block device i/o world, where large i/os may be involved, we'd > 2. In the networking world, we deal with smaller fragments (for protocol Its probably worth commenting at this point that the I2O message passing layers do indeed have both #1 and #2 type descriptor chains to optimise performance for different tasks. We arent the only people to hit this. I2O supports offset, pagelist, length where the middle pages in the list are entirely copied And sets of addr, len tuples. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait @ 2001-02-12 14:56 bsuparna 0 siblings, 0 replies; 186+ messages in thread From: bsuparna @ 2001-02-12 14:56 UTC (permalink / raw) To: Linus Torvalds Cc: Martin Dalecki, Ben LaHaise, Stephen C. Tweedie, Alan Cox, Manfred Spraul, Steve Lord, linux-kernel, kiobuf-io-devel, Ingo Molnar Going through all the discussions once again and trying to look at this from the point of view of just basic requirements for data structures and mechanisms, that they imply. 1. Should have a data structure that represents a memory chain , which may not be contiguous in physical memory, and which can be passed down as a single unit all the way through to lowest level drivers - e.g for direct i/o to/from a contiguous virtual address range in user space (without any intermediate copies) (Networking and block i/o seem may have require different optimizations in the design of such a data structure, due to differences in the kind of patterns expected, as is apparent from the zero-copy networking fragments vs raw i/o kiobuf/kiovec patches. There are situations when such a data structure may be passed between subsystems as in the i2o example) This data structure could be part of an I/O container. 2. I/O containers may get split or merged as they pass through various layers --- so any completion mechanism and i/o container design should be able to account for both cases. At any point, a request could be - a collection of several higher level requests, or - could be one among several sub-requests of a single higher level request. (Just as appropriate "clustering" could happen at each level, appropriate "splitting" may also take place depending on the situation. It may make sense to delay splitting as far down the chain as possible in many situations, where the higher level is only interested in the i/o in its entirety and not in partial completion ) When caching/buffers are involved, sometimes the sub-requests of a single higher level request may have individual completion requirements (even when no merges were involved), because the sub-request buffers may be used to service other requests alongside. With raw i/o that might not be the case. 3. It is desirable that layers which process the requests along the way without splitting/merging, be able to pass along the same I/O container without any duplication or cloning, and intercept async i/o completions for post processing. 4. (Optional) It would be nice if different kinds of I/O containers or buffer structures could be used at different levels, without having explicit linkage fields (like bh --> page, for example) , and in a way that intermediate drivers or layers can work transparently. 3 & 4 are more of layering related items, which gets a little specific, but do 1 and 2 cover the general things we are looking for ? Regards Suparna Suparna Bhattacharya Systems Software Group, IBM Global Services, India E-mail : bsuparna@in.ibm.com Phone : 91-80-5267117, Extn : 2525 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://vger.kernel.org/lkml/ ^ permalink raw reply [flat|nested] 186+ messages in thread
end of thread, other threads:[~2001-02-12 16:21 UTC | newest] Thread overview: 186+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2001-02-01 14:44 [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains bsuparna 2001-02-01 15:09 ` Christoph Hellwig 2001-02-01 16:08 ` Steve Lord 2001-02-01 16:49 ` Stephen C. Tweedie 2001-02-01 17:02 ` Christoph Hellwig 2001-02-01 17:34 ` Alan Cox 2001-02-01 17:49 ` Stephen C. Tweedie 2001-02-01 17:09 ` Chaitanya Tumuluri 2001-02-01 20:33 ` Christoph Hellwig 2001-02-01 20:56 ` Steve Lord 2001-02-01 20:59 ` Christoph Hellwig 2001-02-01 21:17 ` Steve Lord 2001-02-01 21:44 ` Stephen C. Tweedie 2001-02-01 22:07 ` Stephen C. Tweedie 2001-02-02 12:02 ` Christoph Hellwig 2001-02-05 12:19 ` Stephen C. Tweedie 2001-02-05 21:28 ` Ingo Molnar 2001-02-05 22:58 ` Stephen C. Tweedie 2001-02-05 23:06 ` Alan Cox 2001-02-05 23:16 ` Stephen C. Tweedie 2001-02-06 0:19 ` Manfred Spraul 2001-02-03 20:28 ` Linus Torvalds 2001-02-05 11:03 ` Stephen C. Tweedie 2001-02-05 12:00 ` Manfred Spraul 2001-02-05 15:03 ` Stephen C. Tweedie 2001-02-05 15:19 ` Alan Cox 2001-02-05 17:20 ` Stephen C. Tweedie 2001-02-05 17:29 ` Alan Cox 2001-02-05 18:49 ` Stephen C. Tweedie 2001-02-05 19:04 ` Alan Cox 2001-02-05 19:09 ` Linus Torvalds 2001-02-05 19:16 ` [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait Alan Cox 2001-02-05 19:28 ` Linus Torvalds 2001-02-05 20:54 ` Stephen C. Tweedie 2001-02-05 21:08 ` David Lang 2001-02-05 21:51 ` Alan Cox 2001-02-06 0:07 ` Stephen C. Tweedie 2001-02-06 17:00 ` Christoph Hellwig 2001-02-06 17:05 ` Stephen C. Tweedie 2001-02-06 17:14 ` Jens Axboe 2001-02-06 17:22 ` Christoph Hellwig 2001-02-06 18:26 ` Stephen C. Tweedie 2001-02-06 17:37 ` Ben LaHaise 2001-02-06 18:00 ` Jens Axboe 2001-02-06 18:09 ` Ben LaHaise 2001-02-06 19:35 ` Jens Axboe 2001-02-06 18:14 ` Linus Torvalds 2001-02-08 11:21 ` Andi Kleen 2001-02-08 14:11 ` Martin Dalecki 2001-02-08 17:59 ` Linus Torvalds 2001-02-06 18:18 ` Ingo Molnar 2001-02-06 18:25 ` Ben LaHaise 2001-02-06 18:35 ` Ingo Molnar 2001-02-06 18:54 ` Ben LaHaise 2001-02-06 18:58 ` Ingo Molnar 2001-02-06 19:11 ` Ben LaHaise 2001-02-06 19:32 ` Jens Axboe 2001-02-06 19:32 ` Ingo Molnar 2001-02-06 19:32 ` Linus Torvalds 2001-02-06 19:44 ` Ingo Molnar 2001-02-06 19:49 ` Ben LaHaise 2001-02-06 19:57 ` Ingo Molnar 2001-02-06 20:07 ` Jens Axboe 2001-02-06 20:25 ` Ben LaHaise 2001-02-06 20:41 ` Manfred Spraul 2001-02-06 20:50 ` Jens Axboe 2001-02-06 21:26 ` Manfred Spraul 2001-02-06 21:42 ` Linus Torvalds 2001-02-06 20:16 ` Marcelo Tosatti 2001-02-06 22:09 ` Jens Axboe 2001-02-06 22:26 ` Linus Torvalds 2001-02-06 21:13 ` Marcelo Tosatti 2001-02-06 23:26 ` Linus Torvalds 2001-02-07 23:17 ` select() returning busy for regular files [was Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait] Pavel Machek 2001-02-08 13:57 ` Ben LaHaise 2001-02-08 17:52 ` Linus Torvalds 2001-02-08 15:06 ` [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait Ben LaHaise 2001-02-08 13:44 ` Marcelo Tosatti 2001-02-08 13:45 ` Marcelo Tosatti 2001-02-07 23:15 ` Pavel Machek 2001-02-08 13:22 ` Stephen C. Tweedie 2001-02-08 12:03 ` Marcelo Tosatti 2001-02-08 15:46 ` Mikulas Patocka 2001-02-08 14:05 ` Marcelo Tosatti 2001-02-08 16:11 ` Mikulas Patocka 2001-02-08 14:44 ` Marcelo Tosatti 2001-02-08 16:57 ` Rik van Riel 2001-02-08 17:13 ` James Sutherland 2001-02-08 18:38 ` Linus Torvalds 2001-02-09 12:17 ` Martin Dalecki 2001-02-08 15:55 ` Jens Axboe 2001-02-08 18:09 ` Linus Torvalds 2001-02-08 14:52 ` Mikulas Patocka 2001-02-08 19:50 ` Stephen C. Tweedie 2001-02-11 21:30 ` Pavel Machek 2001-02-06 21:57 ` Manfred Spraul 2001-02-06 22:13 ` Linus Torvalds 2001-02-06 22:26 ` Andre Hedrick 2001-02-06 20:49 ` Jens Axboe 2001-02-07 0:21 ` Stephen C. Tweedie 2001-02-07 0:25 ` Ingo Molnar 2001-02-07 0:36 ` Stephen C. Tweedie 2001-02-07 0:50 ` Linus Torvalds 2001-02-07 1:49 ` Stephen C. Tweedie 2001-02-07 2:37 ` Linus Torvalds 2001-02-07 14:52 ` Stephen C. Tweedie 2001-02-07 19:12 ` Richard Gooch 2001-02-07 20:03 ` Stephen C. Tweedie 2001-02-07 1:51 ` Jeff V. Merkey 2001-02-07 1:01 ` Ingo Molnar 2001-02-07 1:59 ` Jeff V. Merkey 2001-02-07 1:02 ` Jens Axboe 2001-02-07 1:19 ` Linus Torvalds 2001-02-07 1:39 ` Jens Axboe 2001-02-07 1:45 ` Linus Torvalds 2001-02-07 1:55 ` Jens Axboe 2001-02-07 9:10 ` David Howells 2001-02-07 12:16 ` Stephen C. Tweedie 2001-02-07 2:00 ` Jeff V. Merkey 2001-02-07 1:06 ` Ingo Molnar 2001-02-07 1:09 ` Jens Axboe 2001-02-07 1:11 ` Ingo Molnar 2001-02-07 1:26 ` Linus Torvalds 2001-02-07 2:07 ` Jeff V. Merkey 2001-02-07 1:08 ` Jens Axboe 2001-02-07 2:08 ` Jeff V. Merkey 2001-02-07 1:42 ` Jeff V. Merkey 2001-02-07 0:42 ` Linus Torvalds 2001-02-07 0:35 ` Jens Axboe 2001-02-07 0:41 ` Linus Torvalds 2001-02-07 1:27 ` Stephen C. Tweedie 2001-02-07 1:40 ` Linus Torvalds 2001-02-12 10:07 ` Jamie Lokier 2001-02-06 20:26 ` Linus Torvalds 2001-02-06 20:25 ` Christoph Hellwig 2001-02-06 20:35 ` Ingo Molnar 2001-02-06 19:05 ` Marcelo Tosatti 2001-02-06 20:59 ` Ingo Molnar 2001-02-06 21:20 ` Steve Lord 2001-02-07 18:27 ` Christoph Hellwig 2001-02-06 20:59 ` Linus Torvalds 2001-02-07 18:26 ` Christoph Hellwig 2001-02-07 18:36 ` Linus Torvalds 2001-02-07 18:44 ` Christoph Hellwig 2001-02-08 0:34 ` Neil Brown 2001-02-06 19:46 ` Ingo Molnar 2001-02-06 20:16 ` Ben LaHaise 2001-02-06 20:22 ` Ingo Molnar 2001-02-06 19:20 ` Linus Torvalds 2001-02-06 0:31 ` Roman Zippel 2001-02-06 1:01 ` Linus Torvalds 2001-02-06 9:22 ` Roman Zippel 2001-02-06 9:30 ` Ingo Molnar 2001-02-06 1:08 ` David S. Miller 2001-02-05 22:09 ` [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains Ingo Molnar 2001-02-05 16:56 ` Linus Torvalds 2001-02-05 17:27 ` [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait Alan Cox 2001-02-05 16:36 ` [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains Linus Torvalds 2001-02-05 19:08 ` Stephen C. Tweedie 2001-02-01 17:49 ` Christoph Hellwig 2001-02-01 17:58 ` Alan Cox 2001-02-01 18:32 ` Rik van Riel 2001-02-01 18:59 ` yodaiken 2001-02-01 19:33 ` Stephen C. Tweedie 2001-02-01 18:51 ` bcrl 2001-02-01 16:16 ` Stephen C. Tweedie 2001-02-01 17:05 ` Christoph Hellwig 2001-02-01 17:09 ` Christoph Hellwig 2001-02-01 17:41 ` Stephen C. Tweedie 2001-02-01 18:14 ` Christoph Hellwig 2001-02-01 18:25 ` Alan Cox 2001-02-01 18:39 ` Rik van Riel 2001-02-01 18:46 ` [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait Alan Cox 2001-02-01 18:48 ` [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains Christoph Hellwig 2001-02-01 18:57 ` Alan Cox 2001-02-01 19:00 ` Christoph Hellwig 2001-02-01 19:32 ` Stephen C. Tweedie 2001-02-01 20:46 ` Christoph Hellwig 2001-02-01 21:25 ` Stephen C. Tweedie 2001-02-02 11:51 ` Christoph Hellwig 2001-02-02 14:04 ` Stephen C. Tweedie 2001-02-02 4:18 ` bcrl 2001-02-02 12:12 ` Christoph Hellwig 2001-02-01 20:04 ` Chaitanya Tumuluri [not found] <CA2569E9.004A4E23.00@d73mta05.au.ibm.com> 2001-02-04 16:46 ` [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait Alan Cox 2001-02-12 14:56 bsuparna
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).