linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait  /notify + callback chains
@ 2001-02-01 14:44 bsuparna
  2001-02-01 15:09 ` Christoph Hellwig
  0 siblings, 1 reply; 186+ messages in thread
From: bsuparna @ 2001-02-01 14:44 UTC (permalink / raw)
  To: Stephen C. Tweedie; +Cc: linux-kernel, kiobuf-io-devel


>Hi,
>
>On Thu, Feb 01, 2001 at 10:25:22AM +0530, bsuparna@in.ibm.com wrote:
>>
>> >We _do_ need the ability to stack completion events, but as far as the
>> >kiobuf work goes, my current thoughts are to do that by stacking
>> >lightweight "clone" kiobufs.
>>
>> Would that work with stackable filesystems ?
>
>Only if the filesystems were using VFS interfaces which used kiobufs.
>Right now, the only filesystem using kiobufs is XFS, and it only
>passes them down to the block device layer, not to other filesystems.

That would require the vfs interfaces themselves (address space
readpage/writepage ops) to take kiobufs as arguments, instead of struct
page *  . That's not the case right now, is it ?
A filter filesystem would be layered over XFS to take this example.
So right now a filter filesystem only sees the struct page * and passes
this along. Any completion event stacking has to be applied with reference
to this.


>> Being able to track the children of a kiobuf would help with I/O
>> cancellation (e.g. to pull sub-ios off their request queues if I/O
>> cancellation for the parent kiobuf was issued). Not essential, I guess,
in
>> general, but useful in some situations.
>
>What exactly is the justification for IO cancellation?  It really
>upsets the normal flow of control through the IO stack to have
>voluntary cancellation semantics.

One reason that I saw is that if the results of an i/o are no longer
required due to some condition (e.g. aio cancellation situations, or if the
process that issued the i/o gets killed), then this avoids the unnecessary
disk i/o, if the request hadn't been scheduled as yet.

Too remote a requirement ? If the capability/support doesn't exist at the
driver level I guess its difficult.

--Stephen

_______________________________________________
Kiobuf-io-devel mailing list
Kiobuf-io-devel@lists.sourceforge.net
http://lists.sourceforge.net/lists/listinfo/kiobuf-io-devel



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
  2001-02-01 14:44 [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains bsuparna
@ 2001-02-01 15:09 ` Christoph Hellwig
  2001-02-01 16:08   ` Steve Lord
  2001-02-01 16:16   ` Stephen C. Tweedie
  0 siblings, 2 replies; 186+ messages in thread
From: Christoph Hellwig @ 2001-02-01 15:09 UTC (permalink / raw)
  To: bsuparna; +Cc: Stephen C. Tweedie, linux-kernel, kiobuf-io-devel

On Thu, Feb 01, 2001 at 08:14:58PM +0530, bsuparna@in.ibm.com wrote:
> 
> >Hi,
> >
> >On Thu, Feb 01, 2001 at 10:25:22AM +0530, bsuparna@in.ibm.com wrote:
> >>
> >> >We _do_ need the ability to stack completion events, but as far as the
> >> >kiobuf work goes, my current thoughts are to do that by stacking
> >> >lightweight "clone" kiobufs.
> >>
> >> Would that work with stackable filesystems ?
> >
> >Only if the filesystems were using VFS interfaces which used kiobufs.
> >Right now, the only filesystem using kiobufs is XFS, and it only
> >passes them down to the block device layer, not to other filesystems.
> 
> That would require the vfs interfaces themselves (address space
> readpage/writepage ops) to take kiobufs as arguments, instead of struct
> page *  . That's not the case right now, is it ?

No, and with the current kiobufs it would not make sense, because they
are to heavy-weight.  With page,length,offsett iobufs this makes sense
and is IMHO the way to go.

	Christoph

-- 
Of course it doesn't work. We've performed a software upgrade.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
  2001-02-01 15:09 ` Christoph Hellwig
@ 2001-02-01 16:08   ` Steve Lord
  2001-02-01 16:49     ` Stephen C. Tweedie
  2001-02-01 16:16   ` Stephen C. Tweedie
  1 sibling, 1 reply; 186+ messages in thread
From: Steve Lord @ 2001-02-01 16:08 UTC (permalink / raw)
  To: hch, linux-kernel, kiobuf-io-devel

Christoph Hellwig wrote:
> On Thu, Feb 01, 2001 at 08:14:58PM +0530, bsuparna@in.ibm.com wrote:
> > 
> > That would require the vfs interfaces themselves (address space
> > readpage/writepage ops) to take kiobufs as arguments, instead of struct
> > page *  . That's not the case right now, is it ?
> 
> No, and with the current kiobufs it would not make sense, because they
> are to heavy-weight.  With page,length,offsett iobufs this makes sense
> and is IMHO the way to go.
> 
> 	Christoph
> 

Enquiring minds would like to know if you are working towards this 
revamp of the kiobuf structure at the moment, you have been very quiet
recently. 

Steve


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
  2001-02-01 15:09 ` Christoph Hellwig
  2001-02-01 16:08   ` Steve Lord
@ 2001-02-01 16:16   ` Stephen C. Tweedie
  2001-02-01 17:05     ` Christoph Hellwig
  1 sibling, 1 reply; 186+ messages in thread
From: Stephen C. Tweedie @ 2001-02-01 16:16 UTC (permalink / raw)
  To: bsuparna, Stephen C. Tweedie, linux-kernel, kiobuf-io-devel

Hi,

On Thu, Feb 01, 2001 at 04:09:53PM +0100, Christoph Hellwig wrote:
> On Thu, Feb 01, 2001 at 08:14:58PM +0530, bsuparna@in.ibm.com wrote:
> > 
> > That would require the vfs interfaces themselves (address space
> > readpage/writepage ops) to take kiobufs as arguments, instead of struct
> > page *  . That's not the case right now, is it ?
> 
> No, and with the current kiobufs it would not make sense, because they
> are to heavy-weight.

Really?  In what way?  

> With page,length,offsett iobufs this makes sense
> and is IMHO the way to go.

What, you mean adding *extra* stuff to the heavyweight kiobuf makes it
lean enough to do the job??

Cheers,
 Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
  2001-02-01 16:08   ` Steve Lord
@ 2001-02-01 16:49     ` Stephen C. Tweedie
  2001-02-01 17:02       ` Christoph Hellwig
  0 siblings, 1 reply; 186+ messages in thread
From: Stephen C. Tweedie @ 2001-02-01 16:49 UTC (permalink / raw)
  To: Steve Lord; +Cc: hch, linux-kernel, kiobuf-io-devel

Hi,

On Thu, Feb 01, 2001 at 10:08:45AM -0600, Steve Lord wrote:
> Christoph Hellwig wrote:
> > On Thu, Feb 01, 2001 at 08:14:58PM +0530, bsuparna@in.ibm.com wrote:
> > > 
> > > That would require the vfs interfaces themselves (address space
> > > readpage/writepage ops) to take kiobufs as arguments, instead of struct
> > > page *  . That's not the case right now, is it ?
> > 
> > No, and with the current kiobufs it would not make sense, because they
> > are to heavy-weight.  With page,length,offsett iobufs this makes sense
> > and is IMHO the way to go.
> 
> Enquiring minds would like to know if you are working towards this 
> revamp of the kiobuf structure at the moment, you have been very quiet
> recently. 

I'm in the middle of some parts of it, and am actively soliciting
feedback on what cleanups are required.  

I've been merging all of the 2.2 fixes into a 2.4 kiobuf tree, and
have started doing some of the cleanups needed --- removing the
embedded page vector, and adding support for lightweight stacking of
kiobufs for completion callback chains.

However, filesystem IO is almost *always* page aligned: O_DIRECT IO
comes from VM pages, and internal filesystem IO comes from page cache
pages.  Buffer cache IOs are the only exception, and kiobufs only fail
for such IOs once you have multiple buffer_heads being merged into
single requests.

So, what are the benefits in the disk IO stack of adding length/offset
pairs to each page of the kiobuf?  Basically, the only advantage right
now is that it would allow us to merge requests together without
having to chain separate kiobufs.  However, chaining kiobufs in this
case is actually much better than merging them if the original IOs
came in as kiobufs: merging kiobufs requires us to reallocate a new,
longer (page/offset/len) vector, whereas chaining kiobufs is just a
list operation.

Having true scatter-gather lists in the kiobuf would let us represent
arbitrary lists of buffer_heads as a single kiobuf, though, and that
_is_ a big win if we can avoid using buffer_heads below the
ll_rw_block layer at all.  (It's not clear that this is really
possible, though, since we still need to propagate completion
information back up into each individual buffer head's status and wait
queue.)

Cheers,
 Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
  2001-02-01 16:49     ` Stephen C. Tweedie
@ 2001-02-01 17:02       ` Christoph Hellwig
  2001-02-01 17:34         ` Alan Cox
  0 siblings, 1 reply; 186+ messages in thread
From: Christoph Hellwig @ 2001-02-01 17:02 UTC (permalink / raw)
  To: Stephen C. Tweedie; +Cc: Steve Lord, linux-kernel, kiobuf-io-devel

On Thu, Feb 01, 2001 at 04:49:58PM +0000, Stephen C. Tweedie wrote:
> > Enquiring minds would like to know if you are working towards this 
> > revamp of the kiobuf structure at the moment, you have been very quiet
> > recently. 
> 
> I'm in the middle of some parts of it, and am actively soliciting
> feedback on what cleanups are required.  

The real issue is that Linus dislikes the current kiobuf scheme.
I do not like everything he proposes, but lots of things makes sense.

> I've been merging all of the 2.2 fixes into a 2.4 kiobuf tree, and
> have started doing some of the cleanups needed --- removing the
> embedded page vector, and adding support for lightweight stacking of
> kiobufs for completion callback chains.

Ok, great.

> However, filesystem IO is almost *always* page aligned: O_DIRECT IO
> comes from VM pages, and internal filesystem IO comes from page cache
> pages.  Buffer cache IOs are the only exception, and kiobufs only fail
> for such IOs once you have multiple buffer_heads being merged into
> single requests.
> 
> So, what are the benefits in the disk IO stack of adding length/offset
> pairs to each page of the kiobuf?

I don't see any real advantage for disk IO.  The real advantage is that
we can have a generic structure that is also usefull in e.g. networking
and can lead to a unified IO buffering scheme (a little like IO-Lite).

	Christoph

-- 
Of course it doesn't work. We've performed a software upgrade.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
  2001-02-01 16:16   ` Stephen C. Tweedie
@ 2001-02-01 17:05     ` Christoph Hellwig
  2001-02-01 17:09       ` Christoph Hellwig
  2001-02-01 17:41       ` Stephen C. Tweedie
  0 siblings, 2 replies; 186+ messages in thread
From: Christoph Hellwig @ 2001-02-01 17:05 UTC (permalink / raw)
  To: Stephen C. Tweedie; +Cc: bsuparna, linux-kernel, kiobuf-io-devel

On Thu, Feb 01, 2001 at 04:16:15PM +0000, Stephen C. Tweedie wrote:
> Hi,
> 
> On Thu, Feb 01, 2001 at 04:09:53PM +0100, Christoph Hellwig wrote:
> > On Thu, Feb 01, 2001 at 08:14:58PM +0530, bsuparna@in.ibm.com wrote:
> > > 
> > > That would require the vfs interfaces themselves (address space
> > > readpage/writepage ops) to take kiobufs as arguments, instead of struct
> > > page *  . That's not the case right now, is it ?
> > 
> > No, and with the current kiobufs it would not make sense, because they
> > are to heavy-weight.
> 
> Really?  In what way?  

We can't allocate a huge kiobuf structure just for requesting one page of
IO.  It might get better with VM-level IO clustering though.

> 
> > With page,length,offsett iobufs this makes sense
> > and is IMHO the way to go.
> 
> What, you mean adding *extra* stuff to the heavyweight kiobuf makes it
> lean enough to do the job??

No.  I was speaking abou the light-weight kiobuf Linux & Me discussed on
lkml some time ago (though I'd much more like to call it kiovec analogous
to BSD iovecs).

And a page,offset,length tuple is pretty cheap compared to a current kiobuf.

	Christoph

-- 
Of course it doesn't work. We've performed a software upgrade.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
  2001-02-01 17:49           ` Stephen C. Tweedie
@ 2001-02-01 17:09             ` Chaitanya Tumuluri
  2001-02-01 20:33             ` Christoph Hellwig
  1 sibling, 0 replies; 186+ messages in thread
From: Chaitanya Tumuluri @ 2001-02-01 17:09 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Alan Cox, Christoph Hellwig, Steve Lord, linux-kernel, kiobuf-io-devel

On Thu, 1 Feb 2001, Stephen C. Tweedie wrote:
> Hi,
> 
> On Thu, Feb 01, 2001 at 05:34:49PM +0000, Alan Cox wrote:
> > > 
> > > I don't see any real advantage for disk IO.  The real advantage is that
> > > we can have a generic structure that is also usefull in e.g. networking
> > > and can lead to a unified IO buffering scheme (a little like IO-Lite).
> > 
> > Networking wants something lighter rather than heavier. Adding tons of
> > base/limit pairs to kiobufs makes it worse not better
> 
> Networking has fundamentally different requirements.  In a network
> stack, you want the ability to add fragments to unaligned chunks of
> data to represent headers at any point in the stack.
> 
> In the disk IO case, you basically don't get that (the only thing
> which comes close is raid5 parity blocks).  The data which the user
> started with is the data sent out on the wire.  You do get some
> interesting cases such as soft raid and LVM, or even in the scsi stack
> if you run out of mailbox space, where you need to send only a
> sub-chunk of the input buffer.  

Or the case of BSD-style UIO implementing the readv() and writev() calls.
This may/may-not align perfectly so, address-length lists per page could
be helpful.

I did try an implementation of this for rawio and found that I had to
restrict the a-len lists coming in via the user iovecs to be aligned.

> In that case, having offset/len as the kiobuf limit markers is ideal:
> you can clone a kiobuf header using the same page vector as the
> parent, narrow down the start/end points, and continue down the stack
> without having to copy any part of the page list.  If you had the
> offset/len data encoded implicitly into each entry in the sglist, you
> would not be able to do that.

This would solve the issue with UIO, yes. Also, I think Martin Peterson
(mkp) had taken a stab at doing "clone-kiobufs" for LVM at some point.

Martin?

Cheers,
-Chait.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
  2001-02-01 17:05     ` Christoph Hellwig
@ 2001-02-01 17:09       ` Christoph Hellwig
  2001-02-01 17:41       ` Stephen C. Tweedie
  1 sibling, 0 replies; 186+ messages in thread
From: Christoph Hellwig @ 2001-02-01 17:09 UTC (permalink / raw)
  To: Stephen C. Tweedie, bsuparna, linux-kernel, kiobuf-io-devel

On Thu, Feb 01, 2001 at 06:05:15PM +0100, Christoph Hellwig wrote:
> > What, you mean adding *extra* stuff to the heavyweight kiobuf makes it
> > lean enough to do the job??
> 
> No.  I was speaking abou the light-weight kiobuf Linux & Me discussed on
						   ^^^^^ Linus ...
> lkml some time ago (though I'd much more like to call it kiovec analogous
> to BSD iovecs).
> 
> And a page,offset,length tuple is pretty cheap compared to a current kiobuf.

	Christoph (slapping himself for the stupid typo and selfreply ...)

-- 
Of course it doesn't work. We've performed a software upgrade.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
  2001-02-01 17:02       ` Christoph Hellwig
@ 2001-02-01 17:34         ` Alan Cox
  2001-02-01 17:49           ` Stephen C. Tweedie
                             ` (2 more replies)
  0 siblings, 3 replies; 186+ messages in thread
From: Alan Cox @ 2001-02-01 17:34 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Stephen C. Tweedie, Steve Lord, linux-kernel, kiobuf-io-devel

> > I'm in the middle of some parts of it, and am actively soliciting
> > feedback on what cleanups are required.  
> 
> The real issue is that Linus dislikes the current kiobuf scheme.
> I do not like everything he proposes, but lots of things makes sense.

Linus basically designed the original kiobuf scheme of course so I guess
he's allowed to dislike it. Linus disliking something however doesn't mean
its wrong. Its not a technically valid basis for argument.

Linus list of reasons like the amount of state are more interesting

> > So, what are the benefits in the disk IO stack of adding length/offset
> > pairs to each page of the kiobuf?
> 
> I don't see any real advantage for disk IO.  The real advantage is that
> we can have a generic structure that is also usefull in e.g. networking
> and can lead to a unified IO buffering scheme (a little like IO-Lite).

Networking wants something lighter rather than heavier. Adding tons of
base/limit pairs to kiobufs makes it worse not better

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
  2001-02-01 17:05     ` Christoph Hellwig
  2001-02-01 17:09       ` Christoph Hellwig
@ 2001-02-01 17:41       ` Stephen C. Tweedie
  2001-02-01 18:14         ` Christoph Hellwig
  2001-02-01 20:04         ` Chaitanya Tumuluri
  1 sibling, 2 replies; 186+ messages in thread
From: Stephen C. Tweedie @ 2001-02-01 17:41 UTC (permalink / raw)
  To: Stephen C. Tweedie, bsuparna, linux-kernel, kiobuf-io-devel

Hi,

On Thu, Feb 01, 2001 at 06:05:15PM +0100, Christoph Hellwig wrote:
> On Thu, Feb 01, 2001 at 04:16:15PM +0000, Stephen C. Tweedie wrote:
> > > 
> > > No, and with the current kiobufs it would not make sense, because they
> > > are to heavy-weight.
> > 
> > Really?  In what way?  
> 
> We can't allocate a huge kiobuf structure just for requesting one page of
> IO.  It might get better with VM-level IO clustering though.

A kiobuf is *much* smaller than, say, a buffer_head, and we currently
allocate a buffer_head per block for all IO.

A kiobuf contains enough embedded page vector space for 16 pages by
default, but I'm happy enough to remove that if people want.  However,
note that that memory is not initialised, so there is no memory access
cost at all for that empty space.  Remove that space and instead of
one memory allocation per kiobuf, you get two, so the cost goes *UP*
for small IOs.

> > > With page,length,offsett iobufs this makes sense
> > > and is IMHO the way to go.
> > 
> > What, you mean adding *extra* stuff to the heavyweight kiobuf makes it
> > lean enough to do the job??
> 
> No.  I was speaking abou the light-weight kiobuf Linux & Me discussed on
> lkml some time ago (though I'd much more like to call it kiovec analogous
> to BSD iovecs).

What is so heavyweight in the current kiobuf (other than the embedded
vector, which I've already noted I'm willing to cut)?

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
  2001-02-01 17:34         ` Alan Cox
@ 2001-02-01 17:49           ` Stephen C. Tweedie
  2001-02-01 17:09             ` Chaitanya Tumuluri
  2001-02-01 20:33             ` Christoph Hellwig
  2001-02-01 17:49           ` Christoph Hellwig
  2001-02-01 18:51           ` bcrl
  2 siblings, 2 replies; 186+ messages in thread
From: Stephen C. Tweedie @ 2001-02-01 17:49 UTC (permalink / raw)
  To: Alan Cox
  Cc: Christoph Hellwig, Stephen C. Tweedie, Steve Lord, linux-kernel,
	kiobuf-io-devel

Hi,

On Thu, Feb 01, 2001 at 05:34:49PM +0000, Alan Cox wrote:
> > 
> > I don't see any real advantage for disk IO.  The real advantage is that
> > we can have a generic structure that is also usefull in e.g. networking
> > and can lead to a unified IO buffering scheme (a little like IO-Lite).
> 
> Networking wants something lighter rather than heavier. Adding tons of
> base/limit pairs to kiobufs makes it worse not better

Networking has fundamentally different requirements.  In a network
stack, you want the ability to add fragments to unaligned chunks of
data to represent headers at any point in the stack.

In the disk IO case, you basically don't get that (the only thing
which comes close is raid5 parity blocks).  The data which the user
started with is the data sent out on the wire.  You do get some
interesting cases such as soft raid and LVM, or even in the scsi stack
if you run out of mailbox space, where you need to send only a
sub-chunk of the input buffer.  

In that case, having offset/len as the kiobuf limit markers is ideal:
you can clone a kiobuf header using the same page vector as the
parent, narrow down the start/end points, and continue down the stack
without having to copy any part of the page list.  If you had the
offset/len data encoded implicitly into each entry in the sglist, you
would not be able to do that.

--Stephen

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
  2001-02-01 17:34         ` Alan Cox
  2001-02-01 17:49           ` Stephen C. Tweedie
@ 2001-02-01 17:49           ` Christoph Hellwig
  2001-02-01 17:58             ` Alan Cox
  2001-02-01 19:33             ` Stephen C. Tweedie
  2001-02-01 18:51           ` bcrl
  2 siblings, 2 replies; 186+ messages in thread
From: Christoph Hellwig @ 2001-02-01 17:49 UTC (permalink / raw)
  To: Alan Cox
  Cc: Christoph Hellwig, Stephen C. Tweedie, Steve Lord, linux-kernel,
	kiobuf-io-devel

On Thu, Feb 01, 2001 at 05:34:49PM +0000, Alan Cox wrote:
> > > I'm in the middle of some parts of it, and am actively soliciting
> > > feedback on what cleanups are required.  
> > 
> > The real issue is that Linus dislikes the current kiobuf scheme.
> > I do not like everything he proposes, but lots of things makes sense.
> 
> Linus basically designed the original kiobuf scheme of course so I guess
> he's allowed to dislike it. Linus disliking something however doesn't mean
> its wrong. Its not a technically valid basis for argument.

Sure.  But Linus saing that he doesn't want more of that (shit, crap,
I don't rember what he said exactly) in the kernel is a very good reason
for thinking a little more aboyt it.

Espescially if most arguments look right to one after thinking more about
it...

> Linus list of reasons like the amount of state are more interesting

True.  The arument that they are to heaviweight also.
That they should allow scatter gather without an array of structs also.


> > > So, what are the benefits in the disk IO stack of adding length/offset
> > > pairs to each page of the kiobuf?
> > 
> > I don't see any real advantage for disk IO.  The real advantage is that
> > we can have a generic structure that is also usefull in e.g. networking
> > and can lead to a unified IO buffering scheme (a little like IO-Lite).
> 
> Networking wants something lighter rather than heavier.

Right.  That's what the new design was about, besides adding a offset and
length to every page instead of the page array, something also wanted by
the networking in the first place.
Look at the skb_frag struct in the zero-copy patch for what networking
thinks it needs for physical page based buffers.

> Adding tons of base/limit pairs to kiobufs makes it worse not better

>From looking at the networking code and listening to Dave and Ingo it looks
like it makes the thing better for networking, although I can not verify
this due to the lack of familarity with the networking code.

For disk I/O it makes the handling a little easier for the cost of the
additional offset/length fields.

	Christoph

P.S. the tuple things is also what Larry had in his inital slice paper.
-- 
Of course it doesn't work. We've performed a software upgrade.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
  2001-02-01 17:49           ` Christoph Hellwig
@ 2001-02-01 17:58             ` Alan Cox
  2001-02-01 18:32               ` Rik van Riel
  2001-02-01 19:33             ` Stephen C. Tweedie
  1 sibling, 1 reply; 186+ messages in thread
From: Alan Cox @ 2001-02-01 17:58 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Alan Cox, Christoph Hellwig, Stephen C. Tweedie, Steve Lord,
	linux-kernel, kiobuf-io-devel

> > Linus basically designed the original kiobuf scheme of course so I guess
> > he's allowed to dislike it. Linus disliking something however doesn't mean
> > its wrong. Its not a technically valid basis for argument.
> 
> Sure.  But Linus saing that he doesn't want more of that (shit, crap,
> I don't rember what he said exactly) in the kernel is a very good reason
> for thinking a little more aboyt it.

No. Linus is not a God, Linus is fallible, regularly makes mistakes and
frequently opens his mouth and says stupid things when he is far too busy.

> Espescially if most arguments look right to one after thinking more about
> it...

I agree with the issues about networking wanting lightweight objects, Im
unconvinced however the existing setup for networking is sanely applicable
for real world applications in other spaces.

Take video capture. I want to stream 60Mbytes/second in multi-megabyte
chunks between my capture cards and a high end raid array. The array wants
1Mbyte or large blocks per I/O to reach 60Mbytes/second performance.

This btw isnt benchmark crap like most of the zero copy networking, this is
a real world application..

The current buffer head stuff is already heavier than the kio stuff. The
networking stuff isnt oriented to that kind of I/O and would end up
needing to do tons of extra processing.

> For disk I/O it makes the handling a little easier for the cost of the
> additional offset/length fields.

I remain to be convinced by that. However you do get 64bytes/cacheline on
a real processor nowdays so if you touch any of that 64byte block you are
practically zero cost to fill the rest. 

Alan

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
  2001-02-01 17:41       ` Stephen C. Tweedie
@ 2001-02-01 18:14         ` Christoph Hellwig
  2001-02-01 18:25           ` Alan Cox
                             ` (2 more replies)
  2001-02-01 20:04         ` Chaitanya Tumuluri
  1 sibling, 3 replies; 186+ messages in thread
From: Christoph Hellwig @ 2001-02-01 18:14 UTC (permalink / raw)
  To: Stephen C. Tweedie; +Cc: bsuparna, linux-kernel, kiobuf-io-devel

On Thu, Feb 01, 2001 at 05:41:20PM +0000, Stephen C. Tweedie wrote:
> Hi,
> 
> On Thu, Feb 01, 2001 at 06:05:15PM +0100, Christoph Hellwig wrote:
> > On Thu, Feb 01, 2001 at 04:16:15PM +0000, Stephen C. Tweedie wrote:
> > > > 
> > > > No, and with the current kiobufs it would not make sense, because they
> > > > are to heavy-weight.
> > > 
> > > Really?  In what way?  
> > 
> > We can't allocate a huge kiobuf structure just for requesting one page of
> > IO.  It might get better with VM-level IO clustering though.
> 
> A kiobuf is *much* smaller than, say, a buffer_head, and we currently
> allocate a buffer_head per block for all IO.

A kiobuf is 124 bytes, a buffer_head 96.  And a buffer_head is additionally
used for caching data, a kiobuf not.

> 
> A kiobuf contains enough embedded page vector space for 16 pages by
> default, but I'm happy enough to remove that if people want.  However,
> note that that memory is not initialised, so there is no memory access
> cost at all for that empty space.  Remove that space and instead of
> one memory allocation per kiobuf, you get two, so the cost goes *UP*
> for small IOs.

You could still embed it into a surrounding structure, even if there are cases
where an additional memory allocation is needed, yes.

> 
> > > > With page,length,offsett iobufs this makes sense
> > > > and is IMHO the way to go.
> > > 
> > > What, you mean adding *extra* stuff to the heavyweight kiobuf makes it
> > > lean enough to do the job??
> > 
> > No.  I was speaking abou the light-weight kiobuf Linux & Me discussed on
> > lkml some time ago (though I'd much more like to call it kiovec analogous
> > to BSD iovecs).
> 
> What is so heavyweight in the current kiobuf (other than the embedded
> vector, which I've already noted I'm willing to cut)?

array_len, io_count, the presence of wait_queue AND end_io, and the lack of
scatter gather in one kiobuf struct (you always need an array), and AFAICS
that is what the networking guys dislike.

They often just want multiple buffers in one physical page, and and array of
those.

Now one could say: just let the networkers use their own kind of buffers
(and that's exactly what is done in the zerocopy patches), but that again leds
to inefficient buffer passing and ungeneric IO handling.

S.th. like:

struct kiovec {
	struct page *           kv_page;        /* physical page        */
	u_short                 kv_offset;      /* offset into page     */
	u_short                 kv_length;      /* data length          */
};
			 
enum kio_flags {
	KIO_LOANED,     /* the calling subsystem wants this buf back    */
	KIO_GIFTED,     /* thanks for the buffer, man!                  */
	KIO_COW         /* copy on write (XXX: not yet)                 */
};


struct kio {
	struct kiovec *         kio_data;       /* our kiovecs          */
	int                     kio_ndata;      /* # of kiovecs         */
	int                     kio_flags;      /* loaned or giftet?    */
	void *                  kio_priv;       /* caller private data  */
	wait_queue_head_t       kio_wait;	/* wait queue           */
};

makes it a lot simpler for the subsytems to integrate.

	Christoph

-- 
Of course it doesn't work. We've performed a software upgrade.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
  2001-02-01 18:14         ` Christoph Hellwig
@ 2001-02-01 18:25           ` Alan Cox
  2001-02-01 18:39             ` Rik van Riel
  2001-02-01 18:48             ` [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains Christoph Hellwig
  2001-02-01 19:32           ` Stephen C. Tweedie
  2001-02-02  4:18           ` bcrl
  2 siblings, 2 replies; 186+ messages in thread
From: Alan Cox @ 2001-02-01 18:25 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Stephen C. Tweedie, bsuparna, linux-kernel, kiobuf-io-devel

> array_len, io_count, the presence of wait_queue AND end_io, and the lack of
> scatter gather in one kiobuf struct (you always need an array), and AFAICS
> that is what the networking guys dislike.

You need a completion pointer. Its arguable whether you want the wait_queue
in the default structure or as part of whatever its contained in and handled
by the completion pointer.

And I've actually bothered to talk to the networking people and they dont have
a problem with the completion pointer.

> Now one could say: just let the networkers use their own kind of buffers
> (and that's exactly what is done in the zerocopy patches), but that again leds
> to inefficient buffer passing and ungeneric IO handling.

Careful.  This is the line of reasoning which also says

Aeroplanes are good for travelling long distances
Cars are better for getting to my front door
Therefore everyone should drive a 747 home

It is quite possible that the right thing to do is to do conversions in the
cases it happens. That might seem a good reason for having offset/length
pairs on each block, because streaming from the network to disk you may well
get a collection of partial pages of data you need to write to disk. 
Unfortunately the reality of DMA support on almost (but not quite) all
disk controllers is that you don't get that degree of scatter gather.

My I2O controllers and I think the fusion controllers could indeed benefit
and cope with being given a pile of randomly located 1480 byte chunks of 
data and being asked to put them on disk.

I do seriously doubt there are any real world situations this is useful.




-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
  2001-02-01 17:58             ` Alan Cox
@ 2001-02-01 18:32               ` Rik van Riel
  2001-02-01 18:59                 ` yodaiken
  0 siblings, 1 reply; 186+ messages in thread
From: Rik van Riel @ 2001-02-01 18:32 UTC (permalink / raw)
  To: Alan Cox
  Cc: Christoph Hellwig, Stephen C. Tweedie, Steve Lord, linux-kernel,
	kiobuf-io-devel

On Thu, 1 Feb 2001, Alan Cox wrote:

> > Sure.  But Linus saing that he doesn't want more of that (shit, crap,
> > I don't rember what he said exactly) in the kernel is a very good reason
> > for thinking a little more aboyt it.
> 
> No. Linus is not a God, Linus is fallible, regularly makes mistakes and
> frequently opens his mouth and says stupid things when he is far too busy.

People may remember Linus saying a resolute no to SMP
support in Linux ;)

In my experience, when Linus says "NO" to a certain
idea, he's usually objecting to bad design decisions
in the proposed implementation of the idea and the
lack of a nice alternative solution ...

... but as soon as a clean, efficient and maintainable
alternative to the original bad idea surfaces, it seems
to be quite easy to convince Linus to include it.

cheers,

Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

		http://www.surriel.com/
http://www.conectiva.com/	http://distro.conectiva.com.br/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
  2001-02-01 18:25           ` Alan Cox
@ 2001-02-01 18:39             ` Rik van Riel
  2001-02-01 18:46               ` [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait Alan Cox
  2001-02-01 18:48             ` [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains Christoph Hellwig
  1 sibling, 1 reply; 186+ messages in thread
From: Rik van Riel @ 2001-02-01 18:39 UTC (permalink / raw)
  To: Alan Cox
  Cc: Christoph Hellwig, Stephen C. Tweedie, bsuparna, linux-kernel,
	kiobuf-io-devel

On Thu, 1 Feb 2001, Alan Cox wrote:

> > Now one could say: just let the networkers use their own kind of buffers
> > (and that's exactly what is done in the zerocopy patches), but that again leds
> > to inefficient buffer passing and ungeneric IO handling.

	[snip]
> It is quite possible that the right thing to do is to do
> conversions in the cases it happens.

OTOH, somehow a zero-copy system which converts the zero-copy
metadata every time the buffer is handed to another subsystem
just doesn't sound right ...

(well, maybe it _is_, but it looks quite inefficient at first
glance)

regards,

Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

		http://www.surriel.com/
http://www.conectiva.com/	http://distro.conectiva.com.br/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-01 18:39             ` Rik van Riel
@ 2001-02-01 18:46               ` Alan Cox
  0 siblings, 0 replies; 186+ messages in thread
From: Alan Cox @ 2001-02-01 18:46 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Alan Cox, Christoph Hellwig, Stephen C. Tweedie, bsuparna,
	linux-kernel, kiobuf-io-devel

> OTOH, somehow a zero-copy system which converts the zero-copy
> metadata every time the buffer is handed to another subsystem
> just doesn't sound right ...
> 
> (well, maybe it _is_, but it looks quite inefficient at first
> glance)

I would certainly be a lot happier if there is a single sensible zero copy
format doing the lot, but only if it doesnt turn into a cross between a 747
and bicycle
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
  2001-02-01 18:25           ` Alan Cox
  2001-02-01 18:39             ` Rik van Riel
@ 2001-02-01 18:48             ` Christoph Hellwig
  2001-02-01 18:57               ` Alan Cox
  1 sibling, 1 reply; 186+ messages in thread
From: Christoph Hellwig @ 2001-02-01 18:48 UTC (permalink / raw)
  To: Alan Cox
  Cc: Christoph Hellwig, Stephen C. Tweedie, bsuparna, linux-kernel,
	kiobuf-io-devel

On Thu, Feb 01, 2001 at 06:25:16PM +0000, Alan Cox wrote:
> > array_len, io_count, the presence of wait_queue AND end_io, and the lack of
> > scatter gather in one kiobuf struct (you always need an array), and AFAICS
> > that is what the networking guys dislike.
> 
> You need a completion pointer. Its arguable whether you want the wait_queue
> in the default structure or as part of whatever its contained in and handled
> by the completion pointer.

I personaly think that Ben's function pointer on wakeup work is the alternative in
this area.

> And I've actually bothered to talk to the networking people and they dont have
> a problem with the completion pointer.

I have never said that they don't like it - but having both the waitqueue and the
completion handler in the kiobuf makes it bigger.

> > Now one could say: just let the networkers use their own kind of buffers
> > (and that's exactly what is done in the zerocopy patches), but that again leds
> > to inefficient buffer passing and ungeneric IO handling.
> 
> Careful.  This is the line of reasoning which also says
> 
> Aeroplanes are good for travelling long distances
> Cars are better for getting to my front door
> Therefore everyone should drive a 747 home

Hehe ;)

> It is quite possible that the right thing to do is to do conversions in the
> cases it happens.

Yes, this would be THE alternative to my suggestion.

> That might seem a good reason for having offset/length
> pairs on each block, because streaming from the network to disk you may well
> get a collection of partial pages of data you need to write to disk. 
> Unfortunately the reality of DMA support on almost (but not quite) all
> disk controllers is that you don't get that degree of scatter gather.
> 
> My I2O controllers and I think the fusion controllers could indeed benefit
> and cope with being given a pile of randomly located 1480 byte chunks of 
> data and being asked to put them on disk.

It doesn't really matter that much, because we write to the pagecache
first anyway.

The real thing is that we want to have some common data structure for
describing physical memory used for IO.  We could either use special
structures in every subsystem and then copy between them or pass
struct page * and lose meta information.  Or we could try to find a
structure that holds enough information to make passing it from one
subsystem to another usefull.  The cut-down kio design (heavily inspired
by Larry McVoy's splice paper) should allow just that, nothing more an
nothing less.  For use in disk-io and networking or v4l there are probably
other primary data structures needed, and that's ok.

	Christoph

-- 
Of course it doesn't work. We've performed a software upgrade.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
  2001-02-01 17:34         ` Alan Cox
  2001-02-01 17:49           ` Stephen C. Tweedie
  2001-02-01 17:49           ` Christoph Hellwig
@ 2001-02-01 18:51           ` bcrl
  2 siblings, 0 replies; 186+ messages in thread
From: bcrl @ 2001-02-01 18:51 UTC (permalink / raw)
  To: Alan Cox
  Cc: Christoph Hellwig, Stephen C. Tweedie, Steve Lord, linux-kernel,
	kiobuf-io-devel

On Thu, 1 Feb 2001, Alan Cox wrote:

> Linus list of reasons like the amount of state are more interesting

The state is required, not optional, if we are to have a decent basis for
building asyncronous io into the kernel.

> Networking wants something lighter rather than heavier. Adding tons of
> base/limit pairs to kiobufs makes it worse not better

I'm still not seeing what I consider valid arguments from the networking
people regarding the use of kiobufs as the interface they present to the
VFS for asynchronous/bulk io.  I agree with their needs for a light weight
mechanism for getting small io requests from userland, and even the need
for using lightweight scatter gather lists within the network layer
itself.  If the statement is that map_user_kiobuf is too heavy for use on
every single io, sure.  But that is a seperate issue.

		-ben


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
  2001-02-01 18:48             ` [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains Christoph Hellwig
@ 2001-02-01 18:57               ` Alan Cox
  2001-02-01 19:00                 ` Christoph Hellwig
  0 siblings, 1 reply; 186+ messages in thread
From: Alan Cox @ 2001-02-01 18:57 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Alan Cox, Christoph Hellwig, Stephen C. Tweedie, bsuparna,
	linux-kernel, kiobuf-io-devel

> It doesn't really matter that much, because we write to the pagecache
> first anyway.

Not for raw I/O. Although for the drivers that can't cope then going via
the page cache is certainly the next best alternative

> The real thing is that we want to have some common data structure for
> describing physical memory used for IO.  We could either use special

Yes. You also need a way to describe it in terms of page * in order to do
mm locking for raw I/O (like the video capture stuff wants)

> by Larry McVoy's splice paper) should allow just that, nothing more an
> nothing less.  For use in disk-io and networking or v4l there are probably
> other primary data structures needed, and that's ok.

Certainly having the lightweight one a subset of the heavyweight one is a good
target. 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
  2001-02-01 18:32               ` Rik van Riel
@ 2001-02-01 18:59                 ` yodaiken
  0 siblings, 0 replies; 186+ messages in thread
From: yodaiken @ 2001-02-01 18:59 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Alan Cox, Christoph Hellwig, Stephen C. Tweedie, Steve Lord,
	linux-kernel, kiobuf-io-devel

On Thu, Feb 01, 2001 at 04:32:48PM -0200, Rik van Riel wrote:
> On Thu, 1 Feb 2001, Alan Cox wrote:
> 
> > > Sure.  But Linus saing that he doesn't want more of that (shit, crap,
> > > I don't rember what he said exactly) in the kernel is a very good reason
> > > for thinking a little more aboyt it.
> > 
> > No. Linus is not a God, Linus is fallible, regularly makes mistakes and
> > frequently opens his mouth and says stupid things when he is far too busy.
> 
> People may remember Linus saying a resolute no to SMP
> support in Linux ;)

And perhaps he was right!

-- 
---------------------------------------------------------
Victor Yodaiken 
Finite State Machine Labs: The RTLinux Company.
 www.fsmlabs.com  www.rtlinux.com

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
  2001-02-01 18:57               ` Alan Cox
@ 2001-02-01 19:00                 ` Christoph Hellwig
  0 siblings, 0 replies; 186+ messages in thread
From: Christoph Hellwig @ 2001-02-01 19:00 UTC (permalink / raw)
  To: Alan Cox; +Cc: Stephen C. Tweedie, bsuparna, linux-kernel, kiobuf-io-devel

On Thu, Feb 01, 2001 at 06:57:41PM +0000, Alan Cox wrote:
> Not for raw I/O. Although for the drivers that can't cope then going via
> the page cache is certainly the next best alternative

True - but raw-io has it's own alignment issues anyway.

> Yes. You also need a way to describe it in terms of page * in order to do
> mm locking for raw I/O (like the video capture stuff wants)

Right. (That's why we have the struct page * always as part of the structure)

> Certainly having the lightweight one a subset of the heavyweight one is a good
> target. 

Yes, I'm trying to address that...

	Christoph

-- 
Of course it doesn't work. We've performed a software upgrade.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
  2001-02-01 18:14         ` Christoph Hellwig
  2001-02-01 18:25           ` Alan Cox
@ 2001-02-01 19:32           ` Stephen C. Tweedie
  2001-02-01 20:46             ` Christoph Hellwig
  2001-02-02  4:18           ` bcrl
  2 siblings, 1 reply; 186+ messages in thread
From: Stephen C. Tweedie @ 2001-02-01 19:32 UTC (permalink / raw)
  To: Stephen C. Tweedie, bsuparna, linux-kernel, kiobuf-io-devel

Hi,

On Thu, Feb 01, 2001 at 07:14:03PM +0100, Christoph Hellwig wrote:
> On Thu, Feb 01, 2001 at 05:41:20PM +0000, Stephen C. Tweedie wrote:
> > > 
> > > We can't allocate a huge kiobuf structure just for requesting one page of
> > > IO.  It might get better with VM-level IO clustering though.
> > 
> > A kiobuf is *much* smaller than, say, a buffer_head, and we currently
> > allocate a buffer_head per block for all IO.
> 
> A kiobuf is 124 bytes,

... the vast majority of which is room for the page vector to expand
without having to be copied.  You don't touch that in the normal case.

> a buffer_head 96.  And a buffer_head is additionally
> used for caching data, a kiobuf not.

Buffer_heads are _sometimes_ used for caching data.  That's one of the
big problems with them, they are too overloaded, being both IO
descriptors _and_ cache descriptors.  If you've got 128k of data to
write out from user space, do you want to set up one kiobuf or 256
buffer_heads?  Buffer_heads become really very heavy indeed once you
start doing non-trivial IO.

> > What is so heavyweight in the current kiobuf (other than the embedded
> > vector, which I've already noted I'm willing to cut)?
> 
> array_len

kiobufs can be reused after IO.  You can depopulate a kiobuf,
repopulate it with new pages and submit new IO without having to
deallocate the kiobuf.  You can't do this without knowing how big the
data vector is.  Removing that functionality will prevent reuse,
making them _more_ heavyweight.

> io_count,

Right now we can take a kiobuf and turn it into a bunch of
buffer_heads for IO.  The io_count lets us track all of those sub-IOs
so that we know when all submitted IO has completed, so that we can
pass the completion callback back up the chain without having to
allocate yet more descriptor structs for the IO.

Again, remove this and the IO becomes more heavyweight because we need
to create a separate struct for the info.

> the presence of wait_queue AND end_io,

That's fine, I'm happy scrapping the wait queue: people can always use
the kiobuf private data field to refer to a wait queue if they want
to.

> and the lack of
> scatter gather in one kiobuf struct (you always need an array)

Again, _all_ data being sent down through the block device layer is
either in buffer heads or is page aligned.  You want us to triple the
size of the "heavyweight" kiobuf's data vector for what gain, exactly?
Obviously, extra code will be needed to scan kiobufs if we do that,
and unless we have both per-page _and_ per-kiobuf start/offset pairs
(adding even further to the complexity), those scatter-gather lists
would prevent us from carving up a kiobuf into smaller sub-ios without
copying the whole (expanded) vector.

That's a _lot_ of extra complexity in the disk IO layers.

I'm all for a fast kiobuf_to_sglist converter.  But I haven't seen any
evidence that such scatter-gather lists will do anything in the block
device case except complicate the code and decrease performance.

> S.th. like:
...
> makes it a lot simpler for the subsytems to integrate.

Possibly, but I remain to be convinced, because you may end up with a
mechanism which is generic but is not well-tuned for any specific
case, so everything goes slower.

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
  2001-02-01 17:49           ` Christoph Hellwig
  2001-02-01 17:58             ` Alan Cox
@ 2001-02-01 19:33             ` Stephen C. Tweedie
  1 sibling, 0 replies; 186+ messages in thread
From: Stephen C. Tweedie @ 2001-02-01 19:33 UTC (permalink / raw)
  To: Alan Cox, Stephen C. Tweedie, Steve Lord, linux-kernel, kiobuf-io-devel

Hi,

On Thu, Feb 01, 2001 at 06:49:50PM +0100, Christoph Hellwig wrote:
> 
> > Adding tons of base/limit pairs to kiobufs makes it worse not better
> 
> For disk I/O it makes the handling a little easier for the cost of the
> additional offset/length fields.

Umm, actually, no, it makes it much worse for many of the cases.  

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
  2001-02-01 17:41       ` Stephen C. Tweedie
  2001-02-01 18:14         ` Christoph Hellwig
@ 2001-02-01 20:04         ` Chaitanya Tumuluri
  1 sibling, 0 replies; 186+ messages in thread
From: Chaitanya Tumuluri @ 2001-02-01 20:04 UTC (permalink / raw)
  To: Stephen C. Tweedie; +Cc: bsuparna, linux-kernel, kiobuf-io-devel

On Thu, 1 Feb 2001, Stephen C. Tweedie wrote:
> Hi,
> 
> On Thu, Feb 01, 2001 at 06:05:15PM +0100, Christoph Hellwig wrote:
> > On Thu, Feb 01, 2001 at 04:16:15PM +0000, Stephen C. Tweedie wrote:
> > > > 
> > > > No, and with the current kiobufs it would not make sense, because they
> > > > are to heavy-weight.
> > > 
> > > Really?  In what way?  
> > 
> > We can't allocate a huge kiobuf structure just for requesting one page of
> > IO.  It might get better with VM-level IO clustering though.
> 
> A kiobuf is *much* smaller than, say, a buffer_head, and we currently
> allocate a buffer_head per block for all IO.
> 
> A kiobuf contains enough embedded page vector space for 16 pages by
> default, but I'm happy enough to remove that if people want.  However,
> note that that memory is not initialised, so there is no memory access
> cost at all for that empty space.  Remove that space and instead of
> one memory allocation per kiobuf, you get two, so the cost goes *UP*
> for small IOs.
> 
> > > > With page,length,offsett iobufs this makes sense
> > > > and is IMHO the way to go.
> > > 
> > > What, you mean adding *extra* stuff to the heavyweight kiobuf makes it
> > > lean enough to do the job??
> > 
> > No.  I was speaking abou the light-weight kiobuf Linux & Me discussed on
> > lkml some time ago (though I'd much more like to call it kiovec analogous
> > to BSD iovecs).
> 
> What is so heavyweight in the current kiobuf (other than the embedded
> vector, which I've already noted I'm willing to cut)?


Hi,

It'd seem that "array_len", "locked", "bounced", "io_count" and "errno" 
are the fields that need to go away (apart from the "maplist").

The field "nr_pages" would reincarnate in the kiovec struct (which is
is not a plain array anymore) as the field "nbufs". See below.

Based on what I've seen fly by on the lists here's my understanding of 
the proposed new kiobuf/kiovec structures:

===========================================================================
/*
 * a simple page,offset,length tuple like Linus wants it
 */
struct kiobuf {
	struct page *   page;   /* The page itself               */
	u_16       	offset; /* Offset to start of valid data */
	u_16       	length; /* Number of valid bytes of data */
};

struct kiovec {
	int             nbufs;          /* Kiobufs actually referenced */
	struct kiobuf * bufs;
}

/*
 * the name is just plain stupid, but that shouldn't matter
 */
struct vfs_kiovec {
        struct kiovec * iov;

        /* private data, mostly for the callback */
        void * private;

        /* completion callback */
        void (*end_io)  (struct vfs_kiovec *);
        wait_queue_head_t wait_queue;
};
===========================================================================

Is this correct? 

If so, I have a few questions/clarifications:

	- The [ll_rw_blk, scsi/ide request-functions, scsi/ide 
	  I/O completion handling] functions would be handed the 
	  "X_kiovec" struct, correct?

	- So, the soft-RAID / LVM layers need to construct their 
	  own "lvm_kiovec" structs to handle request splits and
	  the partial completions, correct? 

	- Then, what are the semantics of request-merges containing 
	  the "X_kiovec" structs in the block I/O queueing layers?
	  Do we add "X_kiovec->next", "X_kiovec->prev" etc. fields?

	  It will also require a re-allocation of a new and longer
	  kiovec->bufs array, correct?
	  
	- How are I/O error codes to be propagated back to the 
	  higher (calling) layers? I think that needs to be added
	  into the "X_kiovec" struct, no?

	- How is bouncing to be handled with this setup? (some state 
	  is needed to (a) determine that bouncing occurred, (b) find 
	  out which pages have been bounced and, (c) find out the 
	  bounce-page for each of these bounced pages).

Cheers,
-Chait.






-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
  2001-02-01 17:49           ` Stephen C. Tweedie
  2001-02-01 17:09             ` Chaitanya Tumuluri
@ 2001-02-01 20:33             ` Christoph Hellwig
  2001-02-01 20:56               ` Steve Lord
                                 ` (2 more replies)
  1 sibling, 3 replies; 186+ messages in thread
From: Christoph Hellwig @ 2001-02-01 20:33 UTC (permalink / raw)
  To: "Stephen C. Tweedie"
  Cc: Steve Lord, linux-kernel, kiobuf-io-devel@lists.sourceforge.net Alan Cox

In article <20010201174946.B11607@redhat.com> you wrote:
> Hi,

> On Thu, Feb 01, 2001 at 05:34:49PM +0000, Alan Cox wrote:
> In the disk IO case, you basically don't get that (the only thing
> which comes close is raid5 parity blocks).  The data which the user
> started with is the data sent out on the wire.  You do get some
> interesting cases such as soft raid and LVM, or even in the scsi stack
> if you run out of mailbox space, where you need to send only a
> sub-chunk of the input buffer. 

Though your describption is right, I don't think the case is very common:
Sometimes in LVM on a pv boundary and maybe sometimes in the scsi code.

In raid1 you need some kind of clone iobuf, which should work with both
cases.  In raid0 you need a complete new pagelist anyway, same for raid5.


> In that case, having offset/len as the kiobuf limit markers is ideal:
> you can clone a kiobuf header using the same page vector as the
> parent, narrow down the start/end points, and continue down the stack
> without having to copy any part of the page list.  If you had the
> offset/len data encoded implicitly into each entry in the sglist, you
> would not be able to do that.

Sure you could: you embedd that information in a higher-level structure.
I think you want the whole kio concept only for disk-like IO.  Then many
of the things you do are completly right and I don't see much problems
(besides thinking that some thing may go away - but that's no major point).

With a generic object that is used over subsytem boundaries things are
different.

	Christoph

-- 
Of course it doesn't work. We've performed a software upgrade.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
  2001-02-01 19:32           ` Stephen C. Tweedie
@ 2001-02-01 20:46             ` Christoph Hellwig
  2001-02-01 21:25               ` Stephen C. Tweedie
  0 siblings, 1 reply; 186+ messages in thread
From: Christoph Hellwig @ 2001-02-01 20:46 UTC (permalink / raw)
  To: "Stephen C. Tweedie"; +Cc: bsuparna, linux-kernel, kiobuf-io-devel

In article <20010201193221.D11607@redhat.com> you wrote:
> Buffer_heads are _sometimes_ used for caching data.

Actually they are mostly used, but that should have any value for the
discussion...

> That's one of the
> big problems with them, they are too overloaded, being both IO
> descriptors _and_ cache descriptors.

Agreed.

> If you've got 128k of data to
> write out from user space, do you want to set up one kiobuf or 256
> buffer_heads?  Buffer_heads become really very heavy indeed once you
> start doing non-trivial IO.

Sure - I was never arguing in favor of buffer_head's ...

>> > What is so heavyweight in the current kiobuf (other than the embedded
>> > vector, which I've already noted I'm willing to cut)?
>> 
>> array_len

> kiobufs can be reused after IO.  You can depopulate a kiobuf,
> repopulate it with new pages and submit new IO without having to
> deallocate the kiobuf.  You can't do this without knowing how big the
> data vector is.  Removing that functionality will prevent reuse,
> making them _more_ heavyweight.

>> io_count,

> Right now we can take a kiobuf and turn it into a bunch of
> buffer_heads for IO.  The io_count lets us track all of those sub-IOs
> so that we know when all submitted IO has completed, so that we can
> pass the completion callback back up the chain without having to
> allocate yet more descriptor structs for the IO.

> Again, remove this and the IO becomes more heavyweight because we need
> to create a separate struct for the info.

No.  Just allow passing the multiple of the devices blocksize over
ll_rw_block.  XFS is doing that and it just needs an audit of the lesser
used block drivers.

>> and the lack of
>> scatter gather in one kiobuf struct (you always need an array)

> Again, _all_ data being sent down through the block device layer is
> either in buffer heads or is page aligned.

That's the point.  You are always talking about the block-layer only.
And I think it should be generic instead.
Looks like that is the major point.

> You want us to triple the
> size of the "heavyweight" kiobuf's data vector for what gain, exactly?

double.

> Obviously, extra code will be needed to scan kiobufs if we do that,
> and unless we have both per-page _and_ per-kiobuf start/offset pairs
> (adding even further to the complexity), those scatter-gather lists
> would prevent us from carving up a kiobuf into smaller sub-ios without
> copying the whole (expanded) vector.

No.  I think I explained that in my last mail.

> That's a _lot_ of extra complexity in the disk IO layers.

> Possibly, but I remain to be convinced, because you may end up with a
> mechanism which is generic but is not well-tuned for any specific
> case, so everything goes slower.

As kiobufs are widely used for real IO, just as containers, this is
better then nothing.
And IMHO a nice generic concepts that lets different subsystems work
toegther is a _lot_ better then a bunch of over-optimized, rather isolated
subsytems.  The IO-Lite people have done a nice research of the effect of
an unified IO-Caching system vs. the typical isolated systems.

	Christoph

-- 
Of course it doesn't work. We've performed a software upgrade.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
  2001-02-01 20:33             ` Christoph Hellwig
@ 2001-02-01 20:56               ` Steve Lord
  2001-02-01 20:59                 ` Christoph Hellwig
  2001-02-01 21:44               ` Stephen C. Tweedie
  2001-02-01 22:07               ` Stephen C. Tweedie
  2 siblings, 1 reply; 186+ messages in thread
From: Steve Lord @ 2001-02-01 20:56 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: "Stephen C. Tweedie",
	Steve Lord, linux-kernel,
	kiobuf-io-devel@lists.sourceforge.net Alan Cox

> In article <20010201174946.B11607@redhat.com> you wrote:
> > Hi,
> 
> > On Thu, Feb 01, 2001 at 05:34:49PM +0000, Alan Cox wrote:
> > In the disk IO case, you basically don't get that (the only thing
> > which comes close is raid5 parity blocks).  The data which the user
> > started with is the data sent out on the wire.  You do get some
> > interesting cases such as soft raid and LVM, or even in the scsi stack
> > if you run out of mailbox space, where you need to send only a
> > sub-chunk of the input buffer. 
> 
> Though your describption is right, I don't think the case is very common:
> Sometimes in LVM on a pv boundary and maybe sometimes in the scsi code.


And if you are writing to a striped volume via a filesystem which can do
it's own I/O clustering, e.g. I throw 500 pages at LVM in one go and LVM
is striped on 64K boundaries.

Steve


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
  2001-02-01 20:56               ` Steve Lord
@ 2001-02-01 20:59                 ` Christoph Hellwig
  2001-02-01 21:17                   ` Steve Lord
  0 siblings, 1 reply; 186+ messages in thread
From: Christoph Hellwig @ 2001-02-01 20:59 UTC (permalink / raw)
  To: Steve Lord; +Cc: Stephen C . Tweedie, linux-kernel, kiobuf-io-devel, Alan Cox

On Thu, Feb 01, 2001 at 02:56:47PM -0600, Steve Lord wrote:
> And if you are writing to a striped volume via a filesystem which can do
> it's own I/O clustering, e.g. I throw 500 pages at LVM in one go and LVM
> is striped on 64K boundaries.

But usually I want to have pages 0-63, 128-191, etc together, because they are
contingous on disk, or?

	Christoph

-- 
Of course it doesn't work. We've performed a software upgrade.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
  2001-02-01 20:59                 ` Christoph Hellwig
@ 2001-02-01 21:17                   ` Steve Lord
  0 siblings, 0 replies; 186+ messages in thread
From: Steve Lord @ 2001-02-01 21:17 UTC (permalink / raw)
  To: Steve Lord, Stephen C . Tweedie, linux-kernel, kiobuf-io-devel, Alan Cox

> On Thu, Feb 01, 2001 at 02:56:47PM -0600, Steve Lord wrote:
> > And if you are writing to a striped volume via a filesystem which can do
> > it's own I/O clustering, e.g. I throw 500 pages at LVM in one go and LVM
> > is striped on 64K boundaries.
> 
> But usually I want to have pages 0-63, 128-191, etc together, because they ar
> e
> contingous on disk, or?

I was just giving an example of how kiobufs might need splitting up more often
than you think, crossing a stripe boundary is one obvious case. Yes you do
want to keep the pages which are contiguous on disk together, but you will
often get requests which cover multiple stripes, otherwise you don't really
get much out of stripes and may as well just concatenate drives.

Ideally the file is striped across the various disks in the volume, and one
large write (direct or from the cache) gets scattered across the disks. All
the I/O's run in parallel (and on different controllers if you have the 
budget).

Steve

> 
> 	Christoph
> 
> -- 
> Of course it doesn't work. We've performed a software upgrade.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
  2001-02-01 20:46             ` Christoph Hellwig
@ 2001-02-01 21:25               ` Stephen C. Tweedie
  2001-02-02 11:51                 ` Christoph Hellwig
  0 siblings, 1 reply; 186+ messages in thread
From: Stephen C. Tweedie @ 2001-02-01 21:25 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Stephen C. Tweedie, bsuparna, linux-kernel, kiobuf-io-devel

Hi,

On Thu, Feb 01, 2001 at 09:46:27PM +0100, Christoph Hellwig wrote:

> > Right now we can take a kiobuf and turn it into a bunch of
> > buffer_heads for IO.  The io_count lets us track all of those sub-IOs
> > so that we know when all submitted IO has completed, so that we can
> > pass the completion callback back up the chain without having to
> > allocate yet more descriptor structs for the IO.
> 
> > Again, remove this and the IO becomes more heavyweight because we need
> > to create a separate struct for the info.
> 
> No.  Just allow passing the multiple of the devices blocksize over
> ll_rw_block.

That was just one example: you need the sub-ios just as much when
you split up an IO over stripe boundaries in LVM or raid0, for
example.  Secondly, ll_rw_block needs to die anyway: you can expand
the blocksize up to PAGE_SIZE but not beyond, whereas something like
ll_rw_kiobuf can submit a much larger IO atomically (and we have
devices which don't start to deliver good throughput until you use
IO sizes of 1MB or more).

> >> and the lack of
> >> scatter gather in one kiobuf struct (you always need an array)
> 
> > Again, _all_ data being sent down through the block device layer is
> > either in buffer heads or is page aligned.
> 
> That's the point.  You are always talking about the block-layer only.

I'm talking about why the minimal, generic solution doesn't provide
what the block layer needs.


> > Obviously, extra code will be needed to scan kiobufs if we do that,
> > and unless we have both per-page _and_ per-kiobuf start/offset pairs
> > (adding even further to the complexity), those scatter-gather lists
> > would prevent us from carving up a kiobuf into smaller sub-ios without
> > copying the whole (expanded) vector.
> 
> No.  I think I explained that in my last mail.

How?

If I've got a vector (page X, offset 0, length PAGE_SIZE) and I want
to split it in two, I have to make two new vectors (page X, offset 0,
length n) and (page X, offset n, length PAGE_SIZE-n).  That implies
copying both vectors.

If I have a page vector with a single offset/length pair, I can build
a new header with the same vector and modified offset/length to split
the vector in two without copying it.

> > Possibly, but I remain to be convinced, because you may end up with a
> > mechanism which is generic but is not well-tuned for any specific
> > case, so everything goes slower.
> 
> As kiobufs are widely used for real IO, just as containers, this is
> better then nothing.

Surely having all of the subsystems working fast is better still?

> And IMHO a nice generic concepts that lets different subsystems work
> toegther is a _lot_ better then a bunch of over-optimized, rather isolated
> subsytems.  The IO-Lite people have done a nice research of the effect of
> an unified IO-Caching system vs. the typical isolated systems.

I know, and IO-Lite has some major problems (the close integration of
that code into the cache, for example, makes it harder to expose the
zero-copy to user-land).

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
  2001-02-01 20:33             ` Christoph Hellwig
  2001-02-01 20:56               ` Steve Lord
@ 2001-02-01 21:44               ` Stephen C. Tweedie
  2001-02-01 22:07               ` Stephen C. Tweedie
  2 siblings, 0 replies; 186+ messages in thread
From: Stephen C. Tweedie @ 2001-02-01 21:44 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Stephen C. Tweedie, Steve Lord, linux-kernel, kiobuf-io-devel, Alan Cox

Hi,

On Thu, Feb 01, 2001 at 09:33:27PM +0100, Christoph Hellwig wrote:
> 
> > On Thu, Feb 01, 2001 at 05:34:49PM +0000, Alan Cox wrote:
> > In the disk IO case, you basically don't get that (the only thing
> > which comes close is raid5 parity blocks).  The data which the user
> > started with is the data sent out on the wire.  You do get some
> > interesting cases such as soft raid and LVM, or even in the scsi stack
> > if you run out of mailbox space, where you need to send only a
> > sub-chunk of the input buffer. 
> 
> Though your describption is right, I don't think the case is very common:
> Sometimes in LVM on a pv boundary and maybe sometimes in the scsi code.

On raid0 stripes, it's common to have stripes of between 16k and 64k,
so it's rather more common there than you'd like.  In any case, you
need the code to handle it, and I don't want to make the code paths
any more complex than necessary.

> In raid1 you need some kind of clone iobuf, which should work with both
> cases.  In raid0 you need a complete new pagelist anyway

No you don't.  You take the existing one, specify which region of it
is going to the current stripe, and send it off.  Nothing more.

> > In that case, having offset/len as the kiobuf limit markers is ideal:
> > you can clone a kiobuf header using the same page vector as the
> > parent, narrow down the start/end points, and continue down the stack
> > without having to copy any part of the page list.  If you had the
> > offset/len data encoded implicitly into each entry in the sglist, you
> > would not be able to do that.
> 
> Sure you could: you embedd that information in a higher-level structure.

What's the point in a common data container structure if you need
higher-level information to make any sense out of it?

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
  2001-02-01 20:33             ` Christoph Hellwig
  2001-02-01 20:56               ` Steve Lord
  2001-02-01 21:44               ` Stephen C. Tweedie
@ 2001-02-01 22:07               ` Stephen C. Tweedie
  2001-02-02 12:02                 ` Christoph Hellwig
  2001-02-03 20:28                 ` Linus Torvalds
  2 siblings, 2 replies; 186+ messages in thread
From: Stephen C. Tweedie @ 2001-02-01 22:07 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Stephen C. Tweedie, Steve Lord, linux-kernel, kiobuf-io-devel,
	Alan Cox, Linus Torvalds

Hi,

On Thu, Feb 01, 2001 at 09:33:27PM +0100, Christoph Hellwig wrote:

> I think you want the whole kio concept only for disk-like IO.  

No.  I want something good for zero-copy IO in general, but a lot of
that concerns the problem of interacting with the user, and the basic
center of that interaction in 99% of the interesting cases is either a
user VM buffer or the page cache --- all of which are page-aligned.  

If you look at the sorts of models being proposed (even by Linus) for
splice, you get

	len = prepare_read();
	prepare_write();
	pull_fd();
	commit_write();

in which the read is being pulled into a known location in the page
cache -- it's page-aligned, again.  I'm perfectly willing to accept
that there may be a need for scatter-gather boundaries including
non-page-aligned fragments in this model, but I can't see one if
you're using the page cache as a mediator, nor if you're doing it
through a user mmapped buffer.

The only reason you need finer scatter-gather boundaries --- and it
may be a compelling reason --- is if you are merging multiple IOs
together into a single device-level IO.  That makes perfect sense for
the zerocopy tcp case where you're doing MSG_MORE-type coalescing.  It
doesn't help the existing SGI kiobuf block device code, because that
performs its merging in the filesystem layers and the block device
code just squirts the IOs to the wire as-is, but if we want to start
merging those kiobuf-based IOs within make_request() then the block
device layer may want it too.

And Linus is right, the old way of using a *kiobuf[] for that was
painful, but the solution of adding start/length to every entry in
the page vector just doesn't sit right with many components of the
block device environment either.

I may still be persuaded that we need the full scatter-gather list
fields throughout, but for now I tend to think that, at least in the
disk layers, we may get cleaner results by allow linked lists of
page-aligned kiobufs instead.  That allows for merging of kiobufs
without having to copy all of the vector information each time.

The killer, however, is what happens if you want to split such a
merged kiobuf.  Right now, that's something that I can only imagine
happening in the block layers if we start encoding buffer_head chains
as kiobufs, but if we do that in the future, or if we start merging
genuine kiobuf requests requests, then doing that split later on (for
raid0 etc) may require duplicating whole chains of kiobufs.  At that
point, just doing scatter-gather lists is cleaner.

But for now, the way to picture what I'm trying to achieve is that
kiobufs are a bit like buffer_heads --- they represent the physical
pages of some VM object that a higher layer has constructed, such as
the page cache or a user VM buffer.  You can chain these objects
together for IO, but that doesn't stop the individual objects from
being separate entities with independent IO completion callbacks to be
honoured.  

Cheers,
 Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
  2001-02-01 18:14         ` Christoph Hellwig
  2001-02-01 18:25           ` Alan Cox
  2001-02-01 19:32           ` Stephen C. Tweedie
@ 2001-02-02  4:18           ` bcrl
  2001-02-02 12:12             ` Christoph Hellwig
  2 siblings, 1 reply; 186+ messages in thread
From: bcrl @ 2001-02-02  4:18 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Stephen C. Tweedie, bsuparna, linux-kernel, kiobuf-io-devel

On Thu, 1 Feb 2001, Christoph Hellwig wrote:

> A kiobuf is 124 bytes, a buffer_head 96.  And a buffer_head is additionally
> used for caching data, a kiobuf not.

Go measure the cost of a distant cache miss, then complain about having
everything in one structure.  Also, 1 kiobuf maps 16-128 times as much
data as a single buffer head.

> enum kio_flags {
> 	KIO_LOANED,     /* the calling subsystem wants this buf back    */
> 	KIO_GIFTED,     /* thanks for the buffer, man!                  */
> 	KIO_COW         /* copy on write (XXX: not yet)                 */
> };

This is a Really Bad Idea.  Having semantics depend on a subtle flag
determined by a caller is a sure way to

>
>
> struct kio {
> 	struct kiovec *         kio_data;       /* our kiovecs          */
> 	int                     kio_ndata;      /* # of kiovecs         */
> 	int                     kio_flags;      /* loaned or giftet?    */
> 	void *                  kio_priv;       /* caller private data  */
> 	wait_queue_head_t       kio_wait;	/* wait queue           */
> };
>
> makes it a lot simpler for the subsytems to integrate.

Keep in mind that using distant memory allocations for kio_data will incur
additional cache misses.  The atomic count is probably going to be widely
used; I see it being applicable to the network stack, block io layers and
others.  Also, how is information about io completion status passed back
to the caller?  That information is required across layers so that io can
be properly aborted or proceed with the partial amount of io.  Add those
back in and we're right back to the original kiobuf structure.

		-ben

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
  2001-02-01 21:25               ` Stephen C. Tweedie
@ 2001-02-02 11:51                 ` Christoph Hellwig
  2001-02-02 14:04                   ` Stephen C. Tweedie
  0 siblings, 1 reply; 186+ messages in thread
From: Christoph Hellwig @ 2001-02-02 11:51 UTC (permalink / raw)
  To: Stephen C. Tweedie; +Cc: bsuparna, linux-kernel, kiobuf-io-devel

On Thu, Feb 01, 2001 at 09:25:08PM +0000, Stephen C. Tweedie wrote:
> > No.  Just allow passing the multiple of the devices blocksize over
> > ll_rw_block.
> 
> That was just one example: you need the sub-ios just as much when
> you split up an IO over stripe boundaries in LVM or raid0, for
> example.

IIRC that's why you designed (and I thought of independandly) clone-kiobufs.

> Secondly, ll_rw_block needs to die anyway: you can expand
> the blocksize up to PAGE_SIZE but not beyond, whereas something like
> ll_rw_kiobuf can submit a much larger IO atomically (and we have
> devices which don't start to deliver good throughput until you use
> IO sizes of 1MB or more).

Completly agreed.

> If I've got a vector (page X, offset 0, length PAGE_SIZE) and I want
> to split it in two, I have to make two new vectors (page X, offset 0,
> length n) and (page X, offset n, length PAGE_SIZE-n).  That implies
> copying both vectors.
> 
> If I have a page vector with a single offset/length pair, I can build
> a new header with the same vector and modified offset/length to split
> the vector in two without copying it.

You just say in the higher-level structure ignore from x to y even if
they have an offset in their own vector.

	Christoph

-- 
Of course it doesn't work. We've performed a software upgrade.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
  2001-02-01 22:07               ` Stephen C. Tweedie
@ 2001-02-02 12:02                 ` Christoph Hellwig
  2001-02-05 12:19                   ` Stephen C. Tweedie
  2001-02-03 20:28                 ` Linus Torvalds
  1 sibling, 1 reply; 186+ messages in thread
From: Christoph Hellwig @ 2001-02-02 12:02 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Christoph Hellwig, Steve Lord, linux-kernel, kiobuf-io-devel,
	Alan Cox, Linus Torvalds

On Thu, Feb 01, 2001 at 10:07:44PM +0000, Stephen C. Tweedie wrote:
> No.  I want something good for zero-copy IO in general, but a lot of
> that concerns the problem of interacting with the user, and the basic
> center of that interaction in 99% of the interesting cases is either a
> user VM buffer or the page cache --- all of which are page-aligned.

Yes.

> If you look at the sorts of models being proposed (even by Linus) for
> splice, you get
> 
> 	len = prepare_read();
> 	prepare_write();
> 	pull_fd();
> 	commit_write();

Yepp.

> in which the read is being pulled into a known location in the page
> cache -- it's page-aligned, again.  I'm perfectly willing to accept
> that there may be a need for scatter-gather boundaries including
> non-page-aligned fragments in this model, but I can't see one if
> you're using the page cache as a mediator, nor if you're doing it
> through a user mmapped buffer.

True.

> The only reason you need finer scatter-gather boundaries --- and it
> may be a compelling reason --- is if you are merging multiple IOs
> together into a single device-level IO.  That makes perfect sense for
> the zerocopy tcp case where you're doing MSG_MORE-type coalescing.  It
> doesn't help the existing SGI kiobuf block device code, because that
> performs its merging in the filesystem layers and the block device
> code just squirts the IOs to the wire as-is,

Yes - but that is no soloution for a generic model.  AFAICS even XFS
falls back to buffer_head's for small requests.

> but if we want to start
> merging those kiobuf-based IOs within make_request() then the block
> device layer may want it too.

Yes.

> And Linus is right, the old way of using a *kiobuf[] for that was
> painful, but the solution of adding start/length to every entry in
> the page vector just doesn't sit right with many components of the
> block device environment either.

What do you thing is the alternative?

> I may still be persuaded that we need the full scatter-gather list
> fields throughout, but for now I tend to think that, at least in the
> disk layers, we may get cleaner results by allow linked lists of
> page-aligned kiobufs instead.  That allows for merging of kiobufs
> without having to copy all of the vector information each time.

But it will have the same problems as the array soloution: there will
be one complete kio structure for each kiobuf, with it's own end_io
callback, etc.

	Christoph

-- 
Of course it doesn't work. We've performed a software upgrade.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
  2001-02-02  4:18           ` bcrl
@ 2001-02-02 12:12             ` Christoph Hellwig
  0 siblings, 0 replies; 186+ messages in thread
From: Christoph Hellwig @ 2001-02-02 12:12 UTC (permalink / raw)
  To: bcrl
  Cc: Christoph Hellwig, Stephen C. Tweedie, bsuparna, linux-kernel,
	kiobuf-io-devel

On Thu, Feb 01, 2001 at 11:18:56PM -0500, bcrl@redhat.com wrote:
> On Thu, 1 Feb 2001, Christoph Hellwig wrote:
> 
> > A kiobuf is 124 bytes, a buffer_head 96.  And a buffer_head is additionally
> > used for caching data, a kiobuf not.
> 
> Go measure the cost of a distant cache miss, then complain about having
> everything in one structure.  Also, 1 kiobuf maps 16-128 times as much
> data as a single buffer head.

I'd never dipute that.  It was just an answers to Stephen's "a kiobuf is
already smaller".

> > enum kio_flags {
> > 	KIO_LOANED,     /* the calling subsystem wants this buf back    */
> > 	KIO_GIFTED,     /* thanks for the buffer, man!                  */
> > 	KIO_COW         /* copy on write (XXX: not yet)                 */
> > };
> 
> This is a Really Bad Idea.  Having semantics depend on a subtle flag
> determined by a caller is a sure way to

The semantics aren't different for the using subsystem.  LOANED vs GIFTED
is an issue for the free function, COW will probably be a page-level mm
thing - though I haven't thought a lot about it yet an am not sure wether
it actually makes sense.

> 
> >
> >
> > struct kio {
> > 	struct kiovec *         kio_data;       /* our kiovecs          */
> > 	int                     kio_ndata;      /* # of kiovecs         */
> > 	int                     kio_flags;      /* loaned or giftet?    */
> > 	void *                  kio_priv;       /* caller private data  */
> > 	wait_queue_head_t       kio_wait;	/* wait queue           */
> > };
> >
> > makes it a lot simpler for the subsytems to integrate.
> 
> Keep in mind that using distant memory allocations for kio_data will incur
> additional cache misses.

It could also be a [0] array at the end, allowing for a single allocation,
but that looks more like a implementation detail then a design problem to me.

> The atomic count is probably going to be widely
> used; I see it being applicable to the network stack, block io layers and
> others.

Hmm.  Currently it is used only for the multiple buffer_head's per iobuf
cruft, and I don't see why multiple outstanding IOs should be noted in a
kiobuf.

> Also, how is information about io completion status passed back
> to the caller?

Yes, there needs to be an kio_errno field - though I wanted to get rid of
it I had to readd in in later versions of my design.

	Christoph

-- 
Of course it doesn't work. We've performed a software upgrade.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
  2001-02-02 11:51                 ` Christoph Hellwig
@ 2001-02-02 14:04                   ` Stephen C. Tweedie
  0 siblings, 0 replies; 186+ messages in thread
From: Stephen C. Tweedie @ 2001-02-02 14:04 UTC (permalink / raw)
  To: Stephen C. Tweedie, bsuparna, linux-kernel, kiobuf-io-devel

Hi,

On Fri, Feb 02, 2001 at 12:51:35PM +0100, Christoph Hellwig wrote:
> > 
> > If I have a page vector with a single offset/length pair, I can build
> > a new header with the same vector and modified offset/length to split
> > the vector in two without copying it.
> 
> You just say in the higher-level structure ignore from x to y even if
> they have an offset in their own vector.

Exactly --- and so you end up with something _much_ uglier, because
you end up with all sorts of combinations of length/offset fields all
over the place.

This is _precisely_ the mess I want to avoid.

Cheers,
 Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
  2001-02-01 22:07               ` Stephen C. Tweedie
  2001-02-02 12:02                 ` Christoph Hellwig
@ 2001-02-03 20:28                 ` Linus Torvalds
  2001-02-05 11:03                   ` Stephen C. Tweedie
  1 sibling, 1 reply; 186+ messages in thread
From: Linus Torvalds @ 2001-02-03 20:28 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Christoph Hellwig, Steve Lord, linux-kernel, kiobuf-io-devel, Alan Cox



On Thu, 1 Feb 2001, Stephen C. Tweedie wrote:
> 
> On Thu, Feb 01, 2001 at 09:33:27PM +0100, Christoph Hellwig wrote:
> 
> > I think you want the whole kio concept only for disk-like IO.  
> 
> No.  I want something good for zero-copy IO in general, but a lot of
> that concerns the problem of interacting with the user, and the basic
> center of that interaction in 99% of the interesting cases is either a
> user VM buffer or the page cache --- all of which are page-aligned.  
> 
> If you look at the sorts of models being proposed (even by Linus) for
> splice, you get
> 
> 	len = prepare_read();
> 	prepare_write();
> 	pull_fd();
> 	commit_write();
> 
> in which the read is being pulled into a known location in the page
> cache -- it's page-aligned, again.

Wrong.

Neither the read nor the write are page-aligned. I don't know where you
got that idea. It's obviously not true even in the common case: it depends
_entirely_ on what the file offsets are, and expecting the offset to be
zero is just being stupid. It's often _not_ zero. With networking it is in
fact seldom zero, because the network packets are seldom aligned either in
size or in location.

Also, there are many reasons why "page" may have different meaning. We
will eventually have a page-cache where the pagecace granularity is not
the same as the user-level visible one. User-level may do mmap at 4kB
boundaries, even if the page cache itself uses 8kB or 16kB pages.

THERE IS NO PAGE-ALIGNMENT. And anything that even _mentions_ the word
page-aligned is going into my trash-can faster than you can say "bug".

		Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
  2001-02-03 20:28                 ` Linus Torvalds
@ 2001-02-05 11:03                   ` Stephen C. Tweedie
  2001-02-05 12:00                     ` Manfred Spraul
  2001-02-05 16:36                     ` [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains Linus Torvalds
  0 siblings, 2 replies; 186+ messages in thread
From: Stephen C. Tweedie @ 2001-02-05 11:03 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Stephen C. Tweedie, Christoph Hellwig, Steve Lord, linux-kernel,
	kiobuf-io-devel, Alan Cox

Hi,

On Sat, Feb 03, 2001 at 12:28:47PM -0800, Linus Torvalds wrote:
> 
> On Thu, 1 Feb 2001, Stephen C. Tweedie wrote:
> > 
> Neither the read nor the write are page-aligned. I don't know where you
> got that idea. It's obviously not true even in the common case: it depends
> _entirely_ on what the file offsets are, and expecting the offset to be
> zero is just being stupid. It's often _not_ zero. With networking it is in
> fact seldom zero, because the network packets are seldom aligned either in
> size or in location.

The underlying buffer is.  The VFS (and the current kiobuf code) is
already happy about IO happening at odd offsets within a page.
However, the more general case --- doing zero-copy IO on arbitrary
unaligned buffers --- simply won't work if you expect to be able to
push those buffers to disk without a copy.  

The splice case you talked about is fine because it's doing the normal
prepare/commit logic where the underlying buffer is page aligned, even
if the splice IO is not to a page aligned location.  That's _exactly_
what kiobufs were intended to support.  The prepare_read/prepare_write/
pull/push cycle lets the caller tell the pull() function where to
store its data, becausse there are alignment constraints which just
can't be ignored: you simply cannot do physical disk IO on
non-sector-aligned memory or in chunks which aren't a multiple of
sector size.  (The buffer address alignment can sometimes be relaxed
--- obviously if you're doing PIO then it doesn't matter --- but the
length granularity is rigidly enforced.)
 
Cheers,
 Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify  + callback chains
  2001-02-05 11:03                   ` Stephen C. Tweedie
@ 2001-02-05 12:00                     ` Manfred Spraul
  2001-02-05 15:03                       ` Stephen C. Tweedie
  2001-02-05 16:56                       ` Linus Torvalds
  2001-02-05 16:36                     ` [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains Linus Torvalds
  1 sibling, 2 replies; 186+ messages in thread
From: Manfred Spraul @ 2001-02-05 12:00 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Linus Torvalds, Christoph Hellwig, Steve Lord, linux-kernel,
	kiobuf-io-devel, Alan Cox

"Stephen C. Tweedie" wrote:
> 
> You simply cannot do physical disk IO on
> non-sector-aligned memory or in chunks which aren't a multiple of
> sector size.

Why not?

Obviously the disk access itself must be sector aligned and the total
length must be a multiple of the sector length, but there shouldn't be
any restrictions on the data buffers.

I remember that even Windoze 95 has scatter-gather support for physical
disk IO with arbitraty buffer chunks. (If the hardware supports it,
otherwise the io subsystem will copy the data into a contiguous
temporary buffer)

--
	Manfred
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
  2001-02-02 12:02                 ` Christoph Hellwig
@ 2001-02-05 12:19                   ` Stephen C. Tweedie
  2001-02-05 21:28                     ` Ingo Molnar
  0 siblings, 1 reply; 186+ messages in thread
From: Stephen C. Tweedie @ 2001-02-05 12:19 UTC (permalink / raw)
  To: Stephen C. Tweedie, Steve Lord, linux-kernel, kiobuf-io-devel,
	Alan Cox, Linus Torvalds

Hi,

On Fri, Feb 02, 2001 at 01:02:28PM +0100, Christoph Hellwig wrote:
> 
> > I may still be persuaded that we need the full scatter-gather list
> > fields throughout, but for now I tend to think that, at least in the
> > disk layers, we may get cleaner results by allow linked lists of
> > page-aligned kiobufs instead.  That allows for merging of kiobufs
> > without having to copy all of the vector information each time.
> 
> But it will have the same problems as the array soloution: there will
> be one complete kio structure for each kiobuf, with it's own end_io
> callback, etc.

And what's the problem with that?

You *need* this.  You have to have that multiple-completion concept in
the disk layers.  Think about chains of buffer_heads being sent to
disk as a single IO --- you need to know which buffers make it to disk
successfully and which had IO errors.

And no, the IO success is *not* necessarily sequential from the start
of the IO: if you are doing IO to raid0, for example, and the IO gets
striped across two disks, you might find that the first disk gets an
error so the start of the IO fails but the rest completes.  It's the
completion code which notifies the caller of what worked and what did
not.

And for readahead, you want to notify the caller as early as posssible
about completion for the first part of the IO, even if the device
driver is still processing the rest.

Multiple completions are a necessary feature of the current block
device interface.  Removing that would be a step backwards.

Cheers,
 Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
  2001-02-05 12:00                     ` Manfred Spraul
@ 2001-02-05 15:03                       ` Stephen C. Tweedie
  2001-02-05 15:19                         ` Alan Cox
  2001-02-05 22:09                         ` [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains Ingo Molnar
  2001-02-05 16:56                       ` Linus Torvalds
  1 sibling, 2 replies; 186+ messages in thread
From: Stephen C. Tweedie @ 2001-02-05 15:03 UTC (permalink / raw)
  To: Manfred Spraul
  Cc: Stephen C. Tweedie, Linus Torvalds, Christoph Hellwig,
	Steve Lord, linux-kernel, kiobuf-io-devel, Alan Cox

Hi,

On Mon, Feb 05, 2001 at 01:00:51PM +0100, Manfred Spraul wrote:
> "Stephen C. Tweedie" wrote:
> > 
> > You simply cannot do physical disk IO on
> > non-sector-aligned memory or in chunks which aren't a multiple of
> > sector size.
> 
> Why not?
> 
> Obviously the disk access itself must be sector aligned and the total
> length must be a multiple of the sector length, but there shouldn't be
> any restrictions on the data buffers.

But there are.  Many controllers just break down and corrupt things
silently if you don't align the data buffers (Jeff Merkey found this
by accident when he started generating unaligned IOs within page
boundaries in his NWFS code).  And a lot of controllers simply cannot
break a sector dma over a page boundary (at least not without some
form of IOMMU remapping).

Yes, it's the sort of thing that you would hope should work, but in
practice it's not reliable.

Cheers,
 Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
  2001-02-05 15:03                       ` Stephen C. Tweedie
@ 2001-02-05 15:19                         ` Alan Cox
  2001-02-05 17:20                           ` Stephen C. Tweedie
  2001-02-05 22:09                         ` [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains Ingo Molnar
  1 sibling, 1 reply; 186+ messages in thread
From: Alan Cox @ 2001-02-05 15:19 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Manfred Spraul, Stephen C. Tweedie, Linus Torvalds,
	Christoph Hellwig, Steve Lord, linux-kernel, kiobuf-io-devel,
	Alan Cox

> Yes, it's the sort of thing that you would hope should work, but in
> practice it's not reliable.

So the less smart devices need to call something like

	kiovec_align(kiovec, 512);

and have it do the bounce buffers ?


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
  2001-02-05 11:03                   ` Stephen C. Tweedie
  2001-02-05 12:00                     ` Manfred Spraul
@ 2001-02-05 16:36                     ` Linus Torvalds
  2001-02-05 19:08                       ` Stephen C. Tweedie
  1 sibling, 1 reply; 186+ messages in thread
From: Linus Torvalds @ 2001-02-05 16:36 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Christoph Hellwig, Steve Lord, linux-kernel, kiobuf-io-devel, Alan Cox



On Mon, 5 Feb 2001, Stephen C. Tweedie wrote:
> 
> On Sat, Feb 03, 2001 at 12:28:47PM -0800, Linus Torvalds wrote:
> > 
> > Neither the read nor the write are page-aligned. I don't know where you
> > got that idea. It's obviously not true even in the common case: it depends
> > _entirely_ on what the file offsets are, and expecting the offset to be
> > zero is just being stupid. It's often _not_ zero. With networking it is in
> > fact seldom zero, because the network packets are seldom aligned either in
> > size or in location.
> 
> The underlying buffer is.  The VFS (and the current kiobuf code) is
> already happy about IO happening at odd offsets within a page.

Stephen. 

Don't bother even talking about this. You're so damn hung up about the
page cache that it's not funny.

Have you ever thought about other things, like networking, special
devices, stuff like that? They can (and do) have packet boundaries that
have nothing to do with pages what-so-ever. They can have such notions as
packets that contain multiple streams in one packet, where it ends up
being split up into several pieces. Where neither the original packet
_nor_ the final pieces have _anything_ to do with "pages".

THERE IS NO PAGE ALIGNMENT.

So stop blathering about it.

Of _course_ the current kiobuf code has page-alignment assumptions. You
_designed_ it that way. So bringing it up as an example is a circular
argument. And a really stupid one at that, as that's the thing I've been
quoting as the single biggest design bug in all of kiobufs. It's the thing
that makes them entirely useless for things like describing "struct
msghdr" etc. 

We should get _away_ from this page-alignment fallacy. It's not true. It's
not necessarily even true for the page cache - which has no real
fundamental reasons any more for not being able to be a "variable-size"
cache some time in the future (ie it might be a per-address-space decision
on whether the granularity is 1, 2, 4 or more pages).

Anything that designs for "everything is a page" will automatically be
limited for cases where you might sometimes have 64kB chunks of data.

Instead, just face the realization that "everything is a bunch or ranges",
and leave it at that. It's true _already_ - thing about fragmented IP
packets. We may not handle it that way completely yet, but the zero-copy
networking is going in this direction.

And as long as you keep on harping about page alignment, you're not going
to play in this game. End of story. 

		Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify  + callback chains
  2001-02-05 12:00                     ` Manfred Spraul
  2001-02-05 15:03                       ` Stephen C. Tweedie
@ 2001-02-05 16:56                       ` Linus Torvalds
  2001-02-05 17:27                         ` [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait Alan Cox
  1 sibling, 1 reply; 186+ messages in thread
From: Linus Torvalds @ 2001-02-05 16:56 UTC (permalink / raw)
  To: Manfred Spraul
  Cc: Stephen C. Tweedie, Christoph Hellwig, Steve Lord, linux-kernel,
	kiobuf-io-devel, Alan Cox



On Mon, 5 Feb 2001, Manfred Spraul wrote:
> "Stephen C. Tweedie" wrote:
> > 
> > You simply cannot do physical disk IO on
> > non-sector-aligned memory or in chunks which aren't a multiple of
> > sector size.
> 
> Why not?
> 
> Obviously the disk access itself must be sector aligned and the total
> length must be a multiple of the sector length, but there shouldn't be
> any restrictions on the data buffers.

In fact, regular IDE DMA allows arbitrary scatter-gather at least in
theory. Linux has never used it, so I don't know how well it works in
practice - I would not be surprised if it ends up causing no end of nasty 
corner-cases that have bugs. It's not as if IDE controllers always follow 
the documentation ;)

The _total_ length of the buffers have to be a multiple of the sector
size, and there are some alignment issues (each scatter-gather area has to
be at least 16-bit aligned both in physical memory and in length, and
apparently many controllers need 32-bit alignment). And I'd almost be
surprised if there wouldn't be hardware that wanted cache alignment
because they always expect to burst. 

But despite a lot of likely practical reasons why it won't work for
arbitrary sg lists on plain IDE DMA, there is no _theoretical_ reason it
wouldn't. And there are bound to be better controllers that could handle
it.

		Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
  2001-02-05 15:19                         ` Alan Cox
@ 2001-02-05 17:20                           ` Stephen C. Tweedie
  2001-02-05 17:29                             ` Alan Cox
  0 siblings, 1 reply; 186+ messages in thread
From: Stephen C. Tweedie @ 2001-02-05 17:20 UTC (permalink / raw)
  To: Alan Cox
  Cc: Stephen C. Tweedie, Manfred Spraul, Linus Torvalds,
	Christoph Hellwig, Steve Lord, linux-kernel, kiobuf-io-devel

Hi,

On Mon, Feb 05, 2001 at 03:19:09PM +0000, Alan Cox wrote:
> > Yes, it's the sort of thing that you would hope should work, but in
> > practice it's not reliable.
> 
> So the less smart devices need to call something like
> 
> 	kiovec_align(kiovec, 512);
> 
> and have it do the bounce buffers ?

_All_ drivers would have to do that in the degenerate case, because
none of our drivers can deal with a dma boundary in the middle of a
sector, and even in those places where the hardware supports it in
theory, you are still often limited to word-alignment.

--Stephen

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-05 16:56                       ` Linus Torvalds
@ 2001-02-05 17:27                         ` Alan Cox
  0 siblings, 0 replies; 186+ messages in thread
From: Alan Cox @ 2001-02-05 17:27 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Manfred Spraul, Stephen C. Tweedie, Christoph Hellwig,
	Steve Lord, linux-kernel, kiobuf-io-devel, Alan Cox

> In fact, regular IDE DMA allows arbitrary scatter-gather at least in
> theory. Linux has never used it, so I don't know how well it works in

Purely in theory, as Jeff found out. 

> But despite a lot of likely practical reasons why it won't work for
> arbitrary sg lists on plain IDE DMA, there is no _theoretical_ reason it
> wouldn't. And there are bound to be better controllers that could handle
> it.

I2O controllers are required too handle it (most dont) and some of the high
end scsi/fc controllers even get it right


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
  2001-02-05 17:20                           ` Stephen C. Tweedie
@ 2001-02-05 17:29                             ` Alan Cox
  2001-02-05 18:49                               ` Stephen C. Tweedie
  0 siblings, 1 reply; 186+ messages in thread
From: Alan Cox @ 2001-02-05 17:29 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Alan Cox, Stephen C. Tweedie, Manfred Spraul, Linus Torvalds,
	Christoph Hellwig, Steve Lord, linux-kernel, kiobuf-io-devel

> > 	kiovec_align(kiovec, 512);
> > and have it do the bounce buffers ?
> 
> _All_ drivers would have to do that in the degenerate case, because
> none of our drivers can deal with a dma boundary in the middle of a
> sector, and even in those places where the hardware supports it in
> theory, you are still often limited to word-alignment.

Thats true for _block_ disk devices but if we want a generic kiovec then
if I am going from video capture to network I dont need to force anything more
than 4 byte align

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
  2001-02-05 17:29                             ` Alan Cox
@ 2001-02-05 18:49                               ` Stephen C. Tweedie
  2001-02-05 19:04                                 ` Alan Cox
  2001-02-05 19:09                                 ` Linus Torvalds
  0 siblings, 2 replies; 186+ messages in thread
From: Stephen C. Tweedie @ 2001-02-05 18:49 UTC (permalink / raw)
  To: Alan Cox
  Cc: Stephen C. Tweedie, Manfred Spraul, Linus Torvalds,
	Christoph Hellwig, Steve Lord, linux-kernel, kiobuf-io-devel

Hi,

On Mon, Feb 05, 2001 at 05:29:47PM +0000, Alan Cox wrote:
> > 
> > _All_ drivers would have to do that in the degenerate case, because
> > none of our drivers can deal with a dma boundary in the middle of a
> > sector, and even in those places where the hardware supports it in
> > theory, you are still often limited to word-alignment.
> 
> Thats true for _block_ disk devices but if we want a generic kiovec then
> if I am going from video capture to network I dont need to force anything more
> than 4 byte align

Kiobufs have never, ever required the IO to be aligned on any
particular boundary.  They simply make the assumption that the
underlying buffered object can be described in terms of pages with
some arbitrary (non-aligned) start/offset.  Every video framebuffer
I've ever seen satisfies that, so you can easily map an arbitrary
contiguous region of the framebuffer with a kiobuf already.

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
  2001-02-05 18:49                               ` Stephen C. Tweedie
@ 2001-02-05 19:04                                 ` Alan Cox
  2001-02-05 19:09                                 ` Linus Torvalds
  1 sibling, 0 replies; 186+ messages in thread
From: Alan Cox @ 2001-02-05 19:04 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Alan Cox, Stephen C. Tweedie, Manfred Spraul, Linus Torvalds,
	Christoph Hellwig, Steve Lord, linux-kernel, kiobuf-io-devel

> Kiobufs have never, ever required the IO to be aligned on any
> particular boundary.  They simply make the assumption that the
> underlying buffered object can be described in terms of pages with
> some arbitrary (non-aligned) start/offset.  Every video framebuffer

start/length per page ?

> I've ever seen satisfies that, so you can easily map an arbitrary
> contiguous region of the framebuffer with a kiobuf already.

Video is non contiguous ranges. In fact if you are blitting to a card with
tiled memory it gets very interesting in its video lists

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
  2001-02-05 16:36                     ` [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains Linus Torvalds
@ 2001-02-05 19:08                       ` Stephen C. Tweedie
  0 siblings, 0 replies; 186+ messages in thread
From: Stephen C. Tweedie @ 2001-02-05 19:08 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Stephen C. Tweedie, Christoph Hellwig, Steve Lord, linux-kernel,
	kiobuf-io-devel, Alan Cox

Hi,

On Mon, Feb 05, 2001 at 08:36:31AM -0800, Linus Torvalds wrote:

> Have you ever thought about other things, like networking, special
> devices, stuff like that? They can (and do) have packet boundaries that
> have nothing to do with pages what-so-ever. They can have such notions as
> packets that contain multiple streams in one packet, where it ends up
> being split up into several pieces. Where neither the original packet
> _nor_ the final pieces have _anything_ to do with "pages".
> 
> THERE IS NO PAGE ALIGNMENT.

And kiobufs don't require IO to be page aligned, and they have never
done.  The only page alignment they assume is that if a *single*
scatter-gather element spans multiple pages, then the joins between
those pages occur on page boundaries.

Remember, a kiobuf is only designed to represent one scatter-gather
fragment, not a full sg list.  That was the whole reason for having a
kiovec as a separate concept: if you have more than one independent
fragment in the sg-list, you need more than one kiobuf.

And the reason why we created sg fragments which can span pages was so
that we can encode IOs which interact with the VM: any arbitrary
virtually-contiguous user data buffer can be mapped into a *single*
kiobuf for a write() call, so it's a generic way of supporting things
like O_DIRECT without the IO layers having to know anything about VM
(and Ben's async IO patches also use kiobufs in this way to allow
read()s to write to the user's data buffer once the IO completes,
without having to have a context switch back into that user's
context.)  Similarly, any extent of a file in the page cache can be
encoded in a single kiobuf.

And no, the simpler networking-style sg-list does not cut it for block
device IO, because for block devices, we want to have separate
completion status made available for each individual sg fragment in
the IO.  *That* is why the kiobuf is more heavyweight than the
networking variant: each fragment [kiobuf] in the scatter-gather list
[kiovec] has its own completion information.  

If we have a bunch of separate data buffers queued for sequential disk
IO as a single request, then we still want things like readahead and
error handling to work.  That means that we want the first kiobuf in
the chain to get its completion wakeup as soon as that segment of the
IO is complete, without having to wait for the remaining sectors of
the IO to be transferred.  It also means that if we've done something
like split the IO over a raid stripe, then when an error occurs, we
still want to know which of the callers' buffers succeeded and which
failed.

Yes, I agree that the original kiovec mechanism of using a *kiobuf[]
array to assemble the scatter-gather fragments sucked.  But I don't
believe that just throwing away the concept of kiobuf as a sc-fragment
will work either when it comes to disk IOs: the need for per-fragment
completion is too compelling.  I'd rather shift to allowing kiobufs to
be assembled into linked lists for IO to avoid *kiobuf[] vectors, in
just the same way that we currently chain buffer_heads for IO.  

Cheers,
 Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
  2001-02-05 18:49                               ` Stephen C. Tweedie
  2001-02-05 19:04                                 ` Alan Cox
@ 2001-02-05 19:09                                 ` Linus Torvalds
  2001-02-05 19:16                                   ` [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait Alan Cox
  1 sibling, 1 reply; 186+ messages in thread
From: Linus Torvalds @ 2001-02-05 19:09 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Alan Cox, Manfred Spraul, Christoph Hellwig, Steve Lord,
	linux-kernel, kiobuf-io-devel



On Mon, 5 Feb 2001, Stephen C. Tweedie wrote:
> > Thats true for _block_ disk devices but if we want a generic kiovec then
> > if I am going from video capture to network I dont need to force anything more
> > than 4 byte align
> 
> Kiobufs have never, ever required the IO to be aligned on any
> particular boundary.  They simply make the assumption that the
> underlying buffered object can be described in terms of pages with
> some arbitrary (non-aligned) start/offset.  Every video framebuffer
> I've ever seen satisfies that, so you can easily map an arbitrary
> contiguous region of the framebuffer with a kiobuf already.

Stop this idiocy, Stephen. You're _this_ close to be the first person I
ever blacklist from my mailbox. 

Network. Packets. Fragmentation. Or just non-page-sized MTU's. 

It is _not_ a "series of contiguous pages". Never has been. Never will be.
So stop making excuses.

Also, think of protocols that may want to gather stuff from multiple
places, where the boundaries have little to do with pages but are
specified some other way. Imagine doing "writev()" style operations to
disk, gathering stuff from multiple sources into one operation.

Think of GART remappings - you can have multiple pages that show up as one
"linear" chunk to the graphics device behind the AGP bridge, but that are
_not_ contiguous in real memory.

There just is NO excuse for the "linear series of pages" view. And if you
cannot realize that, then I don't know what's wrong with you. Your
arguments are obviously crap, and the stuff you seem unable to argue
against (like networking) you decide to just ignore. Get your act
together.

		Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-05 19:09                                 ` Linus Torvalds
@ 2001-02-05 19:16                                   ` Alan Cox
  2001-02-05 19:28                                     ` Linus Torvalds
  0 siblings, 1 reply; 186+ messages in thread
From: Alan Cox @ 2001-02-05 19:16 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Stephen C. Tweedie, Alan Cox, Manfred Spraul, Christoph Hellwig,
	Steve Lord, linux-kernel, kiobuf-io-devel

> Stop this idiocy, Stephen. You're _this_ close to be the first person I
> ever blacklist from my mailbox. 

I think I've just figured out what the miscommunication is around here

kiovecs can describe arbitary scatter gather

its just that they can also cleanly describe the common case of contiguous
pages in one entry.

After all a subpage block is simply a contiguous set of 1 page.

Alan


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-05 19:16                                   ` [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait Alan Cox
@ 2001-02-05 19:28                                     ` Linus Torvalds
  2001-02-05 20:54                                       ` Stephen C. Tweedie
  2001-02-06  0:31                                       ` Roman Zippel
  0 siblings, 2 replies; 186+ messages in thread
From: Linus Torvalds @ 2001-02-05 19:28 UTC (permalink / raw)
  To: Alan Cox
  Cc: Stephen C. Tweedie, Manfred Spraul, Christoph Hellwig,
	Steve Lord, linux-kernel, kiobuf-io-devel



On Mon, 5 Feb 2001, Alan Cox wrote:

> > Stop this idiocy, Stephen. You're _this_ close to be the first person I
> > ever blacklist from my mailbox. 
> 
> I think I've just figured out what the miscommunication is around here
> 
> kiovecs can describe arbitary scatter gather

I know. But they are entirely useless for anything that requires low
latency handling. They are big, bloated, and slow. 

It is also an example of layering gone horribly horribly wrong.

The _vectors_ are needed at the very lowest levels: the levels that do not
necessarily have to worry at all about completion notification etc. You
want the arbitrary scatter-gather vectors passed down to the stuff that
sets up the SG arrays etc, the stuff that doesn't care AT ALL about the
high-level semantics.

This all proves that the lowest level of layering should be pretty much
noting but the vectors. No callbacks, no crap like that. That's already a
level of abstraction away, and should not get tacked on. Your lowest level
of abstraction should be just the "area". Something like

	struct buffer {
		struct page *page;
		u16 offset, length;
	};

	int nr_buffers:
	struct buffer *array;

should be the low-level abstraction. 

And on top of _that_ you build a more complex entity (so a "kiobuf" would
be defined not just by the memory area, but by the operation you want to
do on it, adn the callback on completion etc).

Currently kiobufs do it the other way around: you can build up an array,
but only by having the overhead of passing kiovec's around - ie you have
to pass the _highest_ level of abstraction around just to get the lowest
level of details. That's wrong.

And that wrongness comes _exactly_ from Stephens opinion that the
fundamental IO entity is an array of contiguous pages. 

And, btw, this is why the networking layer will never be able to use
kiobufs.

Which makes kiobufs as they stand now basically useless for anything but
some direct disk stuff. And I'd rather work on making the low-level disk
drivers use something saner.

		Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-05 19:28                                     ` Linus Torvalds
@ 2001-02-05 20:54                                       ` Stephen C. Tweedie
  2001-02-05 21:08                                         ` David Lang
                                                           ` (2 more replies)
  2001-02-06  0:31                                       ` Roman Zippel
  1 sibling, 3 replies; 186+ messages in thread
From: Stephen C. Tweedie @ 2001-02-05 20:54 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Alan Cox, Stephen C. Tweedie, Manfred Spraul, Christoph Hellwig,
	Steve Lord, linux-kernel, kiobuf-io-devel

Hi,

On Mon, Feb 05, 2001 at 11:28:17AM -0800, Linus Torvalds wrote:

> The _vectors_ are needed at the very lowest levels: the levels that do not
> necessarily have to worry at all about completion notification etc. You
> want the arbitrary scatter-gather vectors passed down to the stuff that
> sets up the SG arrays etc, the stuff that doesn't care AT ALL about the
> high-level semantics.

OK, this is exactly where we have a problem: I can see too many cases
where we *do* need to know about completion stuff at a fine
granularity when it comes to disk IO (unlike network IO, where we can
usually rely on a caller doing retransmit at some point in the stack).

If we are doing readahead, we want completion callbacks raised as soon
as possible on IO completions, no matter how many other IOs have been
merged with the current one.  More importantly though, when we are
merging multiple page or buffer_head IOs in a request, we want to know
exactly which buffer/page contents are valid and which are not once
the IO completes.

The current request struct's buffer_head list provides that quite
naturally, but is a hugely heavyweight way of performing large IOs.
What I'm really after is a way of sending IOs to make_request in such
a way that if the caller provides an array of buffer_heads, it gets
back completion information on each one, but if the IO is requested in
large chunks (eg. XFS's pagebufs or large kiobufs from raw IO), then
the request code can deal with it in those large chunks.

What worries me is things like the soft raid1/5 code: pretending that
we can skimp on the return information about which blocks were
transferred successfully and which were not sounds like a really bad
idea when you've got a driver which relies on that completion
information in order to do intelligent error recovery.

Cheers,
 Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-05 20:54                                       ` Stephen C. Tweedie
@ 2001-02-05 21:08                                         ` David Lang
  2001-02-05 21:51                                         ` Alan Cox
  2001-02-06  0:07                                         ` Stephen C. Tweedie
  2 siblings, 0 replies; 186+ messages in thread
From: David Lang @ 2001-02-05 21:08 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Linus Torvalds, Alan Cox, Manfred Spraul, Christoph Hellwig,
	Steve Lord, linux-kernel, kiobuf-io-devel

so you have two concepts in one here

1. SG items that can be more then a single page

2. a container for #1 that includes details for completion callbacks, etc

it looks like Linus is objecting to having both in the same structure and
then using that structure as your generic low-level bucket.

define these as two seperate structures, the #1 structure may now be
lightweight enough to be used for networking and other functions, and when
you go to use it with disk IO you then wrap it in the #2 structure. this
still lets you have the completion callbacks at as low a level as you
want, you just have to explicitly add this layer when it makes sense.

David Lang



On Mon, 5 Feb 2001, Stephen C. Tweedie wrote:

> Date: Mon, 5 Feb 2001 20:54:29 +0000
> From: Stephen C. Tweedie <sct@redhat.com>
> To: Linus Torvalds <torvalds@transmeta.com>
> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>, Stephen C. Tweedie <sct@redhat.com>,
>      Manfred Spraul <manfred@colorfullife.com>,
>      Christoph Hellwig <hch@caldera.de>, Steve Lord <lord@sgi.com>,
>      linux-kernel@vger.kernel.org, kiobuf-io-devel@lists.sourceforge.net
> Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
>
> Hi,
>
> On Mon, Feb 05, 2001 at 11:28:17AM -0800, Linus Torvalds wrote:
>
> > The _vectors_ are needed at the very lowest levels: the levels that do not
> > necessarily have to worry at all about completion notification etc. You
> > want the arbitrary scatter-gather vectors passed down to the stuff that
> > sets up the SG arrays etc, the stuff that doesn't care AT ALL about the
> > high-level semantics.
>
> OK, this is exactly where we have a problem: I can see too many cases
> where we *do* need to know about completion stuff at a fine
> granularity when it comes to disk IO (unlike network IO, where we can
> usually rely on a caller doing retransmit at some point in the stack).
>
> If we are doing readahead, we want completion callbacks raised as soon
> as possible on IO completions, no matter how many other IOs have been
> merged with the current one.  More importantly though, when we are
> merging multiple page or buffer_head IOs in a request, we want to know
> exactly which buffer/page contents are valid and which are not once
> the IO completes.
>
> The current request struct's buffer_head list provides that quite
> naturally, but is a hugely heavyweight way of performing large IOs.
> What I'm really after is a way of sending IOs to make_request in such
> a way that if the caller provides an array of buffer_heads, it gets
> back completion information on each one, but if the IO is requested in
> large chunks (eg. XFS's pagebufs or large kiobufs from raw IO), then
> the request code can deal with it in those large chunks.
>
> What worries me is things like the soft raid1/5 code: pretending that
> we can skimp on the return information about which blocks were
> transferred successfully and which were not sounds like a really bad
> idea when you've got a driver which relies on that completion
> information in order to do intelligent error recovery.
>
> Cheers,
>  Stephen
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> Please read the FAQ at http://www.tux.org/lkml/
>
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
  2001-02-05 12:19                   ` Stephen C. Tweedie
@ 2001-02-05 21:28                     ` Ingo Molnar
  2001-02-05 22:58                       ` Stephen C. Tweedie
  0 siblings, 1 reply; 186+ messages in thread
From: Ingo Molnar @ 2001-02-05 21:28 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Steve Lord, linux-kernel, kiobuf-io-devel, Alan Cox, Linus Torvalds


On Mon, 5 Feb 2001, Stephen C. Tweedie wrote:

> And no, the IO success is *not* necessarily sequential from the start
> of the IO: if you are doing IO to raid0, for example, and the IO gets
> striped across two disks, you might find that the first disk gets an
> error so the start of the IO fails but the rest completes.  It's the
> completion code which notifies the caller of what worked and what did
> not.

it's exactly these 'compound' structures i'm vehemently against. I do
think it's a design nightmare. I can picture these monster kiobufs
complicating the whole code for no good reason - we couldnt even get the
bh-list code in block_device.c right - why do you think kiobufs *all
across the kernel* will be any better?

RAID0 is not an issue. Split it up, use separate kiobufs for every
different disk. We need simple constructs - i do not believe why nobody
sees that these big fat monster-trucks of IO workload are *trouble*. They
keep things localized, instead of putting workload components into the
system immediately. We'll have performance bugs nobody has seen before.
bhs have one very nice property: they are simple, modularized. I think
this is like CISC vs. RISC: CISC designs ended up splitting 'fat
instructions' up into RISC-like instructions.

fragmented skbs are a different matter: they are simply a bit more generic
abstractions of 'memory buffer'. Clear goal, clear solution. I do not
think kiobufs have clear goals.

and i do not buy the performance arguments. In 2.4.1 we improved block-IO
performance dramatically by fixing high-load IO scheduling. Write
performance suddenly improved dramatically, there is a 30-40% improvement
in dbench performance. To put in another way: *we needed 5 years to fix a
serious IO-subsystem performance bug*. Block IO was already too complex -
and Alex & Andrea have done a nice job streamlining and cleaning it up for
2.4. We should simplify it further - and optimize the components, instead
of bringing in yet another *big* complication into the API.

and what is the goal of having multi-page kiobufs. To avoid having to do
multiple function calls via a simpler interface? Shouldnt we optimize that
codepath instead?

	Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-05 20:54                                       ` Stephen C. Tweedie
  2001-02-05 21:08                                         ` David Lang
@ 2001-02-05 21:51                                         ` Alan Cox
  2001-02-06  0:07                                         ` Stephen C. Tweedie
  2 siblings, 0 replies; 186+ messages in thread
From: Alan Cox @ 2001-02-05 21:51 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Linus Torvalds, Alan Cox, Stephen C. Tweedie, Manfred Spraul,
	Christoph Hellwig, Steve Lord, linux-kernel, kiobuf-io-devel

> OK, this is exactly where we have a problem: I can see too many cases
> where we *do* need to know about completion stuff at a fine
> granularity when it comes to disk IO (unlike network IO, where we can
> usually rely on a caller doing retransmit at some point in the stack).

Ok so whats wrong with embedded kiovec points into somethign bigger, one
kmalloc can allocate two arrays, one of buffers (shared with networking etc)
followed by a second of block io completion data.

Now you can also kind of cast from the bigger to the smaller object and get
the right result if the kiovec array is the start of the combined allocation


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
  2001-02-05 15:03                       ` Stephen C. Tweedie
  2001-02-05 15:19                         ` Alan Cox
@ 2001-02-05 22:09                         ` Ingo Molnar
  1 sibling, 0 replies; 186+ messages in thread
From: Ingo Molnar @ 2001-02-05 22:09 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Manfred Spraul, Linus Torvalds, Christoph Hellwig, Steve Lord,
	Linux Kernel List, kiobuf-io-devel, Alan Cox


On Mon, 5 Feb 2001, Stephen C. Tweedie wrote:

> > Obviously the disk access itself must be sector aligned and the total
> > length must be a multiple of the sector length, but there shouldn't be
> > any restrictions on the data buffers.
>
> But there are. Many controllers just break down and corrupt things
> silently if you don't align the data buffers (Jeff Merkey found this
> by accident when he started generating unaligned IOs within page
> boundaries in his NWFS code). And a lot of controllers simply cannot
> break a sector dma over a page boundary (at least not without some
> form of IOMMU remapping).

so we are putting workarounds for hardware bugs into the design?

	Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
  2001-02-05 21:28                     ` Ingo Molnar
@ 2001-02-05 22:58                       ` Stephen C. Tweedie
  2001-02-05 23:06                         ` Alan Cox
  2001-02-06  0:19                         ` Manfred Spraul
  0 siblings, 2 replies; 186+ messages in thread
From: Stephen C. Tweedie @ 2001-02-05 22:58 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Stephen C. Tweedie, Steve Lord, linux-kernel, kiobuf-io-devel,
	Alan Cox, Linus Torvalds

Hi,

On Mon, Feb 05, 2001 at 10:28:37PM +0100, Ingo Molnar wrote:
> 
> On Mon, 5 Feb 2001, Stephen C. Tweedie wrote:
> 
> it's exactly these 'compound' structures i'm vehemently against. I do
> think it's a design nightmare. I can picture these monster kiobufs
> complicating the whole code for no good reason - we couldnt even get the
> bh-list code in block_device.c right - why do you think kiobufs *all
> across the kernel* will be any better?
> 
> RAID0 is not an issue. Split it up, use separate kiobufs for every
> different disk.

Umm, that's not the point --- of course you can use separate kiobufs
for the communication between raid0 and the underlying disks, but what
do you then tell the application _above_ raid0 if one of the
underlying IOs succeeds and the other fails halfway through?

And what about raid1?  Are you really saying that raid1 doesn't need
to know which blocks succeeded and which failed?  That's the level of
completion information I'm worrying about at the moment.

> fragmented skbs are a different matter: they are simply a bit more generic
> abstractions of 'memory buffer'. Clear goal, clear solution. I do not
> think kiobufs have clear goals.

The goal: allow arbitrary IOs to be pushed down through the stack in
such a way that the callers can get meaningful information back about
what worked and what did not.  If the write was a 128kB raw IO, then
you obviously get coarse granularity of completion callback.  If the
write was a series of independent pages which happened to be
contiguous on disk, you actually get told which pages hit disk and
which did not.

> and what is the goal of having multi-page kiobufs. To avoid having to do
> multiple function calls via a simpler interface? Shouldnt we optimize that
> codepath instead?

The original multi-page buffers came from the map_user_kiobuf
interface: they represented a user data buffer.  I'm not wedded to
that format --- we can happily replace it with a fine-grained sg list
--- but the reason they have been pushed so far down the IO stack is
the need for accurate completion information on the originally
requested IOs.

In other words, even if we expand the kiobuf into a sg vector list,
when it comes to merging requests in ll_rw_blk.c we still need to
track the callbacks on each independent source kiobufs.  

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
  2001-02-05 22:58                       ` Stephen C. Tweedie
@ 2001-02-05 23:06                         ` Alan Cox
  2001-02-05 23:16                           ` Stephen C. Tweedie
  2001-02-06  0:19                         ` Manfred Spraul
  1 sibling, 1 reply; 186+ messages in thread
From: Alan Cox @ 2001-02-05 23:06 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Ingo Molnar, Stephen C. Tweedie, Steve Lord, linux-kernel,
	kiobuf-io-devel, Alan Cox, Linus Torvalds

> do you then tell the application _above_ raid0 if one of the
> underlying IOs succeeds and the other fails halfway through?

struct 
{
	u32 flags;	/* because everything needs flags */
	struct io_completion *completions;
	kiovec_t sglist[0];
} thingy;

now kmalloc one object of the header the sglist of the right size and the
completion list. Shove the completion list on the end of it as another
array of objects and what is the problem.

> In other words, even if we expand the kiobuf into a sg vector list,
> when it comes to merging requests in ll_rw_blk.c we still need to
> track the callbacks on each independent source kiobufs.  

But that can be two arrays

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
  2001-02-05 23:06                         ` Alan Cox
@ 2001-02-05 23:16                           ` Stephen C. Tweedie
  0 siblings, 0 replies; 186+ messages in thread
From: Stephen C. Tweedie @ 2001-02-05 23:16 UTC (permalink / raw)
  To: Alan Cox
  Cc: Stephen C. Tweedie, Ingo Molnar, Steve Lord, linux-kernel,
	kiobuf-io-devel, Linus Torvalds

Hi,

On Mon, Feb 05, 2001 at 11:06:48PM +0000, Alan Cox wrote:
> > do you then tell the application _above_ raid0 if one of the
> > underlying IOs succeeds and the other fails halfway through?
> 
> struct 
> {
> 	u32 flags;	/* because everything needs flags */
> 	struct io_completion *completions;
> 	kiovec_t sglist[0];
> } thingy;
> 
> now kmalloc one object of the header the sglist of the right size and the
> completion list. Shove the completion list on the end of it as another
> array of objects and what is the problem.

XFS uses both small metadata items in the buffer cache and large
pagebufs.  You may have merged a 512-byte read with a large pagebuf
read: one completion callback is associated with a single sg fragment,
the next callback belongs to a dozen different fragments.  Associating
the two lists becomes non-trivial, although it could be done.

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-05 20:54                                       ` Stephen C. Tweedie
  2001-02-05 21:08                                         ` David Lang
  2001-02-05 21:51                                         ` Alan Cox
@ 2001-02-06  0:07                                         ` Stephen C. Tweedie
  2001-02-06 17:00                                           ` Christoph Hellwig
  2 siblings, 1 reply; 186+ messages in thread
From: Stephen C. Tweedie @ 2001-02-06  0:07 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Alan Cox, Stephen C. Tweedie, Manfred Spraul, Christoph Hellwig,
	Steve Lord, linux-kernel, kiobuf-io-devel, Ben LaHaise,
	Ingo Molnar

Hi,

OK, if we take a step back what does this look like:

On Mon, Feb 05, 2001 at 08:54:29PM +0000, Stephen C. Tweedie wrote:
> 
> If we are doing readahead, we want completion callbacks raised as soon
> as possible on IO completions, no matter how many other IOs have been
> merged with the current one.  More importantly though, when we are
> merging multiple page or buffer_head IOs in a request, we want to know
> exactly which buffer/page contents are valid and which are not once
> the IO completes.

This is the current situation.  If the page cache submits a 64K IO to
the block layer, it does so in pieces, and then expects to be told on
return exactly which pages succeeded and which failed.

That's where the mess of having multiple completion objects in a
single IO request comes from.  Can we just forbid this case?

That's the short cut that SGI's kiobuf block dev patches do when they
get kiobufs: they currently deal with either buffer_heads or kiobufs
in struct requests, but they don't merge kiobuf requests.  (XFS
already clusters the IOs for them in that case.)

Is that a realistic basis for a cleaned-up ll_rw_blk.c?

It implies that the caller has to do IO merging.  For read, that's not
much pain, as the most important case --- readahead --- is already
done in a generic way which could submit larger IOs relatively easily.
It would be harder for writes, but high-level write clustering code
has already been started.

It implies that for any IO, on IO failure you don't get told which
part of the IO failed.  That adds code to the caller: the page cache
would have to retry per-page to work out which pages are readable and
which are not.  It means that for soft raid, you don't get told which
blocks are bad if a stripe has an error anywhere.  Ingo, is that a
potential problem?

But it gives very, very simple semantics to the request layer: single
IOs go in (with a completion callback and a single scatter-gather
list), and results go back with success or failure.

With that change, it becomes _much_ more natural to push a simple sg
list down through the disk layers.

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify  + callback chains
  2001-02-05 22:58                       ` Stephen C. Tweedie
  2001-02-05 23:06                         ` Alan Cox
@ 2001-02-06  0:19                         ` Manfred Spraul
  1 sibling, 0 replies; 186+ messages in thread
From: Manfred Spraul @ 2001-02-06  0:19 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Ingo Molnar, Steve Lord, linux-kernel, kiobuf-io-devel, Alan Cox,
	Linus Torvalds

"Stephen C. Tweedie" wrote:
> 
> The original multi-page buffers came from the map_user_kiobuf
> interface: they represented a user data buffer.  I'm not wedded to
> that format --- we can happily replace it with a fine-grained sg list
>
Could you change that interface?

<<< from Linus mail:

        struct buffer {
                struct page *page;
                u16 offset, length;
        };

>>>>>>

/* returns the number of used buffers, or <0 on error */
int map_user_buffer(struct buffer *ba, int max_bcount,
			void* addr, int len);
void unmap_buffer(struct buffer *ba, int bcount);

That's enough for the zero copy pipe code ;-)

Real hw drivers probably need a replacement for pci_map_single()
(pci_map_and_align_and_bounce_buffer_array())

The kiobuf structure could contain these 'struct buffer' instead of the
current 'struct page' pointers.

> 
> In other words, even if we expand the kiobuf into a sg vector list,
> when it comes to merging requests in ll_rw_blk.c we still need to
> track the callbacks on each independent source kiobufs.
>
Probably.


--
	Manfred

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-05 19:28                                     ` Linus Torvalds
  2001-02-05 20:54                                       ` Stephen C. Tweedie
@ 2001-02-06  0:31                                       ` Roman Zippel
  2001-02-06  1:01                                         ` Linus Torvalds
  2001-02-06  1:08                                         ` David S. Miller
  1 sibling, 2 replies; 186+ messages in thread
From: Roman Zippel @ 2001-02-06  0:31 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Alan Cox, Stephen C. Tweedie, Manfred Spraul, Christoph Hellwig,
	Steve Lord, linux-kernel, kiobuf-io-devel

Hi,

On Mon, 5 Feb 2001, Linus Torvalds wrote:

> This all proves that the lowest level of layering should be pretty much
> noting but the vectors. No callbacks, no crap like that. That's already a
> level of abstraction away, and should not get tacked on. Your lowest level
> of abstraction should be just the "area". Something like
> 
> 	struct buffer {
> 		struct page *page;
> 		u16 offset, length;
> 	};
> 
> 	int nr_buffers:
> 	struct buffer *array;
> 
> should be the low-level abstraction. 

Does it has to be vectors? What about lists? I'm thinking about this for
some time now and I think lists are more flexible. At higher level we can
easily generate a list of pages and in a lower level you can still split
them up as needed. It would be basically the same structure, but you
could use it everywhere with the same kind of operations.

bye, Roman

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-06  0:31                                       ` Roman Zippel
@ 2001-02-06  1:01                                         ` Linus Torvalds
  2001-02-06  9:22                                           ` Roman Zippel
  2001-02-06  9:30                                           ` Ingo Molnar
  2001-02-06  1:08                                         ` David S. Miller
  1 sibling, 2 replies; 186+ messages in thread
From: Linus Torvalds @ 2001-02-06  1:01 UTC (permalink / raw)
  To: Roman Zippel
  Cc: Alan Cox, Stephen C. Tweedie, Manfred Spraul, Christoph Hellwig,
	Steve Lord, linux-kernel, kiobuf-io-devel



On Tue, 6 Feb 2001, Roman Zippel wrote:
> > 
> > 	int nr_buffers:
> > 	struct buffer *array;
> > 
> > should be the low-level abstraction. 
> 
> Does it has to be vectors? What about lists?

I'd prefer to avoid lists unless there is some overriding concern, like a
real implementation issue. But I don't care much one way or the other -
what I care about is that the setup and usage time is as low as possible.
I suspect arrays are better for that.

I have this strong suspicion that networking is going to be the most
latency-critical and complex part of this, and the fact that the
networking code wanted arrays is what makes me think that arrays are the
right way to go. But talk to Davem and ank about why they wanted vectors.

		Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-06  0:31                                       ` Roman Zippel
  2001-02-06  1:01                                         ` Linus Torvalds
@ 2001-02-06  1:08                                         ` David S. Miller
  1 sibling, 0 replies; 186+ messages in thread
From: David S. Miller @ 2001-02-06  1:08 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Roman Zippel, Alan Cox, Stephen C. Tweedie, Manfred Spraul,
	Christoph Hellwig, Steve Lord, linux-kernel, kiobuf-io-devel


Linus Torvalds writes:
 > But talk to Davem and ank about why they wanted vectors.

SKB setup and free needs to be as light as possible.
Using vectors leads to code like:

skb_data_free(...)
{
...
	for (i = 0; i < MAX_SKB_FRAGS; i++)
		put_page(skb_shinfo(skb)->frags[i].page);
}

Currently, the ZC patches have a fixed frag vector size
(MAX_SKB_FRAGS).  But a part of me wants this to be
made dynamic (to handle HIPPI etc. properly) whereas
another part of me doesn't want to do it that way because
it would increase the complexity of paged SKB handling
and add yet another member to the SKB structure.

Later,
David S. Miller
davem@redhat.com
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-06  1:01                                         ` Linus Torvalds
@ 2001-02-06  9:22                                           ` Roman Zippel
  2001-02-06  9:30                                           ` Ingo Molnar
  1 sibling, 0 replies; 186+ messages in thread
From: Roman Zippel @ 2001-02-06  9:22 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Alan Cox, Stephen C. Tweedie, Manfred Spraul, Christoph Hellwig,
	Steve Lord, linux-kernel, kiobuf-io-devel

Hi,

On Mon, 5 Feb 2001, Linus Torvalds wrote:

> > Does it has to be vectors? What about lists?
> 
> I'd prefer to avoid lists unless there is some overriding concern, like a
> real implementation issue. But I don't care much one way or the other -
> what I care about is that the setup and usage time is as low as possible.
> I suspect arrays are better for that.

I was more thinking about the higher layers. Here it's simpler to setup a
list of pages which can be send to a lower layer. In the page cache we
already have per address space lists, so it would be very easy to use
that. A lower layer can generate of course anything it wants out of this,
e.g. it can generate sublists or vectors.

bye, Roman

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-06  1:01                                         ` Linus Torvalds
  2001-02-06  9:22                                           ` Roman Zippel
@ 2001-02-06  9:30                                           ` Ingo Molnar
  1 sibling, 0 replies; 186+ messages in thread
From: Ingo Molnar @ 2001-02-06  9:30 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Roman Zippel, Alan Cox, Stephen C. Tweedie, Manfred Spraul,
	Christoph Hellwig, Steve Lord, linux-kernel, kiobuf-io-devel


On Mon, 5 Feb 2001, Linus Torvalds wrote:

> [...] But talk to Davem and ank about why they wanted vectors.

one issue is allocation overhead. The fragment array is a natural and
constant-size part of an skb, thus we get all the control structures in
place while allocating a structure that we have to allocate anyway.

another issue is that certain cards have (or can have) SG-limits, so we
have to be prepared to have a 'limited' array of fragments anyway, and
have to be prepared to split/refragment packets. Whether there is a global
MAX_SKB_FRAGS limit or not makes no difference.

	Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-06  0:07                                         ` Stephen C. Tweedie
@ 2001-02-06 17:00                                           ` Christoph Hellwig
  2001-02-06 17:05                                             ` Stephen C. Tweedie
  0 siblings, 1 reply; 186+ messages in thread
From: Christoph Hellwig @ 2001-02-06 17:00 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Linus Torvalds, Alan Cox, Manfred Spraul, Steve Lord,
	linux-kernel, kiobuf-io-devel, Ben LaHaise, Ingo Molnar

On Tue, Feb 06, 2001 at 12:07:04AM +0000, Stephen C. Tweedie wrote:
> This is the current situation.  If the page cache submits a 64K IO to
> the block layer, it does so in pieces, and then expects to be told on
> return exactly which pages succeeded and which failed.
> 
> That's where the mess of having multiple completion objects in a
> single IO request comes from.  Can we just forbid this case?
> 
> That's the short cut that SGI's kiobuf block dev patches do when they
> get kiobufs: they currently deal with either buffer_heads or kiobufs
> in struct requests, but they don't merge kiobuf requests.

IIRC Jens Axboe has done some work on merging kiobuf-based requests.

> (XFS already clusters the IOs for them in that case.)
> 
> Is that a realistic basis for a cleaned-up ll_rw_blk.c?

I don't think os.  If we minimize the state in the IO container object,
the lower levels could split them at their guess and the IO completion
function just has to handle the case that it might be called for a smaller
object.

	Christoph

-- 
Of course it doesn't work. We've performed a software upgrade.
Whip me.  Beat me.  Make me maintain AIX.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-06 17:00                                           ` Christoph Hellwig
@ 2001-02-06 17:05                                             ` Stephen C. Tweedie
  2001-02-06 17:14                                               ` Jens Axboe
                                                                 ` (2 more replies)
  0 siblings, 3 replies; 186+ messages in thread
From: Stephen C. Tweedie @ 2001-02-06 17:05 UTC (permalink / raw)
  To: Stephen C. Tweedie, Linus Torvalds, Alan Cox, Manfred Spraul,
	Steve Lord, linux-kernel, kiobuf-io-devel, Ben LaHaise,
	Ingo Molnar

Hi,

On Tue, Feb 06, 2001 at 06:00:58PM +0100, Christoph Hellwig wrote:
> On Tue, Feb 06, 2001 at 12:07:04AM +0000, Stephen C. Tweedie wrote:
> > 
> > Is that a realistic basis for a cleaned-up ll_rw_blk.c?
> 
> I don't think os.  If we minimize the state in the IO container object,
> the lower levels could split them at their guess and the IO completion
> function just has to handle the case that it might be called for a smaller
> object.

The whole point of the post was that it is merging, not splitting,
which is troublesome.  How are you going to merge requests without
having chains of scatter-gather entities each with their own
completion callbacks?

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-06 17:05                                             ` Stephen C. Tweedie
@ 2001-02-06 17:14                                               ` Jens Axboe
  2001-02-06 17:22                                               ` Christoph Hellwig
  2001-02-06 17:37                                               ` Ben LaHaise
  2 siblings, 0 replies; 186+ messages in thread
From: Jens Axboe @ 2001-02-06 17:14 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Linus Torvalds, Alan Cox, Manfred Spraul, Steve Lord,
	linux-kernel, kiobuf-io-devel, Ben LaHaise, Ingo Molnar

On Tue, Feb 06 2001, Stephen C. Tweedie wrote:
> > I don't think os.  If we minimize the state in the IO container object,
> > the lower levels could split them at their guess and the IO completion
> > function just has to handle the case that it might be called for a smaller
> > object.
> 
> The whole point of the post was that it is merging, not splitting,
> which is troublesome.  How are you going to merge requests without
> having chains of scatter-gather entities each with their own
> completion callbacks?

You can't, the stuff I played with turned out to be horrible. At
least with the current kiobuf I/O stuff, merging will have to be
done before its submitted. And IMO we don't want to loose the
ability to cluster buffers and requests in ll_rw_blk.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-06 17:05                                             ` Stephen C. Tweedie
  2001-02-06 17:14                                               ` Jens Axboe
@ 2001-02-06 17:22                                               ` Christoph Hellwig
  2001-02-06 18:26                                                 ` Stephen C. Tweedie
  2001-02-06 17:37                                               ` Ben LaHaise
  2 siblings, 1 reply; 186+ messages in thread
From: Christoph Hellwig @ 2001-02-06 17:22 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Linus Torvalds, Alan Cox, Manfred Spraul, Steve Lord,
	linux-kernel, kiobuf-io-devel, Ben LaHaise, Ingo Molnar

On Tue, Feb 06, 2001 at 05:05:06PM +0000, Stephen C. Tweedie wrote:
> The whole point of the post was that it is merging, not splitting,
> which is troublesome.  How are you going to merge requests without
> having chains of scatter-gather entities each with their own
> completion callbacks?

The object passed down to the low-level driver just needs to ne able
to contain multiple end-io callbacks.  The decision what to call when
some of the scatter-gather entities fail is of course not so easy to
handle and needs further discussion.

	Christoph

-- 
Whip me.  Beat me.  Make me maintain AIX.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-06 17:05                                             ` Stephen C. Tweedie
  2001-02-06 17:14                                               ` Jens Axboe
  2001-02-06 17:22                                               ` Christoph Hellwig
@ 2001-02-06 17:37                                               ` Ben LaHaise
  2001-02-06 18:00                                                 ` Jens Axboe
                                                                   ` (2 more replies)
  2 siblings, 3 replies; 186+ messages in thread
From: Ben LaHaise @ 2001-02-06 17:37 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Linus Torvalds, Alan Cox, Manfred Spraul, Steve Lord,
	linux-kernel, kiobuf-io-devel, Ingo Molnar

Hey folks,

On Tue, 6 Feb 2001, Stephen C. Tweedie wrote:

> The whole point of the post was that it is merging, not splitting,
> which is troublesome.  How are you going to merge requests without
> having chains of scatter-gather entities each with their own
> completion callbacks?

Let me just emphasize what Stephen is pointing out: if requests are
properly merged at higher layers, then merging is neither required nor
desired.  Traditionally, ext2 has not done merging because the underlying
system doesn't support it.  This leads to rather convoluted code for
readahead which doesn't result in appropriately merged requests on
indirect block boundries, and in fact leads to suboptimal performance.
The only case I see where merging of requests can improve things is when
dealing with lots of small files.  But we already know that small files
need to be treated differently (fe tail merging).  Besides, most of the
benefit of merging can be had by doing readaround for these small files.

As for io completion, can't we just issue seperate requests for the
critical data and the readahead?  That way for SCSI disks, the important
io should be finished while the readahead can continue.  Thoughts?

		-ben

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-06 17:37                                               ` Ben LaHaise
@ 2001-02-06 18:00                                                 ` Jens Axboe
  2001-02-06 18:09                                                   ` Ben LaHaise
  2001-02-06 18:14                                                 ` Linus Torvalds
  2001-02-06 18:18                                                 ` Ingo Molnar
  2 siblings, 1 reply; 186+ messages in thread
From: Jens Axboe @ 2001-02-06 18:00 UTC (permalink / raw)
  To: Ben LaHaise
  Cc: Stephen C. Tweedie, Linus Torvalds, Alan Cox, Manfred Spraul,
	Steve Lord, linux-kernel, kiobuf-io-devel, Ingo Molnar

On Tue, Feb 06 2001, Ben LaHaise wrote:
> > The whole point of the post was that it is merging, not splitting,
> > which is troublesome.  How are you going to merge requests without
> > having chains of scatter-gather entities each with their own
> > completion callbacks?
> 
> Let me just emphasize what Stephen is pointing out: if requests are
> properly merged at higher layers, then merging is neither required nor
> desired.  Traditionally, ext2 has not done merging because the underlying
> system doesn't support it.  This leads to rather convoluted code for
> readahead which doesn't result in appropriately merged requests on
> indirect block boundries, and in fact leads to suboptimal performance.
> The only case I see where merging of requests can improve things is when
> dealing with lots of small files.  But we already know that small files
> need to be treated differently (fe tail merging).  Besides, most of the
> benefit of merging can be had by doing readaround for these small files.

Stephen already covered this point, the merging is not a problem
to deal with for read-ahead. The underlying system can easily
queue that in nice big chunks. Delayed allocation makes it
easier to to flush big chunks as well. I seem to recall the xfs people
having problems with the lack of merging causing a performance hit
on smaller I/O.

Of course merging doesn't have to happen in ll_rw_blk.

> As for io completion, can't we just issue seperate requests for the
> critical data and the readahead?  That way for SCSI disks, the important
> io should be finished while the readahead can continue.  Thoughts?

Priorities?

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-06 18:00                                                 ` Jens Axboe
@ 2001-02-06 18:09                                                   ` Ben LaHaise
  2001-02-06 19:35                                                     ` Jens Axboe
  0 siblings, 1 reply; 186+ messages in thread
From: Ben LaHaise @ 2001-02-06 18:09 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Stephen C. Tweedie, Linus Torvalds, Alan Cox, Manfred Spraul,
	Steve Lord, linux-kernel, kiobuf-io-devel, Ingo Molnar

On Tue, 6 Feb 2001, Jens Axboe wrote:

> Stephen already covered this point, the merging is not a problem
> to deal with for read-ahead. The underlying system can easily

I just wanted to make sure that was clear =)

> queue that in nice big chunks. Delayed allocation makes it
> easier to to flush big chunks as well. I seem to recall the xfs people
> having problems with the lack of merging causing a performance hit
> on smaller I/O.

That's where readaround buffers come into play.  If we have a fixed number
of readaround buffers that are used when small ios are issued, they should
provide a low overhead means of substantially improving things like find
(which reads many nearby inodes out of order but sequentially).  I need to
implement this can get cache hit rates for various workloads. ;-)

> Of course merging doesn't have to happen in ll_rw_blk.
>
> > As for io completion, can't we just issue seperate requests for the
> > critical data and the readahead?  That way for SCSI disks, the important
> > io should be finished while the readahead can continue.  Thoughts?
>
> Priorities?

Definately.  I'd like to be able to issue readaheads with a "don't bother
executing if this request unless the cost is low" bit set.  It might also
be helpful for heavy multiuser loads (or even a single user with multiple
processes) to ensure progress is made for others.

		-ben

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-06 17:37                                               ` Ben LaHaise
  2001-02-06 18:00                                                 ` Jens Axboe
@ 2001-02-06 18:14                                                 ` Linus Torvalds
  2001-02-08 11:21                                                   ` Andi Kleen
  2001-02-08 14:11                                                   ` Martin Dalecki
  2001-02-06 18:18                                                 ` Ingo Molnar
  2 siblings, 2 replies; 186+ messages in thread
From: Linus Torvalds @ 2001-02-06 18:14 UTC (permalink / raw)
  To: Ben LaHaise
  Cc: Stephen C. Tweedie, Alan Cox, Manfred Spraul, Steve Lord,
	linux-kernel, kiobuf-io-devel, Ingo Molnar



On Tue, 6 Feb 2001, Ben LaHaise wrote:
> 
> On Tue, 6 Feb 2001, Stephen C. Tweedie wrote:
> 
> > The whole point of the post was that it is merging, not splitting,
> > which is troublesome.  How are you going to merge requests without
> > having chains of scatter-gather entities each with their own
> > completion callbacks?
> 
> Let me just emphasize what Stephen is pointing out: if requests are
> properly merged at higher layers, then merging is neither required nor
> desired.

I will claim that you CANNOT merge at higher levels and get good
performance.

Sure, you can do read-ahead, and try to get big merges that way at a high
level. Good for you.

But you'll have a bitch of a time trying to merge multiple
threads/processes reading from the same area on disk at roughly the same
time. Your higher levels won't even _know_ that there is merging to be
done until the IO requests hit the wall in waiting for the disk.

Qutie frankly, this whole discussion sounds worthless. We have solved this
problem already: it's called a "buffer head". Deceptively simple at higher
levels, and lower levels can easily merge them together into chains and do
fancy scatter-gather structures of them that can be dynamically extended
at any time.

The buffer heads together with "struct request" do a hell of a lot more
than just a simple scatter-gather: it's able to create ordered lists of
independent sg-events, together with full call-backs etc. They are
low-cost, fairly efficient, and they have worked beautifully for years. 

The fact that kiobufs can't be made to do the same thing is somebody elses
problem. I _know_ that merging has to happen late, and if others are
hitting their heads against this issue until they turn silly, then that's
their problem. You'll eventually learn, or you'll hit your heads into a
pulp. 

		Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-06 17:37                                               ` Ben LaHaise
  2001-02-06 18:00                                                 ` Jens Axboe
  2001-02-06 18:14                                                 ` Linus Torvalds
@ 2001-02-06 18:18                                                 ` Ingo Molnar
  2001-02-06 18:25                                                   ` Ben LaHaise
  2 siblings, 1 reply; 186+ messages in thread
From: Ingo Molnar @ 2001-02-06 18:18 UTC (permalink / raw)
  To: Ben LaHaise
  Cc: Stephen C. Tweedie, Linus Torvalds, Alan Cox, Manfred Spraul,
	Steve Lord, Linux Kernel List, kiobuf-io-devel, Ingo Molnar


On Tue, 6 Feb 2001, Ben LaHaise wrote:

> Let me just emphasize what Stephen is pointing out: if requests are
> properly merged at higher layers, then merging is neither required nor
> desired. [...]

this is just so incorrect that it's not funny anymore.

- higher levels just do not have the kind of knowledge lower levels have.

- merging decisions are often not even *deterministic*.

- higher levels do not have the kind of state to eg. merge requests done
  by different users. The only chance for merging is often the lowest
  level, where we already know what disk, which sector.

- merging is not even *required* for some devices - and chances are high
  that we'll get away from this inefficient and unreliable 'rotating array
  of disks' business of storing bulk data in this century. (solid state
  disks, holographic storage, whatever.)

i'm truly shocked that you and Stephen are both saying this.

	Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-06 18:18                                                 ` Ingo Molnar
@ 2001-02-06 18:25                                                   ` Ben LaHaise
  2001-02-06 18:35                                                     ` Ingo Molnar
  0 siblings, 1 reply; 186+ messages in thread
From: Ben LaHaise @ 2001-02-06 18:25 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Stephen C. Tweedie, Linus Torvalds, Alan Cox, Manfred Spraul,
	Steve Lord, Linux Kernel List, kiobuf-io-devel, Ingo Molnar

On Tue, 6 Feb 2001, Ingo Molnar wrote:

> - higher levels do not have the kind of state to eg. merge requests done
>   by different users. The only chance for merging is often the lowest
>   level, where we already know what disk, which sector.

That's what a readaround buffer is for, and I suspect that readaround will
give use a big performance boost.

> - merging is not even *required* for some devices - and chances are high
>   that we'll get away from this inefficient and unreliable 'rotating array
>   of disks' business of storing bulk data in this century. (solid state
>   disks, holographic storage, whatever.)

Interesting that you've brought up this point, as its an example

> i'm truly shocked that you and Stephen are both saying this.

Merging != sorting.  Sorting of requests has to be carried out at the
lower layers, and the specific block device should be able to choose the
Right Thing To Do for the next item in a chain of sequential requests.

		-ben

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-06 17:22                                               ` Christoph Hellwig
@ 2001-02-06 18:26                                                 ` Stephen C. Tweedie
  0 siblings, 0 replies; 186+ messages in thread
From: Stephen C. Tweedie @ 2001-02-06 18:26 UTC (permalink / raw)
  To: Stephen C. Tweedie, Linus Torvalds, Alan Cox, Manfred Spraul,
	Steve Lord, linux-kernel, kiobuf-io-devel, Ben LaHaise,
	Ingo Molnar

Hi,

On Tue, Feb 06, 2001 at 06:22:58PM +0100, Christoph Hellwig wrote:
> On Tue, Feb 06, 2001 at 05:05:06PM +0000, Stephen C. Tweedie wrote:
> > The whole point of the post was that it is merging, not splitting,
> > which is troublesome.  How are you going to merge requests without
> > having chains of scatter-gather entities each with their own
> > completion callbacks?
> 
> The object passed down to the low-level driver just needs to ne able
> to contain multiple end-io callbacks.  The decision what to call when
> some of the scatter-gather entities fail is of course not so easy to
> handle and needs further discussion.

Umm, and if you want the separate higher-level IOs to be told which
IOs succeeded and which ones failed on error, you need to associate
each of the multiple completion callbacks with its particular
scatter-gather fragment or fragments.  So you end up with the same
sort of kiobuf/kiovec concept where you have chains of sg chunks, each
chunk with its own completion information.

This is *precisely* what I've been trying to get people to address.
Forget whether the individual sg fragments are based on pages or not:
if you want to have IO merging and accurate completion callbacks, you
need not just one sg list but multiple lists each with a separate
callback header.

Abandon the merging of sg-list requests (by moving that functionality
into the higher-level layers) and that problem disappears: flat
sg-lists will then work quite happily at the request layer.

Cheers,
 Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-06 18:25                                                   ` Ben LaHaise
@ 2001-02-06 18:35                                                     ` Ingo Molnar
  2001-02-06 18:54                                                       ` Ben LaHaise
  0 siblings, 1 reply; 186+ messages in thread
From: Ingo Molnar @ 2001-02-06 18:35 UTC (permalink / raw)
  To: Ben LaHaise
  Cc: Stephen C. Tweedie, Linus Torvalds, Alan Cox, Manfred Spraul,
	Steve Lord, Linux Kernel List, kiobuf-io-devel, Ingo Molnar


On Tue, 6 Feb 2001, Ben LaHaise wrote:

> > - higher levels do not have the kind of state to eg. merge requests done
> >   by different users. The only chance for merging is often the lowest
> >   level, where we already know what disk, which sector.
>
> That's what a readaround buffer is for, [...]

If you are merging based on (device, offset) values, then that's lowlevel
- and this is what we have been doing for years.

If you are merging based on (inode, offset), then it has flaws like not
being able to merge through a loopback or stacked filesystem.

	Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-06 18:35                                                     ` Ingo Molnar
@ 2001-02-06 18:54                                                       ` Ben LaHaise
  2001-02-06 18:58                                                         ` Ingo Molnar
  2001-02-06 19:20                                                         ` Linus Torvalds
  0 siblings, 2 replies; 186+ messages in thread
From: Ben LaHaise @ 2001-02-06 18:54 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Stephen C. Tweedie, Linus Torvalds, Alan Cox, Manfred Spraul,
	Steve Lord, Linux Kernel List, kiobuf-io-devel, Ingo Molnar

On Tue, 6 Feb 2001, Ingo Molnar wrote:

> If you are merging based on (device, offset) values, then that's lowlevel
> - and this is what we have been doing for years.
>
> If you are merging based on (inode, offset), then it has flaws like not
> being able to merge through a loopback or stacked filesystem.

I disagree.  Loopback filesystems typically have their data contiguously
on disk and won't split up incoming requests any further.

Here are the points I'm trying to address:

	- reduce the overhead in submitting block ios, especially for
	  large ios. Look at the %CPU usages differences between 512 byte
	  blocks and 4KB blocks, this can be better.
	- make asynchronous io possible in the block layer.  This is
	  impossible with the current ll_rw_block scheme and io request
	  plugging.
	- provide a generic mechanism for reordering io requests for
	  devices which will benefit from this.  Make it a library for
	  drivers to call into.  IDE for example will probably make use of
	  it, but some high end devices do this on the controller.  This
	  is the important point: Make it OPTIONAL.

You mentioned non-spindle base io devices in your last message.  Take
something like a big RAM disk.  Now compare kiobuf base io to buffer head
based io.  Tell me which one is going to perform better.

		-ben

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-06 18:54                                                       ` Ben LaHaise
@ 2001-02-06 18:58                                                         ` Ingo Molnar
  2001-02-06 19:11                                                           ` Ben LaHaise
  2001-02-06 19:20                                                         ` Linus Torvalds
  1 sibling, 1 reply; 186+ messages in thread
From: Ingo Molnar @ 2001-02-06 18:58 UTC (permalink / raw)
  To: Ben LaHaise
  Cc: Stephen C. Tweedie, Linus Torvalds, Alan Cox, Manfred Spraul,
	Steve Lord, Linux Kernel List, kiobuf-io-devel, Ingo Molnar


On Tue, 6 Feb 2001, Ben LaHaise wrote:

> 	- reduce the overhead in submitting block ios, especially for
> 	  large ios. Look at the %CPU usages differences between 512 byte
> 	  blocks and 4KB blocks, this can be better.

my system is already submitting 4KB bhs. If anyone's raw-IO setup submits
512 byte bhs thats a problem of the raw IO code ...

> 	- make asynchronous io possible in the block layer.  This is
> 	  impossible with the current ll_rw_block scheme and io request
> 	  plugging.

why is it impossible?

> You mentioned non-spindle base io devices in your last message.  Take
> something like a big RAM disk. Now compare kiobuf base io to buffer
> head based io. Tell me which one is going to perform better.

roughly equal performance when using 4K bhs. And a hell of a lot more
complex and volatile code in the kiobuf case.

	Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-06 20:35                                                                 ` Ingo Molnar
@ 2001-02-06 19:05                                                                   ` Marcelo Tosatti
  2001-02-06 20:59                                                                     ` Ingo Molnar
  2001-02-07 18:27                                                                   ` Christoph Hellwig
  1 sibling, 1 reply; 186+ messages in thread
From: Marcelo Tosatti @ 2001-02-06 19:05 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Christoph Hellwig, Linus Torvalds, Ben LaHaise,
	Stephen C. Tweedie, Alan Cox, Manfred Spraul, Steve Lord,
	Linux Kernel List, kiobuf-io-devel, Ingo Molnar



On Tue, 6 Feb 2001, Ingo Molnar wrote:

> 
> On Tue, 6 Feb 2001, Christoph Hellwig wrote:
> 
> > The second is that bh's are two things:
> >
> >  - a cacheing object
> >  - an io buffer
> >
> > This is not really an clean appropeach, and I would really like to get
> > away from it.
> 
> caching bmap() blocks was a recent addition around 2.3.20, and i suggested
> some time ago to cache pagecache blocks via explicit entries in struct
> page. That would be one solution - but it creates overhead.

Think about a given number of pages which are physically contiguous on
disk -- you dont need to cache the block number for each page, you just
need to cache the physical block number of the first page of the
"cluster".

SGI's pagebuf do that, and it would be great if we had something similar
in 2.5. 

It allows us to have fast IO clustering. 

> but there isnt anything wrong with having the bhs around to cache blocks -
> think of it as a 'cached and recycled IO buffer entry, with the block
> information cached'.

Usually we need to cache only block information (for clustering), and not
all the other stuff which buffer_head holds.

> frankly, my quick (and limited) hack to abuse bhs to cache blocks just
> cannot be a reason to replace bhs ...


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-06 18:58                                                         ` Ingo Molnar
@ 2001-02-06 19:11                                                           ` Ben LaHaise
  2001-02-06 19:32                                                             ` Jens Axboe
                                                                               ` (3 more replies)
  0 siblings, 4 replies; 186+ messages in thread
From: Ben LaHaise @ 2001-02-06 19:11 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Stephen C. Tweedie, Linus Torvalds, Alan Cox, Manfred Spraul,
	Steve Lord, Linux Kernel List, kiobuf-io-devel, Ingo Molnar

On Tue, 6 Feb 2001, Ingo Molnar wrote:

>
> On Tue, 6 Feb 2001, Ben LaHaise wrote:
>
> > 	- reduce the overhead in submitting block ios, especially for
> > 	  large ios. Look at the %CPU usages differences between 512 byte
> > 	  blocks and 4KB blocks, this can be better.
>
> my system is already submitting 4KB bhs. If anyone's raw-IO setup submits
> 512 byte bhs thats a problem of the raw IO code ...
>
> > 	- make asynchronous io possible in the block layer.  This is
> > 	  impossible with the current ll_rw_block scheme and io request
> > 	  plugging.
>
> why is it impossible?

s/impossible/unpleasant/.  ll_rw_blk blocks; it should be possible to have
a non blocking variant that does all of the setup in the caller's context.
Yes, I know that we can do it with a kernel thread, but that isn't as
clean and it significantly penalises small ios (hint: databases issue
*lots* of small random ios and a good chunk of large ios).

> > You mentioned non-spindle base io devices in your last message.  Take
> > something like a big RAM disk. Now compare kiobuf base io to buffer
> > head based io. Tell me which one is going to perform better.
>
> roughly equal performance when using 4K bhs. And a hell of a lot more
> complex and volatile code in the kiobuf case.

I'm willing to benchmark you on this.

		-ben

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-06 18:54                                                       ` Ben LaHaise
  2001-02-06 18:58                                                         ` Ingo Molnar
@ 2001-02-06 19:20                                                         ` Linus Torvalds
  1 sibling, 0 replies; 186+ messages in thread
From: Linus Torvalds @ 2001-02-06 19:20 UTC (permalink / raw)
  To: Ben LaHaise
  Cc: Ingo Molnar, Stephen C. Tweedie, Alan Cox, Manfred Spraul,
	Steve Lord, Linux Kernel List, kiobuf-io-devel, Ingo Molnar



On Tue, 6 Feb 2001, Ben LaHaise wrote:
> On Tue, 6 Feb 2001, Ingo Molnar wrote:
> 
> > If you are merging based on (device, offset) values, then that's lowlevel
> > - and this is what we have been doing for years.
> >
> > If you are merging based on (inode, offset), then it has flaws like not
> > being able to merge through a loopback or stacked filesystem.
> 
> I disagree.  Loopback filesystems typically have their data contiguously
> on disk and won't split up incoming requests any further.

Face it.

You NEED to merge and sort late. You _cannot_ do a good job early. Early
on, you don't have any concept of what the final IO pattern will be: you
will only have that once you've seen which requests are still pending etc,
something that the higher level layers CANNOT do.

Do you really want the higher levels to know about per-controller request
locking etc? I don't think so. 

Trust me. You HAVE to do the final decisions late in the game. You
absolutely _cannot_ get the best performance except for trivial and
uninteresting cases (ie one process that wants to read gigabytes of data
in one single stream) otherwise.

(It should be pointed out, btw, that SGI etc were often interested exactly
in the trivial and uninteresting cases. When you have the DoD asking you
to stream satellite pictures over the net as fast as you can, money being
no object, you get a rather twisted picture of what is important and what
is not)

And I will turn your own argument against you: if you do merging at a low
level anyway, there's little point in trying to do it at a higher level. 

Higher levels should do high-level sequencing. They can (and should) do
some amount of sorting - the lower levels will still do their own sort as
part of the merging anyway, and the lower level sorting may actually end
up being _different_ from a high-level sort because the lower levels know
about the topology of the device, but higher levels giving data with
"patterns" to it only make it easier for the lower levels to do a good
job. So high-level sorting is not _necessary_, but it's probably a good
idea.

High-level merging is almost certainly not even a good idea - higher
levels should try to _batch_ the requests, but that's a different issue,
and is again all about giving lower levels "patterns". It's can also about
simple issues like cache locality - batching things tends to make for
better icache (and possibly dcache) behaviour.

So you should separate out the issue of batching and merging. An dyou
absolutely should realize that you should NOT ignore Ingo's arguments
about loopback etc just because they don't fit the model you WANT them to
fit. The fact is that higher levels should NOT know about things like RAID
striping etc, yet that has a HUGE impact on the issue of merging (you do
_not_ want to merge requests to separate disks - you'll just have to split
them up again).

> Here are the points I'm trying to address:
> 
> 	- reduce the overhead in submitting block ios, especially for
> 	  large ios. Look at the %CPU usages differences between 512 byte
> 	  blocks and 4KB blocks, this can be better.

This is often a filesystem layer issue. Design your filesystem well, and
you get a lot of batching for free.

You can also batch the requests - this is basically what "readahead" is.
That helps a lot. But that is NOT the same thing as merging. Not at all.
The "batched" read-ahead requests may actually be split up among many
different disks - and they will each then get separately merged with
_other_ requests to those disks. See?

And trust me, THAT is how you get good performance. Not by merging early.
By merging late, and letting the disk layers do their own thing.

> 	- make asynchronous io possible in the block layer.  This is
> 	  impossible with the current ll_rw_block scheme and io request
> 	  plugging.

I'm surprised you say that. It's not only possible, but we do it all the
time. What do you think the swapout and writing is? How do you think that
read-ahead is actually _implemented_? Right. Read-ahead is NOT done as a
"merge" operation. It's done as several asynchronous IO operations that
the low-level stuff can choose (or not) to merge.

What do you think happens if you do a "submit_bh()"? It's a _purely_
asynchronous operation. It turns synchronous when you wait for the bh, not
before.

Your argument is nonsense.

> 	- provide a generic mechanism for reordering io requests for
> 	  devices which will benefit from this.  Make it a library for
> 	  drivers to call into.  IDE for example will probably make use of
> 	  it, but some high end devices do this on the controller.  This
> 	  is the important point: Make it OPTIONAL.

Ehh. You've just described exatcly what we have.

This is what the whole elevator thing _is_. It's a library of routines.
You don't have to use them, and in fact many things DO NOT use them. The
loopback driver, for example, doesn't bother with sorting or merging at
all, because it knows that it's only supposed to pass the request on to
somebody else - who will do a hell of a lot better job of it.

Some high-end drivers have their own merging stuff, exactly because they
don't need the overhead - you're better off just feeding the request to
the controller as soon as you can, as the controller itself will do all
the merging and sorting anyway.

> You mentioned non-spindle base io devices in your last message.  Take
> something like a big RAM disk.  Now compare kiobuf base io to buffer head
> based io.  Tell me which one is going to perform better.

Buffer heads? 

Go and read the code.

Sure, it has some historical baggage still, but the fact is that it works
a hell of a lot better than kiobufs and it _does_ know about merging
multiple requests and handling errors in the middle of one request etc.
You can get the full advantage of streaming megabytes of data in one
request, AND still get proper error handling if it turns out that one
sector in the middle was bad.

		Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-06 19:11                                                           ` Ben LaHaise
@ 2001-02-06 19:32                                                             ` Jens Axboe
  2001-02-06 19:32                                                             ` Ingo Molnar
                                                                               ` (2 subsequent siblings)
  3 siblings, 0 replies; 186+ messages in thread
From: Jens Axboe @ 2001-02-06 19:32 UTC (permalink / raw)
  To: Ben LaHaise
  Cc: Ingo Molnar, Stephen C. Tweedie, Linus Torvalds, Alan Cox,
	Manfred Spraul, Steve Lord, Linux Kernel List, kiobuf-io-devel,
	Ingo Molnar

On Tue, Feb 06 2001, Ben LaHaise wrote:
> > > 	- make asynchronous io possible in the block layer.  This is
> > > 	  impossible with the current ll_rw_block scheme and io request
> > > 	  plugging.
> >
> > why is it impossible?
> 
> s/impossible/unpleasant/.  ll_rw_blk blocks; it should be possible to have
> a non blocking variant that does all of the setup in the caller's context.
> Yes, I know that we can do it with a kernel thread, but that isn't as
> clean and it significantly penalises small ios (hint: databases issue
> *lots* of small random ios and a good chunk of large ios).

So make a non-blocking variant, not a big deal. Users of async I/O
know how to deal with resource limits anyway.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-06 19:11                                                           ` Ben LaHaise
  2001-02-06 19:32                                                             ` Jens Axboe
@ 2001-02-06 19:32                                                             ` Ingo Molnar
  2001-02-06 19:32                                                             ` Linus Torvalds
  2001-02-06 19:46                                                             ` Ingo Molnar
  3 siblings, 0 replies; 186+ messages in thread
From: Ingo Molnar @ 2001-02-06 19:32 UTC (permalink / raw)
  To: Ben LaHaise
  Cc: Stephen C. Tweedie, Linus Torvalds, Alan Cox, Manfred Spraul,
	Steve Lord, Linux Kernel List, kiobuf-io-devel, Ingo Molnar


On Tue, 6 Feb 2001, Ben LaHaise wrote:

> > > 	- make asynchronous io possible in the block layer.  This is
> > > 	  impossible with the current ll_rw_block scheme and io request
> > > 	  plugging.
> >
> > why is it impossible?
>
> s/impossible/unpleasant/. ll_rw_blk blocks; it should be possible to
> have a non blocking variant that does all of the setup in the caller's
> context. [...]

sorry, but exactly what code are you comparing this to? The aio code you
sent a few days ago does not do this either. (And you did not answer my
questions regarding this issue.) What i saw is some scheme that at a point
relies on keventd (a kernel thread) to do the blocking stuff. [or, unless
i have misread the code, does the ->bmap() synchronously.]

indeed an asynchron ll_rw_block() is possible and desirable (and not hard
at all - all structures are interrupt-safe already, opposed to the kiovec
code), but this is only half of the story. What is the big issue for me is
an async ->bmap(). And we wont access ext2fs data structures from IRQ
handlers anytime soon - so true async IO right now is damn near
impossible. No matter what the IO-submission interface is: kiobufs/kiovecs
or bhs/requests.

	Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-06 19:11                                                           ` Ben LaHaise
  2001-02-06 19:32                                                             ` Jens Axboe
  2001-02-06 19:32                                                             ` Ingo Molnar
@ 2001-02-06 19:32                                                             ` Linus Torvalds
  2001-02-06 19:44                                                               ` Ingo Molnar
                                                                                 ` (2 more replies)
  2001-02-06 19:46                                                             ` Ingo Molnar
  3 siblings, 3 replies; 186+ messages in thread
From: Linus Torvalds @ 2001-02-06 19:32 UTC (permalink / raw)
  To: Ben LaHaise
  Cc: Ingo Molnar, Stephen C. Tweedie, Alan Cox, Manfred Spraul,
	Steve Lord, Linux Kernel List, kiobuf-io-devel, Ingo Molnar



On Tue, 6 Feb 2001, Ben LaHaise wrote:
> 
> s/impossible/unpleasant/.  ll_rw_blk blocks; it should be possible to have
> a non blocking variant that does all of the setup in the caller's context.
> Yes, I know that we can do it with a kernel thread, but that isn't as
> clean and it significantly penalises small ios (hint: databases issue
> *lots* of small random ios and a good chunk of large ios).

Ehh.. submit_bh() does everything you want. And, btw, ll_rw_block() does
NOT block. Never has. Never will.

(Small correction: it doesn't block on anything else than allocating a
request structure if needed, and quite frankly, you have to block
SOMETIME. You can't just try to throw stuff at the device faster than it
can take it. Think of it as a "there can only be this many IO's in
flight")

If you want to use kiobuf's because you think they are asycnrhonous and
bh's aren't, then somebody has been feeding you a lot of crap. The kiobuf
PR department seems to have been working overtime on some FUD strategy.

The fact is that bh's can do MORE than kiobuf's. They have all the
callbacks in place etc. They merge and sort correctly. Oh, they have
limitations: one "bh" always describes just one memory area with a
"start,len" kind of thing. That's fine - scatter-gather is pushed
downwards, and the upper layers do not even need to know about it. Which
is what layering is all about, after all.

Traditionally, a "bh" is only _used_ for small areas, but that's not a
"bh" issue, that's a memory management issue. The code should pretty much
handle the issue of a single 64kB bh pretty much as-is, but nothing
creates them: the VM layer only creates bh's in sizes ranging from 512
bytes to a single page.

The IO layer could do more, but there has yet to be anybody who needed
more (becase once you hit a page-size, you tend to get into
scatter-gather, so you want to have one bh per area - and let the
low-level IO level handle the actual merging etc).

Right now, on many normal setups, the thing that limits our ability to do
big IO requests is actually the fact that IDE cannot do more than 128kB
per request, for example (256 sectors). It's not the bh's or the VM layer.

If you want to make a "raw disk device", you can do so TODAY with bh's.
How? Don't use "bread()" (which allocates the backing store and creates
the cache). Allocate a separate anonymous bh (or multiple), and set them
up to point to whatever data source/sink you have, and let it rip. All
asynchronous. All with nice completion callbacks. All with existing code,
no kiobuf's in sight.

What more do you think your kiobuf's should be able to do?

		Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-06 18:09                                                   ` Ben LaHaise
@ 2001-02-06 19:35                                                     ` Jens Axboe
  0 siblings, 0 replies; 186+ messages in thread
From: Jens Axboe @ 2001-02-06 19:35 UTC (permalink / raw)
  To: Ben LaHaise
  Cc: Stephen C. Tweedie, Linus Torvalds, Alan Cox, Manfred Spraul,
	Steve Lord, linux-kernel, kiobuf-io-devel, Ingo Molnar

On Tue, Feb 06 2001, Ben LaHaise wrote:
> > > As for io completion, can't we just issue seperate requests for the
> > > critical data and the readahead?  That way for SCSI disks, the important
> > > io should be finished while the readahead can continue.  Thoughts?
> >
> > Priorities?
> 
> Definately.  I'd like to be able to issue readaheads with a "don't bother
> executing if this request unless the cost is low" bit set.  It might also
> be helpful for heavy multiuser loads (or even a single user with multiple
> processes) to ensure progress is made for others.

And in other contexts too it might be handy to assign priorities to
requests as well. I don't know how sgi plan on handling grio (or already
handle it in irix), maybe Steve can fill us in on that :)

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-06 19:32                                                             ` Linus Torvalds
@ 2001-02-06 19:44                                                               ` Ingo Molnar
  2001-02-06 19:49                                                               ` Ben LaHaise
  2001-02-06 20:25                                                               ` Christoph Hellwig
  2 siblings, 0 replies; 186+ messages in thread
From: Ingo Molnar @ 2001-02-06 19:44 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ben LaHaise, Stephen C. Tweedie, Alan Cox, Manfred Spraul,
	Steve Lord, Linux Kernel List, kiobuf-io-devel, Ingo Molnar


On Tue, 6 Feb 2001, Linus Torvalds wrote:

> (Small correction: it doesn't block on anything else than allocating a
> request structure if needed, and quite frankly, you have to block
> SOMETIME. You can't just try to throw stuff at the device faster than
> it can take it. Think of it as a "there can only be this many IO's in
> flight")

yep. The/my goal would be to get some sort of async IO capability that is
able to read the pagecache without holding up the process. And just
because i've already implemented the helper-kernel-thread async IO variant
[in fact what TUX does is that there are per-CPU async IO helper threads,
and we always pick the 'localized' thread, to avoid unnecessery cross-CPU
traffic], i'd like to explore the possibility of getting this done via a
pure, IRQ-driven state-machine - which arguably has the lowest overhead.

but i just cannot find any robust way to do this with ext2fs (or any other
disk-based FS for that matter). The horror scenario: the inode block is
not cached yet, and the block resides in a triple-indirected block which
triggers 3 other block reads, before the actual data block can be read.
And i definitely do not see why kiobufs would help make this any easier.

	Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-06 19:11                                                           ` Ben LaHaise
                                                                               ` (2 preceding siblings ...)
  2001-02-06 19:32                                                             ` Linus Torvalds
@ 2001-02-06 19:46                                                             ` Ingo Molnar
  2001-02-06 20:16                                                               ` Ben LaHaise
  3 siblings, 1 reply; 186+ messages in thread
From: Ingo Molnar @ 2001-02-06 19:46 UTC (permalink / raw)
  To: Ben LaHaise
  Cc: Stephen C. Tweedie, Linus Torvalds, Alan Cox, Manfred Spraul,
	Steve Lord, Linux Kernel List, kiobuf-io-devel, Ingo Molnar


On Tue, 6 Feb 2001, Ben LaHaise wrote:

> > > You mentioned non-spindle base io devices in your last message.  Take
> > > something like a big RAM disk. Now compare kiobuf base io to buffer
> > > head based io. Tell me which one is going to perform better.
> >
> > roughly equal performance when using 4K bhs. And a hell of a lot more
> > complex and volatile code in the kiobuf case.
>
> I'm willing to benchmark you on this.

sure. Could you specify the actual workload, and desired test-setups?

	Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-06 19:32                                                             ` Linus Torvalds
  2001-02-06 19:44                                                               ` Ingo Molnar
@ 2001-02-06 19:49                                                               ` Ben LaHaise
  2001-02-06 19:57                                                                 ` Ingo Molnar
  2001-02-06 20:26                                                                 ` Linus Torvalds
  2001-02-06 20:25                                                               ` Christoph Hellwig
  2 siblings, 2 replies; 186+ messages in thread
From: Ben LaHaise @ 2001-02-06 19:49 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Stephen C. Tweedie, Alan Cox, Manfred Spraul,
	Steve Lord, Linux Kernel List, kiobuf-io-devel, Ingo Molnar

On Tue, 6 Feb 2001, Linus Torvalds wrote:

>
>
> On Tue, 6 Feb 2001, Ben LaHaise wrote:
> >
> > s/impossible/unpleasant/.  ll_rw_blk blocks; it should be possible to have
> > a non blocking variant that does all of the setup in the caller's context.
> > Yes, I know that we can do it with a kernel thread, but that isn't as
> > clean and it significantly penalises small ios (hint: databases issue
> > *lots* of small random ios and a good chunk of large ios).
>
> Ehh.. submit_bh() does everything you want. And, btw, ll_rw_block() does
> NOT block. Never has. Never will.
>
> (Small correction: it doesn't block on anything else than allocating a
> request structure if needed, and quite frankly, you have to block
> SOMETIME. You can't just try to throw stuff at the device faster than it
> can take it. Think of it as a "there can only be this many IO's in
> flight")

This small correction is the crux of the problem: if it blocks, it takes
away from the ability of the process to continue doing useful work.  If it
returns -EAGAIN, then that's okay, the io will be resubmitted later when
other disk io has completed.  But, it should be possible to continue
servicing network requests or user io while disk io is underway.

> If you want to use kiobuf's because you think they are asycnrhonous and
> bh's aren't, then somebody has been feeding you a lot of crap. The kiobuf
> PR department seems to have been working overtime on some FUD strategy.

I'm using bh's to refer to what is currently being done, and kiobuf when
talking about what could be done.  It's probably the wrong thing to do,
and if bh's are extended to operate on arbitrary sized blocks then there
is no difference between the two.

> If you want to make a "raw disk device", you can do so TODAY with bh's.
> How? Don't use "bread()" (which allocates the backing store and creates
> the cache). Allocate a separate anonymous bh (or multiple), and set them
> up to point to whatever data source/sink you have, and let it rip. All
> asynchronous. All with nice completion callbacks. All with existing code,
> no kiobuf's in sight.

> What more do you think your kiobuf's should be able to do?

That's what my code is doing today.  There are a ton of bh's setup for a
single kiobuf request that is issued.  For something like a single 256kb
io, this is the difference between the batched io requests being passed
into submit_bh fitting in L1 cache and overflowing it.  Resizable bh's
would certainly improve this.

		-ben

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-06 19:49                                                               ` Ben LaHaise
@ 2001-02-06 19:57                                                                 ` Ingo Molnar
  2001-02-06 20:07                                                                   ` Jens Axboe
                                                                                     ` (2 more replies)
  2001-02-06 20:26                                                                 ` Linus Torvalds
  1 sibling, 3 replies; 186+ messages in thread
From: Ingo Molnar @ 2001-02-06 19:57 UTC (permalink / raw)
  To: Ben LaHaise
  Cc: Linus Torvalds, Stephen C. Tweedie, Alan Cox, Manfred Spraul,
	Steve Lord, Linux Kernel List, kiobuf-io-devel, Ingo Molnar


On Tue, 6 Feb 2001, Ben LaHaise wrote:

> This small correction is the crux of the problem: if it blocks, it
> takes away from the ability of the process to continue doing useful
> work.  If it returns -EAGAIN, then that's okay, the io will be
> resubmitted later when other disk io has completed.  But, it should be
> possible to continue servicing network requests or user io while disk
> io is underway.

typical blocking point is waiting for page completion, not
__wait_request(). But, this is really not an issue, NR_REQUESTS can be
increased anytime. If NR_REQUESTS is large enough then think of it as the
'absolute upper limit of doing IO', and think of the blocking as 'the
kernel pulling the brakes'.

[overhead of 512-byte bhs in the raw IO code is an artificial problem of
the raw IO code.]

	Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-06 19:57                                                                 ` Ingo Molnar
@ 2001-02-06 20:07                                                                   ` Jens Axboe
  2001-02-06 20:25                                                                   ` Ben LaHaise
  2001-02-07  0:21                                                                   ` Stephen C. Tweedie
  2 siblings, 0 replies; 186+ messages in thread
From: Jens Axboe @ 2001-02-06 20:07 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Ben LaHaise, Linus Torvalds, Stephen C. Tweedie, Alan Cox,
	Manfred Spraul, Steve Lord, Linux Kernel List, kiobuf-io-devel,
	Ingo Molnar

On Tue, Feb 06 2001, Ingo Molnar wrote:
> > This small correction is the crux of the problem: if it blocks, it
> > takes away from the ability of the process to continue doing useful
> > work.  If it returns -EAGAIN, then that's okay, the io will be
> > resubmitted later when other disk io has completed.  But, it should be
> > possible to continue servicing network requests or user io while disk
> > io is underway.
> 
> typical blocking point is waiting for page completion, not
> __wait_request(). But, this is really not an issue, NR_REQUESTS can be
> increased anytime. If NR_REQUESTS is large enough then think of it as the
> 'absolute upper limit of doing IO', and think of the blocking as 'the
> kernel pulling the brakes'.

Not just __get_request_wait, but also the limit on max locked buffers
in ll_rw_block. Serves the same purpose though, brake effect.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-06 21:42                                                                           ` Linus Torvalds
@ 2001-02-06 20:16                                                                             ` Marcelo Tosatti
  2001-02-06 22:09                                                                               ` Jens Axboe
  2001-02-06 21:57                                                                             ` Manfred Spraul
  1 sibling, 1 reply; 186+ messages in thread
From: Marcelo Tosatti @ 2001-02-06 20:16 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Manfred Spraul, Jens Axboe, Ben LaHaise, Ingo Molnar,
	Stephen C. Tweedie, Alan Cox, Steve Lord, Linux Kernel List,
	kiobuf-io-devel, Ingo Molnar


On Tue, 6 Feb 2001, Linus Torvalds wrote:

> 
> 
> On Tue, 6 Feb 2001, Manfred Spraul wrote:
> > Jens Axboe wrote:
> > > 
> > > > Several kernel functions need a "dontblock" parameter (or a callback, or
> > > > a waitqueue address, or a tq_struct pointer).
> > > 
> > > We don't even need that, non-blocking is implicitly applied with READA.
> > >
> > READA just returns - I doubt that the aio functions should poll until
> > there are free entries in the request queue.
> 
> The aio functions should NOT use READA/WRITEA. They should just use the
> normal operations, waiting for requests. The things that makes them
> asycnhronous is not waiting for the requests to _complete_. Which you can
> already do, trivially enough.

Reading write(2): 

       EAGAIN Non-blocking  I/O has been selected using O_NONBLOCK and there was
              no room in the pipe or socket connected to fd to  write  the data
              immediately.

I see no reason why "aio function have to block waiting for requests". 

_Why_ they do ? 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-06 19:46                                                             ` Ingo Molnar
@ 2001-02-06 20:16                                                               ` Ben LaHaise
  2001-02-06 20:22                                                                 ` Ingo Molnar
  0 siblings, 1 reply; 186+ messages in thread
From: Ben LaHaise @ 2001-02-06 20:16 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Stephen C. Tweedie, Linus Torvalds, Alan Cox, Manfred Spraul,
	Steve Lord, Linux Kernel List, kiobuf-io-devel, Ingo Molnar

On Tue, 6 Feb 2001, Ingo Molnar wrote:

>
> On Tue, 6 Feb 2001, Ben LaHaise wrote:
>
> > > > You mentioned non-spindle base io devices in your last message.  Take
> > > > something like a big RAM disk. Now compare kiobuf base io to buffer
> > > > head based io. Tell me which one is going to perform better.
> > >
> > > roughly equal performance when using 4K bhs. And a hell of a lot more
> > > complex and volatile code in the kiobuf case.
> >
> > I'm willing to benchmark you on this.
>
> sure. Could you specify the actual workload, and desired test-setups?

Sure.  General parameters will be as follows (since I think we both have
access to these machines):

	- 4xXeon, 4GB memory, 3GB to be used for the ramdisk (enough for a
	  base install plus data files.
	- data to/from the ram block device must be copied within the ram
	  block driver.
	- the filesystem used must be ext2.  optimisations to ext2 for
	  tweaks to the interface are permitted & encouraged.

The main item I'm interested in is read (page cache cold)/synchronous
write performance for blocks from 256 bytes to 16MB in powers of two, much
like what I've done in testing the aio patches that shows where
improvement in latency is needed.  Including a few other items on disk
like the timings of find/make -s dep/bonnie/dbench is probably to show
changes in throughput.  Sound fair?

		-ben

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-06 20:16                                                               ` Ben LaHaise
@ 2001-02-06 20:22                                                                 ` Ingo Molnar
  0 siblings, 0 replies; 186+ messages in thread
From: Ingo Molnar @ 2001-02-06 20:22 UTC (permalink / raw)
  To: Ben LaHaise
  Cc: Stephen C. Tweedie, Linus Torvalds, Alan Cox, Manfred Spraul,
	Steve Lord, Linux Kernel List, kiobuf-io-devel, Ingo Molnar


On Tue, 6 Feb 2001, Ben LaHaise wrote:

> Sure.  General parameters will be as follows (since I think we both have
> access to these machines):
>
> 	- 4xXeon, 4GB memory, 3GB to be used for the ramdisk (enough for a
> 	  base install plus data files.
> 	- data to/from the ram block device must be copied within the ram
> 	  block driver.
> 	- the filesystem used must be ext2.  optimisations to ext2 for
> 	  tweaks to the interface are permitted & encouraged.
>
> The main item I'm interested in is read (page cache cold)/synchronous
> write performance for blocks from 256 bytes to 16MB in powers of two,
> much like what I've done in testing the aio patches that shows where
> improvement in latency is needed. Including a few other items on disk
> like the timings of find/make -s dep/bonnie/dbench is probably to show
> changes in throughput. Sound fair?

yep, sounds fair.

	Ingo


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-06 19:57                                                                 ` Ingo Molnar
  2001-02-06 20:07                                                                   ` Jens Axboe
@ 2001-02-06 20:25                                                                   ` Ben LaHaise
  2001-02-06 20:41                                                                     ` Manfred Spraul
  2001-02-06 20:49                                                                     ` Jens Axboe
  2001-02-07  0:21                                                                   ` Stephen C. Tweedie
  2 siblings, 2 replies; 186+ messages in thread
From: Ben LaHaise @ 2001-02-06 20:25 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Stephen C. Tweedie, Alan Cox, Manfred Spraul,
	Steve Lord, Linux Kernel List, kiobuf-io-devel, Ingo Molnar

On Tue, 6 Feb 2001, Ingo Molnar wrote:

>
> On Tue, 6 Feb 2001, Ben LaHaise wrote:
>
> > This small correction is the crux of the problem: if it blocks, it
> > takes away from the ability of the process to continue doing useful
> > work.  If it returns -EAGAIN, then that's okay, the io will be
> > resubmitted later when other disk io has completed.  But, it should be
> > possible to continue servicing network requests or user io while disk
> > io is underway.
>
> typical blocking point is waiting for page completion, not
> __wait_request(). But, this is really not an issue, NR_REQUESTS can be
> increased anytime. If NR_REQUESTS is large enough then think of it as the
> 'absolute upper limit of doing IO', and think of the blocking as 'the
> kernel pulling the brakes'.

=)  This is what I'm seeing: lots of processes waiting with wchan ==
__get_request_wait.  With async io and a database flushing lots of io
asynchronously spread out across the disk, the NR_REQUESTS limit is hit
very quickly.

> [overhead of 512-byte bhs in the raw IO code is an artificial problem of
> the raw IO code.]

True, and in the tests I've run, raw io is using 2KB blocks (same as the
database).

		-ben

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-06 19:32                                                             ` Linus Torvalds
  2001-02-06 19:44                                                               ` Ingo Molnar
  2001-02-06 19:49                                                               ` Ben LaHaise
@ 2001-02-06 20:25                                                               ` Christoph Hellwig
  2001-02-06 20:35                                                                 ` Ingo Molnar
  2001-02-06 20:59                                                                 ` Linus Torvalds
  2 siblings, 2 replies; 186+ messages in thread
From: Christoph Hellwig @ 2001-02-06 20:25 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ben LaHaise, Ingo Molnar, Stephen C. Tweedie, Alan Cox,
	Manfred Spraul, Steve Lord, Linux Kernel List, kiobuf-io-devel,
	Ingo Molnar

On Tue, Feb 06, 2001 at 11:32:43AM -0800, Linus Torvalds wrote:
> Traditionally, a "bh" is only _used_ for small areas, but that's not a
> "bh" issue, that's a memory management issue. The code should pretty much
> handle the issue of a single 64kB bh pretty much as-is, but nothing
> creates them: the VM layer only creates bh's in sizes ranging from 512
> bytes to a single page.
> 
> The IO layer could do more, but there has yet to be anybody who needed
> more (becase once you hit a page-size, you tend to get into
> scatter-gather, so you want to have one bh per area - and let the
> low-level IO level handle the actual merging etc).

Yes.  That's one disadvantage blown away.

The second is that bh's are two things:

 - a cacheing object
 - an io buffer

This is not really an clean appropeach, and I would really like to
get away from it.

	Christoph

-- 
Whip me.  Beat me.  Make me maintain AIX.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-06 19:49                                                               ` Ben LaHaise
  2001-02-06 19:57                                                                 ` Ingo Molnar
@ 2001-02-06 20:26                                                                 ` Linus Torvalds
  1 sibling, 0 replies; 186+ messages in thread
From: Linus Torvalds @ 2001-02-06 20:26 UTC (permalink / raw)
  To: Ben LaHaise
  Cc: Ingo Molnar, Stephen C. Tweedie, Alan Cox, Manfred Spraul,
	Steve Lord, Linux Kernel List, kiobuf-io-devel, Ingo Molnar



On Tue, 6 Feb 2001, Ben LaHaise wrote:
> 
> This small correction is the crux of the problem: if it blocks, it takes
> away from the ability of the process to continue doing useful work.  If it
> returns -EAGAIN, then that's okay, the io will be resubmitted later when
> other disk io has completed.  But, it should be possible to continue
> servicing network requests or user io while disk io is underway.

Ehh..  The supprot for this is actually all there already. It's just not
used, because nobody asked for it.

Check the "rw_ahead" variable in __make_request(). Notice how it does
everything you ask for.

So remind me again why we should need a whole new interface for something
that already exists but isn't exported because nobody needed it? It got
created for READA, but that isn't used any more.

You could absolutely _trivially_ re-introduce it (along with WRITEA), but
you should probably change the semantics of what happens when it doesn't
get a request. Something like making "submit_bh()" return an error value
for the case, instead of doing "bh->b_end_io(0..)" which is what I think
it does right now. That would make it easier for the submitter to say "oh,
the queue is full".

This is probably all of 5 lines of code.

I really think that people don't give the block device layer enough
credit. Some of it is quite ugly due to 10 years of history, and there is
certainly a lack of some interesting capabilities (there is no "barrier"
operation right now to enforce ordering, for example, and it really would
be sensible to support a wider operation of ops than just read/write and
let the ioctl's use it to pass commands too).

These issues are things that I've been discussing with Jens for the last
few months, and are things that he already to some degree has been toying
with, and we already decided to try to do this during 2.5.x.

It's already been a _lot_ of clean-up with the per-queue request lists
etc, and there's more to be done in the cleanup section too. But the fact
is that too many people seem to have ignored the support that IS there,
and that actually works very well indeed - and is very generic.

> > What more do you think your kiobuf's should be able to do?
> 
> That's what my code is doing today.  There are a ton of bh's setup for a
> single kiobuf request that is issued.  For something like a single 256kb
> io, this is the difference between the batched io requests being passed
> into submit_bh fitting in L1 cache and overflowing it.  Resizable bh's
> would certainly improve this.

bh's _are_ resizeable. You just change bh->b_size, and you're done.

Of course, you'll need to do your own memory management for the backing
store. The generic bread() etc layer makes memory management simpler by
having just one size per page and making "struct page" their native mm
etity, but that's not really a bh issue - it's a MM issue and stems from
the fact that this is how all traditional block filesystems tend to want
to work.

NOTE! If you do start to resize the buffer heads, please give me a ping.
The code has never actually been _tested_ with anything but 512. 1024,
2048, 4096 and 8192-byte blocks. I would not be surprised at all if some
low-level drivers actually have asserts that the sizes are ones they
"recognize". The generic layer should be happy with anything that is a
multiple of 512, but as with all things, you'll probably find some gotchas
when you actually try something new.

		Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-06 20:25                                                               ` Christoph Hellwig
@ 2001-02-06 20:35                                                                 ` Ingo Molnar
  2001-02-06 19:05                                                                   ` Marcelo Tosatti
  2001-02-07 18:27                                                                   ` Christoph Hellwig
  2001-02-06 20:59                                                                 ` Linus Torvalds
  1 sibling, 2 replies; 186+ messages in thread
From: Ingo Molnar @ 2001-02-06 20:35 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Linus Torvalds, Ben LaHaise, Stephen C. Tweedie, Alan Cox,
	Manfred Spraul, Steve Lord, Linux Kernel List, kiobuf-io-devel,
	Ingo Molnar


On Tue, 6 Feb 2001, Christoph Hellwig wrote:

> The second is that bh's are two things:
>
>  - a cacheing object
>  - an io buffer
>
> This is not really an clean appropeach, and I would really like to get
> away from it.

caching bmap() blocks was a recent addition around 2.3.20, and i suggested
some time ago to cache pagecache blocks via explicit entries in struct
page. That would be one solution - but it creates overhead.

but there isnt anything wrong with having the bhs around to cache blocks -
think of it as a 'cached and recycled IO buffer entry, with the block
information cached'.

frankly, my quick (and limited) hack to abuse bhs to cache blocks just
cannot be a reason to replace bhs ...

	Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-06 20:25                                                                   ` Ben LaHaise
@ 2001-02-06 20:41                                                                     ` Manfred Spraul
  2001-02-06 20:50                                                                       ` Jens Axboe
  2001-02-06 20:49                                                                     ` Jens Axboe
  1 sibling, 1 reply; 186+ messages in thread
From: Manfred Spraul @ 2001-02-06 20:41 UTC (permalink / raw)
  To: Ben LaHaise
  Cc: Ingo Molnar, Linus Torvalds, Stephen C. Tweedie, Alan Cox,
	Steve Lord, Linux Kernel List, kiobuf-io-devel, Ingo Molnar

Ben LaHaise wrote:
> 
> On Tue, 6 Feb 2001, Ingo Molnar wrote:
> 
> >
> > On Tue, 6 Feb 2001, Ben LaHaise wrote:
> >
> > > This small correction is the crux of the problem: if it blocks, it
> > > takes away from the ability of the process to continue doing useful
> > > work.  If it returns -EAGAIN, then that's okay, the io will be
> > > resubmitted later when other disk io has completed.  But, it should be
> > > possible to continue servicing network requests or user io while disk
> > > io is underway.
> >
> > typical blocking point is waiting for page completion, not
> > __wait_request(). But, this is really not an issue, NR_REQUESTS can be
> > increased anytime. If NR_REQUESTS is large enough then think of it as the
> > 'absolute upper limit of doing IO', and think of the blocking as 'the
> > kernel pulling the brakes'.
> 
> =)  This is what I'm seeing: lots of processes waiting with wchan ==
> __get_request_wait.  With async io and a database flushing lots of io
> asynchronously spread out across the disk, the NR_REQUESTS limit is hit
> very quickly.
>
Has that anything to do with kiobuf or buffer head?

Several kernel functions need a "dontblock" parameter (or a callback, or
a waitqueue address, or a tq_struct pointer). 

--
	Manfred
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-06 20:25                                                                   ` Ben LaHaise
  2001-02-06 20:41                                                                     ` Manfred Spraul
@ 2001-02-06 20:49                                                                     ` Jens Axboe
  1 sibling, 0 replies; 186+ messages in thread
From: Jens Axboe @ 2001-02-06 20:49 UTC (permalink / raw)
  To: Ben LaHaise
  Cc: Ingo Molnar, Linus Torvalds, Stephen C. Tweedie, Alan Cox,
	Manfred Spraul, Steve Lord, Linux Kernel List, kiobuf-io-devel,
	Ingo Molnar

On Tue, Feb 06 2001, Ben LaHaise wrote:
> =)  This is what I'm seeing: lots of processes waiting with wchan ==
> __get_request_wait.  With async io and a database flushing lots of io
> asynchronously spread out across the disk, the NR_REQUESTS limit is hit
> very quickly.

You can't do async I/O this way! In going what Linus said, make submit_bh
return an int telling you if it failed to queue the buffer and use
READA/WRITEA to submit it.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-06 20:41                                                                     ` Manfred Spraul
@ 2001-02-06 20:50                                                                       ` Jens Axboe
  2001-02-06 21:26                                                                         ` Manfred Spraul
  0 siblings, 1 reply; 186+ messages in thread
From: Jens Axboe @ 2001-02-06 20:50 UTC (permalink / raw)
  To: Manfred Spraul
  Cc: Ben LaHaise, Ingo Molnar, Linus Torvalds, Stephen C. Tweedie,
	Alan Cox, Steve Lord, Linux Kernel List, kiobuf-io-devel,
	Ingo Molnar

On Tue, Feb 06 2001, Manfred Spraul wrote:
> > =)  This is what I'm seeing: lots of processes waiting with wchan ==
> > __get_request_wait.  With async io and a database flushing lots of io
> > asynchronously spread out across the disk, the NR_REQUESTS limit is hit
> > very quickly.
> >
> Has that anything to do with kiobuf or buffer head?

Nothing

> Several kernel functions need a "dontblock" parameter (or a callback, or
> a waitqueue address, or a tq_struct pointer). 

We don't even need that, non-blocking is implicitly applied with READA.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-06 19:05                                                                   ` Marcelo Tosatti
@ 2001-02-06 20:59                                                                     ` Ingo Molnar
  2001-02-06 21:20                                                                       ` Steve Lord
  0 siblings, 1 reply; 186+ messages in thread
From: Ingo Molnar @ 2001-02-06 20:59 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Christoph Hellwig, Linus Torvalds, Ben LaHaise,
	Stephen C. Tweedie, Alan Cox, Manfred Spraul, Steve Lord,
	Linux Kernel List, kiobuf-io-devel, Ingo Molnar


On Tue, 6 Feb 2001, Marcelo Tosatti wrote:

> Think about a given number of pages which are physically contiguous on
> disk -- you dont need to cache the block number for each page, you
> just need to cache the physical block number of the first page of the
> "cluster".

ranges are a hell of a lot more trouble to get right than page or
block-sized objects - and typical access patterns are rarely 'ranged'. As
long as the basic unit is not 'too small' (ie. not 512 byte, but something
more sane, like 4096 bytes), i dont think ranging done in higher levels
buys us anything valuable. And we do ranging at the request layer already
... Guess why most CPUs ended up having pages, and not "memory ranges"?
It's simpler, thus faster in the common case and easier to debug.

> Usually we need to cache only block information (for clustering), and
> not all the other stuff which buffer_head holds.

well, the other issue is that buffer_heads hold buffer-cache details as
well. But i think it's too small right now to justify any splitup - and
those issues are related enough to have significant allocation-merging
effects.

	Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-06 20:25                                                               ` Christoph Hellwig
  2001-02-06 20:35                                                                 ` Ingo Molnar
@ 2001-02-06 20:59                                                                 ` Linus Torvalds
  2001-02-07 18:26                                                                   ` Christoph Hellwig
  1 sibling, 1 reply; 186+ messages in thread
From: Linus Torvalds @ 2001-02-06 20:59 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Ben LaHaise, Ingo Molnar, Stephen C. Tweedie, Alan Cox,
	Manfred Spraul, Steve Lord, Linux Kernel List, kiobuf-io-devel,
	Ingo Molnar



On Tue, 6 Feb 2001, Christoph Hellwig wrote:
> 
> The second is that bh's are two things:
> 
>  - a cacheing object
>  - an io buffer

Actually, they really aren't.

They kind of _used_ to be, but more and more they've moved away from that
historical use. Check in particular the page cache, and as a really
extreme case the swap cache version of the page cache.

It certainly _used_ to be true that "bh"s were actually first-class memory
management citizens, and actually had a data buffer and a cache associated
with them. And because of that historical baggage, that's how many people
still think of them.

These days, it's really not true any more. A "bh" doesn't really have an
IO buffer intrisically associated with it any more - all memory management
is done on a _page_ level, and it really works the other way around, ie a
page can have one or more bh's associated with it as the IO entity.

This _does_ show up in the bh itself: you find that bh's end up having the
bh->b_page pointer in it, which is really a layering violation these days,
but you'll notice that it's actually not used very much, and it could
probably be largely removed.

The most fundamental use of it (from an IO standpoint) is actually to
handle high memory issues, because high-memory handling is very
fundamentally based on "struct page", and in order to be able to have
high-memory IO buffers you absolutely have to have the "struct page" the
way things are done now.

(all the other uses tend to not be IO-related at all: they are stuff like
the callbacks that want to find the page that should be free'd up)

The other part of "struct bh" is that it _does_ have support for fast
lookups, and the bh hashing. Again, from a pure IO standpoint you can
easily choose to just ignore this. It's often not used at all (in fact,
_most_ bh's aren't hashed, because the only way to find them are through
the page cache).

> This is not really an clean appropeach, and I would really like to
> get away from it.

Trust me, you really _can_ get away from it. It's not designed into the
bh's at all. You can already just allocate a single (or multiple) "struct
buffer_head" and just use them as IO objects, and give them your _own_
pointers to the IO buffer etc.

In fact, if you look at how the page cache is organized, this is what the
page cache already does. The page cache has it's own IO buffer (the page
itself), and it just uses "struct buffer_head" to allocate temporary IO
entities. It _also_ uses the "struct buffer_head" to cache the meta-data
in the sense of having the buffer head also contain the physical address
on disk so that the page cache doesn't have to ask the low-level
filesystem all the time, so in that sense it actually has a double use for
it.

But you can (and _should_) think of that as a "we got the meta-data
address caching for free, and it fit with our historical use, so why not
use it?".

So you can easily do the equivalent of

 - maintain your own buffers (possibly by looking up pages directly from
   user space, if you want to do zero-copy kind of things)

 - allocate a private buffer head ("get_unused_buffer_head()")

 - make that buffer head point into your buffer

 - submit the IO by just calling "submit_bh()", using the b_end_io()
   callback as your way to maintain _your_ IO buffer ownership.

In particular, think of the things that you do NOT have to do:

 - you do NOT have to allocate a bh-private buffer. Just point the bh at
   your own buffer.
 - you do NOT have to "give" your buffer to the bh. You do, of course,

   want to know when the bh is done with _your_ buffer, but that's what
   the b_end_io callback is all about.

 - you do NOT have to hash the bh you allocated and thus expose it to
   anybody else. It is YOUR private bh, and it does not show up on ANY
   other lists. There are various helper functions to insert the bh on
   various global lists ("mark_bh_dirty()" to put it on the dirty list,
   "buffer_insert_inode_queue()" to put it on the inode lists etc, but
   there is nothing in the thing that _forces_ you to expose your bh.

So don't think of "bh->b_data" as being something that the bh owns. It's
just a pointer. Think of "bh->b_data" and "bh->b_size" as _nothing_ more
than a data range in memory. 

In short, you can, and often should, think of "struct buffer_head" as
nothing but an IO entity. It has some support for being more than that,
but that's secondary. That can validly be seen as another layer, that is
just so common that there is little point in splitting it up (and a lot of
purely historical reasons for not splitting it).

		Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-06 22:26                                                                                 ` Linus Torvalds
@ 2001-02-06 21:13                                                                                   ` Marcelo Tosatti
  2001-02-06 23:26                                                                                     ` Linus Torvalds
  2001-02-07 23:15                                                                                   ` Pavel Machek
  1 sibling, 1 reply; 186+ messages in thread
From: Marcelo Tosatti @ 2001-02-06 21:13 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jens Axboe, Manfred Spraul, Ben LaHaise, Ingo Molnar,
	Stephen C. Tweedie, Alan Cox, Steve Lord, Linux Kernel List,
	kiobuf-io-devel, Ingo Molnar


On Tue, 6 Feb 2001, Linus Torvalds wrote:

> Remember: in the end you HAVE to wait somewhere. You're always going to be
> able to generate data faster than the disk can take it. SOMETHING has to
> throttle - if you don't allow generic_make_request() to throttle, you have
> to do it on your own at some point. It is stupid and counter-productive to
> argue against throttling. The only argument can be _where_ that throttling
> is done, and READA/WRITEA leaves the possibility open of doing it
> somewhere else (or just delaying it and letting a future call with
> READ/WRITE do the throttling).

Its not "arguing against throttling". 

Its arguing against making a smart application block on the disk while its
able to use the CPU for other work.
 
An application which sets non blocking behavior and busy waits for a
request (which seems to be your argument) is just stupid, of course.



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-06 20:59                                                                     ` Ingo Molnar
@ 2001-02-06 21:20                                                                       ` Steve Lord
  0 siblings, 0 replies; 186+ messages in thread
From: Steve Lord @ 2001-02-06 21:20 UTC (permalink / raw)
  To: mingo
  Cc: Marcelo Tosatti, Christoph Hellwig, Linus Torvalds, Ben LaHaise,
	Stephen C. Tweedie, Alan Cox, Manfred Spraul, Steve Lord,
	Linux Kernel List, kiobuf-io-devel, Ingo Molnar

> 
> On Tue, 6 Feb 2001, Marcelo Tosatti wrote:
> 
> > Think about a given number of pages which are physically contiguous on
> > disk -- you dont need to cache the block number for each page, you
> > just need to cache the physical block number of the first page of the
> > "cluster".
> 
> ranges are a hell of a lot more trouble to get right than page or
> block-sized objects - and typical access patterns are rarely 'ranged'. As
> long as the basic unit is not 'too small' (ie. not 512 byte, but something
> more sane, like 4096 bytes), i dont think ranging done in higher levels
> buys us anything valuable. And we do ranging at the request layer already
> ... Guess why most CPUs ended up having pages, and not "memory ranges"?
> It's simpler, thus faster in the common case and easier to debug.
> 
> > Usually we need to cache only block information (for clustering), and
> > not all the other stuff which buffer_head holds.
> 
> well, the other issue is that buffer_heads hold buffer-cache details as
> well. But i think it's too small right now to justify any splitup - and
> those issues are related enough to have significant allocation-merging
> effects.
> 
> 	Ingo

Think about it from the point of view of being able to reduce the number of
times you need to talk to the allocator in a filesystem. You can talk to
the allocator about all of your readahead pages in one go, or you can do
things like allocate on flush rather than allocating page at a time (that is
a bit more complex, but not too much).

Having to talk to the allocator on a page by page basis is my pet peeve about
the current mechanisms.

Steve



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-06 20:50                                                                       ` Jens Axboe
@ 2001-02-06 21:26                                                                         ` Manfred Spraul
  2001-02-06 21:42                                                                           ` Linus Torvalds
  0 siblings, 1 reply; 186+ messages in thread
From: Manfred Spraul @ 2001-02-06 21:26 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Ben LaHaise, Ingo Molnar, Linus Torvalds, Stephen C. Tweedie,
	Alan Cox, Steve Lord, Linux Kernel List, kiobuf-io-devel,
	Ingo Molnar

Jens Axboe wrote:
> 
> > Several kernel functions need a "dontblock" parameter (or a callback, or
> > a waitqueue address, or a tq_struct pointer).
> 
> We don't even need that, non-blocking is implicitly applied with READA.
>
READA just returns - I doubt that the aio functions should poll until
there are free entries in the request queue.

The pending aio requests should be "included" into the wait_for_requests
waitqueue (ok, they don't have a process context, thus a wait queue
entry doesn't help, but these requests belong into that wait queue)

--
	Manfred
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-06 21:26                                                                         ` Manfred Spraul
@ 2001-02-06 21:42                                                                           ` Linus Torvalds
  2001-02-06 20:16                                                                             ` Marcelo Tosatti
  2001-02-06 21:57                                                                             ` Manfred Spraul
  0 siblings, 2 replies; 186+ messages in thread
From: Linus Torvalds @ 2001-02-06 21:42 UTC (permalink / raw)
  To: Manfred Spraul
  Cc: Jens Axboe, Ben LaHaise, Ingo Molnar, Stephen C. Tweedie,
	Alan Cox, Steve Lord, Linux Kernel List, kiobuf-io-devel,
	Ingo Molnar



On Tue, 6 Feb 2001, Manfred Spraul wrote:
> Jens Axboe wrote:
> > 
> > > Several kernel functions need a "dontblock" parameter (or a callback, or
> > > a waitqueue address, or a tq_struct pointer).
> > 
> > We don't even need that, non-blocking is implicitly applied with READA.
> >
> READA just returns - I doubt that the aio functions should poll until
> there are free entries in the request queue.

The aio functions should NOT use READA/WRITEA. They should just use the
normal operations, waiting for requests. The things that makes them
asycnhronous is not waiting for the requests to _complete_. Which you can
already do, trivially enough.

The case for using READA/WRITEA is not that you want to do asynchronous
IO (all Linux IO is asynchronous unless you do extra work), but because
you have a case where you _might_ want to start IO, but if you don't have
a free request slot (ie there's already tons of pending IO happening), you
want the option of doing something else. This is not about aio - with aio
you _need_ to start the IO, you're just not willing to wait for it. 

An example of READA/WRITEA is if you want to do opportunistic dirty page
cleaning - you might not _have_ to clean it up, but you say

 "Hmm.. if you can do this simply without having to wait for other
  requests, start doing the writeout in the background. If notm I'll come
  back to you later after I've done more real work.."

And the Linux block device layer supports both of these kinds of "delayed
IO" already. It's all there. Today.

		Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-06 21:42                                                                           ` Linus Torvalds
  2001-02-06 20:16                                                                             ` Marcelo Tosatti
@ 2001-02-06 21:57                                                                             ` Manfred Spraul
  2001-02-06 22:13                                                                               ` Linus Torvalds
  1 sibling, 1 reply; 186+ messages in thread
From: Manfred Spraul @ 2001-02-06 21:57 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jens Axboe, Ben LaHaise, Ingo Molnar, Stephen C. Tweedie,
	Alan Cox, Steve Lord, Linux Kernel List, kiobuf-io-devel,
	Ingo Molnar

Linus Torvalds wrote:
> 
> On Tue, 6 Feb 2001, Manfred Spraul wrote:
> > Jens Axboe wrote:
> > >
> > > > Several kernel functions need a "dontblock" parameter (or a callback, or
> > > > a waitqueue address, or a tq_struct pointer).
> > >
> > > We don't even need that, non-blocking is implicitly applied with READA.
> > >
> > READA just returns - I doubt that the aio functions should poll until
> > there are free entries in the request queue.
> 
> The aio functions should NOT use READA/WRITEA. They should just use the
> normal operations, waiting for requests.

But then you end with lots of threads blocking in get_request()

Quoting Ben's mail:
<<<<<<<<<
> 
> =)  This is what I'm seeing: lots of processes waiting with wchan ==
> __get_request_wait.  With async io and a database flushing lots of io
> asynchronously spread out across the disk, the NR_REQUESTS limit is hit
> very quickly.
> 
>>>>>>>>>

On an io bound server the request queue is always full - waiting for the
next request might take longer than the actual io.

--
	Manfred
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-06 20:16                                                                             ` Marcelo Tosatti
@ 2001-02-06 22:09                                                                               ` Jens Axboe
  2001-02-06 22:26                                                                                 ` Linus Torvalds
  0 siblings, 1 reply; 186+ messages in thread
From: Jens Axboe @ 2001-02-06 22:09 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Linus Torvalds, Manfred Spraul, Ben LaHaise, Ingo Molnar,
	Stephen C. Tweedie, Alan Cox, Steve Lord, Linux Kernel List,
	kiobuf-io-devel, Ingo Molnar

On Tue, Feb 06 2001, Marcelo Tosatti wrote:
> > > > We don't even need that, non-blocking is implicitly applied with READA.
> > > >
> > > READA just returns - I doubt that the aio functions should poll until
> > > there are free entries in the request queue.
> > 
> > The aio functions should NOT use READA/WRITEA. They should just use the
> > normal operations, waiting for requests. The things that makes them
> > asycnhronous is not waiting for the requests to _complete_. Which you can
> > already do, trivially enough.
> 
> Reading write(2): 
> 
>        EAGAIN Non-blocking  I/O has been selected using O_NONBLOCK and there was
>               no room in the pipe or socket connected to fd to  write  the data
>               immediately.
> 
> I see no reason why "aio function have to block waiting for requests". 

That was my reasoning too with READA etc, but Linus seems to want that we
can block while submitting the I/O (as throttling, Linus?) just not
until completion.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-06 21:57                                                                             ` Manfred Spraul
@ 2001-02-06 22:13                                                                               ` Linus Torvalds
  2001-02-06 22:26                                                                                 ` Andre Hedrick
  0 siblings, 1 reply; 186+ messages in thread
From: Linus Torvalds @ 2001-02-06 22:13 UTC (permalink / raw)
  To: Manfred Spraul
  Cc: Jens Axboe, Ben LaHaise, Ingo Molnar, Stephen C. Tweedie,
	Alan Cox, Steve Lord, Linux Kernel List, kiobuf-io-devel,
	Ingo Molnar



On Tue, 6 Feb 2001, Manfred Spraul wrote:
> > 
> > The aio functions should NOT use READA/WRITEA. They should just use the
> > normal operations, waiting for requests.
> 
> But then you end with lots of threads blocking in get_request()

So?

What the HELL do you expect to happen if somebody writes faster than the
disk can take?

You don't lik ebusy-waiting. Fair enough.

So maybe blocking on a wait-queue is the right thing? Just MAYBE?

		Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-06 22:13                                                                               ` Linus Torvalds
@ 2001-02-06 22:26                                                                                 ` Andre Hedrick
  0 siblings, 0 replies; 186+ messages in thread
From: Andre Hedrick @ 2001-02-06 22:26 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Manfred Spraul, Jens Axboe, Ben LaHaise, Ingo Molnar,
	Stephen C. Tweedie, Alan Cox, Steve Lord, Linux Kernel List,
	kiobuf-io-devel, Ingo Molnar

On Tue, 6 Feb 2001, Linus Torvalds wrote:

> 
> 
> On Tue, 6 Feb 2001, Manfred Spraul wrote:
> > > 
> > > The aio functions should NOT use READA/WRITEA. They should just use the
> > > normal operations, waiting for requests.
> > 
> > But then you end with lots of threads blocking in get_request()
> 
> So?
> 
> What the HELL do you expect to happen if somebody writes faster than the
> disk can take?
> 
> You don't lik ebusy-waiting. Fair enough.
> 
> So maybe blocking on a wait-queue is the right thing? Just MAYBE?

Did I miss a portion of the thread?
Is the block layer ignoring the status of a device?

--Andre

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-06 22:09                                                                               ` Jens Axboe
@ 2001-02-06 22:26                                                                                 ` Linus Torvalds
  2001-02-06 21:13                                                                                   ` Marcelo Tosatti
  2001-02-07 23:15                                                                                   ` Pavel Machek
  0 siblings, 2 replies; 186+ messages in thread
From: Linus Torvalds @ 2001-02-06 22:26 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Marcelo Tosatti, Manfred Spraul, Ben LaHaise, Ingo Molnar,
	Stephen C. Tweedie, Alan Cox, Steve Lord, Linux Kernel List,
	kiobuf-io-devel, Ingo Molnar



On Tue, 6 Feb 2001, Jens Axboe wrote:

> On Tue, Feb 06 2001, Marcelo Tosatti wrote:
> > 
> > Reading write(2): 
> > 
> >        EAGAIN Non-blocking  I/O has been selected using O_NONBLOCK and there was
> >               no room in the pipe or socket connected to fd to  write  the data
> >               immediately.
> > 
> > I see no reason why "aio function have to block waiting for requests". 
> 
> That was my reasoning too with READA etc, but Linus seems to want that we
> can block while submitting the I/O (as throttling, Linus?) just not
> until completion.

Note the "in the pipe or socket" part.
                 ^^^^    ^^^^^^

EAGAIN is _not_ a valid return value for block devices or for regular
files. And in fact it _cannot_ be, because select() is defined to always
return 1 on them - so if a write() were to return EAGAIN, user space would
have nothing to wait on. Busy waiting is evil.

So READA/WRITEA are only useful inside the kernel, and when the caller has
some data structures of its own that it can use to gracefully handle the
case of a failure - it will try to do the IO later for some reasons, maybe
deciding to do it with blocking because it has nothing better to do at the
later date, or because it decides that it can have only so many
outstanding requests.

Remember: in the end you HAVE to wait somewhere. You're always going to be
able to generate data faster than the disk can take it. SOMETHING has to
throttle - if you don't allow generic_make_request() to throttle, you have
to do it on your own at some point. It is stupid and counter-productive to
argue against throttling. The only argument can be _where_ that throttling
is done, and READA/WRITEA leaves the possibility open of doing it
somewhere else (or just delaying it and letting a future call with
READ/WRITE do the throttling).

		Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-06 21:13                                                                                   ` Marcelo Tosatti
@ 2001-02-06 23:26                                                                                     ` Linus Torvalds
  2001-02-07 23:17                                                                                       ` select() returning busy for regular files [was Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait] Pavel Machek
  2001-02-08 15:06                                                                                       ` [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait Ben LaHaise
  0 siblings, 2 replies; 186+ messages in thread
From: Linus Torvalds @ 2001-02-06 23:26 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Jens Axboe, Manfred Spraul, Ben LaHaise, Ingo Molnar,
	Stephen C. Tweedie, Alan Cox, Steve Lord, Linux Kernel List,
	kiobuf-io-devel, Ingo Molnar



On Tue, 6 Feb 2001, Marcelo Tosatti wrote:
> 
> Its arguing against making a smart application block on the disk while its
> able to use the CPU for other work.

There are currently no other alternatives in user space. You'd have to
create whole new interfaces for aio_read/write, and ways for the kernel to
inform user space that "now you can re-try submitting your IO".

Could be done. But that's a big thing.

> An application which sets non blocking behavior and busy waits for a
> request (which seems to be your argument) is just stupid, of course.

Tell me what else it could do at some point? You need something like
select() to wait on it. There are no such interfaces right now...

(besides, latency would suck. I bet you're better off waiting for the
requests if they are all used up. It takes too long to get deep into the
kernel from user space, and you cannot use the exclusive waiters with its
anti-herd behaviour etc).

Simple rule: if you want to optimize concurrency and avoid waiting - use
several processes or threads instead. At which point you can get real work
done on multiple CPU's, instead of worrying about what happens when you
have to wait on the disk.

		Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-06 19:57                                                                 ` Ingo Molnar
  2001-02-06 20:07                                                                   ` Jens Axboe
  2001-02-06 20:25                                                                   ` Ben LaHaise
@ 2001-02-07  0:21                                                                   ` Stephen C. Tweedie
  2001-02-07  0:25                                                                     ` Ingo Molnar
                                                                                       ` (2 more replies)
  2 siblings, 3 replies; 186+ messages in thread
From: Stephen C. Tweedie @ 2001-02-07  0:21 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Ben LaHaise, Linus Torvalds, Stephen C. Tweedie, Alan Cox,
	Manfred Spraul, Steve Lord, Linux Kernel List, kiobuf-io-devel,
	Ingo Molnar

Hi,

On Tue, Feb 06, 2001 at 08:57:13PM +0100, Ingo Molnar wrote:
> 
> [overhead of 512-byte bhs in the raw IO code is an artificial problem of
> the raw IO code.]

No, it is a problem of the ll_rw_block interface: buffer_heads need to
be aligned on disk at a multiple of their buffer size.  Under the Unix
raw IO interface it is perfectly legal to begin a 128kB IO at offset
512 bytes into a device.

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-07  0:21                                                                   ` Stephen C. Tweedie
@ 2001-02-07  0:25                                                                     ` Ingo Molnar
  2001-02-07  0:36                                                                       ` Stephen C. Tweedie
  2001-02-07  0:42                                                                       ` Linus Torvalds
  2001-02-07  0:35                                                                     ` Jens Axboe
  2001-02-07  0:41                                                                     ` Linus Torvalds
  2 siblings, 2 replies; 186+ messages in thread
From: Ingo Molnar @ 2001-02-07  0:25 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Ingo Molnar, Ben LaHaise, Linus Torvalds, Alan Cox,
	Manfred Spraul, Steve Lord, Linux Kernel List, kiobuf-io-devel


On Wed, 7 Feb 2001, Stephen C. Tweedie wrote:

> No, it is a problem of the ll_rw_block interface: buffer_heads need to
> be aligned on disk at a multiple of their buffer size.  Under the Unix
> raw IO interface it is perfectly legal to begin a 128kB IO at offset
> 512 bytes into a device.

then we should either fix this limitation, or the raw IO code should split
the request up into several, variable-size bhs, so that the range is
filled out optimally with aligned bhs.

	Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-07  0:21                                                                   ` Stephen C. Tweedie
  2001-02-07  0:25                                                                     ` Ingo Molnar
@ 2001-02-07  0:35                                                                     ` Jens Axboe
  2001-02-07  0:41                                                                     ` Linus Torvalds
  2 siblings, 0 replies; 186+ messages in thread
From: Jens Axboe @ 2001-02-07  0:35 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Ingo Molnar, Ben LaHaise, Linus Torvalds, Alan Cox,
	Manfred Spraul, Steve Lord, Linux Kernel List, kiobuf-io-devel,
	Ingo Molnar

On Wed, Feb 07 2001, Stephen C. Tweedie wrote:
> > [overhead of 512-byte bhs in the raw IO code is an artificial problem of
> > the raw IO code.]
> 
> No, it is a problem of the ll_rw_block interface: buffer_heads need to
> be aligned on disk at a multiple of their buffer size.  Under the Unix
> raw IO interface it is perfectly legal to begin a 128kB IO at offset
> 512 bytes into a device.

Submitting buffers to lower layers that are not hw sector aligned
can't be supported below ll_rw_blk anyway (they can, but look at the
problems this has always created), and I would much rather see stuff
like this handled outside of there.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-07  0:25                                                                     ` Ingo Molnar
@ 2001-02-07  0:36                                                                       ` Stephen C. Tweedie
  2001-02-07  0:50                                                                         ` Linus Torvalds
  2001-02-07  1:42                                                                         ` Jeff V. Merkey
  2001-02-07  0:42                                                                       ` Linus Torvalds
  1 sibling, 2 replies; 186+ messages in thread
From: Stephen C. Tweedie @ 2001-02-07  0:36 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Stephen C. Tweedie, Ingo Molnar, Ben LaHaise, Linus Torvalds,
	Alan Cox, Manfred Spraul, Steve Lord, Linux Kernel List,
	kiobuf-io-devel

Hi,

On Tue, Feb 06, 2001 at 07:25:19PM -0500, Ingo Molnar wrote:
> 
> On Wed, 7 Feb 2001, Stephen C. Tweedie wrote:
> 
> > No, it is a problem of the ll_rw_block interface: buffer_heads need to
> > be aligned on disk at a multiple of their buffer size.  Under the Unix
> > raw IO interface it is perfectly legal to begin a 128kB IO at offset
> > 512 bytes into a device.
> 
> then we should either fix this limitation, or the raw IO code should split
> the request up into several, variable-size bhs, so that the range is
> filled out optimally with aligned bhs.

That gets us from 512-byte blocks to 4k, but no more (ll_rw_block
enforces a single blocksize on all requests but that relaxing that
requirement is no big deal).  Buffer_heads can't deal with data which
spans more than a page right now.

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-07  0:21                                                                   ` Stephen C. Tweedie
  2001-02-07  0:25                                                                     ` Ingo Molnar
  2001-02-07  0:35                                                                     ` Jens Axboe
@ 2001-02-07  0:41                                                                     ` Linus Torvalds
  2001-02-07  1:27                                                                       ` Stephen C. Tweedie
  2 siblings, 1 reply; 186+ messages in thread
From: Linus Torvalds @ 2001-02-07  0:41 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Ingo Molnar, Ben LaHaise, Alan Cox, Manfred Spraul, Steve Lord,
	Linux Kernel List, kiobuf-io-devel, Ingo Molnar



On Wed, 7 Feb 2001, Stephen C. Tweedie wrote:
> 
> On Tue, Feb 06, 2001 at 08:57:13PM +0100, Ingo Molnar wrote:
> > 
> > [overhead of 512-byte bhs in the raw IO code is an artificial problem of
> > the raw IO code.]
> 
> No, it is a problem of the ll_rw_block interface: buffer_heads need to
> be aligned on disk at a multiple of their buffer size.

Ehh.. True of ll_rw_block() and submit_bh(), which are meant for the
traditional block device setup, where "b_blocknr" is the "virtual
blocknumber" and that indeed is tied in to the block size.

That's the whole _point_ of ll_rw_block() and friends - they show the
device at a different "virtual blocking" level than the low-level physical
accesses necessarily are. Which very much means that if you have a 4kB
"view", of the device, you get a stream of 4kB blocks. Not 4kB sized
blocks at 512-byte offsets (or whatebver the hardware blocking size is).

This way the interfaces are independent of the hardware blocksize. Which
is logical and what you'd expect. You need to go to a lower level to see
those kinds of blocking issues.

But it is _not_ true of "generic_make_request()" and the block IO layer in
general. It obviously _cannot_ be true, because the block I/O layer has
always had the notion of merging consecutive blocks together - regardless
of whether the end result is even a power of two or antyhing like that in
size. You can make an IO request for pretty much any size, as long as it's
a multiple of the hardare blocksize (normally 512 bytes, but there are
certainly devices out there with other blocksizes).

The fact is, if you have problems like the above, then you don't
understand the interfaces. And it sounds like you designed kiobuf support
around the wrong set of interfaces.

If you want to get at the _sector_ level, then you do

	lock_bh();
	bh->b_rdev = device;
	bh->b_rsector = sector-number (where linux defines "sector" to be 512 bytes)
	bh->b_size = size in bytes (must be a multiple of 512);
	bh->b_data = pointer;
	bh->b_end_io = callback;
	generic_make_request(rw, bh);

which doesn't look all that complicated to me. What's the problem?

		Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-07  0:25                                                                     ` Ingo Molnar
  2001-02-07  0:36                                                                       ` Stephen C. Tweedie
@ 2001-02-07  0:42                                                                       ` Linus Torvalds
  1 sibling, 0 replies; 186+ messages in thread
From: Linus Torvalds @ 2001-02-07  0:42 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Stephen C. Tweedie, Ingo Molnar, Ben LaHaise, Alan Cox,
	Manfred Spraul, Steve Lord, Linux Kernel List, kiobuf-io-devel



On Tue, 6 Feb 2001, Ingo Molnar wrote:
> 
> On Wed, 7 Feb 2001, Stephen C. Tweedie wrote:
> 
> > No, it is a problem of the ll_rw_block interface: buffer_heads need to
> > be aligned on disk at a multiple of their buffer size.  Under the Unix
> > raw IO interface it is perfectly legal to begin a 128kB IO at offset
> > 512 bytes into a device.
> 
> then we should either fix this limitation, or the raw IO code should split
> the request up into several, variable-size bhs, so that the range is
> filled out optimally with aligned bhs.

As mentioned, no such limitation exists if you just use the right
interfaces.

		Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-07  0:36                                                                       ` Stephen C. Tweedie
@ 2001-02-07  0:50                                                                         ` Linus Torvalds
  2001-02-07  1:49                                                                           ` Stephen C. Tweedie
  2001-02-07  1:51                                                                           ` Jeff V. Merkey
  2001-02-07  1:42                                                                         ` Jeff V. Merkey
  1 sibling, 2 replies; 186+ messages in thread
From: Linus Torvalds @ 2001-02-07  0:50 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Ingo Molnar, Ben LaHaise, Alan Cox, Manfred Spraul, Steve Lord,
	Linux Kernel List, kiobuf-io-devel



On Wed, 7 Feb 2001, Stephen C. Tweedie wrote:
> 
> That gets us from 512-byte blocks to 4k, but no more (ll_rw_block
> enforces a single blocksize on all requests but that relaxing that
> requirement is no big deal).  Buffer_heads can't deal with data which
> spans more than a page right now.

Stephen, you're so full of shit lately that it's unbelievable. You're
batting a clear 0.000 so far.

"struct buffer_head" can deal with pretty much any size: the only thing it
cares about is bh->b_size.

It so happens that if you have highmem support, then "create_bounce()"
will work on a per-page thing, but that just means that you'd better have
done your bouncing into low memory before you call generic_make_request().

Have you ever spent even just 5 minutes actually _looking_ at the block
device layer, before you decided that you think it needs to be completely
re-done some other way? It appears that you never bothered to.

Sure, I would not be surprised if some device driver ends up being
surpised if you start passing it different request sizes than it is used
to. But that's a driver and testing issue, nothing more.

(Which is not to say that "driver and testing" issues aren't important as
hell: it's one of the more scary things in fact, and it can take a long
time to get right if you start doing somehting that historically has never
been done and thus has historically never gotten any testing. So I'm not
saying that it should work out-of-the-box. But I _am_ saying that there's
no point in trying to re-design upper layers that already do ALL of this
with no problems at all).

		Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-07  1:51                                                                           ` Jeff V. Merkey
@ 2001-02-07  1:01                                                                             ` Ingo Molnar
  2001-02-07  1:59                                                                               ` Jeff V. Merkey
  2001-02-07  1:02                                                                             ` Jens Axboe
  1 sibling, 1 reply; 186+ messages in thread
From: Ingo Molnar @ 2001-02-07  1:01 UTC (permalink / raw)
  To: Jeff V. Merkey
  Cc: Linus Torvalds, Stephen C. Tweedie, Ben LaHaise, Alan Cox,
	Manfred Spraul, Steve Lord, Linux Kernel List, kiobuf-io-devel


On Tue, 6 Feb 2001, Jeff V. Merkey wrote:

> I remember Linus asking to try this variable buffer head chaining
> thing 512-1024-512 kind of stuff several months back, and mixing them
> to see what would happen -- result. About half the drivers break with
> it. [...]

time to fix them then - instead of rewriting the rest of the kernel ;-)

	Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-07  1:51                                                                           ` Jeff V. Merkey
  2001-02-07  1:01                                                                             ` Ingo Molnar
@ 2001-02-07  1:02                                                                             ` Jens Axboe
  2001-02-07  1:19                                                                               ` Linus Torvalds
  2001-02-07  2:00                                                                               ` Jeff V. Merkey
  1 sibling, 2 replies; 186+ messages in thread
From: Jens Axboe @ 2001-02-07  1:02 UTC (permalink / raw)
  To: Jeff V. Merkey
  Cc: Linus Torvalds, Stephen C. Tweedie, Ingo Molnar, Ben LaHaise,
	Alan Cox, Manfred Spraul, Steve Lord, Linux Kernel List,
	kiobuf-io-devel

On Tue, Feb 06 2001, Jeff V. Merkey wrote:
> I remember Linus asking to try this variable buffer head chaining 
> thing 512-1024-512 kind of stuff several months back, and mixing them to 
> see what would happen -- result.  About half the drivers break with it.  
> The interface allows you to do it, I've tried it, (works on Andre's 
> drivers, but a lot of SCSI drivers break) but a lot of drivers seem to 
> have assumptions about these things all being the same size in a 
> buffer head chain. 

I don't see anything that would break doing this, in fact you can
do this as long as the buffers are all at least a multiple of the
block size. All the drivers I've inspected handle this fine, noone
assumes that rq->bh->b_size is the same in all the buffers attached
to the request. This includes SCSI (scsi_lib.c builds sg tables),
IDE, and the Compaq array + Mylex driver. This mostly leaves the
"old-style" drivers using CURRENT etc, the kernel helpers for these
handle it as well.

So I would appreciate pointers to these devices that break so we
can inspect them.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-07  2:00                                                                               ` Jeff V. Merkey
@ 2001-02-07  1:06                                                                                 ` Ingo Molnar
  2001-02-07  1:09                                                                                   ` Jens Axboe
                                                                                                     ` (2 more replies)
  2001-02-07  1:08                                                                                 ` Jens Axboe
  1 sibling, 3 replies; 186+ messages in thread
From: Ingo Molnar @ 2001-02-07  1:06 UTC (permalink / raw)
  To: Jeff V. Merkey
  Cc: Jens Axboe, Linus Torvalds, Stephen C. Tweedie, Ben LaHaise,
	Alan Cox, Manfred Spraul, Steve Lord, Linux Kernel List,
	kiobuf-io-devel


On Tue, 6 Feb 2001, Jeff V. Merkey wrote:

> > I don't see anything that would break doing this, in fact you can
> > do this as long as the buffers are all at least a multiple of the
> > block size. All the drivers I've inspected handle this fine, noone
> > assumes that rq->bh->b_size is the same in all the buffers attached
> > to the request. This includes SCSI (scsi_lib.c builds sg tables),
> > IDE, and the Compaq array + Mylex driver. This mostly leaves the
> > "old-style" drivers using CURRENT etc, the kernel helpers for these
> > handle it as well.
> >
> > So I would appreciate pointers to these devices that break so we
> > can inspect them.
> >
> > --
> > Jens Axboe
>
> Adaptec drivers had an oops.  Also, AIC7XXX also had some oops with it.

most likely some coding error on your side. buffer-size mismatches should
show up as filesystem corruption or random DMA scribble, not in-driver
oopses.

	Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-07  2:00                                                                               ` Jeff V. Merkey
  2001-02-07  1:06                                                                                 ` Ingo Molnar
@ 2001-02-07  1:08                                                                                 ` Jens Axboe
  2001-02-07  2:08                                                                                   ` Jeff V. Merkey
  1 sibling, 1 reply; 186+ messages in thread
From: Jens Axboe @ 2001-02-07  1:08 UTC (permalink / raw)
  To: Jeff V. Merkey
  Cc: Linus Torvalds, Stephen C. Tweedie, Ingo Molnar, Ben LaHaise,
	Alan Cox, Manfred Spraul, Steve Lord, Linux Kernel List,
	kiobuf-io-devel

On Tue, Feb 06 2001, Jeff V. Merkey wrote:
> Adaptec drivers had an oops.  Also, AIC7XXX also had some oops with it.

Do you still have this oops?

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-07  1:06                                                                                 ` Ingo Molnar
@ 2001-02-07  1:09                                                                                   ` Jens Axboe
  2001-02-07  1:11                                                                                     ` Ingo Molnar
  2001-02-07  1:26                                                                                   ` Linus Torvalds
  2001-02-07  2:07                                                                                   ` Jeff V. Merkey
  2 siblings, 1 reply; 186+ messages in thread
From: Jens Axboe @ 2001-02-07  1:09 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Jeff V. Merkey, Linus Torvalds, Stephen C. Tweedie, Ben LaHaise,
	Alan Cox, Manfred Spraul, Steve Lord, Linux Kernel List,
	kiobuf-io-devel

On Wed, Feb 07 2001, Ingo Molnar wrote:
> > > So I would appreciate pointers to these devices that break so we
> > > can inspect them.
> > >
> > > --
> > > Jens Axboe
> >
> > Adaptec drivers had an oops.  Also, AIC7XXX also had some oops with it.
> 
> most likely some coding error on your side. buffer-size mismatches should
> show up as filesystem corruption or random DMA scribble, not in-driver
> oopses.

I would suspect so, aic7xxx shouldn't care about anything except the
sg entries and I would seriously doubt that it makes any such
assumptions on them :-)

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-07  1:09                                                                                   ` Jens Axboe
@ 2001-02-07  1:11                                                                                     ` Ingo Molnar
  0 siblings, 0 replies; 186+ messages in thread
From: Ingo Molnar @ 2001-02-07  1:11 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Jeff V. Merkey, Linus Torvalds, Stephen C. Tweedie, Ben LaHaise,
	Alan Cox, Manfred Spraul, Steve Lord, Linux Kernel List,
	kiobuf-io-devel


On Wed, 7 Feb 2001, Jens Axboe wrote:

> > > Adaptec drivers had an oops.  Also, AIC7XXX also had some oops with it.
> >
> > most likely some coding error on your side. buffer-size mismatches should
> > show up as filesystem corruption or random DMA scribble, not in-driver
> > oopses.
>
> I would suspect so, aic7xxx shouldn't care about anything except the
> sg entries and I would seriously doubt that it makes any such
> assumptions on them :-)

yep - and not a single reference to b_size in aic7xxx.c.

	Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-07  1:02                                                                             ` Jens Axboe
@ 2001-02-07  1:19                                                                               ` Linus Torvalds
  2001-02-07  1:39                                                                                 ` Jens Axboe
  2001-02-07  2:00                                                                               ` Jeff V. Merkey
  1 sibling, 1 reply; 186+ messages in thread
From: Linus Torvalds @ 2001-02-07  1:19 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Jeff V. Merkey, Stephen C. Tweedie, Ingo Molnar, Ben LaHaise,
	Alan Cox, Manfred Spraul, Steve Lord, Linux Kernel List,
	kiobuf-io-devel



On Wed, 7 Feb 2001, Jens Axboe wrote:
> 
> I don't see anything that would break doing this, in fact you can
> do this as long as the buffers are all at least a multiple of the
> block size. All the drivers I've inspected handle this fine, noone
> assumes that rq->bh->b_size is the same in all the buffers attached
> to the request.

It's really easy to get this wrong when going forward in the request list:
you need to make sure that you update "request->current_nr_sectors" each
time you move on to the next bh.

I would not be surprised if some of them have been seriously buggered. 

On the other hand, I would _also_ not be surprised if we've actually fixed
a lot of them: one of the things that the RAID code and loopback test is
exactly getting these kinds of issues right (not this exact one, but
similar ones).

And let's remember things like the old ultrastor driver that was totally
unable to handle anything but 1kB devices etc. I would not be _totally_
surprised if it turns out that there are still drivers out there that
remember the time when Linux only ever had 1kB buffers. Even if it is 7
years ago or so ;)

(Also, there might be drivers that are "optimized" - they set the IO
length once per request, and just never set it again as they do partial
end_io() calls. None of those kinds of issues would ever be found under
normal load, so I would be _really_ nervous about just turning it on
silently. This is all very much a 2.5.x-kind of thing ;)

		Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-07  1:06                                                                                 ` Ingo Molnar
  2001-02-07  1:09                                                                                   ` Jens Axboe
@ 2001-02-07  1:26                                                                                   ` Linus Torvalds
  2001-02-07  2:07                                                                                   ` Jeff V. Merkey
  2 siblings, 0 replies; 186+ messages in thread
From: Linus Torvalds @ 2001-02-07  1:26 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Jeff V. Merkey, Jens Axboe, Stephen C. Tweedie, Ben LaHaise,
	Alan Cox, Manfred Spraul, Steve Lord, Linux Kernel List,
	kiobuf-io-devel



On Wed, 7 Feb 2001, Ingo Molnar wrote:
> 
> most likely some coding error on your side. buffer-size mismatches should
> show up as filesystem corruption or random DMA scribble, not in-driver
> oopses.

I'm not sure. If I was a driver writer (and I'm happy those days are
mostly behind me ;), I would not be totally dis-inclined to check for
various limits and things.

There can be hardware out there that simply has trouble with non-native
alignment, ie be unhappy about getting a 1kB request that is aligned in
memory at a 512-byte boundary. So there are real reasons why drivers might
need updating. Don't dismiss the concerns out-of-hand.

		Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-07  0:41                                                                     ` Linus Torvalds
@ 2001-02-07  1:27                                                                       ` Stephen C. Tweedie
  2001-02-07  1:40                                                                         ` Linus Torvalds
  0 siblings, 1 reply; 186+ messages in thread
From: Stephen C. Tweedie @ 2001-02-07  1:27 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Stephen C. Tweedie, Ingo Molnar, Ben LaHaise, Alan Cox,
	Manfred Spraul, Steve Lord, Linux Kernel List, kiobuf-io-devel,
	Ingo Molnar

Hi,

On Tue, Feb 06, 2001 at 04:41:21PM -0800, Linus Torvalds wrote:
> 
> On Wed, 7 Feb 2001, Stephen C. Tweedie wrote:
> > No, it is a problem of the ll_rw_block interface: buffer_heads need to
> > be aligned on disk at a multiple of their buffer size.
> 
> Ehh.. True of ll_rw_block() and submit_bh(), which are meant for the
> traditional block device setup, where "b_blocknr" is the "virtual
> blocknumber" and that indeed is tied in to the block size.
> 
> The fact is, if you have problems like the above, then you don't
> understand the interfaces. And it sounds like you designed kiobuf support
> around the wrong set of interfaces.

They used the only interfaces available at the time...

> If you want to get at the _sector_ level, then you do
...
> which doesn't look all that complicated to me. What's the problem?

Doesn't this break nastily as soon as the IO hits an LVM or soft raid
device?  I don't think we are safe if we create a larger-sized
buffer_head which spans a raid stripe: the raid mapping is only
applied once per buffer_head.

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-07  1:19                                                                               ` Linus Torvalds
@ 2001-02-07  1:39                                                                                 ` Jens Axboe
  2001-02-07  1:45                                                                                   ` Linus Torvalds
  0 siblings, 1 reply; 186+ messages in thread
From: Jens Axboe @ 2001-02-07  1:39 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jeff V. Merkey, Stephen C. Tweedie, Ingo Molnar, Ben LaHaise,
	Alan Cox, Manfred Spraul, Steve Lord, Linux Kernel List,
	kiobuf-io-devel

On Tue, Feb 06 2001, Linus Torvalds wrote:
> > I don't see anything that would break doing this, in fact you can
> > do this as long as the buffers are all at least a multiple of the
> > block size. All the drivers I've inspected handle this fine, noone
> > assumes that rq->bh->b_size is the same in all the buffers attached
> > to the request.
> 
> It's really easy to get this wrong when going forward in the request list:
> you need to make sure that you update "request->current_nr_sectors" each
> time you move on to the next bh.
> 
> I would not be surprised if some of them have been seriously buggered. 

Maybe have been, but it looks good at least with the general drivers
that I mentioned.

> [...] so I would be _really_ nervous about just turning it on
> silently. This is all very much a 2.5.x-kind of thing ;)

Then you might want to apply this :-)

--- drivers/block/ll_rw_blk.c~	Wed Feb  7 02:38:31 2001
+++ drivers/block/ll_rw_blk.c	Wed Feb  7 02:38:42 2001
@@ -1048,7 +1048,7 @@
 	/* Verify requested block sizes. */
 	for (i = 0; i < nr; i++) {
 		struct buffer_head *bh = bhs[i];
-		if (bh->b_size % correct_size) {
+		if (bh->b_size != correct_size) {
 			printk(KERN_NOTICE "ll_rw_block: device %s: "
 			       "only %d-char blocks implemented (%u)\n",
 			       kdevname(bhs[0]->b_dev),

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-07  1:27                                                                       ` Stephen C. Tweedie
@ 2001-02-07  1:40                                                                         ` Linus Torvalds
  2001-02-12 10:07                                                                           ` Jamie Lokier
  0 siblings, 1 reply; 186+ messages in thread
From: Linus Torvalds @ 2001-02-07  1:40 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Ingo Molnar, Ben LaHaise, Alan Cox, Manfred Spraul, Steve Lord,
	Linux Kernel List, kiobuf-io-devel, Ingo Molnar



On Wed, 7 Feb 2001, Stephen C. Tweedie wrote:
> > 
> > The fact is, if you have problems like the above, then you don't
> > understand the interfaces. And it sounds like you designed kiobuf support
> > around the wrong set of interfaces.
> 
> They used the only interfaces available at the time...

Ehh.. "generic_make_request()" goes back a _loong_ time. It used to be
called just "make_request()", but all my points still stand.

It's even exported to modules. As far as I know, the raid code has always
used this interface exactly because raid needed to feed back the remapped
stuff and get around the blocksizing in ll_rw_block().

This really isn't anything new. I _know_ it's there in 2.2.x, and I
would not be surprised if it was there even in 2.0.x.

> > If you want to get at the _sector_ level, then you do
> ...
> > which doesn't look all that complicated to me. What's the problem?
> 
> Doesn't this break nastily as soon as the IO hits an LVM or soft raid
> device?  I don't think we are safe if we create a larger-sized
> buffer_head which spans a raid stripe: the raid mapping is only
> applied once per buffer_head.

Absolutely. This is exactly what I mean by saying that low-level drivers
may not actually be able to handle new cases that they've never been asked
to do before - they just never saw anything like a 64kB request before or
something that crossed its own alignment.

But the _higher_ levels are there. And there's absolutely nothing in the
design that is a real problem. But there's no question that you might need
to fix up more than one or two low-level drivers.

(The only drivers I know better are the IDE ones, and as far as I can tell
they'd have no trouble at all with any of this. Most other normal drivers
are likely to be in this same situation. But because I've not had a reason
to test, I certainly won't guarantee even that).

		Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-07  0:36                                                                       ` Stephen C. Tweedie
  2001-02-07  0:50                                                                         ` Linus Torvalds
@ 2001-02-07  1:42                                                                         ` Jeff V. Merkey
  1 sibling, 0 replies; 186+ messages in thread
From: Jeff V. Merkey @ 2001-02-07  1:42 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Ingo Molnar, Ingo Molnar, Ben LaHaise, Linus Torvalds, Alan Cox,
	Manfred Spraul, Steve Lord, Linux Kernel List, kiobuf-io-devel

On Wed, Feb 07, 2001 at 12:36:29AM +0000, Stephen C. Tweedie wrote:
> Hi,
> 
> On Tue, Feb 06, 2001 at 07:25:19PM -0500, Ingo Molnar wrote:
> > 
> > On Wed, 7 Feb 2001, Stephen C. Tweedie wrote:
> > 
> > > No, it is a problem of the ll_rw_block interface: buffer_heads need to
> > > be aligned on disk at a multiple of their buffer size.  Under the Unix
> > > raw IO interface it is perfectly legal to begin a 128kB IO at offset
> > > 512 bytes into a device.
> > 
> > then we should either fix this limitation, or the raw IO code should split
> > the request up into several, variable-size bhs, so that the range is
> > filled out optimally with aligned bhs.
> 
> That gets us from 512-byte blocks to 4k, but no more (ll_rw_block
> enforces a single blocksize on all requests but that relaxing that
> requirement is no big deal).  Buffer_heads can't deal with data which
> spans more than a page right now.


I can handle requests larger than a page (64K) but I am not using 
the buffer cache in Linux.  We really need an NT/NetWare like model 
to support the non-Unix FS's properly.

i.e.   

a disk request should be 

<disk> <lba> <length> <buffer> and get rid of this fixed block 
stuff with buffer heads. :-)

I understand that the way the elevator is implemented in Linux makes
this very hard at this point to support, since it's very troublesome 
to handling requests that overlap sector boundries.

Jeff


> 
> --Stephen
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> Please read the FAQ at http://www.tux.org/lkml/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-07  1:39                                                                                 ` Jens Axboe
@ 2001-02-07  1:45                                                                                   ` Linus Torvalds
  2001-02-07  1:55                                                                                     ` Jens Axboe
  2001-02-07  9:10                                                                                     ` David Howells
  0 siblings, 2 replies; 186+ messages in thread
From: Linus Torvalds @ 2001-02-07  1:45 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Jeff V. Merkey, Stephen C. Tweedie, Ingo Molnar, Ben LaHaise,
	Alan Cox, Manfred Spraul, Steve Lord, Linux Kernel List,
	kiobuf-io-devel



On Wed, 7 Feb 2001, Jens Axboe wrote:
> 
> > [...] so I would be _really_ nervous about just turning it on
> > silently. This is all very much a 2.5.x-kind of thing ;)
> 
> Then you might want to apply this :-)
> 
> --- drivers/block/ll_rw_blk.c~	Wed Feb  7 02:38:31 2001
> +++ drivers/block/ll_rw_blk.c	Wed Feb  7 02:38:42 2001
> @@ -1048,7 +1048,7 @@
>  	/* Verify requested block sizes. */
>  	for (i = 0; i < nr; i++) {
>  		struct buffer_head *bh = bhs[i];
> -		if (bh->b_size % correct_size) {
> +		if (bh->b_size != correct_size) {
>  			printk(KERN_NOTICE "ll_rw_block: device %s: "
>  			       "only %d-char blocks implemented (%u)\n",
>  			       kdevname(bhs[0]->b_dev),

Actually, I'd rather leave it in, but speed it up with the saner and
faster

	if (bh->b_size & (correct_size-1)) {
		...

That way people who _want_ to test the odd-size thing can do so. And
normal code (that never generates requests on any other size than the
"native" size) won't ever notice either way.

(Oh, we'll eventually need to move to "correct_size == hardware
blocksize", not the "virtual blocksize" that it is now. As it it a tester
needs to set the soft-blk size by hand now).

		Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-07  0:50                                                                         ` Linus Torvalds
@ 2001-02-07  1:49                                                                           ` Stephen C. Tweedie
  2001-02-07  2:37                                                                             ` Linus Torvalds
  2001-02-07  1:51                                                                           ` Jeff V. Merkey
  1 sibling, 1 reply; 186+ messages in thread
From: Stephen C. Tweedie @ 2001-02-07  1:49 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Stephen C. Tweedie, Ingo Molnar, Ben LaHaise, Alan Cox,
	Manfred Spraul, Steve Lord, Linux Kernel List, kiobuf-io-devel

Hi,

On Tue, Feb 06, 2001 at 04:50:19PM -0800, Linus Torvalds wrote:
> 
> 
> On Wed, 7 Feb 2001, Stephen C. Tweedie wrote:
> > 
> > That gets us from 512-byte blocks to 4k, but no more (ll_rw_block
> > enforces a single blocksize on all requests but that relaxing that
> > requirement is no big deal).  Buffer_heads can't deal with data which
> > spans more than a page right now.
> 
> "struct buffer_head" can deal with pretty much any size: the only thing it
> cares about is bh->b_size.

Right now, anything larger than a page is physically non-contiguous,
and sorry if I didn't make that explicit, but I thought that was
obvious enough that I didn't need to.  We were talking about raw IO,
and as long as we're doing IO out of user anonymous data allocated
from individual pages, buffer_heads are limited to that page size in
this context.

> Have you ever spent even just 5 minutes actually _looking_ at the block
> device layer, before you decided that you think it needs to be completely
> re-done some other way? It appears that you never bothered to.

Yes.  We still have this fundamental property: if a user sends in a
128kB IO, we end up having to split it up into buffer_heads and doing
a separate submit_bh() on each single one.  Given our VM, PAGE_SIZE
(*not* PAGE_CACHE_SIZE) is the best granularity we can hope for in
this case.

THAT is the overhead that I'm talking about: having to split a large
IO into small chunks, each of which just ends up having to be merged
back again into a single struct request by the *make_request code.

A constructed IO request basically doesn't care about anything in the
buffer_head except for the data pointer and size, and the completion
status info and callback.  All of the physical IO description is in
the struct request by this point.  The chain of buffer_heads is
carrying around a huge amount of information which isn't used by the
IO, and if the caller is something like the raw IO driver which isn't
using the buffer cache, that extra buffer_head data is just overhead. 

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-07  0:50                                                                         ` Linus Torvalds
  2001-02-07  1:49                                                                           ` Stephen C. Tweedie
@ 2001-02-07  1:51                                                                           ` Jeff V. Merkey
  2001-02-07  1:01                                                                             ` Ingo Molnar
  2001-02-07  1:02                                                                             ` Jens Axboe
  1 sibling, 2 replies; 186+ messages in thread
From: Jeff V. Merkey @ 2001-02-07  1:51 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Stephen C. Tweedie, Ingo Molnar, Ben LaHaise, Alan Cox,
	Manfred Spraul, Steve Lord, Linux Kernel List, kiobuf-io-devel

On Tue, Feb 06, 2001 at 04:50:19PM -0800, Linus Torvalds wrote:
> 
> 
> On Wed, 7 Feb 2001, Stephen C. Tweedie wrote:
> > 
> > That gets us from 512-byte blocks to 4k, but no more (ll_rw_block
> > enforces a single blocksize on all requests but that relaxing that
> > requirement is no big deal).  Buffer_heads can't deal with data which
> > spans more than a page right now.
> 
> Stephen, you're so full of shit lately that it's unbelievable. You're
> batting a clear 0.000 so far.
> 
> "struct buffer_head" can deal with pretty much any size: the only thing it
> cares about is bh->b_size.
> 
> It so happens that if you have highmem support, then "create_bounce()"
> will work on a per-page thing, but that just means that you'd better have
> done your bouncing into low memory before you call generic_make_request().
> 
> Have you ever spent even just 5 minutes actually _looking_ at the block
> device layer, before you decided that you think it needs to be completely
> re-done some other way? It appears that you never bothered to.
> 
> Sure, I would not be surprised if some device driver ends up being
> surpised if you start passing it different request sizes than it is used
> to. But that's a driver and testing issue, nothing more.
> 
> (Which is not to say that "driver and testing" issues aren't important as
> hell: it's one of the more scary things in fact, and it can take a long
> time to get right if you start doing somehting that historically has never
> been done and thus has historically never gotten any testing. So I'm not
> saying that it should work out-of-the-box. But I _am_ saying that there's
> no point in trying to re-design upper layers that already do ALL of this
> with no problems at all).
> 
> 		Linus
> 

I remember Linus asking to try this variable buffer head chaining 
thing 512-1024-512 kind of stuff several months back, and mixing them to 
see what would happen -- result.  About half the drivers break with it.  
The interface allows you to do it, I've tried it, (works on Andre's 
drivers, but a lot of SCSI drivers break) but a lot of drivers seem to 
have assumptions about these things all being the same size in a 
buffer head chain. 

:-)

Jeff


> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> Please read the FAQ at http://www.tux.org/lkml/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-07  1:45                                                                                   ` Linus Torvalds
@ 2001-02-07  1:55                                                                                     ` Jens Axboe
  2001-02-07  9:10                                                                                     ` David Howells
  1 sibling, 0 replies; 186+ messages in thread
From: Jens Axboe @ 2001-02-07  1:55 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jeff V. Merkey, Stephen C. Tweedie, Ingo Molnar, Ben LaHaise,
	Alan Cox, Manfred Spraul, Steve Lord, Linux Kernel List,
	kiobuf-io-devel

On Tue, Feb 06 2001, Linus Torvalds wrote:
> > > [...] so I would be _really_ nervous about just turning it on
> > > silently. This is all very much a 2.5.x-kind of thing ;)
> > 
> > Then you might want to apply this :-)
> > 
> > --- drivers/block/ll_rw_blk.c~	Wed Feb  7 02:38:31 2001
> > +++ drivers/block/ll_rw_blk.c	Wed Feb  7 02:38:42 2001
> > @@ -1048,7 +1048,7 @@
> >  	/* Verify requested block sizes. */
> >  	for (i = 0; i < nr; i++) {
> >  		struct buffer_head *bh = bhs[i];
> > -		if (bh->b_size % correct_size) {
> > +		if (bh->b_size != correct_size) {
> >  			printk(KERN_NOTICE "ll_rw_block: device %s: "
> >  			       "only %d-char blocks implemented (%u)\n",
> >  			       kdevname(bhs[0]->b_dev),
> 
> Actually, I'd rather leave it in, but speed it up with the saner and
> faster
> 
> 	if (bh->b_size & (correct_size-1)) {
> 		...
> 
> That way people who _want_ to test the odd-size thing can do so. And
> normal code (that never generates requests on any other size than the
> "native" size) won't ever notice either way.

Fine, as I said I didn't spot anything bad so that's why it was changed.

> (Oh, we'll eventually need to move to "correct_size == hardware
> blocksize", not the "virtual blocksize" that it is now. As it it a tester
> needs to set the soft-blk size by hand now).

Exactly, wrt earlier mail about submitting < hw block size requests to
the lower levels.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-07  1:01                                                                             ` Ingo Molnar
@ 2001-02-07  1:59                                                                               ` Jeff V. Merkey
  0 siblings, 0 replies; 186+ messages in thread
From: Jeff V. Merkey @ 2001-02-07  1:59 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Stephen C. Tweedie, Ben LaHaise, Alan Cox,
	Manfred Spraul, Steve Lord, Linux Kernel List, kiobuf-io-devel

On Wed, Feb 07, 2001 at 02:01:54AM +0100, Ingo Molnar wrote:
> 
> On Tue, 6 Feb 2001, Jeff V. Merkey wrote:
> 
> > I remember Linus asking to try this variable buffer head chaining
> > thing 512-1024-512 kind of stuff several months back, and mixing them
> > to see what would happen -- result. About half the drivers break with
> > it. [...]
> 
> time to fix them then - instead of rewriting the rest of the kernel ;-)
> 
> 	Ingo

I agree.  

Jeff

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-07  1:02                                                                             ` Jens Axboe
  2001-02-07  1:19                                                                               ` Linus Torvalds
@ 2001-02-07  2:00                                                                               ` Jeff V. Merkey
  2001-02-07  1:06                                                                                 ` Ingo Molnar
  2001-02-07  1:08                                                                                 ` Jens Axboe
  1 sibling, 2 replies; 186+ messages in thread
From: Jeff V. Merkey @ 2001-02-07  2:00 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Linus Torvalds, Stephen C. Tweedie, Ingo Molnar, Ben LaHaise,
	Alan Cox, Manfred Spraul, Steve Lord, Linux Kernel List,
	kiobuf-io-devel

On Wed, Feb 07, 2001 at 02:02:21AM +0100, Jens Axboe wrote:
> On Tue, Feb 06 2001, Jeff V. Merkey wrote:
> > I remember Linus asking to try this variable buffer head chaining 
> > thing 512-1024-512 kind of stuff several months back, and mixing them to 
> > see what would happen -- result.  About half the drivers break with it.  
> > The interface allows you to do it, I've tried it, (works on Andre's 
> > drivers, but a lot of SCSI drivers break) but a lot of drivers seem to 
> > have assumptions about these things all being the same size in a 
> > buffer head chain. 
> 
> I don't see anything that would break doing this, in fact you can
> do this as long as the buffers are all at least a multiple of the
> block size. All the drivers I've inspected handle this fine, noone
> assumes that rq->bh->b_size is the same in all the buffers attached
> to the request. This includes SCSI (scsi_lib.c builds sg tables),
> IDE, and the Compaq array + Mylex driver. This mostly leaves the
> "old-style" drivers using CURRENT etc, the kernel helpers for these
> handle it as well.
> 
> So I would appreciate pointers to these devices that break so we
> can inspect them.
> 
> -- 
> Jens Axboe

Adaptec drivers had an oops.  Also, AIC7XXX also had some oops with it.

Jeff

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-07  1:06                                                                                 ` Ingo Molnar
  2001-02-07  1:09                                                                                   ` Jens Axboe
  2001-02-07  1:26                                                                                   ` Linus Torvalds
@ 2001-02-07  2:07                                                                                   ` Jeff V. Merkey
  2 siblings, 0 replies; 186+ messages in thread
From: Jeff V. Merkey @ 2001-02-07  2:07 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Jens Axboe, Linus Torvalds, Stephen C. Tweedie, Ben LaHaise,
	Alan Cox, Manfred Spraul, Steve Lord, Linux Kernel List,
	kiobuf-io-devel

On Wed, Feb 07, 2001 at 02:06:27AM +0100, Ingo Molnar wrote:
> 
> On Tue, 6 Feb 2001, Jeff V. Merkey wrote:
> 
> > > I don't see anything that would break doing this, in fact you can
> > > do this as long as the buffers are all at least a multiple of the
> > > block size. All the drivers I've inspected handle this fine, noone
> > > assumes that rq->bh->b_size is the same in all the buffers attached
> > > to the request. This includes SCSI (scsi_lib.c builds sg tables),
> > > IDE, and the Compaq array + Mylex driver. This mostly leaves the
> > > "old-style" drivers using CURRENT etc, the kernel helpers for these
> > > handle it as well.
> > >
> > > So I would appreciate pointers to these devices that break so we
> > > can inspect them.
> > >
> > > --
> > > Jens Axboe
> >
> > Adaptec drivers had an oops.  Also, AIC7XXX also had some oops with it.
> 
> most likely some coding error on your side. buffer-size mismatches should
> show up as filesystem corruption or random DMA scribble, not in-driver
> oopses.
> 
> 	Ingo

Oops was in my code, but was caused by these drivers.  The Adaptec 
driver did have an oops that was it's own code address, AIC7XXX 
crashed in my code.

Jeff

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-07  1:08                                                                                 ` Jens Axboe
@ 2001-02-07  2:08                                                                                   ` Jeff V. Merkey
  0 siblings, 0 replies; 186+ messages in thread
From: Jeff V. Merkey @ 2001-02-07  2:08 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Linus Torvalds, Stephen C. Tweedie, Ingo Molnar, Ben LaHaise,
	Alan Cox, Manfred Spraul, Steve Lord, Linux Kernel List,
	kiobuf-io-devel

On Wed, Feb 07, 2001 at 02:08:53AM +0100, Jens Axboe wrote:
> On Tue, Feb 06 2001, Jeff V. Merkey wrote:
> > Adaptec drivers had an oops.  Also, AIC7XXX also had some oops with it.
> 
> Do you still have this oops?
> 

I can recreate.  Will work on it tommorrow.  SCI testing today.

Jeff

> -- 
> Jens Axboe
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-07  1:49                                                                           ` Stephen C. Tweedie
@ 2001-02-07  2:37                                                                             ` Linus Torvalds
  2001-02-07 14:52                                                                               ` Stephen C. Tweedie
  2001-02-07 19:12                                                                               ` Richard Gooch
  0 siblings, 2 replies; 186+ messages in thread
From: Linus Torvalds @ 2001-02-07  2:37 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Ingo Molnar, Ben LaHaise, Alan Cox, Manfred Spraul, Steve Lord,
	Linux Kernel List, kiobuf-io-devel



On Wed, 7 Feb 2001, Stephen C. Tweedie wrote:
>
> > "struct buffer_head" can deal with pretty much any size: the only thing it
> > cares about is bh->b_size.
> 
> Right now, anything larger than a page is physically non-contiguous,
> and sorry if I didn't make that explicit, but I thought that was
> obvious enough that I didn't need to.  We were talking about raw IO,
> and as long as we're doing IO out of user anonymous data allocated
> from individual pages, buffer_heads are limited to that page size in
> this context.

Sure. That's obviously also one of the reasons why the IO layer has never
seen bigger requests anyway - the data _does_ tend to be fundamentally
broken up into page-size entities, if for no other reason that that is how
user-space sees memory.

However, I really _do_ want to have the page cache have a bigger
granularity than the smallest memory mapping size, and there are always
special cases that might be able to generate IO in bigger chunks (ie
in-kernel services etc)

> Yes.  We still have this fundamental property: if a user sends in a
> 128kB IO, we end up having to split it up into buffer_heads and doing
> a separate submit_bh() on each single one.  Given our VM, PAGE_SIZE
> (*not* PAGE_CACHE_SIZE) is the best granularity we can hope for in
> this case.

Absolutely. And this is independent of what kind of interface we end up
using, whether it be kiobuf of just plain "struct buffer_head". In that
respect they are equivalent.

> THAT is the overhead that I'm talking about: having to split a large
> IO into small chunks, each of which just ends up having to be merged
> back again into a single struct request by the *make_request code.

You could easily just generate the bh then and there, if you wanted to.

Your overhead comes from the fact that you want to gather the IO together. 

And I'm saying that you _shouldn't_ gather the IO. There's no point. The
gathering is sufficiently done by the low-level code anyway, and I've
tried to explain why the low-level code _has_ to do that work regardless
of what upper layers do.

You need to generate a separate sg entry for each page anyway. So why not
just use the existing one? The "struct buffer_head". Which already
_handles_ all the issues that you have complained are hard to handle.

		Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-07  1:45                                                                                   ` Linus Torvalds
  2001-02-07  1:55                                                                                     ` Jens Axboe
@ 2001-02-07  9:10                                                                                     ` David Howells
  2001-02-07 12:16                                                                                       ` Stephen C. Tweedie
  1 sibling, 1 reply; 186+ messages in thread
From: David Howells @ 2001-02-07  9:10 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Jens Axboe, linux-kernel, kiobuf-io-devel


Linus Torvalds <torvalds@transmeta.com> wrote:
> Actually, I'd rather leave it in, but speed it up with the saner and
> faster
>
>	if (bh->b_size & (correct_size-1)) {

I presume that correct_size will always be a power of 2...

David
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-07  9:10                                                                                     ` David Howells
@ 2001-02-07 12:16                                                                                       ` Stephen C. Tweedie
  0 siblings, 0 replies; 186+ messages in thread
From: Stephen C. Tweedie @ 2001-02-07 12:16 UTC (permalink / raw)
  To: David Howells; +Cc: Linus Torvalds, Jens Axboe, linux-kernel, kiobuf-io-devel

Hi,

On Wed, Feb 07, 2001 at 09:10:32AM +0000, David Howells wrote:
> 
> I presume that correct_size will always be a power of 2...

Yes.

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-07  2:37                                                                             ` Linus Torvalds
@ 2001-02-07 14:52                                                                               ` Stephen C. Tweedie
  2001-02-07 19:12                                                                               ` Richard Gooch
  1 sibling, 0 replies; 186+ messages in thread
From: Stephen C. Tweedie @ 2001-02-07 14:52 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Stephen C. Tweedie, Ingo Molnar, Ben LaHaise, Alan Cox,
	Manfred Spraul, Steve Lord, Linux Kernel List, kiobuf-io-devel

Hi,

On Tue, Feb 06, 2001 at 06:37:41PM -0800, Linus Torvalds wrote:
> >
> However, I really _do_ want to have the page cache have a bigger
> granularity than the smallest memory mapping size, and there are always
> special cases that might be able to generate IO in bigger chunks (ie
> in-kernel services etc)

No argument there.

> > Yes.  We still have this fundamental property: if a user sends in a
> > 128kB IO, we end up having to split it up into buffer_heads and doing
> > a separate submit_bh() on each single one.  Given our VM, PAGE_SIZE
> > (*not* PAGE_CACHE_SIZE) is the best granularity we can hope for in
> > this case.
> 
> Absolutely. And this is independent of what kind of interface we end up
> using, whether it be kiobuf of just plain "struct buffer_head". In that
> respect they are equivalent.

Sorry?  I'm not sure where communication is breaking down here, but
we really don't seem to be talking about the same things.  SGI's
kiobuf request patches already let us pass a large IO through the
request layer in a single unit without having to split it up to
squeeze it through the API.

> > THAT is the overhead that I'm talking about: having to split a large
> > IO into small chunks, each of which just ends up having to be merged
> > back again into a single struct request by the *make_request code.
> 
> You could easily just generate the bh then and there, if you wanted to.

In the current 2.4 tree, we already do: brw_kiovec creates the
temporary buffer_heads on demand to feed them to the IO layers.

> Your overhead comes from the fact that you want to gather the IO together. 

> And I'm saying that you _shouldn't_ gather the IO. There's no point.

I don't --- the underlying layer does.  And that is where the overhead
is: for every single large IO being created by the higher layers,
make_request is doing a dozen or more merges because I can only feed
the IO through make_request in tiny pieces.

> The
> gathering is sufficiently done by the low-level code anyway, and I've
> tried to explain why the low-level code _has_ to do that work regardless
> of what upper layers do.

I know.  The problem is the low-level code doing it a hundred times
for a single injected IO.

> You need to generate a separate sg entry for each page anyway. So why not
> just use the existing one? The "struct buffer_head". Which already
> _handles_ all the issues that you have complained are hard to handle.

Two issues here.  First is that the buffer_head is an enormously
heavyweight object for a sg-list fragment.  It contains a ton of
fields of interest only to the buffer cache.  We could mitigate this
to some extent by ensuring that the relevant fields for IO (rsector,
size, req_next, state, data, page etc) were in a single cache line.

Secondly, the cost of adding each single buffer_head to the request
list is O(n) in the number of requests already on the list.  We end up
walking potentially the entire request queue before finding the
request to merge against, and we do that again and again, once for
every single buffer_head in the list.  We do this even if the caller
went in via a multi-bh ll_rw_block() call in which case we know in
advance that all of the buffer_heads are contiguous on disk.


There is a side problem: right now, things like raid remapping occur
during generic_make_request, before we have a request built.  That
means that all of the raid0 remapping or raid1/5 request expanding is
being done on a per-buffer_head, not per-request, basis, so again
we're doing a whole lot of unnecessary duplicate work when an IO
larger than a buffer_head is submitted.


If you really don't mind the size of the buffer_head as a sg fragment
header, then at least I'd like us to be able to submit a pre-built
chain of bh's all at once without having to go through the remap/merge
cost for each single bh.

Cheers,
 Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-06 20:59                                                                 ` Linus Torvalds
@ 2001-02-07 18:26                                                                   ` Christoph Hellwig
  2001-02-07 18:36                                                                     ` Linus Torvalds
  0 siblings, 1 reply; 186+ messages in thread
From: Christoph Hellwig @ 2001-02-07 18:26 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ben LaHaise, Ingo Molnar, Stephen C. Tweedie, Alan Cox,
	Manfred Spraul, Steve Lord, Linux Kernel List, kiobuf-io-devel,
	Ingo Molnar

On Tue, Feb 06, 2001 at 12:59:02PM -0800, Linus Torvalds wrote:
> 
> 
> On Tue, 6 Feb 2001, Christoph Hellwig wrote:
> > 
> > The second is that bh's are two things:
> > 
> >  - a cacheing object
> >  - an io buffer
> 
> Actually, they really aren't.
> 
> They kind of _used_ to be, but more and more they've moved away from that
> historical use. Check in particular the page cache, and as a really
> extreme case the swap cache version of the page cache.

Yes.  And that exactly why I think it's ugly to have the left-over
caching stuff in the same data sctruture as the IO buffer.

> It certainly _used_ to be true that "bh"s were actually first-class memory
> management citizens, and actually had a data buffer and a cache associated
> with them. And because of that historical baggage, that's how many people
> still think of them.

I do even know that the pagecache is our primary cache now :)
Anyway having that caching cruft still in is ugly.

> > This is not really an clean appropeach, and I would really like to
> > get away from it.
> 
> Trust me, you really _can_ get away from it. It's not designed into the
> bh's at all. You can already just allocate a single (or multiple) "struct
> buffer_head" and just use them as IO objects, and give them your _own_
> pointers to the IO buffer etc.

So true.  Exactly because of that the data structures should become
seperated also.

	Christoph

-- 
Of course it doesn't work. We've performed a software upgrade.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-06 20:35                                                                 ` Ingo Molnar
  2001-02-06 19:05                                                                   ` Marcelo Tosatti
@ 2001-02-07 18:27                                                                   ` Christoph Hellwig
  1 sibling, 0 replies; 186+ messages in thread
From: Christoph Hellwig @ 2001-02-07 18:27 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Ben LaHaise, Stephen C. Tweedie, Alan Cox,
	Manfred Spraul, Steve Lord, Linux Kernel List, kiobuf-io-devel,
	Ingo Molnar

On Tue, Feb 06, 2001 at 09:35:58PM +0100, Ingo Molnar wrote:
> caching bmap() blocks was a recent addition around 2.3.20, and i suggested
> some time ago to cache pagecache blocks via explicit entries in struct
> page. That would be one solution - but it creates overhead.
> 
> but there isnt anything wrong with having the bhs around to cache blocks -
> think of it as a 'cached and recycled IO buffer entry, with the block
> information cached'.

I was not talking about caching physical blocks but the remaining
buffer-cache support stuff.

	Christoph

-- 
Of course it doesn't work. We've performed a software upgrade.
Whip me.  Beat me.  Make me maintain AIX.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-07 18:26                                                                   ` Christoph Hellwig
@ 2001-02-07 18:36                                                                     ` Linus Torvalds
  2001-02-07 18:44                                                                       ` Christoph Hellwig
  2001-02-08  0:34                                                                       ` Neil Brown
  0 siblings, 2 replies; 186+ messages in thread
From: Linus Torvalds @ 2001-02-07 18:36 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Ben LaHaise, Ingo Molnar, Stephen C. Tweedie, Alan Cox,
	Manfred Spraul, Steve Lord, Linux Kernel List, kiobuf-io-devel,
	Ingo Molnar



On Wed, 7 Feb 2001, Christoph Hellwig wrote:

> On Tue, Feb 06, 2001 at 12:59:02PM -0800, Linus Torvalds wrote:
> > 
> > Actually, they really aren't.
> > 
> > They kind of _used_ to be, but more and more they've moved away from that
> > historical use. Check in particular the page cache, and as a really
> > extreme case the swap cache version of the page cache.
> 
> Yes.  And that exactly why I think it's ugly to have the left-over
> caching stuff in the same data sctruture as the IO buffer.

I do agree.

I would not be opposed to factoring out the "pure block IO" part from the
bh struct. It should not even be very hard. You'd do something like

	struct block_io {
		.. here is the stuff needed for block IO ..
	};

	struct buffer_head {
		struct block_io io;
		.. here is the stuff needed for hashing etc ..
	}

and then you make "generic_make_request()" and everything lower down take
just the "struct block_io".

You'd still leave "ll_rw_block()" and "submit_bh()" operating on bh's,
because they knoa about bh semantics (ie things like scaling the sector
number to the bh size etc). Which means that pretty much all the code
outside the block layer wouldn't even _notice_. Which is a sign of good
layering.

If you want to do this, please do go ahead.

But do realize that this is not exactly a 2.4.x thing ;)

		Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-07 18:36                                                                     ` Linus Torvalds
@ 2001-02-07 18:44                                                                       ` Christoph Hellwig
  2001-02-08  0:34                                                                       ` Neil Brown
  1 sibling, 0 replies; 186+ messages in thread
From: Christoph Hellwig @ 2001-02-07 18:44 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ben LaHaise, Ingo Molnar, Stephen C. Tweedie, Alan Cox,
	Manfred Spraul, Steve Lord, Linux Kernel List, kiobuf-io-devel,
	Ingo Molnar

On Wed, Feb 07, 2001 at 10:36:47AM -0800, Linus Torvalds wrote:
> 
> 
> On Wed, 7 Feb 2001, Christoph Hellwig wrote:
> 
> > On Tue, Feb 06, 2001 at 12:59:02PM -0800, Linus Torvalds wrote:
> > > 
> > > Actually, they really aren't.
> > > 
> > > They kind of _used_ to be, but more and more they've moved away from that
> > > historical use. Check in particular the page cache, and as a really
> > > extreme case the swap cache version of the page cache.
> > 
> > Yes.  And that exactly why I think it's ugly to have the left-over
> > caching stuff in the same data sctruture as the IO buffer.
> 
> I do agree.
> 
> I would not be opposed to factoring out the "pure block IO" part from the
> bh struct. It should not even be very hard. You'd do something like
> 
> 	struct block_io {
> 		.. here is the stuff needed for block IO ..
> 	};
> 
> 	struct buffer_head {
> 		struct block_io io;
> 		.. here is the stuff needed for hashing etc ..
> 	}
> 
> and then you make "generic_make_request()" and everything lower down take
> just the "struct block_io".

Yep. (besides the name block_io sucks :))

> You'd still leave "ll_rw_block()" and "submit_bh()" operating on bh's,
> because they knoa about bh semantics (ie things like scaling the sector
> number to the bh size etc). Which means that pretty much all the code
> outside the block layer wouldn't even _notice_. Which is a sign of good
> layering.

Yep.

> If you want to do this, please do go ahead.

I'll take a look at it.

> But do realize that this is not exactly a 2.4.x thing ;)

Sure.

	Christoph

-- 
Whip me.  Beat me.  Make me maintain AIX.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-07  2:37                                                                             ` Linus Torvalds
  2001-02-07 14:52                                                                               ` Stephen C. Tweedie
@ 2001-02-07 19:12                                                                               ` Richard Gooch
  2001-02-07 20:03                                                                                 ` Stephen C. Tweedie
  1 sibling, 1 reply; 186+ messages in thread
From: Richard Gooch @ 2001-02-07 19:12 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Linus Torvalds, Ingo Molnar, Ben LaHaise, Alan Cox,
	Manfred Spraul, Steve Lord, Linux Kernel List, kiobuf-io-devel

Stephen C. Tweedie writes:
> Hi,
> 
> On Tue, Feb 06, 2001 at 06:37:41PM -0800, Linus Torvalds wrote:
> > Absolutely. And this is independent of what kind of interface we end up
> > using, whether it be kiobuf of just plain "struct buffer_head". In that
> > respect they are equivalent.
> 
> Sorry?  I'm not sure where communication is breaking down here, but
> we really don't seem to be talking about the same things.  SGI's
> kiobuf request patches already let us pass a large IO through the
> request layer in a single unit without having to split it up to
> squeeze it through the API.

Isn't Linus saying that you can use (say) 4 kiB buffer_heads, so you
don't need kiobufs? IIRC, kiobufs are page containers, so a 4 kiB
buffer_head is effectively the same thing.

> If you really don't mind the size of the buffer_head as a sg fragment
> header, then at least I'd like us to be able to submit a pre-built
> chain of bh's all at once without having to go through the remap/merge
> cost for each single bh.

Even if you are limited to feeding one buffer_head at a time, the
merge costs should be somewhat mitigated, since you'll decrease your
calls into the API by a factor of 8 or 16.
But an API extension to allow passing a pre-built chain would be even
better.

Hopefully I haven't missed the point. I've got the flu so I'm not
running on all 4 cylinders :-(

				Regards,

					Richard....
Permanent: rgooch@atnf.csiro.au
Current:   rgooch@ras.ucalgary.ca
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-07 19:12                                                                               ` Richard Gooch
@ 2001-02-07 20:03                                                                                 ` Stephen C. Tweedie
  0 siblings, 0 replies; 186+ messages in thread
From: Stephen C. Tweedie @ 2001-02-07 20:03 UTC (permalink / raw)
  To: Richard Gooch
  Cc: Stephen C. Tweedie, Linus Torvalds, Ingo Molnar, Ben LaHaise,
	Alan Cox, Manfred Spraul, Steve Lord, Linux Kernel List,
	kiobuf-io-devel

Hi,

On Wed, Feb 07, 2001 at 12:12:44PM -0700, Richard Gooch wrote:
> Stephen C. Tweedie writes:
> > 
> > Sorry?  I'm not sure where communication is breaking down here, but
> > we really don't seem to be talking about the same things.  SGI's
> > kiobuf request patches already let us pass a large IO through the
> > request layer in a single unit without having to split it up to
> > squeeze it through the API.
> 
> Isn't Linus saying that you can use (say) 4 kiB buffer_heads, so you
> don't need kiobufs? IIRC, kiobufs are page containers, so a 4 kiB
> buffer_head is effectively the same thing.

kiobufs let you encode _any_ contiguous region of user VA or of an
inode's page cache contents in one kiobuf, no matter how many pages
there are in it.  A write of a megabyte to a raw device can be encoded
as a single kiobuf if we want to pass the entire 1MB IO down to the
block layers untouched.  That's what the page vector in the kiobuf is
for.

Doing the same thing with buffer_heads would still require a couple of
hundred of them, and you'd have to submit each such buffer_head to the
IO subsystem independently.  And then the IO layer will just have to
reassemble them on the other side (and it may have to scan the
device's entire request queue once for every single buffer_head to do
so).

> But an API extension to allow passing a pre-built chain would be even
> better.

Yep.

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-06 22:26                                                                                 ` Linus Torvalds
  2001-02-06 21:13                                                                                   ` Marcelo Tosatti
@ 2001-02-07 23:15                                                                                   ` Pavel Machek
  2001-02-08 13:22                                                                                     ` Stephen C. Tweedie
  2001-02-08 14:52                                                                                     ` Mikulas Patocka
  1 sibling, 2 replies; 186+ messages in thread
From: Pavel Machek @ 2001-02-07 23:15 UTC (permalink / raw)
  To: Linus Torvalds, Jens Axboe
  Cc: Marcelo Tosatti, Manfred Spraul, Ben LaHaise, Ingo Molnar,
	Stephen C. Tweedie, Alan Cox, Steve Lord, Linux Kernel List,
	kiobuf-io-devel, Ingo Molnar

Hi!

> > > Reading write(2): 
> > > 
> > >        EAGAIN Non-blocking  I/O has been selected using O_NONBLOCK and there was
> > >               no room in the pipe or socket connected to fd to  write  the data
> > >               immediately.
> > > 
> > > I see no reason why "aio function have to block waiting for requests". 
> > 
> > That was my reasoning too with READA etc, but Linus seems to want that we
> > can block while submitting the I/O (as throttling, Linus?) just not
> > until completion.
> 
> Note the "in the pipe or socket" part.
>                  ^^^^    ^^^^^^
> 
> EAGAIN is _not_ a valid return value for block devices or for regular
> files. And in fact it _cannot_ be, because select() is defined to always
> return 1 on them - so if a write() were to return EAGAIN, user space would
> have nothing to wait on. Busy waiting is evil.

So you consider inability to select() on regular files _feature_?

It can be a pretty serious problem with slow block devices
(floppy). It also hurts when you are trying to do high-performance
reads/writes. [I know it hurt in userspace sherlock search engine --
kind of small altavista.]

How do you write high-performance ftp server without threads if select
on regular file always returns "ready"?
 

> Remember: in the end you HAVE to wait somewhere. You're always going to be
> able to generate data faster than the disk can take it. SOMETHING

Userspace wants to _know_ when to stop. It asks politely using
"select()".
								Pavel
-- 
I'm pavel@ucw.cz. "In my country we have almost anarchy and I don't care."
Panos Katsaloulis describing me w.r.t. patents at discuss@linmodems.org
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* select() returning busy for regular files [was Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait]
  2001-02-06 23:26                                                                                     ` Linus Torvalds
@ 2001-02-07 23:17                                                                                       ` Pavel Machek
  2001-02-08 13:57                                                                                         ` Ben LaHaise
  2001-02-08 17:52                                                                                         ` Linus Torvalds
  2001-02-08 15:06                                                                                       ` [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait Ben LaHaise
  1 sibling, 2 replies; 186+ messages in thread
From: Pavel Machek @ 2001-02-07 23:17 UTC (permalink / raw)
  To: Linus Torvalds, Marcelo Tosatti
  Cc: Jens Axboe, Manfred Spraul, Ben LaHaise, Ingo Molnar,
	Stephen C. Tweedie, Alan Cox, Steve Lord, Linux Kernel List,
	kiobuf-io-devel, Ingo Molnar

Hi!

> > Its arguing against making a smart application block on the disk while its
> > able to use the CPU for other work.
> 
> There are currently no other alternatives in user space. You'd have to
> create whole new interfaces for aio_read/write, and ways for the kernel to
> inform user space that "now you can re-try submitting your IO".

Why is current select() interface not good enough?

Defining that select may say regular file is not ready should be
enough. Okay, maybe you'd want new fcntl() flag saying "I _really_
want this regular file to be non-blocking". No need for new
interfaces.
								Pavel
-- 
I'm pavel@ucw.cz. "In my country we have almost anarchy and I don't care."
Panos Katsaloulis describing me w.r.t. patents at discuss@linmodems.org
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-07 18:36                                                                     ` Linus Torvalds
  2001-02-07 18:44                                                                       ` Christoph Hellwig
@ 2001-02-08  0:34                                                                       ` Neil Brown
  1 sibling, 0 replies; 186+ messages in thread
From: Neil Brown @ 2001-02-08  0:34 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Christoph Hellwig, Ben LaHaise, Ingo Molnar, Stephen C. Tweedie,
	Alan Cox, Manfred Spraul, Steve Lord, Linux Kernel List,
	kiobuf-io-devel, Ingo Molnar

On Wednesday February 7, torvalds@transmeta.com wrote:
> 
> 
> On Wed, 7 Feb 2001, Christoph Hellwig wrote:
> 
> > On Tue, Feb 06, 2001 at 12:59:02PM -0800, Linus Torvalds wrote:
> > > 
> > > Actually, they really aren't.
> > > 
> > > They kind of _used_ to be, but more and more they've moved away from that
> > > historical use. Check in particular the page cache, and as a really
> > > extreme case the swap cache version of the page cache.
> > 
> > Yes.  And that exactly why I think it's ugly to have the left-over
> > caching stuff in the same data sctruture as the IO buffer.
> 
> I do agree.
> 
> I would not be opposed to factoring out the "pure block IO" part from the
> bh struct. It should not even be very hard. You'd do something like
> 
> 	struct block_io {
> 		.. here is the stuff needed for block IO ..
> 	};
> 
> 	struct buffer_head {
> 		struct block_io io;
> 		.. here is the stuff needed for hashing etc ..
> 	}
> 
> and then you make "generic_make_request()" and everything lower down take
> just the "struct block_io".
> 

I was just thinking the same, or a similar thing.
I wanted to do

    struct io_head {
         stuff
    };
    struct buffer_head {
         struct io_head;
         more stuff;
    }

so that, as an unnamed substructure, the content of the struct io_head
would automagically be promoted to appear to be content of
buffer_head.
However I then remembered (when it didn't work) that unnamed
substructures are a feature of the Plan-9 C compiler, not the GNU
Compiler Collection. (Any gcc coders out there think this would be a
good thing to add?
  http://plan9.bell-labs.com/sys/doc/compiler.html
)

Anyway, I produced the same result in a rather ugly way with #defines
and modified raid5 to use 32byte block_io structures instead of the
80+ byte buffer_heads, and it ... doesn't quite work :-( it boots
fine, but raid5 dies and the Oops message is a few kilometers away.
Anyway, I think the concept it fine.

Patch is below for your inspection.

It occurs to me that Stephen's desire to pass lots of requests through
make_request all at once isn't a bad idea and could be done by simply
linking the io_heads together with b_reqnext.
This would require:
  1/ all callers of generic_make_request (there are 3) to initialise
     b_reqnext
  2/ all registered make_request_fn functions (there are again 3 I
     think)  to cope with following b_reqnext

It shouldn't be too hard to make the elevator code take advantage of
any ordering that it fines in the list.

I don't have a patch which does this.

NeilBrown


--- ./include/linux/fs.h	2001/02/07 22:45:37	1.1
+++ ./include/linux/fs.h	2001/02/07 23:09:05
@@ -207,6 +207,7 @@
 #define BH_Protected	6	/* 1 if the buffer is protected */
 
 /*
+ * THIS COMMENT NO-LONGER CORRECT.
  * Try to keep the most commonly used fields in single cache lines (16
  * bytes) to improve performance.  This ordering should be
  * particularly beneficial on 32-bit processors.
@@ -217,31 +218,43 @@
  * The second 16 bytes we use for lru buffer scans, as used by
  * sync_buffers() and refill_freelist().  -- sct
  */
+
+/* 
+ * io_head is all that is needed by device drivers.
+ */
+#define io_head_fields \
+	unsigned long b_state;		/* buffer state bitmap (see above) */	\
+	struct buffer_head *b_reqnext;	/* request queue */			\
+	unsigned short b_size;		/* block size */			\
+	kdev_t b_rdev;			/* Real device */			\
+	unsigned long b_rsector;	/* Real buffer location on disk */	\
+	char * b_data;			/* pointer to data block (512 byte) */	\
+	void (*b_end_io)(struct buffer_head *bh, int uptodate); /* I/O completion */ \
+ 	void *b_private;		/* reserved for b_end_io */		\
+	struct page *b_page;		/* the page this bh is mapped to */	\
+     /* this line intensionally left blank */
+struct io_head {
+	io_head_fields
+};
+
+/* buffer_head adds all the stuff needed by the buffer cache */
 struct buffer_head {
-	/* First cache line: */
+	io_head_fields
+
 	struct buffer_head *b_next;	/* Hash queue list */
 	unsigned long b_blocknr;	/* block number */
-	unsigned short b_size;		/* block size */
 	unsigned short b_list;		/* List that this buffer appears */
 	kdev_t b_dev;			/* device (B_FREE = free) */
 
 	atomic_t b_count;		/* users using this block */
-	kdev_t b_rdev;			/* Real device */
-	unsigned long b_state;		/* buffer state bitmap (see above) */
 	unsigned long b_flushtime;	/* Time when (dirty) buffer should be written */
 
 	struct buffer_head *b_next_free;/* lru/free list linkage */
 	struct buffer_head *b_prev_free;/* doubly linked list of buffers */
 	struct buffer_head *b_this_page;/* circular list of buffers in one page */
-	struct buffer_head *b_reqnext;	/* request queue */
 
 	struct buffer_head **b_pprev;	/* doubly linked list of hash-queue */
-	char * b_data;			/* pointer to data block (512 byte) */
-	struct page *b_page;		/* the page this bh is mapped to */
-	void (*b_end_io)(struct buffer_head *bh, int uptodate); /* I/O completion */
- 	void *b_private;		/* reserved for b_end_io */
 
-	unsigned long b_rsector;	/* Real buffer location on disk */
 	wait_queue_head_t b_wait;
 
 	struct inode *	     b_inode;
--- ./drivers/md/raid5.c	2001/02/06 05:43:31	1.2
+++ ./drivers/md/raid5.c	2001/02/07 23:15:36
@@ -151,18 +151,16 @@
 
 	for (i=0; i<num; i++) {
 		struct page *page;
-		bh = kmalloc(sizeof(struct buffer_head), priority);
+		bh = kmalloc(sizeof(struct io_head), priority);
 		if (!bh)
 			return 1;
-		memset(bh, 0, sizeof (struct buffer_head));
-		init_waitqueue_head(&bh->b_wait);
+		memset(bh, 0, sizeof (struct io_head));
 		page = alloc_page(priority);
 		bh->b_data = page_address(page);
 		if (!bh->b_data) {
 			kfree(bh);
 			return 1;
 		}
-		atomic_set(&bh->b_count, 0);
 		bh->b_page = page;
 		sh->bh_cache[i] = bh;
 
@@ -412,7 +410,7 @@
 			spin_lock_irqsave(&conf->device_lock, flags);
 		}
 	} else {
-		md_error(mddev_to_kdev(conf->mddev), bh->b_dev);
+		md_error(mddev_to_kdev(conf->mddev), conf->disks[i].dev);
 		clear_bit(BH_Uptodate, &bh->b_state);
 	}
 	clear_bit(BH_Lock, &bh->b_state);
@@ -440,7 +438,7 @@
 
 	md_spin_lock_irqsave(&conf->device_lock, flags);
 	if (!uptodate)
-		md_error(mddev_to_kdev(conf->mddev), bh->b_dev);
+		md_error(mddev_to_kdev(conf->mddev), conf->disks[i].dev);
 	clear_bit(BH_Lock, &bh->b_state);
 	set_bit(STRIPE_HANDLE, &sh->state);
 	__release_stripe(conf, sh);
@@ -456,12 +454,10 @@
 	unsigned long block = sh->sector / (sh->size >> 9);
 
 	init_buffer(bh, raid5_end_read_request, sh);
-	bh->b_dev       = conf->disks[i].dev;
 	bh->b_blocknr   = block;
 
 	bh->b_state	= (1 << BH_Req) | (1 << BH_Mapped);
 	bh->b_size	= sh->size;
-	bh->b_list	= BUF_LOCKED;
 	return bh;
 }
 
@@ -1085,15 +1081,14 @@
 			else
 				bh->b_end_io = raid5_end_write_request;
 			if (conf->disks[i].operational)
-				bh->b_dev = conf->disks[i].dev;
+				bh->b_rdev = conf->disks[i].dev;
 			else if (conf->spare && action[i] == WRITE+1)
-				bh->b_dev = conf->spare->dev;
+				bh->b_rdev = conf->spare->dev;
 			else skip=1;
 			if (!skip) {
 				PRINTK("for %ld schedule op %d on disc %d\n", sh->sector, action[i]-1, i);
 				atomic_inc(&sh->count);
-				bh->b_rdev = bh->b_dev;
-				bh->b_rsector = bh->b_blocknr * (bh->b_size>>9);
+				bh->b_rsector = sh->sector;
 				generic_make_request(action[i]-1, bh);
 			} else {
 				PRINTK("skip op %d on disc %d for sector %ld\n", action[i]-1, i, sh->sector);
@@ -1502,7 +1497,7 @@
 	}
 
 	memory = conf->max_nr_stripes * (sizeof(struct stripe_head) +
-		 conf->raid_disks * ((sizeof(struct buffer_head) + PAGE_SIZE))) / 1024;
+		 conf->raid_disks * ((sizeof(struct io_head) + PAGE_SIZE))) / 1024;
 	if (grow_stripes(conf, conf->max_nr_stripes, GFP_KERNEL)) {
 		printk(KERN_ERR "raid5: couldn't allocate %dkB for buffers\n", memory);
 		shrink_stripes(conf, conf->max_nr_stripes);
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-06 18:14                                                 ` Linus Torvalds
@ 2001-02-08 11:21                                                   ` Andi Kleen
  2001-02-08 14:11                                                   ` Martin Dalecki
  1 sibling, 0 replies; 186+ messages in thread
From: Andi Kleen @ 2001-02-08 11:21 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ben LaHaise, Stephen C. Tweedie, Alan Cox, Manfred Spraul,
	Steve Lord, linux-kernel, kiobuf-io-devel, Ingo Molnar

On Tue, Feb 06, 2001 at 10:14:21AM -0800, Linus Torvalds wrote:
> I will claim that you CANNOT merge at higher levels and get good
> performance.
> 
> Sure, you can do read-ahead, and try to get big merges that way at a high
> level. Good for you.
> 
> But you'll have a bitch of a time trying to merge multiple
> threads/processes reading from the same area on disk at roughly the same
> time. Your higher levels won't even _know_ that there is merging to be
> done until the IO requests hit the wall in waiting for the disk.

Hi,

I've tried to experimentally check this statement.

I instrumented a kernel with the following patch. It keeps a counter
for every merge between unrelated requests. An unrelated requests
is defined as the requests getting allocated from different currents.
I did various tests and suprisingly I was not able to trigger a 
single unrelated merge on my IDE system with various IO loads (dbench,
news expire, news sort, kernel compile, swapping ...) 

So either my patch is wrong (if yes, what is wrong?), or they do simply not 
happen in usual IO loads. I know that it has a few holes (like it doesn't 
count unrelated merges that happen from the same process, or if a process 
quits and another one gets its kernel stack and IO of both is merged it'll 
be counted as related merge), but if unrelated merges were relevant there 
should still show up more, no? 

My pet theory is that page and buffer cache filters most unrelated merges
out. I haven't tried to use raw IO to avoid this problem, but I expect that
anything that does raw IO will do some intelligent IO scheduling on its
own anyways.

If anyone is interested: it would be interesting if other people are 
able to trigger unrelated merges in real loads.
Here is a patch. Display statistics using:

(echo print unrelated_merge ; print related_merge ) | gdb vmlinux /proc/kcore


--- linux/drivers/block/ll_rw_blk.c-REQSTAT	Tue Jan 30 13:33:25 2001
+++ linux/drivers/block/ll_rw_blk.c	Thu Feb  8 01:13:57 2001
@@ -31,6 +31,9 @@
 
 #include <linux/module.h>
 
+int unrelated_merge; 
+int related_merge;
+
 /*
  * MAC Floppy IWM hooks
  */
@@ -478,6 +481,7 @@
 		rq->rq_status = RQ_ACTIVE;
 		rq->special = NULL;
 		rq->q = q;
+		rq->originator = current;
 	}
 
 	return rq;
@@ -668,6 +672,11 @@
 	if (!q->merge_requests_fn(q, req, next, max_segments))
 		return;
 
+	if (next->originator != req->originator)
+		unrelated_merge++; 
+	else
+		related_merge++; 
+
 	q->elevator.elevator_merge_req_fn(req, next);
 	req->bhtail->b_reqnext = next->bh;
 	req->bhtail = next->bhtail;
--- linux/include/linux/blkdev.h-REQSTAT	Tue Jan 30 17:17:01 2001
+++ linux/include/linux/blkdev.h	Wed Feb  7 23:33:35 2001
@@ -45,6 +45,8 @@
 	struct buffer_head * bh;
 	struct buffer_head * bhtail;
 	request_queue_t *q;
+
+	struct task_struct *originator;
 };
 
 #include <linux/elevator.h>




-Andi

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-08 13:22                                                                                     ` Stephen C. Tweedie
@ 2001-02-08 12:03                                                                                       ` Marcelo Tosatti
  2001-02-08 15:46                                                                                         ` Mikulas Patocka
  2001-02-08 18:09                                                                                         ` Linus Torvalds
  0 siblings, 2 replies; 186+ messages in thread
From: Marcelo Tosatti @ 2001-02-08 12:03 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Pavel Machek, Linus Torvalds, Jens Axboe, Manfred Spraul,
	Ben LaHaise, Ingo Molnar, Alan Cox, Steve Lord,
	Linux Kernel List, kiobuf-io-devel, Ingo Molnar



On Thu, 8 Feb 2001, Stephen C. Tweedie wrote:

<snip>

> > How do you write high-performance ftp server without threads if select
> > on regular file always returns "ready"?
> 
> Select can work if the access is sequential, but async IO is a more
> general solution.

Even async IO (ie aio_read/aio_write) should block on the request queue if
its full in Linus mind.








-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-07 23:15                                                                                   ` Pavel Machek
@ 2001-02-08 13:22                                                                                     ` Stephen C. Tweedie
  2001-02-08 12:03                                                                                       ` Marcelo Tosatti
  2001-02-08 14:52                                                                                     ` Mikulas Patocka
  1 sibling, 1 reply; 186+ messages in thread
From: Stephen C. Tweedie @ 2001-02-08 13:22 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Linus Torvalds, Jens Axboe, Marcelo Tosatti, Manfred Spraul,
	Ben LaHaise, Ingo Molnar, Stephen C. Tweedie, Alan Cox,
	Steve Lord, Linux Kernel List, kiobuf-io-devel, Ingo Molnar

Hi,

On Thu, Feb 08, 2001 at 12:15:13AM +0100, Pavel Machek wrote:
> 
> > EAGAIN is _not_ a valid return value for block devices or for regular
> > files. And in fact it _cannot_ be, because select() is defined to always
> > return 1 on them - so if a write() were to return EAGAIN, user space would
> > have nothing to wait on. Busy waiting is evil.
> 
> So you consider inability to select() on regular files _feature_?

Select might make some sort of sense for sequential access to files,
and for random access via lseek/read but it makes no sense at all for
pread and pwrite where select() has no idea _which_ part of the file
the user is going to want to access next.

> How do you write high-performance ftp server without threads if select
> on regular file always returns "ready"?

Select can work if the access is sequential, but async IO is a more
general solution.

Cheers,
 Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-08 15:06                                                                                       ` [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait Ben LaHaise
@ 2001-02-08 13:44                                                                                         ` Marcelo Tosatti
  2001-02-08 13:45                                                                                           ` Marcelo Tosatti
  0 siblings, 1 reply; 186+ messages in thread
From: Marcelo Tosatti @ 2001-02-08 13:44 UTC (permalink / raw)
  To: Ben LaHaise
  Cc: Linus Torvalds, Jens Axboe, Manfred Spraul, Ingo Molnar,
	Stephen C. Tweedie, Alan Cox, Steve Lord, Linux Kernel List,
	kiobuf-io-devel, Ingo Molnar


On Thu, 8 Feb 2001, Ben LaHaise wrote:

<snip>

> > (besides, latency would suck. I bet you're better off waiting for the
> > requests if they are all used up. It takes too long to get deep into the
> > kernel from user space, and you cannot use the exclusive waiters with its
> > anti-herd behaviour etc).
> 
> Ah, but no.  In fact for some things, the wait queue extensions I'm using
> will be more efficient as things like test_and_set_bit for obtaining a
> lock gets executed without waking up a task.

The latency argument is somewhat bogus because there is no problem to
check the request queue, in the aio syscalls, and simply fail if its full.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-08 13:44                                                                                         ` Marcelo Tosatti
@ 2001-02-08 13:45                                                                                           ` Marcelo Tosatti
  0 siblings, 0 replies; 186+ messages in thread
From: Marcelo Tosatti @ 2001-02-08 13:45 UTC (permalink / raw)
  To: Ben LaHaise
  Cc: Linus Torvalds, Jens Axboe, Manfred Spraul, Ingo Molnar,
	Stephen C. Tweedie, Alan Cox, Steve Lord, Linux Kernel List,
	kiobuf-io-devel, Ingo Molnar


On Thu, 8 Feb 2001, Marcelo Tosatti wrote:

> 
> On Thu, 8 Feb 2001, Ben LaHaise wrote:
> 
> <snip>
> 
> > > (besides, latency would suck. I bet you're better off waiting for the
> > > requests if they are all used up. It takes too long to get deep into the
> > > kernel from user space, and you cannot use the exclusive waiters with its
> > > anti-herd behaviour etc).
> > 
> > Ah, but no.  In fact for some things, the wait queue extensions I'm using
> > will be more efficient as things like test_and_set_bit for obtaining a
> > lock gets executed without waking up a task.
> 
> The latency argument is somewhat bogus because there is no problem to
> check the request queue, in the aio syscalls, and simply fail if its full.

Ugh, I forgot to say check the request queue before doing any filesystem
work. 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: select() returning busy for regular files [was Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait]
  2001-02-07 23:17                                                                                       ` select() returning busy for regular files [was Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait] Pavel Machek
@ 2001-02-08 13:57                                                                                         ` Ben LaHaise
  2001-02-08 17:52                                                                                         ` Linus Torvalds
  1 sibling, 0 replies; 186+ messages in thread
From: Ben LaHaise @ 2001-02-08 13:57 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Linus Torvalds, Marcelo Tosatti, Jens Axboe, Manfred Spraul,
	Ingo Molnar, Stephen C. Tweedie, Alan Cox, Steve Lord,
	Linux Kernel List, kiobuf-io-devel, Ingo Molnar

On Thu, 8 Feb 2001, Pavel Machek wrote:

> Hi!
>
> > > Its arguing against making a smart application block on the disk while its
> > > able to use the CPU for other work.
> >
> > There are currently no other alternatives in user space. You'd have to
> > create whole new interfaces for aio_read/write, and ways for the kernel to
> > inform user space that "now you can re-try submitting your IO".
>
> Why is current select() interface not good enough?

Think of random disk io scattered across the disk.  Think about aio_write
providing a means to perform zero copy io without needing to resort to
playing mm tricks write protecting pages in the user's page tables.  It's
also a means for dealing efficiently with thousands of outstanding
requests for network io.  Using a select based interface is going to be an
ugly kludge that still has all the overhead of select/poll.

		-ben

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-08 15:46                                                                                         ` Mikulas Patocka
@ 2001-02-08 14:05                                                                                           ` Marcelo Tosatti
  2001-02-08 16:11                                                                                             ` Mikulas Patocka
  2001-02-08 15:55                                                                                           ` Jens Axboe
  1 sibling, 1 reply; 186+ messages in thread
From: Marcelo Tosatti @ 2001-02-08 14:05 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Stephen C. Tweedie, Pavel Machek, Linus Torvalds, Jens Axboe,
	Manfred Spraul, Ben LaHaise, Ingo Molnar, Alan Cox, Steve Lord,
	Linux Kernel List, kiobuf-io-devel, Ingo Molnar



On Thu, 8 Feb 2001, Mikulas Patocka wrote:

> > > > How do you write high-performance ftp server without threads if select
> > > > on regular file always returns "ready"?
> > > 
> > > Select can work if the access is sequential, but async IO is a more
> > > general solution.
> > 
> > Even async IO (ie aio_read/aio_write) should block on the request queue if
> > its full in Linus mind.
> 
> This is not problem (you can create queue big enough to handle the load).

The point is that you want to be able to not block if the queue full (and
the queue size has nothing to do with that).

> The problem is that aio_read and aio_write are pretty useless for ftp or
> http server. You need aio_open.

Could you explain this? 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-06 18:14                                                 ` Linus Torvalds
  2001-02-08 11:21                                                   ` Andi Kleen
@ 2001-02-08 14:11                                                   ` Martin Dalecki
  2001-02-08 17:59                                                     ` Linus Torvalds
  1 sibling, 1 reply; 186+ messages in thread
From: Martin Dalecki @ 2001-02-08 14:11 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ben LaHaise, Stephen C. Tweedie, Alan Cox, Manfred Spraul,
	Steve Lord, linux-kernel, kiobuf-io-devel, Ingo Molnar

Linus Torvalds wrote:
> 
> On Tue, 6 Feb 2001, Ben LaHaise wrote:
> >
> > On Tue, 6 Feb 2001, Stephen C. Tweedie wrote:
> >
> > > The whole point of the post was that it is merging, not splitting,
> > > which is troublesome.  How are you going to merge requests without
> > > having chains of scatter-gather entities each with their own
> > > completion callbacks?
> >
> > Let me just emphasize what Stephen is pointing out: if requests are
> > properly merged at higher layers, then merging is neither required nor
> > desired.
> 
> I will claim that you CANNOT merge at higher levels and get good
> performance.
> 
> Sure, you can do read-ahead, and try to get big merges that way at a high
> level. Good for you.
> 
> But you'll have a bitch of a time trying to merge multiple
> threads/processes reading from the same area on disk at roughly the same
> time. Your higher levels won't even _know_ that there is merging to be
> done until the IO requests hit the wall in waiting for the disk.

Merging is a hardware tighted optimization, so it should happen, there
we you
really have full "knowlendge" and controll of the hardware -> namely the
device driver. 

> Qutie frankly, this whole discussion sounds worthless. We have solved this
> problem already: it's called a "buffer head". Deceptively simple at higher
> levels, and lower levels can easily merge them together into chains and do
> fancy scatter-gather structures of them that can be dynamically extended
> at any time.
> 
> The buffer heads together with "struct request" do a hell of a lot more
> than just a simple scatter-gather: it's able to create ordered lists of
> independent sg-events, together with full call-backs etc. They are
> low-cost, fairly efficient, and they have worked beautifully for years.
> 
> The fact that kiobufs can't be made to do the same thing is somebody elses
> problem. I _know_ that merging has to happen late, and if others are
> hitting their heads against this issue until they turn silly, then that's
> their problem. You'll eventually learn, or you'll hit your heads into a
> pulp.

Amen.

-- 
- phone: +49 214 8656 283
- job:   STOCK-WORLD Media AG, LEV .de (MY OPPINNIONS ARE MY OWN!)
- langs: de_DE.ISO8859-1, en_US, pl_PL.ISO8859-2, last ressort:
ru_RU.KOI8-R
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-08 16:11                                                                                             ` Mikulas Patocka
@ 2001-02-08 14:44                                                                                               ` Marcelo Tosatti
  2001-02-08 16:57                                                                                               ` Rik van Riel
  1 sibling, 0 replies; 186+ messages in thread
From: Marcelo Tosatti @ 2001-02-08 14:44 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Stephen C. Tweedie, Pavel Machek, Linus Torvalds, Jens Axboe,
	Manfred Spraul, Ben LaHaise, Ingo Molnar, Alan Cox, Steve Lord,
	Linux Kernel List, kiobuf-io-devel, Ingo Molnar



On Thu, 8 Feb 2001, Mikulas Patocka wrote:

> > > The problem is that aio_read and aio_write are pretty useless for ftp or
> > > http server. You need aio_open.
> > 
> > Could you explain this? 
> 
> If the server is sending many small files, disk spends huge amount time
> walking directory tree and seeking to inodes. Maybe opening the file is
> even slower than reading it - read is usually sequential but open needs to
> seek at few areas of disk.
> 
> And if you have one-threaded server using open, close, aio_read and
> aio_write, you actually block the whole server while it is opening a
> single file. This is not how async io is supposed to work.

Ok but this is not the point of the discussion. 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-07 23:15                                                                                   ` Pavel Machek
  2001-02-08 13:22                                                                                     ` Stephen C. Tweedie
@ 2001-02-08 14:52                                                                                     ` Mikulas Patocka
  2001-02-08 19:50                                                                                       ` Stephen C. Tweedie
  2001-02-11 21:30                                                                                       ` Pavel Machek
  1 sibling, 2 replies; 186+ messages in thread
From: Mikulas Patocka @ 2001-02-08 14:52 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Linus Torvalds, Jens Axboe, Marcelo Tosatti, Manfred Spraul,
	Ben LaHaise, Ingo Molnar, Stephen C. Tweedie, Alan Cox,
	Steve Lord, Linux Kernel List, kiobuf-io-devel, Ingo Molnar

Hi!

> So you consider inability to select() on regular files _feature_?

select on files is unimplementable. You can't do background file IO the
same way you do background receiving of packets on socket. Filesystem is
synchronous. It can block. 

> It can be a pretty serious problem with slow block devices
> (floppy). It also hurts when you are trying to do high-performance
> reads/writes. [I know it hurt in userspace sherlock search engine --
> kind of small altavista.]
> 
> How do you write high-performance ftp server without threads if select
> on regular file always returns "ready"?

No, it's not really possible on Linux. Use SYS$QIO call on VMS :-)

You can emulate asynchronous IO with kernel threads like FreeBSD and some
commercial Unices do, but you still need as many (possibly kernel) threads
as many requests you are servicing. 

> > Remember: in the end you HAVE to wait somewhere. You're always going to be
> > able to generate data faster than the disk can take it. SOMETHING
> 
> Userspace wants to _know_ when to stop. It asks politely using
> "select()".

And how do you want to wait for other select()ed events if you are blocked
in wait_for_buffer in get_block (former bmap)?

Making real async IO would require to rewrite all filesystems and whole
VFS _from_scratch_. It won't happen.

Mikulas

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-06 23:26                                                                                     ` Linus Torvalds
  2001-02-07 23:17                                                                                       ` select() returning busy for regular files [was Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait] Pavel Machek
@ 2001-02-08 15:06                                                                                       ` Ben LaHaise
  2001-02-08 13:44                                                                                         ` Marcelo Tosatti
  1 sibling, 1 reply; 186+ messages in thread
From: Ben LaHaise @ 2001-02-08 15:06 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Marcelo Tosatti, Jens Axboe, Manfred Spraul, Ingo Molnar,
	Stephen C. Tweedie, Alan Cox, Steve Lord, Linux Kernel List,
	kiobuf-io-devel, Ingo Molnar

On Tue, 6 Feb 2001, Linus Torvalds wrote:

> There are currently no other alternatives in user space. You'd have to
> create whole new interfaces for aio_read/write, and ways for the kernel to
> inform user space that "now you can re-try submitting your IO".
>
> Could be done. But that's a big thing.

Has been done.  Still needs some work, but it works pretty well.  As for
throttling io, having ios submitted does not have to correspond to them
being queued in the lower layers.  The main issue with async io is
limiting the amount of pinned memory for ios; if that's taken care of, I
don't think it matters how many ios are in flight.

> > An application which sets non blocking behavior and busy waits for a
> > request (which seems to be your argument) is just stupid, of course.
>
> Tell me what else it could do at some point? You need something like
> select() to wait on it. There are no such interfaces right now...
>
> (besides, latency would suck. I bet you're better off waiting for the
> requests if they are all used up. It takes too long to get deep into the
> kernel from user space, and you cannot use the exclusive waiters with its
> anti-herd behaviour etc).

Ah, but no.  In fact for some things, the wait queue extensions I'm using
will be more efficient as things like test_and_set_bit for obtaining a
lock gets executed without waking up a task.

> Simple rule: if you want to optimize concurrency and avoid waiting - use
> several processes or threads instead. At which point you can get real work
> done on multiple CPU's, instead of worrying about what happens when you
> have to wait on the disk.

There do exist plenty of cases where threads are not efficient enough.
Just the stack overhead alone with 8000 threads makes things really suck.
Event based io completion means that server processes don't need to have
the overhead of select/poll.  Add in NT style completion ports for waking
up the right number of worker threads off of the completion queue, and

That said, I don't expect all devices to support async io.  But given
support for files, raw and sockets all the important cases are covered.
The remainder can be supported via userspace helpers.

		-ben

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-08 12:03                                                                                       ` Marcelo Tosatti
@ 2001-02-08 15:46                                                                                         ` Mikulas Patocka
  2001-02-08 14:05                                                                                           ` Marcelo Tosatti
  2001-02-08 15:55                                                                                           ` Jens Axboe
  2001-02-08 18:09                                                                                         ` Linus Torvalds
  1 sibling, 2 replies; 186+ messages in thread
From: Mikulas Patocka @ 2001-02-08 15:46 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Stephen C. Tweedie, Pavel Machek, Linus Torvalds, Jens Axboe,
	Manfred Spraul, Ben LaHaise, Ingo Molnar, Alan Cox, Steve Lord,
	Linux Kernel List, kiobuf-io-devel, Ingo Molnar

> > > How do you write high-performance ftp server without threads if select
> > > on regular file always returns "ready"?
> > 
> > Select can work if the access is sequential, but async IO is a more
> > general solution.
> 
> Even async IO (ie aio_read/aio_write) should block on the request queue if
> its full in Linus mind.

This is not problem (you can create queue big enough to handle the load).

The problem is that aio_read and aio_write are pretty useless for ftp or
http server. You need aio_open.

Mikulas

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-08 15:46                                                                                         ` Mikulas Patocka
  2001-02-08 14:05                                                                                           ` Marcelo Tosatti
@ 2001-02-08 15:55                                                                                           ` Jens Axboe
  1 sibling, 0 replies; 186+ messages in thread
From: Jens Axboe @ 2001-02-08 15:55 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Marcelo Tosatti, Stephen C. Tweedie, Pavel Machek,
	Linus Torvalds, Manfred Spraul, Ben LaHaise, Ingo Molnar,
	Alan Cox, Steve Lord, Linux Kernel List, kiobuf-io-devel,
	Ingo Molnar

On Thu, Feb 08 2001, Mikulas Patocka wrote:
> > Even async IO (ie aio_read/aio_write) should block on the request queue if
> > its full in Linus mind.
> 
> This is not problem (you can create queue big enough to handle the load).

Well in theory, but in practice this isn't a very good idea. At some
point throwing yet more requests in there doesn't make a whole lot
of sense. You are basically _always_ going to be able to empty the request
list by dirtying lots of data.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-08 14:05                                                                                           ` Marcelo Tosatti
@ 2001-02-08 16:11                                                                                             ` Mikulas Patocka
  2001-02-08 14:44                                                                                               ` Marcelo Tosatti
  2001-02-08 16:57                                                                                               ` Rik van Riel
  0 siblings, 2 replies; 186+ messages in thread
From: Mikulas Patocka @ 2001-02-08 16:11 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Stephen C. Tweedie, Pavel Machek, Linus Torvalds, Jens Axboe,
	Manfred Spraul, Ben LaHaise, Ingo Molnar, Alan Cox, Steve Lord,
	Linux Kernel List, kiobuf-io-devel, Ingo Molnar

> > The problem is that aio_read and aio_write are pretty useless for ftp or
> > http server. You need aio_open.
> 
> Could you explain this? 

If the server is sending many small files, disk spends huge amount time
walking directory tree and seeking to inodes. Maybe opening the file is
even slower than reading it - read is usually sequential but open needs to
seek at few areas of disk.

And if you have one-threaded server using open, close, aio_read and
aio_write, you actually block the whole server while it is opening a
single file. This is not how async io is supposed to work.

Mikulas


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-08 16:11                                                                                             ` Mikulas Patocka
  2001-02-08 14:44                                                                                               ` Marcelo Tosatti
@ 2001-02-08 16:57                                                                                               ` Rik van Riel
  2001-02-08 17:13                                                                                                 ` James Sutherland
  2001-02-08 18:38                                                                                                 ` Linus Torvalds
  1 sibling, 2 replies; 186+ messages in thread
From: Rik van Riel @ 2001-02-08 16:57 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Marcelo Tosatti, Stephen C. Tweedie, Pavel Machek,
	Linus Torvalds, Jens Axboe, Manfred Spraul, Ben LaHaise,
	Ingo Molnar, Alan Cox, Steve Lord, Linux Kernel List,
	kiobuf-io-devel, Ingo Molnar

On Thu, 8 Feb 2001, Mikulas Patocka wrote:

> > > You need aio_open.
> > Could you explain this? 
> 
> If the server is sending many small files, disk spends huge
> amount time walking directory tree and seeking to inodes. Maybe
> opening the file is even slower than reading it

Not if you have a big enough inode_cache and dentry_cache.

OTOH ... if you have enough memory the whole async IO argument
is moot anyway because all your files will be in memory too.

regards,

Rik
--
Linux MM bugzilla: http://linux-mm.org/bugzilla.shtml

Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

		http://www.surriel.com/
http://www.conectiva.com/	http://distro.conectiva.com/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-08 16:57                                                                                               ` Rik van Riel
@ 2001-02-08 17:13                                                                                                 ` James Sutherland
  2001-02-08 18:38                                                                                                 ` Linus Torvalds
  1 sibling, 0 replies; 186+ messages in thread
From: James Sutherland @ 2001-02-08 17:13 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Mikulas Patocka, Marcelo Tosatti, Stephen C. Tweedie,
	Pavel Machek, Linus Torvalds, Jens Axboe, Manfred Spraul,
	Ben LaHaise, Ingo Molnar, Alan Cox, Steve Lord,
	Linux Kernel List, kiobuf-io-devel, Ingo Molnar

On Thu, 8 Feb 2001, Rik van Riel wrote:

> On Thu, 8 Feb 2001, Mikulas Patocka wrote:
> 
> > > > You need aio_open.
> > > Could you explain this? 
> > 
> > If the server is sending many small files, disk spends huge
> > amount time walking directory tree and seeking to inodes. Maybe
> > opening the file is even slower than reading it
> 
> Not if you have a big enough inode_cache and dentry_cache.

Eh? However big the caches are, you can still get misses which will
require multiple (blocking) disk accesses to handle...

> OTOH ... if you have enough memory the whole async IO argument
> is moot anyway because all your files will be in memory too.

Only for cache hits. If you're doing a Mindcraft benchmark or something
with everything in RAM, you're fine - for real world servers, that's not
really an option ;-)

Really, you want/need cache MISSES to be handled without blocking. However
big the caches, short of running EVERYTHING from a ramdisk, these will
still happen!


James.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: select() returning busy for regular files [was Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait]
  2001-02-07 23:17                                                                                       ` select() returning busy for regular files [was Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait] Pavel Machek
  2001-02-08 13:57                                                                                         ` Ben LaHaise
@ 2001-02-08 17:52                                                                                         ` Linus Torvalds
  1 sibling, 0 replies; 186+ messages in thread
From: Linus Torvalds @ 2001-02-08 17:52 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Marcelo Tosatti, Jens Axboe, Manfred Spraul, Ben LaHaise,
	Ingo Molnar, Stephen C. Tweedie, Alan Cox, Steve Lord,
	Linux Kernel List, kiobuf-io-devel, Ingo Molnar



On Thu, 8 Feb 2001, Pavel Machek wrote:
> > 
> > There are currently no other alternatives in user space. You'd have to
> > create whole new interfaces for aio_read/write, and ways for the kernel to
> > inform user space that "now you can re-try submitting your IO".
> 
> Why is current select() interface not good enough?

Ehh..

One major reason is rather simple: disk request wait times tend to be on
the order of sub-millisecond (remember: if we run out of requests, that
means that we have 256 of them already queued, which means that it's very
likely that several of them will be freed up in the very near future due
to completion).

The fact is, that if you start doing write/select loops, you're going to
waste a _large_ portion of your CPU speed on it.  Especially considering
that the select() call would have to go all the way down to the ll_rw_blk
layer to figure out whether there are more requests etc.

So there is (a) historical reasons that say that regular files can never
wait and EAGAIN is not an acceptable return value and (b) practical
reasons for why such an interface would be a bad one.

There are better ways to do it. Either using threads, or just having a
better aio-like interface.

		Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-08 14:11                                                   ` Martin Dalecki
@ 2001-02-08 17:59                                                     ` Linus Torvalds
  0 siblings, 0 replies; 186+ messages in thread
From: Linus Torvalds @ 2001-02-08 17:59 UTC (permalink / raw)
  To: Martin Dalecki
  Cc: Ben LaHaise, Stephen C. Tweedie, Alan Cox, Manfred Spraul,
	Steve Lord, linux-kernel, kiobuf-io-devel, Ingo Molnar



On Thu, 8 Feb 2001, Martin Dalecki wrote:
> > 
> > But you'll have a bitch of a time trying to merge multiple
> > threads/processes reading from the same area on disk at roughly the same
> > time. Your higher levels won't even _know_ that there is merging to be
> > done until the IO requests hit the wall in waiting for the disk.
> 
> Merging is a hardware tighted optimization, so it should happen, there we you
> really have full "knowlendge" and controll of the hardware -> namely the
> device driver. 

Or, in many cases, the device itself. There are valid reasons for not
doing merging in the driver, but they all tend to boil down to "even lower
layers can do a better job of it". They basically _never_ boil down to
"upper layers already did it for us".

That said, there tend to be advantages to doing "appropriate" clustering
at each level. Upper layers can (and do) use read-ahead to help the lower
levels. The write-out can (and currently does not) try to sort the
requests for better elevator behaviour.

The driver level can (and does) further cluster the requests - even if the
low-level device does a perfect job of orderign and merging on its own
it's usually advantageous to have fewer (and bigger) commands in-flight in
order to have fewer completion interrupts and less command traffic on the
bus.

So it's obviously not entirely black-and-white. Upper layers can help, but
it's a mistake to think that they should "do the work".

(Note: a lot of people seem to think that "layering" means that the
complexity is in upper layers, and that lower layers should be simple and
"stupid". This is not true. A well-balanced layering would have all layers
doing potentially equally complex things - but the complexity should be
_independent_. Complex interactions are bad. But it's also bad to thin
kthat lower levels shouldn't be allowed to optimize because they should be
"simple".).

		Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-08 12:03                                                                                       ` Marcelo Tosatti
  2001-02-08 15:46                                                                                         ` Mikulas Patocka
@ 2001-02-08 18:09                                                                                         ` Linus Torvalds
  1 sibling, 0 replies; 186+ messages in thread
From: Linus Torvalds @ 2001-02-08 18:09 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Stephen C. Tweedie, Pavel Machek, Jens Axboe, Manfred Spraul,
	Ben LaHaise, Ingo Molnar, Alan Cox, Steve Lord,
	Linux Kernel List, kiobuf-io-devel, Ingo Molnar



On Thu, 8 Feb 2001, Marcelo Tosatti wrote:
> 
> On Thu, 8 Feb 2001, Stephen C. Tweedie wrote:
> 
> <snip>
> 
> > > How do you write high-performance ftp server without threads if select
> > > on regular file always returns "ready"?
> > 
> > Select can work if the access is sequential, but async IO is a more
> > general solution.
> 
> Even async IO (ie aio_read/aio_write) should block on the request queue if
> its full in Linus mind.

Not necessarily. I said that "READA/WRITEA" are only worth exporting
inside the kernel - because the latencies and complexities are low-level
enough that it should not be exported to user space as such.

But I could imagine a kernel aio package that does the equivalent of

	bh->b_end_io = completion_handler;
	generic_make_request(WRITE, bh);	/* this may block */
	bh= bh->b_next;

	/* Now, fill it up as much as we can.. */
	current->state = TASK_INTERRUPTIBLE;
	while (more data to be written) {
		if (generic_make_request(WRITEA, bh) < 0)
			break;
		bh = bh->b_next;
	}

	return;

and then you make the _completion handler_ thing continue to feed more
requests. Yes, you may block at some points (because you need to always
have at least _one_ request in-flight in order to have the state machine
active, but you can basically try to avoid blocking more than necessary.

But do you see why the above can't be done from user space? It requires
that the completion handler (which runs in an interrupt context) be able
to continue to feed requests and keep the queue filled. If you don't do
that, you'll never have good throughput, because it takes too long to send
signals, re-schedule or whatever to user mode.

And do you see how it has to block _sometimes_? If people do hundreds of
AIO requests, we can't let memory just fill up with pending writes..

		Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-08 16:57                                                                                               ` Rik van Riel
  2001-02-08 17:13                                                                                                 ` James Sutherland
@ 2001-02-08 18:38                                                                                                 ` Linus Torvalds
  2001-02-09 12:17                                                                                                   ` Martin Dalecki
  1 sibling, 1 reply; 186+ messages in thread
From: Linus Torvalds @ 2001-02-08 18:38 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Mikulas Patocka, Marcelo Tosatti, Stephen C. Tweedie,
	Pavel Machek, Jens Axboe, Manfred Spraul, Ben LaHaise,
	Ingo Molnar, Alan Cox, Steve Lord, Linux Kernel List,
	kiobuf-io-devel, Ingo Molnar



On Thu, 8 Feb 2001, Rik van Riel wrote:

> On Thu, 8 Feb 2001, Mikulas Patocka wrote:
> 
> > > > You need aio_open.
> > > Could you explain this? 
> > 
> > If the server is sending many small files, disk spends huge
> > amount time walking directory tree and seeking to inodes. Maybe
> > opening the file is even slower than reading it
> 
> Not if you have a big enough inode_cache and dentry_cache.
> 
> OTOH ... if you have enough memory the whole async IO argument
> is moot anyway because all your files will be in memory too.

Note that this _is_ an important point.

You should never _ever_ think about pure IO speed as the most important
thing. Even if you get absolutely perfect IO streaming off the fastest
disk you can find, I will beat you every single time with a cached setup
that doesn't need to do IO at all.

90% of the VFS layer is all about caching, and trying to avoid IO. Of the
rest, about 9% is about trying to avoid even calling down to the low-level
filesystem, because it's faster if we can handle it at a high level
without any need to even worry about issues like physical disk addresses.
Even if those addresses are cached.

The remaining 1% is about actually getting the IO done. At that point we
end up throwing our hands in the air and saying "ok, this will be slow".

So if you design your system for disk load, you are missing a big portion
of the picture.

There are cases where IO really matter. The most notable one being
databases, certainly _not_ web or ftp servers. For web- or ftp-servers you
buy more memory if you want high performance, and you tend to be limited
by the network speed anyway (if you have multiple gigabit networks and
network speed isn't an issue, then I can also tell you that buying a few
gigabyte of RAM isn't an issue, because you are obviously working for
something like the DoD and have very little regard for the cost of the
thing ;)

For databases (and for file servers that you want to be robust over a
crash), IO throughput is an issue mainly because you need to put the damn
requests in stable memory somewhere. Which tends to mean that _write_
speed is what really matters, because the reads you can still try to cache
as efficiently as humanly possible (and the issue of database design then
turns into trying to find every single piece of locality you can, so that
the read caching works as well as possible).

Short and sweet: "aio_open()" is basically never supposed to be an issue.
If it is, you've misdesigned something, or you're trying too damn hard to
single-thread everything (and "hiding" the threading that _does_ happen by
just calling it "AIO" instead - lying to yourself, in short).

		Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-08 14:52                                                                                     ` Mikulas Patocka
@ 2001-02-08 19:50                                                                                       ` Stephen C. Tweedie
  2001-02-11 21:30                                                                                       ` Pavel Machek
  1 sibling, 0 replies; 186+ messages in thread
From: Stephen C. Tweedie @ 2001-02-08 19:50 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Pavel Machek, Linus Torvalds, Jens Axboe, Marcelo Tosatti,
	Manfred Spraul, Ben LaHaise, Ingo Molnar, Stephen C. Tweedie,
	Alan Cox, Steve Lord, Linux Kernel List, kiobuf-io-devel,
	Ingo Molnar

Hi,

On Thu, Feb 08, 2001 at 03:52:35PM +0100, Mikulas Patocka wrote:
> 
> > How do you write high-performance ftp server without threads if select
> > on regular file always returns "ready"?
> 
> No, it's not really possible on Linux. Use SYS$QIO call on VMS :-)

Ahh, but even VMS SYS$QIO is synchronous at doing opens, allocation of
the IO request packets, and mapping file location to disk blocks.
Only the data IO is ever async (and Ben's async IO stuff for Linux
provides that too).

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-08 18:38                                                                                                 ` Linus Torvalds
@ 2001-02-09 12:17                                                                                                   ` Martin Dalecki
  0 siblings, 0 replies; 186+ messages in thread
From: Martin Dalecki @ 2001-02-09 12:17 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Rik van Riel, Mikulas Patocka, Marcelo Tosatti,
	Stephen C. Tweedie, Pavel Machek, Jens Axboe, Manfred Spraul,
	Ben LaHaise, Ingo Molnar, Alan Cox, Steve Lord,
	Linux Kernel List, kiobuf-io-devel, Ingo Molnar

Linus Torvalds wrote:
> 
> On Thu, 8 Feb 2001, Rik van Riel wrote:
> 
> > On Thu, 8 Feb 2001, Mikulas Patocka wrote:
> >
> > > > > You need aio_open.
> > > > Could you explain this?
> > >
> > > If the server is sending many small files, disk spends huge
> > > amount time walking directory tree and seeking to inodes. Maybe
> > > opening the file is even slower than reading it
> >
> > Not if you have a big enough inode_cache and dentry_cache.
> >
> > OTOH ... if you have enough memory the whole async IO argument
> > is moot anyway because all your files will be in memory too.
> 
> Note that this _is_ an important point.
> 
> You should never _ever_ think about pure IO speed as the most important
> thing. Even if you get absolutely perfect IO streaming off the fastest
> disk you can find, I will beat you every single time with a cached setup
> that doesn't need to do IO at all.
> 
> 90% of the VFS layer is all about caching, and trying to avoid IO. Of the
> rest, about 9% is about trying to avoid even calling down to the low-level
> filesystem, because it's faster if we can handle it at a high level
> without any need to even worry about issues like physical disk addresses.
> Even if those addresses are cached.
> 
> The remaining 1% is about actually getting the IO done. At that point we
> end up throwing our hands in the air and saying "ok, this will be slow".
> 
> So if you design your system for disk load, you are missing a big portion
> of the picture.
> 
> There are cases where IO really matter. The most notable one being
> databases, certainly _not_ web or ftp servers. For web- or ftp-servers you
> buy more memory if you want high performance, and you tend to be limited
> by the network speed anyway (if you have multiple gigabit networks and
> network speed isn't an issue, then I can also tell you that buying a few
> gigabyte of RAM isn't an issue, because you are obviously working for
> something like the DoD and have very little regard for the cost of the
> thing ;)
> 
> For databases (and for file servers that you want to be robust over a
> crash), IO throughput is an issue mainly because you need to put the damn
> requests in stable memory somewhere. Which tends to mean that _write_
> speed is what really matters, because the reads you can still try to cache
> as efficiently as humanly possible (and the issue of database design then
> turns into trying to find every single piece of locality you can, so that
> the read caching works as well as possible).
> 
> Short and sweet: "aio_open()" is basically never supposed to be an issue.
> If it is, you've misdesigned something, or you're trying too damn hard to
> single-thread everything (and "hiding" the threading that _does_ happen by
> just calling it "AIO" instead - lying to yourself, in short).

Right - I agree with you that an AIO design is basically hiding an
inherently
multi threaded program flow. This argument is indeed very catchy. And
looking
from some other point one will see that most of the AIO designs are from
times
where multi threading in applications wasn't that common as it is now.
Most prominently coprocesses in a shell come to my mind as a very good
example
about how to handle AIO (sort of)...
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-08 14:52                                                                                     ` Mikulas Patocka
  2001-02-08 19:50                                                                                       ` Stephen C. Tweedie
@ 2001-02-11 21:30                                                                                       ` Pavel Machek
  1 sibling, 0 replies; 186+ messages in thread
From: Pavel Machek @ 2001-02-11 21:30 UTC (permalink / raw)
  To: Mikulas Patocka, Pavel Machek
  Cc: Linus Torvalds, Jens Axboe, Marcelo Tosatti, Manfred Spraul,
	Ben LaHaise, Ingo Molnar, Stephen C. Tweedie, Alan Cox,
	Steve Lord, Linux Kernel List, kiobuf-io-devel, Ingo Molnar

Hi!

> > So you consider inability to select() on regular files _feature_?
> 
> select on files is unimplementable. You can't do background file IO the
> same way you do background receiving of packets on socket. Filesystem is
> synchronous. It can block. 

You can use helper friends if VFS layer is not able to handle
background IO. Then we can do it right in linux-4.4.
								Pavel

-- 
I'm pavel@ucw.cz. "In my country we have almost anarchy and I don't care."
Panos Katsaloulis describing me w.r.t. patents at discuss@linmodems.org
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-07  1:40                                                                         ` Linus Torvalds
@ 2001-02-12 10:07                                                                           ` Jamie Lokier
  0 siblings, 0 replies; 186+ messages in thread
From: Jamie Lokier @ 2001-02-12 10:07 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Stephen C. Tweedie, Ingo Molnar, Ben LaHaise, Alan Cox,
	Manfred Spraul, Steve Lord, Linux Kernel List, kiobuf-io-devel,
	Ingo Molnar

Linus Torvalds wrote:
> Absolutely. This is exactly what I mean by saying that low-level drivers
> may not actually be able to handle new cases that they've never been asked
> to do before - they just never saw anything like a 64kB request before or
> something that crossed its own alignment.
> 
> But the _higher_ levels are there. And there's absolutely nothing in the
> design that is a real problem. But there's no question that you might need
> to fix up more than one or two low-level drivers.
> 
> (The only drivers I know better are the IDE ones, and as far as I can tell
> they'd have no trouble at all with any of this. Most other normal drivers
> are likely to be in this same situation. But because I've not had a reason
> to test, I certainly won't guarantee even that).

PCI has dma_mask, which distinguishes different device capabilities.
This nice interface handles 64-bit capable devices, 32-bit ones, ISA
limitations (the old 16MB limit) and some other strange devices.

This mask appears in block devices one way or another so that bounce
buffers are used for high addresses.

How about a mask for block devices which indicates the kinds of
alignment and lengths that the driver can handle?  For old drivers that
can't be thoroughly tested, we assume the worst.  Some devices have
hardware limitations.  Newer, tested drivers can relax the limits.

It's probably not difficult to say, "this 64k request can't be handled
so split it into 1k requests".  It integrates naturally with the
decision to use bounce buffers -- alignment restrictions cause copying
just as high addresses causes copying.

-- Jamie
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
@ 2001-02-12 14:56 bsuparna
  0 siblings, 0 replies; 186+ messages in thread
From: bsuparna @ 2001-02-12 14:56 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Martin Dalecki, Ben LaHaise, Stephen C. Tweedie, Alan Cox,
	Manfred Spraul, Steve Lord, linux-kernel, kiobuf-io-devel,
	Ingo Molnar


Going through all the discussions once again and trying to look at this
from the point of view of just basic requirements for data structures and
mechanisms, that they imply.

1. Should have a data structure that represents a  memory chain , which may
not be contiguous in physical memory, and which can be passed down as a
single unit all the way  through to lowest level drivers
     - e.g for direct i/o to/from a contiguous virtual address range in
user space (without any intermediate copies)

(Networking and block i/o seem may have require different optimizations in
the design of such a data structure, due to differences in the kind of
patterns expected, as is apparent from the zero-copy networking fragments
vs raw i/o kiobuf/kiovec patches. There are situations when such a data
structure may be passed between subsystems as in the i2o example)

This data structure could be part of an I/O container.

2.  I/O containers may get split or merged as they pass through various
layers --- so any completion mechanism and i/o container design should be
able to account for both cases. At any point, a request could be
     - a collection of several higher level requests,
          or
     - could be one among several sub-requests of a single higher level
request.
(Just as appropriate "clustering"  could happen at each level, appropriate
"splitting" may also take place depending on the situation. It may make
sense to delay splitting as far down the chain as possible in many
situations, where the higher level is only interested in the i/o in its
entirety and not in partial completion )
When caching/buffers are involved, sometimes the sub-requests of a single
higher level request may have individual completion requirements (even when
no merges were involved), because the sub-request buffers may be used to
service other requests alongside. With raw i/o that might not be the case.

3. It is desirable that layers which process the requests along the way
without splitting/merging, be able to pass along the same I/O container
without any duplication or cloning, and intercept async i/o completions for
post processing.

4. (Optional) It would be nice if different kinds of I/O containers or
buffer structures could be used at different levels, without having
explicit linkage fields (like bh --> page, for example) , and in a way that
intermediate drivers or layers can work transparently.

3 & 4 are more of layering related items, which gets a little specific, but
do 1 and 2 cover the general things we are looking for ?

Regards
Suparna

  Suparna Bhattacharya
  Systems Software Group, IBM Global Services, India
  E-mail : bsuparna@in.ibm.com
  Phone : 91-80-5267117, Extn : 2525


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://vger.kernel.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
       [not found] <CA2569E9.004A4E23.00@d73mta05.au.ibm.com>
@ 2001-02-04 16:46 ` Alan Cox
  0 siblings, 0 replies; 186+ messages in thread
From: Alan Cox @ 2001-02-04 16:46 UTC (permalink / raw)
  To: bsuparna
  Cc: Stephen C. Tweedie, linux-kernel, kiobuf-io-devel, Alan Cox,
	Christoph Hellwig, Andi Kleen

> It appears that we are coming across 2 kinds of requirements for kiobuf
> vectors - and quite a bit of debate centering around that.
> 
> 1. In the block device i/o world, where large i/os may be involved, we'd
> 2. In the networking world, we deal with smaller fragments (for protocol

Its probably worth commenting at this point that the I2O message passing layers
do indeed have both #1 and #2 type descriptor chains to optimise performance
for different tasks. We arent the only people to hit this.

I2O supports 
	offset, pagelist, length

where the middle pages in the list are entirely copied

And sets of
	addr, len

tuples.



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 186+ messages in thread

end of thread, other threads:[~2001-02-12 16:21 UTC | newest]

Thread overview: 186+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2001-02-01 14:44 [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains bsuparna
2001-02-01 15:09 ` Christoph Hellwig
2001-02-01 16:08   ` Steve Lord
2001-02-01 16:49     ` Stephen C. Tweedie
2001-02-01 17:02       ` Christoph Hellwig
2001-02-01 17:34         ` Alan Cox
2001-02-01 17:49           ` Stephen C. Tweedie
2001-02-01 17:09             ` Chaitanya Tumuluri
2001-02-01 20:33             ` Christoph Hellwig
2001-02-01 20:56               ` Steve Lord
2001-02-01 20:59                 ` Christoph Hellwig
2001-02-01 21:17                   ` Steve Lord
2001-02-01 21:44               ` Stephen C. Tweedie
2001-02-01 22:07               ` Stephen C. Tweedie
2001-02-02 12:02                 ` Christoph Hellwig
2001-02-05 12:19                   ` Stephen C. Tweedie
2001-02-05 21:28                     ` Ingo Molnar
2001-02-05 22:58                       ` Stephen C. Tweedie
2001-02-05 23:06                         ` Alan Cox
2001-02-05 23:16                           ` Stephen C. Tweedie
2001-02-06  0:19                         ` Manfred Spraul
2001-02-03 20:28                 ` Linus Torvalds
2001-02-05 11:03                   ` Stephen C. Tweedie
2001-02-05 12:00                     ` Manfred Spraul
2001-02-05 15:03                       ` Stephen C. Tweedie
2001-02-05 15:19                         ` Alan Cox
2001-02-05 17:20                           ` Stephen C. Tweedie
2001-02-05 17:29                             ` Alan Cox
2001-02-05 18:49                               ` Stephen C. Tweedie
2001-02-05 19:04                                 ` Alan Cox
2001-02-05 19:09                                 ` Linus Torvalds
2001-02-05 19:16                                   ` [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait Alan Cox
2001-02-05 19:28                                     ` Linus Torvalds
2001-02-05 20:54                                       ` Stephen C. Tweedie
2001-02-05 21:08                                         ` David Lang
2001-02-05 21:51                                         ` Alan Cox
2001-02-06  0:07                                         ` Stephen C. Tweedie
2001-02-06 17:00                                           ` Christoph Hellwig
2001-02-06 17:05                                             ` Stephen C. Tweedie
2001-02-06 17:14                                               ` Jens Axboe
2001-02-06 17:22                                               ` Christoph Hellwig
2001-02-06 18:26                                                 ` Stephen C. Tweedie
2001-02-06 17:37                                               ` Ben LaHaise
2001-02-06 18:00                                                 ` Jens Axboe
2001-02-06 18:09                                                   ` Ben LaHaise
2001-02-06 19:35                                                     ` Jens Axboe
2001-02-06 18:14                                                 ` Linus Torvalds
2001-02-08 11:21                                                   ` Andi Kleen
2001-02-08 14:11                                                   ` Martin Dalecki
2001-02-08 17:59                                                     ` Linus Torvalds
2001-02-06 18:18                                                 ` Ingo Molnar
2001-02-06 18:25                                                   ` Ben LaHaise
2001-02-06 18:35                                                     ` Ingo Molnar
2001-02-06 18:54                                                       ` Ben LaHaise
2001-02-06 18:58                                                         ` Ingo Molnar
2001-02-06 19:11                                                           ` Ben LaHaise
2001-02-06 19:32                                                             ` Jens Axboe
2001-02-06 19:32                                                             ` Ingo Molnar
2001-02-06 19:32                                                             ` Linus Torvalds
2001-02-06 19:44                                                               ` Ingo Molnar
2001-02-06 19:49                                                               ` Ben LaHaise
2001-02-06 19:57                                                                 ` Ingo Molnar
2001-02-06 20:07                                                                   ` Jens Axboe
2001-02-06 20:25                                                                   ` Ben LaHaise
2001-02-06 20:41                                                                     ` Manfred Spraul
2001-02-06 20:50                                                                       ` Jens Axboe
2001-02-06 21:26                                                                         ` Manfred Spraul
2001-02-06 21:42                                                                           ` Linus Torvalds
2001-02-06 20:16                                                                             ` Marcelo Tosatti
2001-02-06 22:09                                                                               ` Jens Axboe
2001-02-06 22:26                                                                                 ` Linus Torvalds
2001-02-06 21:13                                                                                   ` Marcelo Tosatti
2001-02-06 23:26                                                                                     ` Linus Torvalds
2001-02-07 23:17                                                                                       ` select() returning busy for regular files [was Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait] Pavel Machek
2001-02-08 13:57                                                                                         ` Ben LaHaise
2001-02-08 17:52                                                                                         ` Linus Torvalds
2001-02-08 15:06                                                                                       ` [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait Ben LaHaise
2001-02-08 13:44                                                                                         ` Marcelo Tosatti
2001-02-08 13:45                                                                                           ` Marcelo Tosatti
2001-02-07 23:15                                                                                   ` Pavel Machek
2001-02-08 13:22                                                                                     ` Stephen C. Tweedie
2001-02-08 12:03                                                                                       ` Marcelo Tosatti
2001-02-08 15:46                                                                                         ` Mikulas Patocka
2001-02-08 14:05                                                                                           ` Marcelo Tosatti
2001-02-08 16:11                                                                                             ` Mikulas Patocka
2001-02-08 14:44                                                                                               ` Marcelo Tosatti
2001-02-08 16:57                                                                                               ` Rik van Riel
2001-02-08 17:13                                                                                                 ` James Sutherland
2001-02-08 18:38                                                                                                 ` Linus Torvalds
2001-02-09 12:17                                                                                                   ` Martin Dalecki
2001-02-08 15:55                                                                                           ` Jens Axboe
2001-02-08 18:09                                                                                         ` Linus Torvalds
2001-02-08 14:52                                                                                     ` Mikulas Patocka
2001-02-08 19:50                                                                                       ` Stephen C. Tweedie
2001-02-11 21:30                                                                                       ` Pavel Machek
2001-02-06 21:57                                                                             ` Manfred Spraul
2001-02-06 22:13                                                                               ` Linus Torvalds
2001-02-06 22:26                                                                                 ` Andre Hedrick
2001-02-06 20:49                                                                     ` Jens Axboe
2001-02-07  0:21                                                                   ` Stephen C. Tweedie
2001-02-07  0:25                                                                     ` Ingo Molnar
2001-02-07  0:36                                                                       ` Stephen C. Tweedie
2001-02-07  0:50                                                                         ` Linus Torvalds
2001-02-07  1:49                                                                           ` Stephen C. Tweedie
2001-02-07  2:37                                                                             ` Linus Torvalds
2001-02-07 14:52                                                                               ` Stephen C. Tweedie
2001-02-07 19:12                                                                               ` Richard Gooch
2001-02-07 20:03                                                                                 ` Stephen C. Tweedie
2001-02-07  1:51                                                                           ` Jeff V. Merkey
2001-02-07  1:01                                                                             ` Ingo Molnar
2001-02-07  1:59                                                                               ` Jeff V. Merkey
2001-02-07  1:02                                                                             ` Jens Axboe
2001-02-07  1:19                                                                               ` Linus Torvalds
2001-02-07  1:39                                                                                 ` Jens Axboe
2001-02-07  1:45                                                                                   ` Linus Torvalds
2001-02-07  1:55                                                                                     ` Jens Axboe
2001-02-07  9:10                                                                                     ` David Howells
2001-02-07 12:16                                                                                       ` Stephen C. Tweedie
2001-02-07  2:00                                                                               ` Jeff V. Merkey
2001-02-07  1:06                                                                                 ` Ingo Molnar
2001-02-07  1:09                                                                                   ` Jens Axboe
2001-02-07  1:11                                                                                     ` Ingo Molnar
2001-02-07  1:26                                                                                   ` Linus Torvalds
2001-02-07  2:07                                                                                   ` Jeff V. Merkey
2001-02-07  1:08                                                                                 ` Jens Axboe
2001-02-07  2:08                                                                                   ` Jeff V. Merkey
2001-02-07  1:42                                                                         ` Jeff V. Merkey
2001-02-07  0:42                                                                       ` Linus Torvalds
2001-02-07  0:35                                                                     ` Jens Axboe
2001-02-07  0:41                                                                     ` Linus Torvalds
2001-02-07  1:27                                                                       ` Stephen C. Tweedie
2001-02-07  1:40                                                                         ` Linus Torvalds
2001-02-12 10:07                                                                           ` Jamie Lokier
2001-02-06 20:26                                                                 ` Linus Torvalds
2001-02-06 20:25                                                               ` Christoph Hellwig
2001-02-06 20:35                                                                 ` Ingo Molnar
2001-02-06 19:05                                                                   ` Marcelo Tosatti
2001-02-06 20:59                                                                     ` Ingo Molnar
2001-02-06 21:20                                                                       ` Steve Lord
2001-02-07 18:27                                                                   ` Christoph Hellwig
2001-02-06 20:59                                                                 ` Linus Torvalds
2001-02-07 18:26                                                                   ` Christoph Hellwig
2001-02-07 18:36                                                                     ` Linus Torvalds
2001-02-07 18:44                                                                       ` Christoph Hellwig
2001-02-08  0:34                                                                       ` Neil Brown
2001-02-06 19:46                                                             ` Ingo Molnar
2001-02-06 20:16                                                               ` Ben LaHaise
2001-02-06 20:22                                                                 ` Ingo Molnar
2001-02-06 19:20                                                         ` Linus Torvalds
2001-02-06  0:31                                       ` Roman Zippel
2001-02-06  1:01                                         ` Linus Torvalds
2001-02-06  9:22                                           ` Roman Zippel
2001-02-06  9:30                                           ` Ingo Molnar
2001-02-06  1:08                                         ` David S. Miller
2001-02-05 22:09                         ` [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains Ingo Molnar
2001-02-05 16:56                       ` Linus Torvalds
2001-02-05 17:27                         ` [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait Alan Cox
2001-02-05 16:36                     ` [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains Linus Torvalds
2001-02-05 19:08                       ` Stephen C. Tweedie
2001-02-01 17:49           ` Christoph Hellwig
2001-02-01 17:58             ` Alan Cox
2001-02-01 18:32               ` Rik van Riel
2001-02-01 18:59                 ` yodaiken
2001-02-01 19:33             ` Stephen C. Tweedie
2001-02-01 18:51           ` bcrl
2001-02-01 16:16   ` Stephen C. Tweedie
2001-02-01 17:05     ` Christoph Hellwig
2001-02-01 17:09       ` Christoph Hellwig
2001-02-01 17:41       ` Stephen C. Tweedie
2001-02-01 18:14         ` Christoph Hellwig
2001-02-01 18:25           ` Alan Cox
2001-02-01 18:39             ` Rik van Riel
2001-02-01 18:46               ` [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait Alan Cox
2001-02-01 18:48             ` [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains Christoph Hellwig
2001-02-01 18:57               ` Alan Cox
2001-02-01 19:00                 ` Christoph Hellwig
2001-02-01 19:32           ` Stephen C. Tweedie
2001-02-01 20:46             ` Christoph Hellwig
2001-02-01 21:25               ` Stephen C. Tweedie
2001-02-02 11:51                 ` Christoph Hellwig
2001-02-02 14:04                   ` Stephen C. Tweedie
2001-02-02  4:18           ` bcrl
2001-02-02 12:12             ` Christoph Hellwig
2001-02-01 20:04         ` Chaitanya Tumuluri
     [not found] <CA2569E9.004A4E23.00@d73mta05.au.ibm.com>
2001-02-04 16:46 ` [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait Alan Cox
2001-02-12 14:56 bsuparna

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).