linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
@ 2001-02-12 14:56 bsuparna
  0 siblings, 0 replies; 124+ messages in thread
From: bsuparna @ 2001-02-12 14:56 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Martin Dalecki, Ben LaHaise, Stephen C. Tweedie, Alan Cox,
	Manfred Spraul, Steve Lord, linux-kernel, kiobuf-io-devel,
	Ingo Molnar


Going through all the discussions once again and trying to look at this
from the point of view of just basic requirements for data structures and
mechanisms, that they imply.

1. Should have a data structure that represents a  memory chain , which may
not be contiguous in physical memory, and which can be passed down as a
single unit all the way  through to lowest level drivers
     - e.g for direct i/o to/from a contiguous virtual address range in
user space (without any intermediate copies)

(Networking and block i/o seem may have require different optimizations in
the design of such a data structure, due to differences in the kind of
patterns expected, as is apparent from the zero-copy networking fragments
vs raw i/o kiobuf/kiovec patches. There are situations when such a data
structure may be passed between subsystems as in the i2o example)

This data structure could be part of an I/O container.

2.  I/O containers may get split or merged as they pass through various
layers --- so any completion mechanism and i/o container design should be
able to account for both cases. At any point, a request could be
     - a collection of several higher level requests,
          or
     - could be one among several sub-requests of a single higher level
request.
(Just as appropriate "clustering"  could happen at each level, appropriate
"splitting" may also take place depending on the situation. It may make
sense to delay splitting as far down the chain as possible in many
situations, where the higher level is only interested in the i/o in its
entirety and not in partial completion )
When caching/buffers are involved, sometimes the sub-requests of a single
higher level request may have individual completion requirements (even when
no merges were involved), because the sub-request buffers may be used to
service other requests alongside. With raw i/o that might not be the case.

3. It is desirable that layers which process the requests along the way
without splitting/merging, be able to pass along the same I/O container
without any duplication or cloning, and intercept async i/o completions for
post processing.

4. (Optional) It would be nice if different kinds of I/O containers or
buffer structures could be used at different levels, without having
explicit linkage fields (like bh --> page, for example) , and in a way that
intermediate drivers or layers can work transparently.

3 & 4 are more of layering related items, which gets a little specific, but
do 1 and 2 cover the general things we are looking for ?

Regards
Suparna

  Suparna Bhattacharya
  Systems Software Group, IBM Global Services, India
  E-mail : bsuparna@in.ibm.com
  Phone : 91-80-5267117, Extn : 2525


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://vger.kernel.org/lkml/

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-07  1:40                                       ` Linus Torvalds
@ 2001-02-12 10:07                                         ` Jamie Lokier
  0 siblings, 0 replies; 124+ messages in thread
From: Jamie Lokier @ 2001-02-12 10:07 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Stephen C. Tweedie, Ingo Molnar, Ben LaHaise, Alan Cox,
	Manfred Spraul, Steve Lord, Linux Kernel List, kiobuf-io-devel,
	Ingo Molnar

Linus Torvalds wrote:
> Absolutely. This is exactly what I mean by saying that low-level drivers
> may not actually be able to handle new cases that they've never been asked
> to do before - they just never saw anything like a 64kB request before or
> something that crossed its own alignment.
> 
> But the _higher_ levels are there. And there's absolutely nothing in the
> design that is a real problem. But there's no question that you might need
> to fix up more than one or two low-level drivers.
> 
> (The only drivers I know better are the IDE ones, and as far as I can tell
> they'd have no trouble at all with any of this. Most other normal drivers
> are likely to be in this same situation. But because I've not had a reason
> to test, I certainly won't guarantee even that).

PCI has dma_mask, which distinguishes different device capabilities.
This nice interface handles 64-bit capable devices, 32-bit ones, ISA
limitations (the old 16MB limit) and some other strange devices.

This mask appears in block devices one way or another so that bounce
buffers are used for high addresses.

How about a mask for block devices which indicates the kinds of
alignment and lengths that the driver can handle?  For old drivers that
can't be thoroughly tested, we assume the worst.  Some devices have
hardware limitations.  Newer, tested drivers can relax the limits.

It's probably not difficult to say, "this 64k request can't be handled
so split it into 1k requests".  It integrates naturally with the
decision to use bounce buffers -- alignment restrictions cause copying
just as high addresses causes copying.

-- Jamie
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-08 14:52                                                   ` Mikulas Patocka
  2001-02-08 19:50                                                     ` Stephen C. Tweedie
@ 2001-02-11 21:30                                                     ` Pavel Machek
  1 sibling, 0 replies; 124+ messages in thread
From: Pavel Machek @ 2001-02-11 21:30 UTC (permalink / raw)
  To: Mikulas Patocka, Pavel Machek
  Cc: Linus Torvalds, Jens Axboe, Marcelo Tosatti, Manfred Spraul,
	Ben LaHaise, Ingo Molnar, Stephen C. Tweedie, Alan Cox,
	Steve Lord, Linux Kernel List, kiobuf-io-devel, Ingo Molnar

Hi!

> > So you consider inability to select() on regular files _feature_?
> 
> select on files is unimplementable. You can't do background file IO the
> same way you do background receiving of packets on socket. Filesystem is
> synchronous. It can block. 

You can use helper friends if VFS layer is not able to handle
background IO. Then we can do it right in linux-4.4.
								Pavel

-- 
I'm pavel@ucw.cz. "In my country we have almost anarchy and I don't care."
Panos Katsaloulis describing me w.r.t. patents at discuss@linmodems.org
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-08 18:38                                                               ` Linus Torvalds
@ 2001-02-09 12:17                                                                 ` Martin Dalecki
  0 siblings, 0 replies; 124+ messages in thread
From: Martin Dalecki @ 2001-02-09 12:17 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Rik van Riel, Mikulas Patocka, Marcelo Tosatti,
	Stephen C. Tweedie, Pavel Machek, Jens Axboe, Manfred Spraul,
	Ben LaHaise, Ingo Molnar, Alan Cox, Steve Lord,
	Linux Kernel List, kiobuf-io-devel, Ingo Molnar

Linus Torvalds wrote:
> 
> On Thu, 8 Feb 2001, Rik van Riel wrote:
> 
> > On Thu, 8 Feb 2001, Mikulas Patocka wrote:
> >
> > > > > You need aio_open.
> > > > Could you explain this?
> > >
> > > If the server is sending many small files, disk spends huge
> > > amount time walking directory tree and seeking to inodes. Maybe
> > > opening the file is even slower than reading it
> >
> > Not if you have a big enough inode_cache and dentry_cache.
> >
> > OTOH ... if you have enough memory the whole async IO argument
> > is moot anyway because all your files will be in memory too.
> 
> Note that this _is_ an important point.
> 
> You should never _ever_ think about pure IO speed as the most important
> thing. Even if you get absolutely perfect IO streaming off the fastest
> disk you can find, I will beat you every single time with a cached setup
> that doesn't need to do IO at all.
> 
> 90% of the VFS layer is all about caching, and trying to avoid IO. Of the
> rest, about 9% is about trying to avoid even calling down to the low-level
> filesystem, because it's faster if we can handle it at a high level
> without any need to even worry about issues like physical disk addresses.
> Even if those addresses are cached.
> 
> The remaining 1% is about actually getting the IO done. At that point we
> end up throwing our hands in the air and saying "ok, this will be slow".
> 
> So if you design your system for disk load, you are missing a big portion
> of the picture.
> 
> There are cases where IO really matter. The most notable one being
> databases, certainly _not_ web or ftp servers. For web- or ftp-servers you
> buy more memory if you want high performance, and you tend to be limited
> by the network speed anyway (if you have multiple gigabit networks and
> network speed isn't an issue, then I can also tell you that buying a few
> gigabyte of RAM isn't an issue, because you are obviously working for
> something like the DoD and have very little regard for the cost of the
> thing ;)
> 
> For databases (and for file servers that you want to be robust over a
> crash), IO throughput is an issue mainly because you need to put the damn
> requests in stable memory somewhere. Which tends to mean that _write_
> speed is what really matters, because the reads you can still try to cache
> as efficiently as humanly possible (and the issue of database design then
> turns into trying to find every single piece of locality you can, so that
> the read caching works as well as possible).
> 
> Short and sweet: "aio_open()" is basically never supposed to be an issue.
> If it is, you've misdesigned something, or you're trying too damn hard to
> single-thread everything (and "hiding" the threading that _does_ happen by
> just calling it "AIO" instead - lying to yourself, in short).

Right - I agree with you that an AIO design is basically hiding an
inherently
multi threaded program flow. This argument is indeed very catchy. And
looking
from some other point one will see that most of the AIO designs are from
times
where multi threading in applications wasn't that common as it is now.
Most prominently coprocesses in a shell come to my mind as a very good
example
about how to handle AIO (sort of)...
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-08 14:52                                                   ` Mikulas Patocka
@ 2001-02-08 19:50                                                     ` Stephen C. Tweedie
  2001-02-11 21:30                                                     ` Pavel Machek
  1 sibling, 0 replies; 124+ messages in thread
From: Stephen C. Tweedie @ 2001-02-08 19:50 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Pavel Machek, Linus Torvalds, Jens Axboe, Marcelo Tosatti,
	Manfred Spraul, Ben LaHaise, Ingo Molnar, Stephen C. Tweedie,
	Alan Cox, Steve Lord, Linux Kernel List, kiobuf-io-devel,
	Ingo Molnar

Hi,

On Thu, Feb 08, 2001 at 03:52:35PM +0100, Mikulas Patocka wrote:
> 
> > How do you write high-performance ftp server without threads if select
> > on regular file always returns "ready"?
> 
> No, it's not really possible on Linux. Use SYS$QIO call on VMS :-)

Ahh, but even VMS SYS$QIO is synchronous at doing opens, allocation of
the IO request packets, and mapping file location to disk blocks.
Only the data IO is ever async (and Ben's async IO stuff for Linux
provides that too).

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-08 16:57                                                             ` Rik van Riel
  2001-02-08 17:13                                                               ` James Sutherland
@ 2001-02-08 18:38                                                               ` Linus Torvalds
  2001-02-09 12:17                                                                 ` Martin Dalecki
  1 sibling, 1 reply; 124+ messages in thread
From: Linus Torvalds @ 2001-02-08 18:38 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Mikulas Patocka, Marcelo Tosatti, Stephen C. Tweedie,
	Pavel Machek, Jens Axboe, Manfred Spraul, Ben LaHaise,
	Ingo Molnar, Alan Cox, Steve Lord, Linux Kernel List,
	kiobuf-io-devel, Ingo Molnar



On Thu, 8 Feb 2001, Rik van Riel wrote:

> On Thu, 8 Feb 2001, Mikulas Patocka wrote:
> 
> > > > You need aio_open.
> > > Could you explain this? 
> > 
> > If the server is sending many small files, disk spends huge
> > amount time walking directory tree and seeking to inodes. Maybe
> > opening the file is even slower than reading it
> 
> Not if you have a big enough inode_cache and dentry_cache.
> 
> OTOH ... if you have enough memory the whole async IO argument
> is moot anyway because all your files will be in memory too.

Note that this _is_ an important point.

You should never _ever_ think about pure IO speed as the most important
thing. Even if you get absolutely perfect IO streaming off the fastest
disk you can find, I will beat you every single time with a cached setup
that doesn't need to do IO at all.

90% of the VFS layer is all about caching, and trying to avoid IO. Of the
rest, about 9% is about trying to avoid even calling down to the low-level
filesystem, because it's faster if we can handle it at a high level
without any need to even worry about issues like physical disk addresses.
Even if those addresses are cached.

The remaining 1% is about actually getting the IO done. At that point we
end up throwing our hands in the air and saying "ok, this will be slow".

So if you design your system for disk load, you are missing a big portion
of the picture.

There are cases where IO really matter. The most notable one being
databases, certainly _not_ web or ftp servers. For web- or ftp-servers you
buy more memory if you want high performance, and you tend to be limited
by the network speed anyway (if you have multiple gigabit networks and
network speed isn't an issue, then I can also tell you that buying a few
gigabyte of RAM isn't an issue, because you are obviously working for
something like the DoD and have very little regard for the cost of the
thing ;)

For databases (and for file servers that you want to be robust over a
crash), IO throughput is an issue mainly because you need to put the damn
requests in stable memory somewhere. Which tends to mean that _write_
speed is what really matters, because the reads you can still try to cache
as efficiently as humanly possible (and the issue of database design then
turns into trying to find every single piece of locality you can, so that
the read caching works as well as possible).

Short and sweet: "aio_open()" is basically never supposed to be an issue.
If it is, you've misdesigned something, or you're trying too damn hard to
single-thread everything (and "hiding" the threading that _does_ happen by
just calling it "AIO" instead - lying to yourself, in short).

		Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-08 12:03                                                     ` Marcelo Tosatti
  2001-02-08 15:46                                                       ` Mikulas Patocka
@ 2001-02-08 18:09                                                       ` Linus Torvalds
  1 sibling, 0 replies; 124+ messages in thread
From: Linus Torvalds @ 2001-02-08 18:09 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Stephen C. Tweedie, Pavel Machek, Jens Axboe, Manfred Spraul,
	Ben LaHaise, Ingo Molnar, Alan Cox, Steve Lord,
	Linux Kernel List, kiobuf-io-devel, Ingo Molnar



On Thu, 8 Feb 2001, Marcelo Tosatti wrote:
> 
> On Thu, 8 Feb 2001, Stephen C. Tweedie wrote:
> 
> <snip>
> 
> > > How do you write high-performance ftp server without threads if select
> > > on regular file always returns "ready"?
> > 
> > Select can work if the access is sequential, but async IO is a more
> > general solution.
> 
> Even async IO (ie aio_read/aio_write) should block on the request queue if
> its full in Linus mind.

Not necessarily. I said that "READA/WRITEA" are only worth exporting
inside the kernel - because the latencies and complexities are low-level
enough that it should not be exported to user space as such.

But I could imagine a kernel aio package that does the equivalent of

	bh->b_end_io = completion_handler;
	generic_make_request(WRITE, bh);	/* this may block */
	bh= bh->b_next;

	/* Now, fill it up as much as we can.. */
	current->state = TASK_INTERRUPTIBLE;
	while (more data to be written) {
		if (generic_make_request(WRITEA, bh) < 0)
			break;
		bh = bh->b_next;
	}

	return;

and then you make the _completion handler_ thing continue to feed more
requests. Yes, you may block at some points (because you need to always
have at least _one_ request in-flight in order to have the state machine
active, but you can basically try to avoid blocking more than necessary.

But do you see why the above can't be done from user space? It requires
that the completion handler (which runs in an interrupt context) be able
to continue to feed requests and keep the queue filled. If you don't do
that, you'll never have good throughput, because it takes too long to send
signals, re-schedule or whatever to user mode.

And do you see how it has to block _sometimes_? If people do hundreds of
AIO requests, we can't let memory just fill up with pending writes..

		Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-08 14:11                 ` Martin Dalecki
@ 2001-02-08 17:59                   ` Linus Torvalds
  0 siblings, 0 replies; 124+ messages in thread
From: Linus Torvalds @ 2001-02-08 17:59 UTC (permalink / raw)
  To: Martin Dalecki
  Cc: Ben LaHaise, Stephen C. Tweedie, Alan Cox, Manfred Spraul,
	Steve Lord, linux-kernel, kiobuf-io-devel, Ingo Molnar



On Thu, 8 Feb 2001, Martin Dalecki wrote:
> > 
> > But you'll have a bitch of a time trying to merge multiple
> > threads/processes reading from the same area on disk at roughly the same
> > time. Your higher levels won't even _know_ that there is merging to be
> > done until the IO requests hit the wall in waiting for the disk.
> 
> Merging is a hardware tighted optimization, so it should happen, there we you
> really have full "knowlendge" and controll of the hardware -> namely the
> device driver. 

Or, in many cases, the device itself. There are valid reasons for not
doing merging in the driver, but they all tend to boil down to "even lower
layers can do a better job of it". They basically _never_ boil down to
"upper layers already did it for us".

That said, there tend to be advantages to doing "appropriate" clustering
at each level. Upper layers can (and do) use read-ahead to help the lower
levels. The write-out can (and currently does not) try to sort the
requests for better elevator behaviour.

The driver level can (and does) further cluster the requests - even if the
low-level device does a perfect job of orderign and merging on its own
it's usually advantageous to have fewer (and bigger) commands in-flight in
order to have fewer completion interrupts and less command traffic on the
bus.

So it's obviously not entirely black-and-white. Upper layers can help, but
it's a mistake to think that they should "do the work".

(Note: a lot of people seem to think that "layering" means that the
complexity is in upper layers, and that lower layers should be simple and
"stupid". This is not true. A well-balanced layering would have all layers
doing potentially equally complex things - but the complexity should be
_independent_. Complex interactions are bad. But it's also bad to thin
kthat lower levels shouldn't be allowed to optimize because they should be
"simple".).

		Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-08 16:57                                                             ` Rik van Riel
@ 2001-02-08 17:13                                                               ` James Sutherland
  2001-02-08 18:38                                                               ` Linus Torvalds
  1 sibling, 0 replies; 124+ messages in thread
From: James Sutherland @ 2001-02-08 17:13 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Mikulas Patocka, Marcelo Tosatti, Stephen C. Tweedie,
	Pavel Machek, Linus Torvalds, Jens Axboe, Manfred Spraul,
	Ben LaHaise, Ingo Molnar, Alan Cox, Steve Lord,
	Linux Kernel List, kiobuf-io-devel, Ingo Molnar

On Thu, 8 Feb 2001, Rik van Riel wrote:

> On Thu, 8 Feb 2001, Mikulas Patocka wrote:
> 
> > > > You need aio_open.
> > > Could you explain this? 
> > 
> > If the server is sending many small files, disk spends huge
> > amount time walking directory tree and seeking to inodes. Maybe
> > opening the file is even slower than reading it
> 
> Not if you have a big enough inode_cache and dentry_cache.

Eh? However big the caches are, you can still get misses which will
require multiple (blocking) disk accesses to handle...

> OTOH ... if you have enough memory the whole async IO argument
> is moot anyway because all your files will be in memory too.

Only for cache hits. If you're doing a Mindcraft benchmark or something
with everything in RAM, you're fine - for real world servers, that's not
really an option ;-)

Really, you want/need cache MISSES to be handled without blocking. However
big the caches, short of running EVERYTHING from a ramdisk, these will
still happen!


James.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-08 16:11                                                           ` Mikulas Patocka
  2001-02-08 14:44                                                             ` Marcelo Tosatti
@ 2001-02-08 16:57                                                             ` Rik van Riel
  2001-02-08 17:13                                                               ` James Sutherland
  2001-02-08 18:38                                                               ` Linus Torvalds
  1 sibling, 2 replies; 124+ messages in thread
From: Rik van Riel @ 2001-02-08 16:57 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Marcelo Tosatti, Stephen C. Tweedie, Pavel Machek,
	Linus Torvalds, Jens Axboe, Manfred Spraul, Ben LaHaise,
	Ingo Molnar, Alan Cox, Steve Lord, Linux Kernel List,
	kiobuf-io-devel, Ingo Molnar

On Thu, 8 Feb 2001, Mikulas Patocka wrote:

> > > You need aio_open.
> > Could you explain this? 
> 
> If the server is sending many small files, disk spends huge
> amount time walking directory tree and seeking to inodes. Maybe
> opening the file is even slower than reading it

Not if you have a big enough inode_cache and dentry_cache.

OTOH ... if you have enough memory the whole async IO argument
is moot anyway because all your files will be in memory too.

regards,

Rik
--
Linux MM bugzilla: http://linux-mm.org/bugzilla.shtml

Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

		http://www.surriel.com/
http://www.conectiva.com/	http://distro.conectiva.com/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-08 14:05                                                         ` Marcelo Tosatti
@ 2001-02-08 16:11                                                           ` Mikulas Patocka
  2001-02-08 14:44                                                             ` Marcelo Tosatti
  2001-02-08 16:57                                                             ` Rik van Riel
  0 siblings, 2 replies; 124+ messages in thread
From: Mikulas Patocka @ 2001-02-08 16:11 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Stephen C. Tweedie, Pavel Machek, Linus Torvalds, Jens Axboe,
	Manfred Spraul, Ben LaHaise, Ingo Molnar, Alan Cox, Steve Lord,
	Linux Kernel List, kiobuf-io-devel, Ingo Molnar

> > The problem is that aio_read and aio_write are pretty useless for ftp or
> > http server. You need aio_open.
> 
> Could you explain this? 

If the server is sending many small files, disk spends huge amount time
walking directory tree and seeking to inodes. Maybe opening the file is
even slower than reading it - read is usually sequential but open needs to
seek at few areas of disk.

And if you have one-threaded server using open, close, aio_read and
aio_write, you actually block the whole server while it is opening a
single file. This is not how async io is supposed to work.

Mikulas


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-08 15:46                                                       ` Mikulas Patocka
  2001-02-08 14:05                                                         ` Marcelo Tosatti
@ 2001-02-08 15:55                                                         ` Jens Axboe
  1 sibling, 0 replies; 124+ messages in thread
From: Jens Axboe @ 2001-02-08 15:55 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Marcelo Tosatti, Stephen C. Tweedie, Pavel Machek,
	Linus Torvalds, Manfred Spraul, Ben LaHaise, Ingo Molnar,
	Alan Cox, Steve Lord, Linux Kernel List, kiobuf-io-devel,
	Ingo Molnar

On Thu, Feb 08 2001, Mikulas Patocka wrote:
> > Even async IO (ie aio_read/aio_write) should block on the request queue if
> > its full in Linus mind.
> 
> This is not problem (you can create queue big enough to handle the load).

Well in theory, but in practice this isn't a very good idea. At some
point throwing yet more requests in there doesn't make a whole lot
of sense. You are basically _always_ going to be able to empty the request
list by dirtying lots of data.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-08 12:03                                                     ` Marcelo Tosatti
@ 2001-02-08 15:46                                                       ` Mikulas Patocka
  2001-02-08 14:05                                                         ` Marcelo Tosatti
  2001-02-08 15:55                                                         ` Jens Axboe
  2001-02-08 18:09                                                       ` Linus Torvalds
  1 sibling, 2 replies; 124+ messages in thread
From: Mikulas Patocka @ 2001-02-08 15:46 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Stephen C. Tweedie, Pavel Machek, Linus Torvalds, Jens Axboe,
	Manfred Spraul, Ben LaHaise, Ingo Molnar, Alan Cox, Steve Lord,
	Linux Kernel List, kiobuf-io-devel, Ingo Molnar

> > > How do you write high-performance ftp server without threads if select
> > > on regular file always returns "ready"?
> > 
> > Select can work if the access is sequential, but async IO is a more
> > general solution.
> 
> Even async IO (ie aio_read/aio_write) should block on the request queue if
> its full in Linus mind.

This is not problem (you can create queue big enough to handle the load).

The problem is that aio_read and aio_write are pretty useless for ftp or
http server. You need aio_open.

Mikulas

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-06 23:26                                                   ` Linus Torvalds
@ 2001-02-08 15:06                                                     ` Ben LaHaise
  2001-02-08 13:44                                                       ` Marcelo Tosatti
  0 siblings, 1 reply; 124+ messages in thread
From: Ben LaHaise @ 2001-02-08 15:06 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Marcelo Tosatti, Jens Axboe, Manfred Spraul, Ingo Molnar,
	Stephen C. Tweedie, Alan Cox, Steve Lord, Linux Kernel List,
	kiobuf-io-devel, Ingo Molnar

On Tue, 6 Feb 2001, Linus Torvalds wrote:

> There are currently no other alternatives in user space. You'd have to
> create whole new interfaces for aio_read/write, and ways for the kernel to
> inform user space that "now you can re-try submitting your IO".
>
> Could be done. But that's a big thing.

Has been done.  Still needs some work, but it works pretty well.  As for
throttling io, having ios submitted does not have to correspond to them
being queued in the lower layers.  The main issue with async io is
limiting the amount of pinned memory for ios; if that's taken care of, I
don't think it matters how many ios are in flight.

> > An application which sets non blocking behavior and busy waits for a
> > request (which seems to be your argument) is just stupid, of course.
>
> Tell me what else it could do at some point? You need something like
> select() to wait on it. There are no such interfaces right now...
>
> (besides, latency would suck. I bet you're better off waiting for the
> requests if they are all used up. It takes too long to get deep into the
> kernel from user space, and you cannot use the exclusive waiters with its
> anti-herd behaviour etc).

Ah, but no.  In fact for some things, the wait queue extensions I'm using
will be more efficient as things like test_and_set_bit for obtaining a
lock gets executed without waking up a task.

> Simple rule: if you want to optimize concurrency and avoid waiting - use
> several processes or threads instead. At which point you can get real work
> done on multiple CPU's, instead of worrying about what happens when you
> have to wait on the disk.

There do exist plenty of cases where threads are not efficient enough.
Just the stack overhead alone with 8000 threads makes things really suck.
Event based io completion means that server processes don't need to have
the overhead of select/poll.  Add in NT style completion ports for waking
up the right number of worker threads off of the completion queue, and

That said, I don't expect all devices to support async io.  But given
support for files, raw and sockets all the important cases are covered.
The remainder can be supported via userspace helpers.

		-ben

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-07 23:15                                                 ` Pavel Machek
  2001-02-08 13:22                                                   ` Stephen C. Tweedie
@ 2001-02-08 14:52                                                   ` Mikulas Patocka
  2001-02-08 19:50                                                     ` Stephen C. Tweedie
  2001-02-11 21:30                                                     ` Pavel Machek
  1 sibling, 2 replies; 124+ messages in thread
From: Mikulas Patocka @ 2001-02-08 14:52 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Linus Torvalds, Jens Axboe, Marcelo Tosatti, Manfred Spraul,
	Ben LaHaise, Ingo Molnar, Stephen C. Tweedie, Alan Cox,
	Steve Lord, Linux Kernel List, kiobuf-io-devel, Ingo Molnar

Hi!

> So you consider inability to select() on regular files _feature_?

select on files is unimplementable. You can't do background file IO the
same way you do background receiving of packets on socket. Filesystem is
synchronous. It can block. 

> It can be a pretty serious problem with slow block devices
> (floppy). It also hurts when you are trying to do high-performance
> reads/writes. [I know it hurt in userspace sherlock search engine --
> kind of small altavista.]
> 
> How do you write high-performance ftp server without threads if select
> on regular file always returns "ready"?

No, it's not really possible on Linux. Use SYS$QIO call on VMS :-)

You can emulate asynchronous IO with kernel threads like FreeBSD and some
commercial Unices do, but you still need as many (possibly kernel) threads
as many requests you are servicing. 

> > Remember: in the end you HAVE to wait somewhere. You're always going to be
> > able to generate data faster than the disk can take it. SOMETHING
> 
> Userspace wants to _know_ when to stop. It asks politely using
> "select()".

And how do you want to wait for other select()ed events if you are blocked
in wait_for_buffer in get_block (former bmap)?

Making real async IO would require to rewrite all filesystems and whole
VFS _from_scratch_. It won't happen.

Mikulas

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-08 16:11                                                           ` Mikulas Patocka
@ 2001-02-08 14:44                                                             ` Marcelo Tosatti
  2001-02-08 16:57                                                             ` Rik van Riel
  1 sibling, 0 replies; 124+ messages in thread
From: Marcelo Tosatti @ 2001-02-08 14:44 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Stephen C. Tweedie, Pavel Machek, Linus Torvalds, Jens Axboe,
	Manfred Spraul, Ben LaHaise, Ingo Molnar, Alan Cox, Steve Lord,
	Linux Kernel List, kiobuf-io-devel, Ingo Molnar



On Thu, 8 Feb 2001, Mikulas Patocka wrote:

> > > The problem is that aio_read and aio_write are pretty useless for ftp or
> > > http server. You need aio_open.
> > 
> > Could you explain this? 
> 
> If the server is sending many small files, disk spends huge amount time
> walking directory tree and seeking to inodes. Maybe opening the file is
> even slower than reading it - read is usually sequential but open needs to
> seek at few areas of disk.
> 
> And if you have one-threaded server using open, close, aio_read and
> aio_write, you actually block the whole server while it is opening a
> single file. This is not how async io is supposed to work.

Ok but this is not the point of the discussion. 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-06 18:14               ` Linus Torvalds
  2001-02-08 11:21                 ` Andi Kleen
@ 2001-02-08 14:11                 ` Martin Dalecki
  2001-02-08 17:59                   ` Linus Torvalds
  1 sibling, 1 reply; 124+ messages in thread
From: Martin Dalecki @ 2001-02-08 14:11 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ben LaHaise, Stephen C. Tweedie, Alan Cox, Manfred Spraul,
	Steve Lord, linux-kernel, kiobuf-io-devel, Ingo Molnar

Linus Torvalds wrote:
> 
> On Tue, 6 Feb 2001, Ben LaHaise wrote:
> >
> > On Tue, 6 Feb 2001, Stephen C. Tweedie wrote:
> >
> > > The whole point of the post was that it is merging, not splitting,
> > > which is troublesome.  How are you going to merge requests without
> > > having chains of scatter-gather entities each with their own
> > > completion callbacks?
> >
> > Let me just emphasize what Stephen is pointing out: if requests are
> > properly merged at higher layers, then merging is neither required nor
> > desired.
> 
> I will claim that you CANNOT merge at higher levels and get good
> performance.
> 
> Sure, you can do read-ahead, and try to get big merges that way at a high
> level. Good for you.
> 
> But you'll have a bitch of a time trying to merge multiple
> threads/processes reading from the same area on disk at roughly the same
> time. Your higher levels won't even _know_ that there is merging to be
> done until the IO requests hit the wall in waiting for the disk.

Merging is a hardware tighted optimization, so it should happen, there
we you
really have full "knowlendge" and controll of the hardware -> namely the
device driver. 

> Qutie frankly, this whole discussion sounds worthless. We have solved this
> problem already: it's called a "buffer head". Deceptively simple at higher
> levels, and lower levels can easily merge them together into chains and do
> fancy scatter-gather structures of them that can be dynamically extended
> at any time.
> 
> The buffer heads together with "struct request" do a hell of a lot more
> than just a simple scatter-gather: it's able to create ordered lists of
> independent sg-events, together with full call-backs etc. They are
> low-cost, fairly efficient, and they have worked beautifully for years.
> 
> The fact that kiobufs can't be made to do the same thing is somebody elses
> problem. I _know_ that merging has to happen late, and if others are
> hitting their heads against this issue until they turn silly, then that's
> their problem. You'll eventually learn, or you'll hit your heads into a
> pulp.

Amen.

-- 
- phone: +49 214 8656 283
- job:   STOCK-WORLD Media AG, LEV .de (MY OPPINNIONS ARE MY OWN!)
- langs: de_DE.ISO8859-1, en_US, pl_PL.ISO8859-2, last ressort:
ru_RU.KOI8-R
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-08 15:46                                                       ` Mikulas Patocka
@ 2001-02-08 14:05                                                         ` Marcelo Tosatti
  2001-02-08 16:11                                                           ` Mikulas Patocka
  2001-02-08 15:55                                                         ` Jens Axboe
  1 sibling, 1 reply; 124+ messages in thread
From: Marcelo Tosatti @ 2001-02-08 14:05 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Stephen C. Tweedie, Pavel Machek, Linus Torvalds, Jens Axboe,
	Manfred Spraul, Ben LaHaise, Ingo Molnar, Alan Cox, Steve Lord,
	Linux Kernel List, kiobuf-io-devel, Ingo Molnar



On Thu, 8 Feb 2001, Mikulas Patocka wrote:

> > > > How do you write high-performance ftp server without threads if select
> > > > on regular file always returns "ready"?
> > > 
> > > Select can work if the access is sequential, but async IO is a more
> > > general solution.
> > 
> > Even async IO (ie aio_read/aio_write) should block on the request queue if
> > its full in Linus mind.
> 
> This is not problem (you can create queue big enough to handle the load).

The point is that you want to be able to not block if the queue full (and
the queue size has nothing to do with that).

> The problem is that aio_read and aio_write are pretty useless for ftp or
> http server. You need aio_open.

Could you explain this? 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-08 13:44                                                       ` Marcelo Tosatti
@ 2001-02-08 13:45                                                         ` Marcelo Tosatti
  0 siblings, 0 replies; 124+ messages in thread
From: Marcelo Tosatti @ 2001-02-08 13:45 UTC (permalink / raw)
  To: Ben LaHaise
  Cc: Linus Torvalds, Jens Axboe, Manfred Spraul, Ingo Molnar,
	Stephen C. Tweedie, Alan Cox, Steve Lord, Linux Kernel List,
	kiobuf-io-devel, Ingo Molnar


On Thu, 8 Feb 2001, Marcelo Tosatti wrote:

> 
> On Thu, 8 Feb 2001, Ben LaHaise wrote:
> 
> <snip>
> 
> > > (besides, latency would suck. I bet you're better off waiting for the
> > > requests if they are all used up. It takes too long to get deep into the
> > > kernel from user space, and you cannot use the exclusive waiters with its
> > > anti-herd behaviour etc).
> > 
> > Ah, but no.  In fact for some things, the wait queue extensions I'm using
> > will be more efficient as things like test_and_set_bit for obtaining a
> > lock gets executed without waking up a task.
> 
> The latency argument is somewhat bogus because there is no problem to
> check the request queue, in the aio syscalls, and simply fail if its full.

Ugh, I forgot to say check the request queue before doing any filesystem
work. 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-08 15:06                                                     ` Ben LaHaise
@ 2001-02-08 13:44                                                       ` Marcelo Tosatti
  2001-02-08 13:45                                                         ` Marcelo Tosatti
  0 siblings, 1 reply; 124+ messages in thread
From: Marcelo Tosatti @ 2001-02-08 13:44 UTC (permalink / raw)
  To: Ben LaHaise
  Cc: Linus Torvalds, Jens Axboe, Manfred Spraul, Ingo Molnar,
	Stephen C. Tweedie, Alan Cox, Steve Lord, Linux Kernel List,
	kiobuf-io-devel, Ingo Molnar


On Thu, 8 Feb 2001, Ben LaHaise wrote:

<snip>

> > (besides, latency would suck. I bet you're better off waiting for the
> > requests if they are all used up. It takes too long to get deep into the
> > kernel from user space, and you cannot use the exclusive waiters with its
> > anti-herd behaviour etc).
> 
> Ah, but no.  In fact for some things, the wait queue extensions I'm using
> will be more efficient as things like test_and_set_bit for obtaining a
> lock gets executed without waking up a task.

The latency argument is somewhat bogus because there is no problem to
check the request queue, in the aio syscalls, and simply fail if its full.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-07 23:15                                                 ` Pavel Machek
@ 2001-02-08 13:22                                                   ` Stephen C. Tweedie
  2001-02-08 12:03                                                     ` Marcelo Tosatti
  2001-02-08 14:52                                                   ` Mikulas Patocka
  1 sibling, 1 reply; 124+ messages in thread
From: Stephen C. Tweedie @ 2001-02-08 13:22 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Linus Torvalds, Jens Axboe, Marcelo Tosatti, Manfred Spraul,
	Ben LaHaise, Ingo Molnar, Stephen C. Tweedie, Alan Cox,
	Steve Lord, Linux Kernel List, kiobuf-io-devel, Ingo Molnar

Hi,

On Thu, Feb 08, 2001 at 12:15:13AM +0100, Pavel Machek wrote:
> 
> > EAGAIN is _not_ a valid return value for block devices or for regular
> > files. And in fact it _cannot_ be, because select() is defined to always
> > return 1 on them - so if a write() were to return EAGAIN, user space would
> > have nothing to wait on. Busy waiting is evil.
> 
> So you consider inability to select() on regular files _feature_?

Select might make some sort of sense for sequential access to files,
and for random access via lseek/read but it makes no sense at all for
pread and pwrite where select() has no idea _which_ part of the file
the user is going to want to access next.

> How do you write high-performance ftp server without threads if select
> on regular file always returns "ready"?

Select can work if the access is sequential, but async IO is a more
general solution.

Cheers,
 Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-08 13:22                                                   ` Stephen C. Tweedie
@ 2001-02-08 12:03                                                     ` Marcelo Tosatti
  2001-02-08 15:46                                                       ` Mikulas Patocka
  2001-02-08 18:09                                                       ` Linus Torvalds
  0 siblings, 2 replies; 124+ messages in thread
From: Marcelo Tosatti @ 2001-02-08 12:03 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Pavel Machek, Linus Torvalds, Jens Axboe, Manfred Spraul,
	Ben LaHaise, Ingo Molnar, Alan Cox, Steve Lord,
	Linux Kernel List, kiobuf-io-devel, Ingo Molnar



On Thu, 8 Feb 2001, Stephen C. Tweedie wrote:

<snip>

> > How do you write high-performance ftp server without threads if select
> > on regular file always returns "ready"?
> 
> Select can work if the access is sequential, but async IO is a more
> general solution.

Even async IO (ie aio_read/aio_write) should block on the request queue if
its full in Linus mind.








-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-06 18:14               ` Linus Torvalds
@ 2001-02-08 11:21                 ` Andi Kleen
  2001-02-08 14:11                 ` Martin Dalecki
  1 sibling, 0 replies; 124+ messages in thread
From: Andi Kleen @ 2001-02-08 11:21 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ben LaHaise, Stephen C. Tweedie, Alan Cox, Manfred Spraul,
	Steve Lord, linux-kernel, kiobuf-io-devel, Ingo Molnar

On Tue, Feb 06, 2001 at 10:14:21AM -0800, Linus Torvalds wrote:
> I will claim that you CANNOT merge at higher levels and get good
> performance.
> 
> Sure, you can do read-ahead, and try to get big merges that way at a high
> level. Good for you.
> 
> But you'll have a bitch of a time trying to merge multiple
> threads/processes reading from the same area on disk at roughly the same
> time. Your higher levels won't even _know_ that there is merging to be
> done until the IO requests hit the wall in waiting for the disk.

Hi,

I've tried to experimentally check this statement.

I instrumented a kernel with the following patch. It keeps a counter
for every merge between unrelated requests. An unrelated requests
is defined as the requests getting allocated from different currents.
I did various tests and suprisingly I was not able to trigger a 
single unrelated merge on my IDE system with various IO loads (dbench,
news expire, news sort, kernel compile, swapping ...) 

So either my patch is wrong (if yes, what is wrong?), or they do simply not 
happen in usual IO loads. I know that it has a few holes (like it doesn't 
count unrelated merges that happen from the same process, or if a process 
quits and another one gets its kernel stack and IO of both is merged it'll 
be counted as related merge), but if unrelated merges were relevant there 
should still show up more, no? 

My pet theory is that page and buffer cache filters most unrelated merges
out. I haven't tried to use raw IO to avoid this problem, but I expect that
anything that does raw IO will do some intelligent IO scheduling on its
own anyways.

If anyone is interested: it would be interesting if other people are 
able to trigger unrelated merges in real loads.
Here is a patch. Display statistics using:

(echo print unrelated_merge ; print related_merge ) | gdb vmlinux /proc/kcore


--- linux/drivers/block/ll_rw_blk.c-REQSTAT	Tue Jan 30 13:33:25 2001
+++ linux/drivers/block/ll_rw_blk.c	Thu Feb  8 01:13:57 2001
@@ -31,6 +31,9 @@
 
 #include <linux/module.h>
 
+int unrelated_merge; 
+int related_merge;
+
 /*
  * MAC Floppy IWM hooks
  */
@@ -478,6 +481,7 @@
 		rq->rq_status = RQ_ACTIVE;
 		rq->special = NULL;
 		rq->q = q;
+		rq->originator = current;
 	}
 
 	return rq;
@@ -668,6 +672,11 @@
 	if (!q->merge_requests_fn(q, req, next, max_segments))
 		return;
 
+	if (next->originator != req->originator)
+		unrelated_merge++; 
+	else
+		related_merge++; 
+
 	q->elevator.elevator_merge_req_fn(req, next);
 	req->bhtail->b_reqnext = next->bh;
 	req->bhtail = next->bhtail;
--- linux/include/linux/blkdev.h-REQSTAT	Tue Jan 30 17:17:01 2001
+++ linux/include/linux/blkdev.h	Wed Feb  7 23:33:35 2001
@@ -45,6 +45,8 @@
 	struct buffer_head * bh;
 	struct buffer_head * bhtail;
 	request_queue_t *q;
+
+	struct task_struct *originator;
 };
 
 #include <linux/elevator.h>




-Andi

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-07 18:36                                   ` Linus Torvalds
  2001-02-07 18:44                                     ` Christoph Hellwig
@ 2001-02-08  0:34                                     ` Neil Brown
  1 sibling, 0 replies; 124+ messages in thread
From: Neil Brown @ 2001-02-08  0:34 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Christoph Hellwig, Ben LaHaise, Ingo Molnar, Stephen C. Tweedie,
	Alan Cox, Manfred Spraul, Steve Lord, Linux Kernel List,
	kiobuf-io-devel, Ingo Molnar

On Wednesday February 7, torvalds@transmeta.com wrote:
> 
> 
> On Wed, 7 Feb 2001, Christoph Hellwig wrote:
> 
> > On Tue, Feb 06, 2001 at 12:59:02PM -0800, Linus Torvalds wrote:
> > > 
> > > Actually, they really aren't.
> > > 
> > > They kind of _used_ to be, but more and more they've moved away from that
> > > historical use. Check in particular the page cache, and as a really
> > > extreme case the swap cache version of the page cache.
> > 
> > Yes.  And that exactly why I think it's ugly to have the left-over
> > caching stuff in the same data sctruture as the IO buffer.
> 
> I do agree.
> 
> I would not be opposed to factoring out the "pure block IO" part from the
> bh struct. It should not even be very hard. You'd do something like
> 
> 	struct block_io {
> 		.. here is the stuff needed for block IO ..
> 	};
> 
> 	struct buffer_head {
> 		struct block_io io;
> 		.. here is the stuff needed for hashing etc ..
> 	}
> 
> and then you make "generic_make_request()" and everything lower down take
> just the "struct block_io".
> 

I was just thinking the same, or a similar thing.
I wanted to do

    struct io_head {
         stuff
    };
    struct buffer_head {
         struct io_head;
         more stuff;
    }

so that, as an unnamed substructure, the content of the struct io_head
would automagically be promoted to appear to be content of
buffer_head.
However I then remembered (when it didn't work) that unnamed
substructures are a feature of the Plan-9 C compiler, not the GNU
Compiler Collection. (Any gcc coders out there think this would be a
good thing to add?
  http://plan9.bell-labs.com/sys/doc/compiler.html
)

Anyway, I produced the same result in a rather ugly way with #defines
and modified raid5 to use 32byte block_io structures instead of the
80+ byte buffer_heads, and it ... doesn't quite work :-( it boots
fine, but raid5 dies and the Oops message is a few kilometers away.
Anyway, I think the concept it fine.

Patch is below for your inspection.

It occurs to me that Stephen's desire to pass lots of requests through
make_request all at once isn't a bad idea and could be done by simply
linking the io_heads together with b_reqnext.
This would require:
  1/ all callers of generic_make_request (there are 3) to initialise
     b_reqnext
  2/ all registered make_request_fn functions (there are again 3 I
     think)  to cope with following b_reqnext

It shouldn't be too hard to make the elevator code take advantage of
any ordering that it fines in the list.

I don't have a patch which does this.

NeilBrown


--- ./include/linux/fs.h	2001/02/07 22:45:37	1.1
+++ ./include/linux/fs.h	2001/02/07 23:09:05
@@ -207,6 +207,7 @@
 #define BH_Protected	6	/* 1 if the buffer is protected */
 
 /*
+ * THIS COMMENT NO-LONGER CORRECT.
  * Try to keep the most commonly used fields in single cache lines (16
  * bytes) to improve performance.  This ordering should be
  * particularly beneficial on 32-bit processors.
@@ -217,31 +218,43 @@
  * The second 16 bytes we use for lru buffer scans, as used by
  * sync_buffers() and refill_freelist().  -- sct
  */
+
+/* 
+ * io_head is all that is needed by device drivers.
+ */
+#define io_head_fields \
+	unsigned long b_state;		/* buffer state bitmap (see above) */	\
+	struct buffer_head *b_reqnext;	/* request queue */			\
+	unsigned short b_size;		/* block size */			\
+	kdev_t b_rdev;			/* Real device */			\
+	unsigned long b_rsector;	/* Real buffer location on disk */	\
+	char * b_data;			/* pointer to data block (512 byte) */	\
+	void (*b_end_io)(struct buffer_head *bh, int uptodate); /* I/O completion */ \
+ 	void *b_private;		/* reserved for b_end_io */		\
+	struct page *b_page;		/* the page this bh is mapped to */	\
+     /* this line intensionally left blank */
+struct io_head {
+	io_head_fields
+};
+
+/* buffer_head adds all the stuff needed by the buffer cache */
 struct buffer_head {
-	/* First cache line: */
+	io_head_fields
+
 	struct buffer_head *b_next;	/* Hash queue list */
 	unsigned long b_blocknr;	/* block number */
-	unsigned short b_size;		/* block size */
 	unsigned short b_list;		/* List that this buffer appears */
 	kdev_t b_dev;			/* device (B_FREE = free) */
 
 	atomic_t b_count;		/* users using this block */
-	kdev_t b_rdev;			/* Real device */
-	unsigned long b_state;		/* buffer state bitmap (see above) */
 	unsigned long b_flushtime;	/* Time when (dirty) buffer should be written */
 
 	struct buffer_head *b_next_free;/* lru/free list linkage */
 	struct buffer_head *b_prev_free;/* doubly linked list of buffers */
 	struct buffer_head *b_this_page;/* circular list of buffers in one page */
-	struct buffer_head *b_reqnext;	/* request queue */
 
 	struct buffer_head **b_pprev;	/* doubly linked list of hash-queue */
-	char * b_data;			/* pointer to data block (512 byte) */
-	struct page *b_page;		/* the page this bh is mapped to */
-	void (*b_end_io)(struct buffer_head *bh, int uptodate); /* I/O completion */
- 	void *b_private;		/* reserved for b_end_io */
 
-	unsigned long b_rsector;	/* Real buffer location on disk */
 	wait_queue_head_t b_wait;
 
 	struct inode *	     b_inode;
--- ./drivers/md/raid5.c	2001/02/06 05:43:31	1.2
+++ ./drivers/md/raid5.c	2001/02/07 23:15:36
@@ -151,18 +151,16 @@
 
 	for (i=0; i<num; i++) {
 		struct page *page;
-		bh = kmalloc(sizeof(struct buffer_head), priority);
+		bh = kmalloc(sizeof(struct io_head), priority);
 		if (!bh)
 			return 1;
-		memset(bh, 0, sizeof (struct buffer_head));
-		init_waitqueue_head(&bh->b_wait);
+		memset(bh, 0, sizeof (struct io_head));
 		page = alloc_page(priority);
 		bh->b_data = page_address(page);
 		if (!bh->b_data) {
 			kfree(bh);
 			return 1;
 		}
-		atomic_set(&bh->b_count, 0);
 		bh->b_page = page;
 		sh->bh_cache[i] = bh;
 
@@ -412,7 +410,7 @@
 			spin_lock_irqsave(&conf->device_lock, flags);
 		}
 	} else {
-		md_error(mddev_to_kdev(conf->mddev), bh->b_dev);
+		md_error(mddev_to_kdev(conf->mddev), conf->disks[i].dev);
 		clear_bit(BH_Uptodate, &bh->b_state);
 	}
 	clear_bit(BH_Lock, &bh->b_state);
@@ -440,7 +438,7 @@
 
 	md_spin_lock_irqsave(&conf->device_lock, flags);
 	if (!uptodate)
-		md_error(mddev_to_kdev(conf->mddev), bh->b_dev);
+		md_error(mddev_to_kdev(conf->mddev), conf->disks[i].dev);
 	clear_bit(BH_Lock, &bh->b_state);
 	set_bit(STRIPE_HANDLE, &sh->state);
 	__release_stripe(conf, sh);
@@ -456,12 +454,10 @@
 	unsigned long block = sh->sector / (sh->size >> 9);
 
 	init_buffer(bh, raid5_end_read_request, sh);
-	bh->b_dev       = conf->disks[i].dev;
 	bh->b_blocknr   = block;
 
 	bh->b_state	= (1 << BH_Req) | (1 << BH_Mapped);
 	bh->b_size	= sh->size;
-	bh->b_list	= BUF_LOCKED;
 	return bh;
 }
 
@@ -1085,15 +1081,14 @@
 			else
 				bh->b_end_io = raid5_end_write_request;
 			if (conf->disks[i].operational)
-				bh->b_dev = conf->disks[i].dev;
+				bh->b_rdev = conf->disks[i].dev;
 			else if (conf->spare && action[i] == WRITE+1)
-				bh->b_dev = conf->spare->dev;
+				bh->b_rdev = conf->spare->dev;
 			else skip=1;
 			if (!skip) {
 				PRINTK("for %ld schedule op %d on disc %d\n", sh->sector, action[i]-1, i);
 				atomic_inc(&sh->count);
-				bh->b_rdev = bh->b_dev;
-				bh->b_rsector = bh->b_blocknr * (bh->b_size>>9);
+				bh->b_rsector = sh->sector;
 				generic_make_request(action[i]-1, bh);
 			} else {
 				PRINTK("skip op %d on disc %d for sector %ld\n", action[i]-1, i, sh->sector);
@@ -1502,7 +1497,7 @@
 	}
 
 	memory = conf->max_nr_stripes * (sizeof(struct stripe_head) +
-		 conf->raid_disks * ((sizeof(struct buffer_head) + PAGE_SIZE))) / 1024;
+		 conf->raid_disks * ((sizeof(struct io_head) + PAGE_SIZE))) / 1024;
 	if (grow_stripes(conf, conf->max_nr_stripes, GFP_KERNEL)) {
 		printk(KERN_ERR "raid5: couldn't allocate %dkB for buffers\n", memory);
 		shrink_stripes(conf, conf->max_nr_stripes);
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-06 22:26                                               ` Linus Torvalds
  2001-02-06 21:13                                                 ` Marcelo Tosatti
@ 2001-02-07 23:15                                                 ` Pavel Machek
  2001-02-08 13:22                                                   ` Stephen C. Tweedie
  2001-02-08 14:52                                                   ` Mikulas Patocka
  1 sibling, 2 replies; 124+ messages in thread
From: Pavel Machek @ 2001-02-07 23:15 UTC (permalink / raw)
  To: Linus Torvalds, Jens Axboe
  Cc: Marcelo Tosatti, Manfred Spraul, Ben LaHaise, Ingo Molnar,
	Stephen C. Tweedie, Alan Cox, Steve Lord, Linux Kernel List,
	kiobuf-io-devel, Ingo Molnar

Hi!

> > > Reading write(2): 
> > > 
> > >        EAGAIN Non-blocking  I/O has been selected using O_NONBLOCK and there was
> > >               no room in the pipe or socket connected to fd to  write  the data
> > >               immediately.
> > > 
> > > I see no reason why "aio function have to block waiting for requests". 
> > 
> > That was my reasoning too with READA etc, but Linus seems to want that we
> > can block while submitting the I/O (as throttling, Linus?) just not
> > until completion.
> 
> Note the "in the pipe or socket" part.
>                  ^^^^    ^^^^^^
> 
> EAGAIN is _not_ a valid return value for block devices or for regular
> files. And in fact it _cannot_ be, because select() is defined to always
> return 1 on them - so if a write() were to return EAGAIN, user space would
> have nothing to wait on. Busy waiting is evil.

So you consider inability to select() on regular files _feature_?

It can be a pretty serious problem with slow block devices
(floppy). It also hurts when you are trying to do high-performance
reads/writes. [I know it hurt in userspace sherlock search engine --
kind of small altavista.]

How do you write high-performance ftp server without threads if select
on regular file always returns "ready"?
 

> Remember: in the end you HAVE to wait somewhere. You're always going to be
> able to generate data faster than the disk can take it. SOMETHING

Userspace wants to _know_ when to stop. It asks politely using
"select()".
								Pavel
-- 
I'm pavel@ucw.cz. "In my country we have almost anarchy and I don't care."
Panos Katsaloulis describing me w.r.t. patents at discuss@linmodems.org
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-07 19:12                                             ` Richard Gooch
@ 2001-02-07 20:03                                               ` Stephen C. Tweedie
  0 siblings, 0 replies; 124+ messages in thread
From: Stephen C. Tweedie @ 2001-02-07 20:03 UTC (permalink / raw)
  To: Richard Gooch
  Cc: Stephen C. Tweedie, Linus Torvalds, Ingo Molnar, Ben LaHaise,
	Alan Cox, Manfred Spraul, Steve Lord, Linux Kernel List,
	kiobuf-io-devel

Hi,

On Wed, Feb 07, 2001 at 12:12:44PM -0700, Richard Gooch wrote:
> Stephen C. Tweedie writes:
> > 
> > Sorry?  I'm not sure where communication is breaking down here, but
> > we really don't seem to be talking about the same things.  SGI's
> > kiobuf request patches already let us pass a large IO through the
> > request layer in a single unit without having to split it up to
> > squeeze it through the API.
> 
> Isn't Linus saying that you can use (say) 4 kiB buffer_heads, so you
> don't need kiobufs? IIRC, kiobufs are page containers, so a 4 kiB
> buffer_head is effectively the same thing.

kiobufs let you encode _any_ contiguous region of user VA or of an
inode's page cache contents in one kiobuf, no matter how many pages
there are in it.  A write of a megabyte to a raw device can be encoded
as a single kiobuf if we want to pass the entire 1MB IO down to the
block layers untouched.  That's what the page vector in the kiobuf is
for.

Doing the same thing with buffer_heads would still require a couple of
hundred of them, and you'd have to submit each such buffer_head to the
IO subsystem independently.  And then the IO layer will just have to
reassemble them on the other side (and it may have to scan the
device's entire request queue once for every single buffer_head to do
so).

> But an API extension to allow passing a pre-built chain would be even
> better.

Yep.

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-07  2:37                                           ` Linus Torvalds
  2001-02-07 14:52                                             ` Stephen C. Tweedie
@ 2001-02-07 19:12                                             ` Richard Gooch
  2001-02-07 20:03                                               ` Stephen C. Tweedie
  1 sibling, 1 reply; 124+ messages in thread
From: Richard Gooch @ 2001-02-07 19:12 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Linus Torvalds, Ingo Molnar, Ben LaHaise, Alan Cox,
	Manfred Spraul, Steve Lord, Linux Kernel List, kiobuf-io-devel

Stephen C. Tweedie writes:
> Hi,
> 
> On Tue, Feb 06, 2001 at 06:37:41PM -0800, Linus Torvalds wrote:
> > Absolutely. And this is independent of what kind of interface we end up
> > using, whether it be kiobuf of just plain "struct buffer_head". In that
> > respect they are equivalent.
> 
> Sorry?  I'm not sure where communication is breaking down here, but
> we really don't seem to be talking about the same things.  SGI's
> kiobuf request patches already let us pass a large IO through the
> request layer in a single unit without having to split it up to
> squeeze it through the API.

Isn't Linus saying that you can use (say) 4 kiB buffer_heads, so you
don't need kiobufs? IIRC, kiobufs are page containers, so a 4 kiB
buffer_head is effectively the same thing.

> If you really don't mind the size of the buffer_head as a sg fragment
> header, then at least I'd like us to be able to submit a pre-built
> chain of bh's all at once without having to go through the remap/merge
> cost for each single bh.

Even if you are limited to feeding one buffer_head at a time, the
merge costs should be somewhat mitigated, since you'll decrease your
calls into the API by a factor of 8 or 16.
But an API extension to allow passing a pre-built chain would be even
better.

Hopefully I haven't missed the point. I've got the flu so I'm not
running on all 4 cylinders :-(

				Regards,

					Richard....
Permanent: rgooch@atnf.csiro.au
Current:   rgooch@ras.ucalgary.ca
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-07 18:36                                   ` Linus Torvalds
@ 2001-02-07 18:44                                     ` Christoph Hellwig
  2001-02-08  0:34                                     ` Neil Brown
  1 sibling, 0 replies; 124+ messages in thread
From: Christoph Hellwig @ 2001-02-07 18:44 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ben LaHaise, Ingo Molnar, Stephen C. Tweedie, Alan Cox,
	Manfred Spraul, Steve Lord, Linux Kernel List, kiobuf-io-devel,
	Ingo Molnar

On Wed, Feb 07, 2001 at 10:36:47AM -0800, Linus Torvalds wrote:
> 
> 
> On Wed, 7 Feb 2001, Christoph Hellwig wrote:
> 
> > On Tue, Feb 06, 2001 at 12:59:02PM -0800, Linus Torvalds wrote:
> > > 
> > > Actually, they really aren't.
> > > 
> > > They kind of _used_ to be, but more and more they've moved away from that
> > > historical use. Check in particular the page cache, and as a really
> > > extreme case the swap cache version of the page cache.
> > 
> > Yes.  And that exactly why I think it's ugly to have the left-over
> > caching stuff in the same data sctruture as the IO buffer.
> 
> I do agree.
> 
> I would not be opposed to factoring out the "pure block IO" part from the
> bh struct. It should not even be very hard. You'd do something like
> 
> 	struct block_io {
> 		.. here is the stuff needed for block IO ..
> 	};
> 
> 	struct buffer_head {
> 		struct block_io io;
> 		.. here is the stuff needed for hashing etc ..
> 	}
> 
> and then you make "generic_make_request()" and everything lower down take
> just the "struct block_io".

Yep. (besides the name block_io sucks :))

> You'd still leave "ll_rw_block()" and "submit_bh()" operating on bh's,
> because they knoa about bh semantics (ie things like scaling the sector
> number to the bh size etc). Which means that pretty much all the code
> outside the block layer wouldn't even _notice_. Which is a sign of good
> layering.

Yep.

> If you want to do this, please do go ahead.

I'll take a look at it.

> But do realize that this is not exactly a 2.4.x thing ;)

Sure.

	Christoph

-- 
Whip me.  Beat me.  Make me maintain AIX.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-07 18:26                                 ` Christoph Hellwig
@ 2001-02-07 18:36                                   ` Linus Torvalds
  2001-02-07 18:44                                     ` Christoph Hellwig
  2001-02-08  0:34                                     ` Neil Brown
  0 siblings, 2 replies; 124+ messages in thread
From: Linus Torvalds @ 2001-02-07 18:36 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Ben LaHaise, Ingo Molnar, Stephen C. Tweedie, Alan Cox,
	Manfred Spraul, Steve Lord, Linux Kernel List, kiobuf-io-devel,
	Ingo Molnar



On Wed, 7 Feb 2001, Christoph Hellwig wrote:

> On Tue, Feb 06, 2001 at 12:59:02PM -0800, Linus Torvalds wrote:
> > 
> > Actually, they really aren't.
> > 
> > They kind of _used_ to be, but more and more they've moved away from that
> > historical use. Check in particular the page cache, and as a really
> > extreme case the swap cache version of the page cache.
> 
> Yes.  And that exactly why I think it's ugly to have the left-over
> caching stuff in the same data sctruture as the IO buffer.

I do agree.

I would not be opposed to factoring out the "pure block IO" part from the
bh struct. It should not even be very hard. You'd do something like

	struct block_io {
		.. here is the stuff needed for block IO ..
	};

	struct buffer_head {
		struct block_io io;
		.. here is the stuff needed for hashing etc ..
	}

and then you make "generic_make_request()" and everything lower down take
just the "struct block_io".

You'd still leave "ll_rw_block()" and "submit_bh()" operating on bh's,
because they knoa about bh semantics (ie things like scaling the sector
number to the bh size etc). Which means that pretty much all the code
outside the block layer wouldn't even _notice_. Which is a sign of good
layering.

If you want to do this, please do go ahead.

But do realize that this is not exactly a 2.4.x thing ;)

		Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-06 20:35                               ` Ingo Molnar
  2001-02-06 19:05                                 ` Marcelo Tosatti
@ 2001-02-07 18:27                                 ` Christoph Hellwig
  1 sibling, 0 replies; 124+ messages in thread
From: Christoph Hellwig @ 2001-02-07 18:27 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Ben LaHaise, Stephen C. Tweedie, Alan Cox,
	Manfred Spraul, Steve Lord, Linux Kernel List, kiobuf-io-devel,
	Ingo Molnar

On Tue, Feb 06, 2001 at 09:35:58PM +0100, Ingo Molnar wrote:
> caching bmap() blocks was a recent addition around 2.3.20, and i suggested
> some time ago to cache pagecache blocks via explicit entries in struct
> page. That would be one solution - but it creates overhead.
> 
> but there isnt anything wrong with having the bhs around to cache blocks -
> think of it as a 'cached and recycled IO buffer entry, with the block
> information cached'.

I was not talking about caching physical blocks but the remaining
buffer-cache support stuff.

	Christoph

-- 
Of course it doesn't work. We've performed a software upgrade.
Whip me.  Beat me.  Make me maintain AIX.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-06 20:59                               ` Linus Torvalds
@ 2001-02-07 18:26                                 ` Christoph Hellwig
  2001-02-07 18:36                                   ` Linus Torvalds
  0 siblings, 1 reply; 124+ messages in thread
From: Christoph Hellwig @ 2001-02-07 18:26 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ben LaHaise, Ingo Molnar, Stephen C. Tweedie, Alan Cox,
	Manfred Spraul, Steve Lord, Linux Kernel List, kiobuf-io-devel,
	Ingo Molnar

On Tue, Feb 06, 2001 at 12:59:02PM -0800, Linus Torvalds wrote:
> 
> 
> On Tue, 6 Feb 2001, Christoph Hellwig wrote:
> > 
> > The second is that bh's are two things:
> > 
> >  - a cacheing object
> >  - an io buffer
> 
> Actually, they really aren't.
> 
> They kind of _used_ to be, but more and more they've moved away from that
> historical use. Check in particular the page cache, and as a really
> extreme case the swap cache version of the page cache.

Yes.  And that exactly why I think it's ugly to have the left-over
caching stuff in the same data sctruture as the IO buffer.

> It certainly _used_ to be true that "bh"s were actually first-class memory
> management citizens, and actually had a data buffer and a cache associated
> with them. And because of that historical baggage, that's how many people
> still think of them.

I do even know that the pagecache is our primary cache now :)
Anyway having that caching cruft still in is ugly.

> > This is not really an clean appropeach, and I would really like to
> > get away from it.
> 
> Trust me, you really _can_ get away from it. It's not designed into the
> bh's at all. You can already just allocate a single (or multiple) "struct
> buffer_head" and just use them as IO objects, and give them your _own_
> pointers to the IO buffer etc.

So true.  Exactly because of that the data structures should become
seperated also.

	Christoph

-- 
Of course it doesn't work. We've performed a software upgrade.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-07  2:37                                           ` Linus Torvalds
@ 2001-02-07 14:52                                             ` Stephen C. Tweedie
  2001-02-07 19:12                                             ` Richard Gooch
  1 sibling, 0 replies; 124+ messages in thread
From: Stephen C. Tweedie @ 2001-02-07 14:52 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Stephen C. Tweedie, Ingo Molnar, Ben LaHaise, Alan Cox,
	Manfred Spraul, Steve Lord, Linux Kernel List, kiobuf-io-devel

Hi,

On Tue, Feb 06, 2001 at 06:37:41PM -0800, Linus Torvalds wrote:
> >
> However, I really _do_ want to have the page cache have a bigger
> granularity than the smallest memory mapping size, and there are always
> special cases that might be able to generate IO in bigger chunks (ie
> in-kernel services etc)

No argument there.

> > Yes.  We still have this fundamental property: if a user sends in a
> > 128kB IO, we end up having to split it up into buffer_heads and doing
> > a separate submit_bh() on each single one.  Given our VM, PAGE_SIZE
> > (*not* PAGE_CACHE_SIZE) is the best granularity we can hope for in
> > this case.
> 
> Absolutely. And this is independent of what kind of interface we end up
> using, whether it be kiobuf of just plain "struct buffer_head". In that
> respect they are equivalent.

Sorry?  I'm not sure where communication is breaking down here, but
we really don't seem to be talking about the same things.  SGI's
kiobuf request patches already let us pass a large IO through the
request layer in a single unit without having to split it up to
squeeze it through the API.

> > THAT is the overhead that I'm talking about: having to split a large
> > IO into small chunks, each of which just ends up having to be merged
> > back again into a single struct request by the *make_request code.
> 
> You could easily just generate the bh then and there, if you wanted to.

In the current 2.4 tree, we already do: brw_kiovec creates the
temporary buffer_heads on demand to feed them to the IO layers.

> Your overhead comes from the fact that you want to gather the IO together. 

> And I'm saying that you _shouldn't_ gather the IO. There's no point.

I don't --- the underlying layer does.  And that is where the overhead
is: for every single large IO being created by the higher layers,
make_request is doing a dozen or more merges because I can only feed
the IO through make_request in tiny pieces.

> The
> gathering is sufficiently done by the low-level code anyway, and I've
> tried to explain why the low-level code _has_ to do that work regardless
> of what upper layers do.

I know.  The problem is the low-level code doing it a hundred times
for a single injected IO.

> You need to generate a separate sg entry for each page anyway. So why not
> just use the existing one? The "struct buffer_head". Which already
> _handles_ all the issues that you have complained are hard to handle.

Two issues here.  First is that the buffer_head is an enormously
heavyweight object for a sg-list fragment.  It contains a ton of
fields of interest only to the buffer cache.  We could mitigate this
to some extent by ensuring that the relevant fields for IO (rsector,
size, req_next, state, data, page etc) were in a single cache line.

Secondly, the cost of adding each single buffer_head to the request
list is O(n) in the number of requests already on the list.  We end up
walking potentially the entire request queue before finding the
request to merge against, and we do that again and again, once for
every single buffer_head in the list.  We do this even if the caller
went in via a multi-bh ll_rw_block() call in which case we know in
advance that all of the buffer_heads are contiguous on disk.


There is a side problem: right now, things like raid remapping occur
during generic_make_request, before we have a request built.  That
means that all of the raid0 remapping or raid1/5 request expanding is
being done on a per-buffer_head, not per-request, basis, so again
we're doing a whole lot of unnecessary duplicate work when an IO
larger than a buffer_head is submitted.


If you really don't mind the size of the buffer_head as a sg fragment
header, then at least I'd like us to be able to submit a pre-built
chain of bh's all at once without having to go through the remap/merge
cost for each single bh.

Cheers,
 Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-07  9:10                                                   ` David Howells
@ 2001-02-07 12:16                                                     ` Stephen C. Tweedie
  0 siblings, 0 replies; 124+ messages in thread
From: Stephen C. Tweedie @ 2001-02-07 12:16 UTC (permalink / raw)
  To: David Howells; +Cc: Linus Torvalds, Jens Axboe, linux-kernel, kiobuf-io-devel

Hi,

On Wed, Feb 07, 2001 at 09:10:32AM +0000, David Howells wrote:
> 
> I presume that correct_size will always be a power of 2...

Yes.

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-07  1:45                                                 ` Linus Torvalds
  2001-02-07  1:55                                                   ` Jens Axboe
@ 2001-02-07  9:10                                                   ` David Howells
  2001-02-07 12:16                                                     ` Stephen C. Tweedie
  1 sibling, 1 reply; 124+ messages in thread
From: David Howells @ 2001-02-07  9:10 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Jens Axboe, linux-kernel, kiobuf-io-devel


Linus Torvalds <torvalds@transmeta.com> wrote:
> Actually, I'd rather leave it in, but speed it up with the saner and
> faster
>
>	if (bh->b_size & (correct_size-1)) {

I presume that correct_size will always be a power of 2...

David
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-07  1:49                                         ` Stephen C. Tweedie
@ 2001-02-07  2:37                                           ` Linus Torvalds
  2001-02-07 14:52                                             ` Stephen C. Tweedie
  2001-02-07 19:12                                             ` Richard Gooch
  0 siblings, 2 replies; 124+ messages in thread
From: Linus Torvalds @ 2001-02-07  2:37 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Ingo Molnar, Ben LaHaise, Alan Cox, Manfred Spraul, Steve Lord,
	Linux Kernel List, kiobuf-io-devel



On Wed, 7 Feb 2001, Stephen C. Tweedie wrote:
>
> > "struct buffer_head" can deal with pretty much any size: the only thing it
> > cares about is bh->b_size.
> 
> Right now, anything larger than a page is physically non-contiguous,
> and sorry if I didn't make that explicit, but I thought that was
> obvious enough that I didn't need to.  We were talking about raw IO,
> and as long as we're doing IO out of user anonymous data allocated
> from individual pages, buffer_heads are limited to that page size in
> this context.

Sure. That's obviously also one of the reasons why the IO layer has never
seen bigger requests anyway - the data _does_ tend to be fundamentally
broken up into page-size entities, if for no other reason that that is how
user-space sees memory.

However, I really _do_ want to have the page cache have a bigger
granularity than the smallest memory mapping size, and there are always
special cases that might be able to generate IO in bigger chunks (ie
in-kernel services etc)

> Yes.  We still have this fundamental property: if a user sends in a
> 128kB IO, we end up having to split it up into buffer_heads and doing
> a separate submit_bh() on each single one.  Given our VM, PAGE_SIZE
> (*not* PAGE_CACHE_SIZE) is the best granularity we can hope for in
> this case.

Absolutely. And this is independent of what kind of interface we end up
using, whether it be kiobuf of just plain "struct buffer_head". In that
respect they are equivalent.

> THAT is the overhead that I'm talking about: having to split a large
> IO into small chunks, each of which just ends up having to be merged
> back again into a single struct request by the *make_request code.

You could easily just generate the bh then and there, if you wanted to.

Your overhead comes from the fact that you want to gather the IO together. 

And I'm saying that you _shouldn't_ gather the IO. There's no point. The
gathering is sufficiently done by the low-level code anyway, and I've
tried to explain why the low-level code _has_ to do that work regardless
of what upper layers do.

You need to generate a separate sg entry for each page anyway. So why not
just use the existing one? The "struct buffer_head". Which already
_handles_ all the issues that you have complained are hard to handle.

		Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-07  1:08                                               ` Jens Axboe
@ 2001-02-07  2:08                                                 ` Jeff V. Merkey
  0 siblings, 0 replies; 124+ messages in thread
From: Jeff V. Merkey @ 2001-02-07  2:08 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Linus Torvalds, Stephen C. Tweedie, Ingo Molnar, Ben LaHaise,
	Alan Cox, Manfred Spraul, Steve Lord, Linux Kernel List,
	kiobuf-io-devel

On Wed, Feb 07, 2001 at 02:08:53AM +0100, Jens Axboe wrote:
> On Tue, Feb 06 2001, Jeff V. Merkey wrote:
> > Adaptec drivers had an oops.  Also, AIC7XXX also had some oops with it.
> 
> Do you still have this oops?
> 

I can recreate.  Will work on it tommorrow.  SCI testing today.

Jeff

> -- 
> Jens Axboe
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-07  1:06                                               ` Ingo Molnar
  2001-02-07  1:09                                                 ` Jens Axboe
  2001-02-07  1:26                                                 ` Linus Torvalds
@ 2001-02-07  2:07                                                 ` Jeff V. Merkey
  2 siblings, 0 replies; 124+ messages in thread
From: Jeff V. Merkey @ 2001-02-07  2:07 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Jens Axboe, Linus Torvalds, Stephen C. Tweedie, Ben LaHaise,
	Alan Cox, Manfred Spraul, Steve Lord, Linux Kernel List,
	kiobuf-io-devel

On Wed, Feb 07, 2001 at 02:06:27AM +0100, Ingo Molnar wrote:
> 
> On Tue, 6 Feb 2001, Jeff V. Merkey wrote:
> 
> > > I don't see anything that would break doing this, in fact you can
> > > do this as long as the buffers are all at least a multiple of the
> > > block size. All the drivers I've inspected handle this fine, noone
> > > assumes that rq->bh->b_size is the same in all the buffers attached
> > > to the request. This includes SCSI (scsi_lib.c builds sg tables),
> > > IDE, and the Compaq array + Mylex driver. This mostly leaves the
> > > "old-style" drivers using CURRENT etc, the kernel helpers for these
> > > handle it as well.
> > >
> > > So I would appreciate pointers to these devices that break so we
> > > can inspect them.
> > >
> > > --
> > > Jens Axboe
> >
> > Adaptec drivers had an oops.  Also, AIC7XXX also had some oops with it.
> 
> most likely some coding error on your side. buffer-size mismatches should
> show up as filesystem corruption or random DMA scribble, not in-driver
> oopses.
> 
> 	Ingo

Oops was in my code, but was caused by these drivers.  The Adaptec 
driver did have an oops that was it's own code address, AIC7XXX 
crashed in my code.

Jeff

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-07  1:02                                           ` Jens Axboe
  2001-02-07  1:19                                             ` Linus Torvalds
@ 2001-02-07  2:00                                             ` Jeff V. Merkey
  2001-02-07  1:06                                               ` Ingo Molnar
  2001-02-07  1:08                                               ` Jens Axboe
  1 sibling, 2 replies; 124+ messages in thread
From: Jeff V. Merkey @ 2001-02-07  2:00 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Linus Torvalds, Stephen C. Tweedie, Ingo Molnar, Ben LaHaise,
	Alan Cox, Manfred Spraul, Steve Lord, Linux Kernel List,
	kiobuf-io-devel

On Wed, Feb 07, 2001 at 02:02:21AM +0100, Jens Axboe wrote:
> On Tue, Feb 06 2001, Jeff V. Merkey wrote:
> > I remember Linus asking to try this variable buffer head chaining 
> > thing 512-1024-512 kind of stuff several months back, and mixing them to 
> > see what would happen -- result.  About half the drivers break with it.  
> > The interface allows you to do it, I've tried it, (works on Andre's 
> > drivers, but a lot of SCSI drivers break) but a lot of drivers seem to 
> > have assumptions about these things all being the same size in a 
> > buffer head chain. 
> 
> I don't see anything that would break doing this, in fact you can
> do this as long as the buffers are all at least a multiple of the
> block size. All the drivers I've inspected handle this fine, noone
> assumes that rq->bh->b_size is the same in all the buffers attached
> to the request. This includes SCSI (scsi_lib.c builds sg tables),
> IDE, and the Compaq array + Mylex driver. This mostly leaves the
> "old-style" drivers using CURRENT etc, the kernel helpers for these
> handle it as well.
> 
> So I would appreciate pointers to these devices that break so we
> can inspect them.
> 
> -- 
> Jens Axboe

Adaptec drivers had an oops.  Also, AIC7XXX also had some oops with it.

Jeff

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-07  1:01                                           ` Ingo Molnar
@ 2001-02-07  1:59                                             ` Jeff V. Merkey
  0 siblings, 0 replies; 124+ messages in thread
From: Jeff V. Merkey @ 2001-02-07  1:59 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Stephen C. Tweedie, Ben LaHaise, Alan Cox,
	Manfred Spraul, Steve Lord, Linux Kernel List, kiobuf-io-devel

On Wed, Feb 07, 2001 at 02:01:54AM +0100, Ingo Molnar wrote:
> 
> On Tue, 6 Feb 2001, Jeff V. Merkey wrote:
> 
> > I remember Linus asking to try this variable buffer head chaining
> > thing 512-1024-512 kind of stuff several months back, and mixing them
> > to see what would happen -- result. About half the drivers break with
> > it. [...]
> 
> time to fix them then - instead of rewriting the rest of the kernel ;-)
> 
> 	Ingo

I agree.  

Jeff

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-07  1:45                                                 ` Linus Torvalds
@ 2001-02-07  1:55                                                   ` Jens Axboe
  2001-02-07  9:10                                                   ` David Howells
  1 sibling, 0 replies; 124+ messages in thread
From: Jens Axboe @ 2001-02-07  1:55 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jeff V. Merkey, Stephen C. Tweedie, Ingo Molnar, Ben LaHaise,
	Alan Cox, Manfred Spraul, Steve Lord, Linux Kernel List,
	kiobuf-io-devel

On Tue, Feb 06 2001, Linus Torvalds wrote:
> > > [...] so I would be _really_ nervous about just turning it on
> > > silently. This is all very much a 2.5.x-kind of thing ;)
> > 
> > Then you might want to apply this :-)
> > 
> > --- drivers/block/ll_rw_blk.c~	Wed Feb  7 02:38:31 2001
> > +++ drivers/block/ll_rw_blk.c	Wed Feb  7 02:38:42 2001
> > @@ -1048,7 +1048,7 @@
> >  	/* Verify requested block sizes. */
> >  	for (i = 0; i < nr; i++) {
> >  		struct buffer_head *bh = bhs[i];
> > -		if (bh->b_size % correct_size) {
> > +		if (bh->b_size != correct_size) {
> >  			printk(KERN_NOTICE "ll_rw_block: device %s: "
> >  			       "only %d-char blocks implemented (%u)\n",
> >  			       kdevname(bhs[0]->b_dev),
> 
> Actually, I'd rather leave it in, but speed it up with the saner and
> faster
> 
> 	if (bh->b_size & (correct_size-1)) {
> 		...
> 
> That way people who _want_ to test the odd-size thing can do so. And
> normal code (that never generates requests on any other size than the
> "native" size) won't ever notice either way.

Fine, as I said I didn't spot anything bad so that's why it was changed.

> (Oh, we'll eventually need to move to "correct_size == hardware
> blocksize", not the "virtual blocksize" that it is now. As it it a tester
> needs to set the soft-blk size by hand now).

Exactly, wrt earlier mail about submitting < hw block size requests to
the lower levels.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-07  0:50                                       ` Linus Torvalds
  2001-02-07  1:49                                         ` Stephen C. Tweedie
@ 2001-02-07  1:51                                         ` Jeff V. Merkey
  2001-02-07  1:01                                           ` Ingo Molnar
  2001-02-07  1:02                                           ` Jens Axboe
  1 sibling, 2 replies; 124+ messages in thread
From: Jeff V. Merkey @ 2001-02-07  1:51 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Stephen C. Tweedie, Ingo Molnar, Ben LaHaise, Alan Cox,
	Manfred Spraul, Steve Lord, Linux Kernel List, kiobuf-io-devel

On Tue, Feb 06, 2001 at 04:50:19PM -0800, Linus Torvalds wrote:
> 
> 
> On Wed, 7 Feb 2001, Stephen C. Tweedie wrote:
> > 
> > That gets us from 512-byte blocks to 4k, but no more (ll_rw_block
> > enforces a single blocksize on all requests but that relaxing that
> > requirement is no big deal).  Buffer_heads can't deal with data which
> > spans more than a page right now.
> 
> Stephen, you're so full of shit lately that it's unbelievable. You're
> batting a clear 0.000 so far.
> 
> "struct buffer_head" can deal with pretty much any size: the only thing it
> cares about is bh->b_size.
> 
> It so happens that if you have highmem support, then "create_bounce()"
> will work on a per-page thing, but that just means that you'd better have
> done your bouncing into low memory before you call generic_make_request().
> 
> Have you ever spent even just 5 minutes actually _looking_ at the block
> device layer, before you decided that you think it needs to be completely
> re-done some other way? It appears that you never bothered to.
> 
> Sure, I would not be surprised if some device driver ends up being
> surpised if you start passing it different request sizes than it is used
> to. But that's a driver and testing issue, nothing more.
> 
> (Which is not to say that "driver and testing" issues aren't important as
> hell: it's one of the more scary things in fact, and it can take a long
> time to get right if you start doing somehting that historically has never
> been done and thus has historically never gotten any testing. So I'm not
> saying that it should work out-of-the-box. But I _am_ saying that there's
> no point in trying to re-design upper layers that already do ALL of this
> with no problems at all).
> 
> 		Linus
> 

I remember Linus asking to try this variable buffer head chaining 
thing 512-1024-512 kind of stuff several months back, and mixing them to 
see what would happen -- result.  About half the drivers break with it.  
The interface allows you to do it, I've tried it, (works on Andre's 
drivers, but a lot of SCSI drivers break) but a lot of drivers seem to 
have assumptions about these things all being the same size in a 
buffer head chain. 

:-)

Jeff


> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> Please read the FAQ at http://www.tux.org/lkml/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-07  0:50                                       ` Linus Torvalds
@ 2001-02-07  1:49                                         ` Stephen C. Tweedie
  2001-02-07  2:37                                           ` Linus Torvalds
  2001-02-07  1:51                                         ` Jeff V. Merkey
  1 sibling, 1 reply; 124+ messages in thread
From: Stephen C. Tweedie @ 2001-02-07  1:49 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Stephen C. Tweedie, Ingo Molnar, Ben LaHaise, Alan Cox,
	Manfred Spraul, Steve Lord, Linux Kernel List, kiobuf-io-devel

Hi,

On Tue, Feb 06, 2001 at 04:50:19PM -0800, Linus Torvalds wrote:
> 
> 
> On Wed, 7 Feb 2001, Stephen C. Tweedie wrote:
> > 
> > That gets us from 512-byte blocks to 4k, but no more (ll_rw_block
> > enforces a single blocksize on all requests but that relaxing that
> > requirement is no big deal).  Buffer_heads can't deal with data which
> > spans more than a page right now.
> 
> "struct buffer_head" can deal with pretty much any size: the only thing it
> cares about is bh->b_size.

Right now, anything larger than a page is physically non-contiguous,
and sorry if I didn't make that explicit, but I thought that was
obvious enough that I didn't need to.  We were talking about raw IO,
and as long as we're doing IO out of user anonymous data allocated
from individual pages, buffer_heads are limited to that page size in
this context.

> Have you ever spent even just 5 minutes actually _looking_ at the block
> device layer, before you decided that you think it needs to be completely
> re-done some other way? It appears that you never bothered to.

Yes.  We still have this fundamental property: if a user sends in a
128kB IO, we end up having to split it up into buffer_heads and doing
a separate submit_bh() on each single one.  Given our VM, PAGE_SIZE
(*not* PAGE_CACHE_SIZE) is the best granularity we can hope for in
this case.

THAT is the overhead that I'm talking about: having to split a large
IO into small chunks, each of which just ends up having to be merged
back again into a single struct request by the *make_request code.

A constructed IO request basically doesn't care about anything in the
buffer_head except for the data pointer and size, and the completion
status info and callback.  All of the physical IO description is in
the struct request by this point.  The chain of buffer_heads is
carrying around a huge amount of information which isn't used by the
IO, and if the caller is something like the raw IO driver which isn't
using the buffer cache, that extra buffer_head data is just overhead. 

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-07  1:39                                               ` Jens Axboe
@ 2001-02-07  1:45                                                 ` Linus Torvalds
  2001-02-07  1:55                                                   ` Jens Axboe
  2001-02-07  9:10                                                   ` David Howells
  0 siblings, 2 replies; 124+ messages in thread
From: Linus Torvalds @ 2001-02-07  1:45 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Jeff V. Merkey, Stephen C. Tweedie, Ingo Molnar, Ben LaHaise,
	Alan Cox, Manfred Spraul, Steve Lord, Linux Kernel List,
	kiobuf-io-devel



On Wed, 7 Feb 2001, Jens Axboe wrote:
> 
> > [...] so I would be _really_ nervous about just turning it on
> > silently. This is all very much a 2.5.x-kind of thing ;)
> 
> Then you might want to apply this :-)
> 
> --- drivers/block/ll_rw_blk.c~	Wed Feb  7 02:38:31 2001
> +++ drivers/block/ll_rw_blk.c	Wed Feb  7 02:38:42 2001
> @@ -1048,7 +1048,7 @@
>  	/* Verify requested block sizes. */
>  	for (i = 0; i < nr; i++) {
>  		struct buffer_head *bh = bhs[i];
> -		if (bh->b_size % correct_size) {
> +		if (bh->b_size != correct_size) {
>  			printk(KERN_NOTICE "ll_rw_block: device %s: "
>  			       "only %d-char blocks implemented (%u)\n",
>  			       kdevname(bhs[0]->b_dev),

Actually, I'd rather leave it in, but speed it up with the saner and
faster

	if (bh->b_size & (correct_size-1)) {
		...

That way people who _want_ to test the odd-size thing can do so. And
normal code (that never generates requests on any other size than the
"native" size) won't ever notice either way.

(Oh, we'll eventually need to move to "correct_size == hardware
blocksize", not the "virtual blocksize" that it is now. As it it a tester
needs to set the soft-blk size by hand now).

		Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-07  0:36                                     ` Stephen C. Tweedie
  2001-02-07  0:50                                       ` Linus Torvalds
@ 2001-02-07  1:42                                       ` Jeff V. Merkey
  1 sibling, 0 replies; 124+ messages in thread
From: Jeff V. Merkey @ 2001-02-07  1:42 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Ingo Molnar, Ingo Molnar, Ben LaHaise, Linus Torvalds, Alan Cox,
	Manfred Spraul, Steve Lord, Linux Kernel List, kiobuf-io-devel

On Wed, Feb 07, 2001 at 12:36:29AM +0000, Stephen C. Tweedie wrote:
> Hi,
> 
> On Tue, Feb 06, 2001 at 07:25:19PM -0500, Ingo Molnar wrote:
> > 
> > On Wed, 7 Feb 2001, Stephen C. Tweedie wrote:
> > 
> > > No, it is a problem of the ll_rw_block interface: buffer_heads need to
> > > be aligned on disk at a multiple of their buffer size.  Under the Unix
> > > raw IO interface it is perfectly legal to begin a 128kB IO at offset
> > > 512 bytes into a device.
> > 
> > then we should either fix this limitation, or the raw IO code should split
> > the request up into several, variable-size bhs, so that the range is
> > filled out optimally with aligned bhs.
> 
> That gets us from 512-byte blocks to 4k, but no more (ll_rw_block
> enforces a single blocksize on all requests but that relaxing that
> requirement is no big deal).  Buffer_heads can't deal with data which
> spans more than a page right now.


I can handle requests larger than a page (64K) but I am not using 
the buffer cache in Linux.  We really need an NT/NetWare like model 
to support the non-Unix FS's properly.

i.e.   

a disk request should be 

<disk> <lba> <length> <buffer> and get rid of this fixed block 
stuff with buffer heads. :-)

I understand that the way the elevator is implemented in Linux makes
this very hard at this point to support, since it's very troublesome 
to handling requests that overlap sector boundries.

Jeff


> 
> --Stephen
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> Please read the FAQ at http://www.tux.org/lkml/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-07  1:27                                     ` Stephen C. Tweedie
@ 2001-02-07  1:40                                       ` Linus Torvalds
  2001-02-12 10:07                                         ` Jamie Lokier
  0 siblings, 1 reply; 124+ messages in thread
From: Linus Torvalds @ 2001-02-07  1:40 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Ingo Molnar, Ben LaHaise, Alan Cox, Manfred Spraul, Steve Lord,
	Linux Kernel List, kiobuf-io-devel, Ingo Molnar



On Wed, 7 Feb 2001, Stephen C. Tweedie wrote:
> > 
> > The fact is, if you have problems like the above, then you don't
> > understand the interfaces. And it sounds like you designed kiobuf support
> > around the wrong set of interfaces.
> 
> They used the only interfaces available at the time...

Ehh.. "generic_make_request()" goes back a _loong_ time. It used to be
called just "make_request()", but all my points still stand.

It's even exported to modules. As far as I know, the raid code has always
used this interface exactly because raid needed to feed back the remapped
stuff and get around the blocksizing in ll_rw_block().

This really isn't anything new. I _know_ it's there in 2.2.x, and I
would not be surprised if it was there even in 2.0.x.

> > If you want to get at the _sector_ level, then you do
> ...
> > which doesn't look all that complicated to me. What's the problem?
> 
> Doesn't this break nastily as soon as the IO hits an LVM or soft raid
> device?  I don't think we are safe if we create a larger-sized
> buffer_head which spans a raid stripe: the raid mapping is only
> applied once per buffer_head.

Absolutely. This is exactly what I mean by saying that low-level drivers
may not actually be able to handle new cases that they've never been asked
to do before - they just never saw anything like a 64kB request before or
something that crossed its own alignment.

But the _higher_ levels are there. And there's absolutely nothing in the
design that is a real problem. But there's no question that you might need
to fix up more than one or two low-level drivers.

(The only drivers I know better are the IDE ones, and as far as I can tell
they'd have no trouble at all with any of this. Most other normal drivers
are likely to be in this same situation. But because I've not had a reason
to test, I certainly won't guarantee even that).

		Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-07  1:19                                             ` Linus Torvalds
@ 2001-02-07  1:39                                               ` Jens Axboe
  2001-02-07  1:45                                                 ` Linus Torvalds
  0 siblings, 1 reply; 124+ messages in thread
From: Jens Axboe @ 2001-02-07  1:39 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jeff V. Merkey, Stephen C. Tweedie, Ingo Molnar, Ben LaHaise,
	Alan Cox, Manfred Spraul, Steve Lord, Linux Kernel List,
	kiobuf-io-devel

On Tue, Feb 06 2001, Linus Torvalds wrote:
> > I don't see anything that would break doing this, in fact you can
> > do this as long as the buffers are all at least a multiple of the
> > block size. All the drivers I've inspected handle this fine, noone
> > assumes that rq->bh->b_size is the same in all the buffers attached
> > to the request.
> 
> It's really easy to get this wrong when going forward in the request list:
> you need to make sure that you update "request->current_nr_sectors" each
> time you move on to the next bh.
> 
> I would not be surprised if some of them have been seriously buggered. 

Maybe have been, but it looks good at least with the general drivers
that I mentioned.

> [...] so I would be _really_ nervous about just turning it on
> silently. This is all very much a 2.5.x-kind of thing ;)

Then you might want to apply this :-)

--- drivers/block/ll_rw_blk.c~	Wed Feb  7 02:38:31 2001
+++ drivers/block/ll_rw_blk.c	Wed Feb  7 02:38:42 2001
@@ -1048,7 +1048,7 @@
 	/* Verify requested block sizes. */
 	for (i = 0; i < nr; i++) {
 		struct buffer_head *bh = bhs[i];
-		if (bh->b_size % correct_size) {
+		if (bh->b_size != correct_size) {
 			printk(KERN_NOTICE "ll_rw_block: device %s: "
 			       "only %d-char blocks implemented (%u)\n",
 			       kdevname(bhs[0]->b_dev),

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-07  0:41                                   ` Linus Torvalds
@ 2001-02-07  1:27                                     ` Stephen C. Tweedie
  2001-02-07  1:40                                       ` Linus Torvalds
  0 siblings, 1 reply; 124+ messages in thread
From: Stephen C. Tweedie @ 2001-02-07  1:27 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Stephen C. Tweedie, Ingo Molnar, Ben LaHaise, Alan Cox,
	Manfred Spraul, Steve Lord, Linux Kernel List, kiobuf-io-devel,
	Ingo Molnar

Hi,

On Tue, Feb 06, 2001 at 04:41:21PM -0800, Linus Torvalds wrote:
> 
> On Wed, 7 Feb 2001, Stephen C. Tweedie wrote:
> > No, it is a problem of the ll_rw_block interface: buffer_heads need to
> > be aligned on disk at a multiple of their buffer size.
> 
> Ehh.. True of ll_rw_block() and submit_bh(), which are meant for the
> traditional block device setup, where "b_blocknr" is the "virtual
> blocknumber" and that indeed is tied in to the block size.
> 
> The fact is, if you have problems like the above, then you don't
> understand the interfaces. And it sounds like you designed kiobuf support
> around the wrong set of interfaces.

They used the only interfaces available at the time...

> If you want to get at the _sector_ level, then you do
...
> which doesn't look all that complicated to me. What's the problem?

Doesn't this break nastily as soon as the IO hits an LVM or soft raid
device?  I don't think we are safe if we create a larger-sized
buffer_head which spans a raid stripe: the raid mapping is only
applied once per buffer_head.

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-07  1:06                                               ` Ingo Molnar
  2001-02-07  1:09                                                 ` Jens Axboe
@ 2001-02-07  1:26                                                 ` Linus Torvalds
  2001-02-07  2:07                                                 ` Jeff V. Merkey
  2 siblings, 0 replies; 124+ messages in thread
From: Linus Torvalds @ 2001-02-07  1:26 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Jeff V. Merkey, Jens Axboe, Stephen C. Tweedie, Ben LaHaise,
	Alan Cox, Manfred Spraul, Steve Lord, Linux Kernel List,
	kiobuf-io-devel



On Wed, 7 Feb 2001, Ingo Molnar wrote:
> 
> most likely some coding error on your side. buffer-size mismatches should
> show up as filesystem corruption or random DMA scribble, not in-driver
> oopses.

I'm not sure. If I was a driver writer (and I'm happy those days are
mostly behind me ;), I would not be totally dis-inclined to check for
various limits and things.

There can be hardware out there that simply has trouble with non-native
alignment, ie be unhappy about getting a 1kB request that is aligned in
memory at a 512-byte boundary. So there are real reasons why drivers might
need updating. Don't dismiss the concerns out-of-hand.

		Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-07  1:02                                           ` Jens Axboe
@ 2001-02-07  1:19                                             ` Linus Torvalds
  2001-02-07  1:39                                               ` Jens Axboe
  2001-02-07  2:00                                             ` Jeff V. Merkey
  1 sibling, 1 reply; 124+ messages in thread
From: Linus Torvalds @ 2001-02-07  1:19 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Jeff V. Merkey, Stephen C. Tweedie, Ingo Molnar, Ben LaHaise,
	Alan Cox, Manfred Spraul, Steve Lord, Linux Kernel List,
	kiobuf-io-devel



On Wed, 7 Feb 2001, Jens Axboe wrote:
> 
> I don't see anything that would break doing this, in fact you can
> do this as long as the buffers are all at least a multiple of the
> block size. All the drivers I've inspected handle this fine, noone
> assumes that rq->bh->b_size is the same in all the buffers attached
> to the request.

It's really easy to get this wrong when going forward in the request list:
you need to make sure that you update "request->current_nr_sectors" each
time you move on to the next bh.

I would not be surprised if some of them have been seriously buggered. 

On the other hand, I would _also_ not be surprised if we've actually fixed
a lot of them: one of the things that the RAID code and loopback test is
exactly getting these kinds of issues right (not this exact one, but
similar ones).

And let's remember things like the old ultrastor driver that was totally
unable to handle anything but 1kB devices etc. I would not be _totally_
surprised if it turns out that there are still drivers out there that
remember the time when Linux only ever had 1kB buffers. Even if it is 7
years ago or so ;)

(Also, there might be drivers that are "optimized" - they set the IO
length once per request, and just never set it again as they do partial
end_io() calls. None of those kinds of issues would ever be found under
normal load, so I would be _really_ nervous about just turning it on
silently. This is all very much a 2.5.x-kind of thing ;)

		Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-07  1:09                                                 ` Jens Axboe
@ 2001-02-07  1:11                                                   ` Ingo Molnar
  0 siblings, 0 replies; 124+ messages in thread
From: Ingo Molnar @ 2001-02-07  1:11 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Jeff V. Merkey, Linus Torvalds, Stephen C. Tweedie, Ben LaHaise,
	Alan Cox, Manfred Spraul, Steve Lord, Linux Kernel List,
	kiobuf-io-devel


On Wed, 7 Feb 2001, Jens Axboe wrote:

> > > Adaptec drivers had an oops.  Also, AIC7XXX also had some oops with it.
> >
> > most likely some coding error on your side. buffer-size mismatches should
> > show up as filesystem corruption or random DMA scribble, not in-driver
> > oopses.
>
> I would suspect so, aic7xxx shouldn't care about anything except the
> sg entries and I would seriously doubt that it makes any such
> assumptions on them :-)

yep - and not a single reference to b_size in aic7xxx.c.

	Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-07  1:06                                               ` Ingo Molnar
@ 2001-02-07  1:09                                                 ` Jens Axboe
  2001-02-07  1:11                                                   ` Ingo Molnar
  2001-02-07  1:26                                                 ` Linus Torvalds
  2001-02-07  2:07                                                 ` Jeff V. Merkey
  2 siblings, 1 reply; 124+ messages in thread
From: Jens Axboe @ 2001-02-07  1:09 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Jeff V. Merkey, Linus Torvalds, Stephen C. Tweedie, Ben LaHaise,
	Alan Cox, Manfred Spraul, Steve Lord, Linux Kernel List,
	kiobuf-io-devel

On Wed, Feb 07 2001, Ingo Molnar wrote:
> > > So I would appreciate pointers to these devices that break so we
> > > can inspect them.
> > >
> > > --
> > > Jens Axboe
> >
> > Adaptec drivers had an oops.  Also, AIC7XXX also had some oops with it.
> 
> most likely some coding error on your side. buffer-size mismatches should
> show up as filesystem corruption or random DMA scribble, not in-driver
> oopses.

I would suspect so, aic7xxx shouldn't care about anything except the
sg entries and I would seriously doubt that it makes any such
assumptions on them :-)

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-07  2:00                                             ` Jeff V. Merkey
  2001-02-07  1:06                                               ` Ingo Molnar
@ 2001-02-07  1:08                                               ` Jens Axboe
  2001-02-07  2:08                                                 ` Jeff V. Merkey
  1 sibling, 1 reply; 124+ messages in thread
From: Jens Axboe @ 2001-02-07  1:08 UTC (permalink / raw)
  To: Jeff V. Merkey
  Cc: Linus Torvalds, Stephen C. Tweedie, Ingo Molnar, Ben LaHaise,
	Alan Cox, Manfred Spraul, Steve Lord, Linux Kernel List,
	kiobuf-io-devel

On Tue, Feb 06 2001, Jeff V. Merkey wrote:
> Adaptec drivers had an oops.  Also, AIC7XXX also had some oops with it.

Do you still have this oops?

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-07  2:00                                             ` Jeff V. Merkey
@ 2001-02-07  1:06                                               ` Ingo Molnar
  2001-02-07  1:09                                                 ` Jens Axboe
                                                                   ` (2 more replies)
  2001-02-07  1:08                                               ` Jens Axboe
  1 sibling, 3 replies; 124+ messages in thread
From: Ingo Molnar @ 2001-02-07  1:06 UTC (permalink / raw)
  To: Jeff V. Merkey
  Cc: Jens Axboe, Linus Torvalds, Stephen C. Tweedie, Ben LaHaise,
	Alan Cox, Manfred Spraul, Steve Lord, Linux Kernel List,
	kiobuf-io-devel


On Tue, 6 Feb 2001, Jeff V. Merkey wrote:

> > I don't see anything that would break doing this, in fact you can
> > do this as long as the buffers are all at least a multiple of the
> > block size. All the drivers I've inspected handle this fine, noone
> > assumes that rq->bh->b_size is the same in all the buffers attached
> > to the request. This includes SCSI (scsi_lib.c builds sg tables),
> > IDE, and the Compaq array + Mylex driver. This mostly leaves the
> > "old-style" drivers using CURRENT etc, the kernel helpers for these
> > handle it as well.
> >
> > So I would appreciate pointers to these devices that break so we
> > can inspect them.
> >
> > --
> > Jens Axboe
>
> Adaptec drivers had an oops.  Also, AIC7XXX also had some oops with it.

most likely some coding error on your side. buffer-size mismatches should
show up as filesystem corruption or random DMA scribble, not in-driver
oopses.

	Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-07  1:51                                         ` Jeff V. Merkey
  2001-02-07  1:01                                           ` Ingo Molnar
@ 2001-02-07  1:02                                           ` Jens Axboe
  2001-02-07  1:19                                             ` Linus Torvalds
  2001-02-07  2:00                                             ` Jeff V. Merkey
  1 sibling, 2 replies; 124+ messages in thread
From: Jens Axboe @ 2001-02-07  1:02 UTC (permalink / raw)
  To: Jeff V. Merkey
  Cc: Linus Torvalds, Stephen C. Tweedie, Ingo Molnar, Ben LaHaise,
	Alan Cox, Manfred Spraul, Steve Lord, Linux Kernel List,
	kiobuf-io-devel

On Tue, Feb 06 2001, Jeff V. Merkey wrote:
> I remember Linus asking to try this variable buffer head chaining 
> thing 512-1024-512 kind of stuff several months back, and mixing them to 
> see what would happen -- result.  About half the drivers break with it.  
> The interface allows you to do it, I've tried it, (works on Andre's 
> drivers, but a lot of SCSI drivers break) but a lot of drivers seem to 
> have assumptions about these things all being the same size in a 
> buffer head chain. 

I don't see anything that would break doing this, in fact you can
do this as long as the buffers are all at least a multiple of the
block size. All the drivers I've inspected handle this fine, noone
assumes that rq->bh->b_size is the same in all the buffers attached
to the request. This includes SCSI (scsi_lib.c builds sg tables),
IDE, and the Compaq array + Mylex driver. This mostly leaves the
"old-style" drivers using CURRENT etc, the kernel helpers for these
handle it as well.

So I would appreciate pointers to these devices that break so we
can inspect them.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-07  1:51                                         ` Jeff V. Merkey
@ 2001-02-07  1:01                                           ` Ingo Molnar
  2001-02-07  1:59                                             ` Jeff V. Merkey
  2001-02-07  1:02                                           ` Jens Axboe
  1 sibling, 1 reply; 124+ messages in thread
From: Ingo Molnar @ 2001-02-07  1:01 UTC (permalink / raw)
  To: Jeff V. Merkey
  Cc: Linus Torvalds, Stephen C. Tweedie, Ben LaHaise, Alan Cox,
	Manfred Spraul, Steve Lord, Linux Kernel List, kiobuf-io-devel


On Tue, 6 Feb 2001, Jeff V. Merkey wrote:

> I remember Linus asking to try this variable buffer head chaining
> thing 512-1024-512 kind of stuff several months back, and mixing them
> to see what would happen -- result. About half the drivers break with
> it. [...]

time to fix them then - instead of rewriting the rest of the kernel ;-)

	Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-07  0:36                                     ` Stephen C. Tweedie
@ 2001-02-07  0:50                                       ` Linus Torvalds
  2001-02-07  1:49                                         ` Stephen C. Tweedie
  2001-02-07  1:51                                         ` Jeff V. Merkey
  2001-02-07  1:42                                       ` Jeff V. Merkey
  1 sibling, 2 replies; 124+ messages in thread
From: Linus Torvalds @ 2001-02-07  0:50 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Ingo Molnar, Ben LaHaise, Alan Cox, Manfred Spraul, Steve Lord,
	Linux Kernel List, kiobuf-io-devel



On Wed, 7 Feb 2001, Stephen C. Tweedie wrote:
> 
> That gets us from 512-byte blocks to 4k, but no more (ll_rw_block
> enforces a single blocksize on all requests but that relaxing that
> requirement is no big deal).  Buffer_heads can't deal with data which
> spans more than a page right now.

Stephen, you're so full of shit lately that it's unbelievable. You're
batting a clear 0.000 so far.

"struct buffer_head" can deal with pretty much any size: the only thing it
cares about is bh->b_size.

It so happens that if you have highmem support, then "create_bounce()"
will work on a per-page thing, but that just means that you'd better have
done your bouncing into low memory before you call generic_make_request().

Have you ever spent even just 5 minutes actually _looking_ at the block
device layer, before you decided that you think it needs to be completely
re-done some other way? It appears that you never bothered to.

Sure, I would not be surprised if some device driver ends up being
surpised if you start passing it different request sizes than it is used
to. But that's a driver and testing issue, nothing more.

(Which is not to say that "driver and testing" issues aren't important as
hell: it's one of the more scary things in fact, and it can take a long
time to get right if you start doing somehting that historically has never
been done and thus has historically never gotten any testing. So I'm not
saying that it should work out-of-the-box. But I _am_ saying that there's
no point in trying to re-design upper layers that already do ALL of this
with no problems at all).

		Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-07  0:25                                   ` Ingo Molnar
  2001-02-07  0:36                                     ` Stephen C. Tweedie
@ 2001-02-07  0:42                                     ` Linus Torvalds
  1 sibling, 0 replies; 124+ messages in thread
From: Linus Torvalds @ 2001-02-07  0:42 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Stephen C. Tweedie, Ingo Molnar, Ben LaHaise, Alan Cox,
	Manfred Spraul, Steve Lord, Linux Kernel List, kiobuf-io-devel



On Tue, 6 Feb 2001, Ingo Molnar wrote:
> 
> On Wed, 7 Feb 2001, Stephen C. Tweedie wrote:
> 
> > No, it is a problem of the ll_rw_block interface: buffer_heads need to
> > be aligned on disk at a multiple of their buffer size.  Under the Unix
> > raw IO interface it is perfectly legal to begin a 128kB IO at offset
> > 512 bytes into a device.
> 
> then we should either fix this limitation, or the raw IO code should split
> the request up into several, variable-size bhs, so that the range is
> filled out optimally with aligned bhs.

As mentioned, no such limitation exists if you just use the right
interfaces.

		Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-07  0:21                                 ` Stephen C. Tweedie
  2001-02-07  0:25                                   ` Ingo Molnar
  2001-02-07  0:35                                   ` Jens Axboe
@ 2001-02-07  0:41                                   ` Linus Torvalds
  2001-02-07  1:27                                     ` Stephen C. Tweedie
  2 siblings, 1 reply; 124+ messages in thread
From: Linus Torvalds @ 2001-02-07  0:41 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Ingo Molnar, Ben LaHaise, Alan Cox, Manfred Spraul, Steve Lord,
	Linux Kernel List, kiobuf-io-devel, Ingo Molnar



On Wed, 7 Feb 2001, Stephen C. Tweedie wrote:
> 
> On Tue, Feb 06, 2001 at 08:57:13PM +0100, Ingo Molnar wrote:
> > 
> > [overhead of 512-byte bhs in the raw IO code is an artificial problem of
> > the raw IO code.]
> 
> No, it is a problem of the ll_rw_block interface: buffer_heads need to
> be aligned on disk at a multiple of their buffer size.

Ehh.. True of ll_rw_block() and submit_bh(), which are meant for the
traditional block device setup, where "b_blocknr" is the "virtual
blocknumber" and that indeed is tied in to the block size.

That's the whole _point_ of ll_rw_block() and friends - they show the
device at a different "virtual blocking" level than the low-level physical
accesses necessarily are. Which very much means that if you have a 4kB
"view", of the device, you get a stream of 4kB blocks. Not 4kB sized
blocks at 512-byte offsets (or whatebver the hardware blocking size is).

This way the interfaces are independent of the hardware blocksize. Which
is logical and what you'd expect. You need to go to a lower level to see
those kinds of blocking issues.

But it is _not_ true of "generic_make_request()" and the block IO layer in
general. It obviously _cannot_ be true, because the block I/O layer has
always had the notion of merging consecutive blocks together - regardless
of whether the end result is even a power of two or antyhing like that in
size. You can make an IO request for pretty much any size, as long as it's
a multiple of the hardare blocksize (normally 512 bytes, but there are
certainly devices out there with other blocksizes).

The fact is, if you have problems like the above, then you don't
understand the interfaces. And it sounds like you designed kiobuf support
around the wrong set of interfaces.

If you want to get at the _sector_ level, then you do

	lock_bh();
	bh->b_rdev = device;
	bh->b_rsector = sector-number (where linux defines "sector" to be 512 bytes)
	bh->b_size = size in bytes (must be a multiple of 512);
	bh->b_data = pointer;
	bh->b_end_io = callback;
	generic_make_request(rw, bh);

which doesn't look all that complicated to me. What's the problem?

		Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-07  0:25                                   ` Ingo Molnar
@ 2001-02-07  0:36                                     ` Stephen C. Tweedie
  2001-02-07  0:50                                       ` Linus Torvalds
  2001-02-07  1:42                                       ` Jeff V. Merkey
  2001-02-07  0:42                                     ` Linus Torvalds
  1 sibling, 2 replies; 124+ messages in thread
From: Stephen C. Tweedie @ 2001-02-07  0:36 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Stephen C. Tweedie, Ingo Molnar, Ben LaHaise, Linus Torvalds,
	Alan Cox, Manfred Spraul, Steve Lord, Linux Kernel List,
	kiobuf-io-devel

Hi,

On Tue, Feb 06, 2001 at 07:25:19PM -0500, Ingo Molnar wrote:
> 
> On Wed, 7 Feb 2001, Stephen C. Tweedie wrote:
> 
> > No, it is a problem of the ll_rw_block interface: buffer_heads need to
> > be aligned on disk at a multiple of their buffer size.  Under the Unix
> > raw IO interface it is perfectly legal to begin a 128kB IO at offset
> > 512 bytes into a device.
> 
> then we should either fix this limitation, or the raw IO code should split
> the request up into several, variable-size bhs, so that the range is
> filled out optimally with aligned bhs.

That gets us from 512-byte blocks to 4k, but no more (ll_rw_block
enforces a single blocksize on all requests but that relaxing that
requirement is no big deal).  Buffer_heads can't deal with data which
spans more than a page right now.

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-07  0:21                                 ` Stephen C. Tweedie
  2001-02-07  0:25                                   ` Ingo Molnar
@ 2001-02-07  0:35                                   ` Jens Axboe
  2001-02-07  0:41                                   ` Linus Torvalds
  2 siblings, 0 replies; 124+ messages in thread
From: Jens Axboe @ 2001-02-07  0:35 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Ingo Molnar, Ben LaHaise, Linus Torvalds, Alan Cox,
	Manfred Spraul, Steve Lord, Linux Kernel List, kiobuf-io-devel,
	Ingo Molnar

On Wed, Feb 07 2001, Stephen C. Tweedie wrote:
> > [overhead of 512-byte bhs in the raw IO code is an artificial problem of
> > the raw IO code.]
> 
> No, it is a problem of the ll_rw_block interface: buffer_heads need to
> be aligned on disk at a multiple of their buffer size.  Under the Unix
> raw IO interface it is perfectly legal to begin a 128kB IO at offset
> 512 bytes into a device.

Submitting buffers to lower layers that are not hw sector aligned
can't be supported below ll_rw_blk anyway (they can, but look at the
problems this has always created), and I would much rather see stuff
like this handled outside of there.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-07  0:21                                 ` Stephen C. Tweedie
@ 2001-02-07  0:25                                   ` Ingo Molnar
  2001-02-07  0:36                                     ` Stephen C. Tweedie
  2001-02-07  0:42                                     ` Linus Torvalds
  2001-02-07  0:35                                   ` Jens Axboe
  2001-02-07  0:41                                   ` Linus Torvalds
  2 siblings, 2 replies; 124+ messages in thread
From: Ingo Molnar @ 2001-02-07  0:25 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Ingo Molnar, Ben LaHaise, Linus Torvalds, Alan Cox,
	Manfred Spraul, Steve Lord, Linux Kernel List, kiobuf-io-devel


On Wed, 7 Feb 2001, Stephen C. Tweedie wrote:

> No, it is a problem of the ll_rw_block interface: buffer_heads need to
> be aligned on disk at a multiple of their buffer size.  Under the Unix
> raw IO interface it is perfectly legal to begin a 128kB IO at offset
> 512 bytes into a device.

then we should either fix this limitation, or the raw IO code should split
the request up into several, variable-size bhs, so that the range is
filled out optimally with aligned bhs.

	Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-06 19:57                               ` Ingo Molnar
  2001-02-06 20:07                                 ` Jens Axboe
  2001-02-06 20:25                                 ` Ben LaHaise
@ 2001-02-07  0:21                                 ` Stephen C. Tweedie
  2001-02-07  0:25                                   ` Ingo Molnar
                                                     ` (2 more replies)
  2 siblings, 3 replies; 124+ messages in thread
From: Stephen C. Tweedie @ 2001-02-07  0:21 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Ben LaHaise, Linus Torvalds, Stephen C. Tweedie, Alan Cox,
	Manfred Spraul, Steve Lord, Linux Kernel List, kiobuf-io-devel,
	Ingo Molnar

Hi,

On Tue, Feb 06, 2001 at 08:57:13PM +0100, Ingo Molnar wrote:
> 
> [overhead of 512-byte bhs in the raw IO code is an artificial problem of
> the raw IO code.]

No, it is a problem of the ll_rw_block interface: buffer_heads need to
be aligned on disk at a multiple of their buffer size.  Under the Unix
raw IO interface it is perfectly legal to begin a 128kB IO at offset
512 bytes into a device.

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-06 21:13                                                 ` Marcelo Tosatti
@ 2001-02-06 23:26                                                   ` Linus Torvalds
  2001-02-08 15:06                                                     ` Ben LaHaise
  0 siblings, 1 reply; 124+ messages in thread
From: Linus Torvalds @ 2001-02-06 23:26 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Jens Axboe, Manfred Spraul, Ben LaHaise, Ingo Molnar,
	Stephen C. Tweedie, Alan Cox, Steve Lord, Linux Kernel List,
	kiobuf-io-devel, Ingo Molnar



On Tue, 6 Feb 2001, Marcelo Tosatti wrote:
> 
> Its arguing against making a smart application block on the disk while its
> able to use the CPU for other work.

There are currently no other alternatives in user space. You'd have to
create whole new interfaces for aio_read/write, and ways for the kernel to
inform user space that "now you can re-try submitting your IO".

Could be done. But that's a big thing.

> An application which sets non blocking behavior and busy waits for a
> request (which seems to be your argument) is just stupid, of course.

Tell me what else it could do at some point? You need something like
select() to wait on it. There are no such interfaces right now...

(besides, latency would suck. I bet you're better off waiting for the
requests if they are all used up. It takes too long to get deep into the
kernel from user space, and you cannot use the exclusive waiters with its
anti-herd behaviour etc).

Simple rule: if you want to optimize concurrency and avoid waiting - use
several processes or threads instead. At which point you can get real work
done on multiple CPU's, instead of worrying about what happens when you
have to wait on the disk.

		Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-06 22:09                                             ` Jens Axboe
@ 2001-02-06 22:26                                               ` Linus Torvalds
  2001-02-06 21:13                                                 ` Marcelo Tosatti
  2001-02-07 23:15                                                 ` Pavel Machek
  0 siblings, 2 replies; 124+ messages in thread
From: Linus Torvalds @ 2001-02-06 22:26 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Marcelo Tosatti, Manfred Spraul, Ben LaHaise, Ingo Molnar,
	Stephen C. Tweedie, Alan Cox, Steve Lord, Linux Kernel List,
	kiobuf-io-devel, Ingo Molnar



On Tue, 6 Feb 2001, Jens Axboe wrote:

> On Tue, Feb 06 2001, Marcelo Tosatti wrote:
> > 
> > Reading write(2): 
> > 
> >        EAGAIN Non-blocking  I/O has been selected using O_NONBLOCK and there was
> >               no room in the pipe or socket connected to fd to  write  the data
> >               immediately.
> > 
> > I see no reason why "aio function have to block waiting for requests". 
> 
> That was my reasoning too with READA etc, but Linus seems to want that we
> can block while submitting the I/O (as throttling, Linus?) just not
> until completion.

Note the "in the pipe or socket" part.
                 ^^^^    ^^^^^^

EAGAIN is _not_ a valid return value for block devices or for regular
files. And in fact it _cannot_ be, because select() is defined to always
return 1 on them - so if a write() were to return EAGAIN, user space would
have nothing to wait on. Busy waiting is evil.

So READA/WRITEA are only useful inside the kernel, and when the caller has
some data structures of its own that it can use to gracefully handle the
case of a failure - it will try to do the IO later for some reasons, maybe
deciding to do it with blocking because it has nothing better to do at the
later date, or because it decides that it can have only so many
outstanding requests.

Remember: in the end you HAVE to wait somewhere. You're always going to be
able to generate data faster than the disk can take it. SOMETHING has to
throttle - if you don't allow generic_make_request() to throttle, you have
to do it on your own at some point. It is stupid and counter-productive to
argue against throttling. The only argument can be _where_ that throttling
is done, and READA/WRITEA leaves the possibility open of doing it
somewhere else (or just delaying it and letting a future call with
READ/WRITE do the throttling).

		Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-06 22:13                                             ` Linus Torvalds
@ 2001-02-06 22:26                                               ` Andre Hedrick
  0 siblings, 0 replies; 124+ messages in thread
From: Andre Hedrick @ 2001-02-06 22:26 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Manfred Spraul, Jens Axboe, Ben LaHaise, Ingo Molnar,
	Stephen C. Tweedie, Alan Cox, Steve Lord, Linux Kernel List,
	kiobuf-io-devel, Ingo Molnar

On Tue, 6 Feb 2001, Linus Torvalds wrote:

> 
> 
> On Tue, 6 Feb 2001, Manfred Spraul wrote:
> > > 
> > > The aio functions should NOT use READA/WRITEA. They should just use the
> > > normal operations, waiting for requests.
> > 
> > But then you end with lots of threads blocking in get_request()
> 
> So?
> 
> What the HELL do you expect to happen if somebody writes faster than the
> disk can take?
> 
> You don't lik ebusy-waiting. Fair enough.
> 
> So maybe blocking on a wait-queue is the right thing? Just MAYBE?

Did I miss a portion of the thread?
Is the block layer ignoring the status of a device?

--Andre

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-06 21:57                                           ` Manfred Spraul
@ 2001-02-06 22:13                                             ` Linus Torvalds
  2001-02-06 22:26                                               ` Andre Hedrick
  0 siblings, 1 reply; 124+ messages in thread
From: Linus Torvalds @ 2001-02-06 22:13 UTC (permalink / raw)
  To: Manfred Spraul
  Cc: Jens Axboe, Ben LaHaise, Ingo Molnar, Stephen C. Tweedie,
	Alan Cox, Steve Lord, Linux Kernel List, kiobuf-io-devel,
	Ingo Molnar



On Tue, 6 Feb 2001, Manfred Spraul wrote:
> > 
> > The aio functions should NOT use READA/WRITEA. They should just use the
> > normal operations, waiting for requests.
> 
> But then you end with lots of threads blocking in get_request()

So?

What the HELL do you expect to happen if somebody writes faster than the
disk can take?

You don't lik ebusy-waiting. Fair enough.

So maybe blocking on a wait-queue is the right thing? Just MAYBE?

		Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-06 20:16                                           ` Marcelo Tosatti
@ 2001-02-06 22:09                                             ` Jens Axboe
  2001-02-06 22:26                                               ` Linus Torvalds
  0 siblings, 1 reply; 124+ messages in thread
From: Jens Axboe @ 2001-02-06 22:09 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Linus Torvalds, Manfred Spraul, Ben LaHaise, Ingo Molnar,
	Stephen C. Tweedie, Alan Cox, Steve Lord, Linux Kernel List,
	kiobuf-io-devel, Ingo Molnar

On Tue, Feb 06 2001, Marcelo Tosatti wrote:
> > > > We don't even need that, non-blocking is implicitly applied with READA.
> > > >
> > > READA just returns - I doubt that the aio functions should poll until
> > > there are free entries in the request queue.
> > 
> > The aio functions should NOT use READA/WRITEA. They should just use the
> > normal operations, waiting for requests. The things that makes them
> > asycnhronous is not waiting for the requests to _complete_. Which you can
> > already do, trivially enough.
> 
> Reading write(2): 
> 
>        EAGAIN Non-blocking  I/O has been selected using O_NONBLOCK and there was
>               no room in the pipe or socket connected to fd to  write  the data
>               immediately.
> 
> I see no reason why "aio function have to block waiting for requests". 

That was my reasoning too with READA etc, but Linus seems to want that we
can block while submitting the I/O (as throttling, Linus?) just not
until completion.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-06 21:42                                         ` Linus Torvalds
  2001-02-06 20:16                                           ` Marcelo Tosatti
@ 2001-02-06 21:57                                           ` Manfred Spraul
  2001-02-06 22:13                                             ` Linus Torvalds
  1 sibling, 1 reply; 124+ messages in thread
From: Manfred Spraul @ 2001-02-06 21:57 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jens Axboe, Ben LaHaise, Ingo Molnar, Stephen C. Tweedie,
	Alan Cox, Steve Lord, Linux Kernel List, kiobuf-io-devel,
	Ingo Molnar

Linus Torvalds wrote:
> 
> On Tue, 6 Feb 2001, Manfred Spraul wrote:
> > Jens Axboe wrote:
> > >
> > > > Several kernel functions need a "dontblock" parameter (or a callback, or
> > > > a waitqueue address, or a tq_struct pointer).
> > >
> > > We don't even need that, non-blocking is implicitly applied with READA.
> > >
> > READA just returns - I doubt that the aio functions should poll until
> > there are free entries in the request queue.
> 
> The aio functions should NOT use READA/WRITEA. They should just use the
> normal operations, waiting for requests.

But then you end with lots of threads blocking in get_request()

Quoting Ben's mail:
<<<<<<<<<
> 
> =)  This is what I'm seeing: lots of processes waiting with wchan ==
> __get_request_wait.  With async io and a database flushing lots of io
> asynchronously spread out across the disk, the NR_REQUESTS limit is hit
> very quickly.
> 
>>>>>>>>>

On an io bound server the request queue is always full - waiting for the
next request might take longer than the actual io.

--
	Manfred
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-06 21:26                                       ` Manfred Spraul
@ 2001-02-06 21:42                                         ` Linus Torvalds
  2001-02-06 20:16                                           ` Marcelo Tosatti
  2001-02-06 21:57                                           ` Manfred Spraul
  0 siblings, 2 replies; 124+ messages in thread
From: Linus Torvalds @ 2001-02-06 21:42 UTC (permalink / raw)
  To: Manfred Spraul
  Cc: Jens Axboe, Ben LaHaise, Ingo Molnar, Stephen C. Tweedie,
	Alan Cox, Steve Lord, Linux Kernel List, kiobuf-io-devel,
	Ingo Molnar



On Tue, 6 Feb 2001, Manfred Spraul wrote:
> Jens Axboe wrote:
> > 
> > > Several kernel functions need a "dontblock" parameter (or a callback, or
> > > a waitqueue address, or a tq_struct pointer).
> > 
> > We don't even need that, non-blocking is implicitly applied with READA.
> >
> READA just returns - I doubt that the aio functions should poll until
> there are free entries in the request queue.

The aio functions should NOT use READA/WRITEA. They should just use the
normal operations, waiting for requests. The things that makes them
asycnhronous is not waiting for the requests to _complete_. Which you can
already do, trivially enough.

The case for using READA/WRITEA is not that you want to do asynchronous
IO (all Linux IO is asynchronous unless you do extra work), but because
you have a case where you _might_ want to start IO, but if you don't have
a free request slot (ie there's already tons of pending IO happening), you
want the option of doing something else. This is not about aio - with aio
you _need_ to start the IO, you're just not willing to wait for it. 

An example of READA/WRITEA is if you want to do opportunistic dirty page
cleaning - you might not _have_ to clean it up, but you say

 "Hmm.. if you can do this simply without having to wait for other
  requests, start doing the writeout in the background. If notm I'll come
  back to you later after I've done more real work.."

And the Linux block device layer supports both of these kinds of "delayed
IO" already. It's all there. Today.

		Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-06 20:50                                     ` Jens Axboe
@ 2001-02-06 21:26                                       ` Manfred Spraul
  2001-02-06 21:42                                         ` Linus Torvalds
  0 siblings, 1 reply; 124+ messages in thread
From: Manfred Spraul @ 2001-02-06 21:26 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Ben LaHaise, Ingo Molnar, Linus Torvalds, Stephen C. Tweedie,
	Alan Cox, Steve Lord, Linux Kernel List, kiobuf-io-devel,
	Ingo Molnar

Jens Axboe wrote:
> 
> > Several kernel functions need a "dontblock" parameter (or a callback, or
> > a waitqueue address, or a tq_struct pointer).
> 
> We don't even need that, non-blocking is implicitly applied with READA.
>
READA just returns - I doubt that the aio functions should poll until
there are free entries in the request queue.

The pending aio requests should be "included" into the wait_for_requests
waitqueue (ok, they don't have a process context, thus a wait queue
entry doesn't help, but these requests belong into that wait queue)

--
	Manfred
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-06 20:59                                   ` Ingo Molnar
@ 2001-02-06 21:20                                     ` Steve Lord
  0 siblings, 0 replies; 124+ messages in thread
From: Steve Lord @ 2001-02-06 21:20 UTC (permalink / raw)
  To: mingo
  Cc: Marcelo Tosatti, Christoph Hellwig, Linus Torvalds, Ben LaHaise,
	Stephen C. Tweedie, Alan Cox, Manfred Spraul, Steve Lord,
	Linux Kernel List, kiobuf-io-devel, Ingo Molnar

> 
> On Tue, 6 Feb 2001, Marcelo Tosatti wrote:
> 
> > Think about a given number of pages which are physically contiguous on
> > disk -- you dont need to cache the block number for each page, you
> > just need to cache the physical block number of the first page of the
> > "cluster".
> 
> ranges are a hell of a lot more trouble to get right than page or
> block-sized objects - and typical access patterns are rarely 'ranged'. As
> long as the basic unit is not 'too small' (ie. not 512 byte, but something
> more sane, like 4096 bytes), i dont think ranging done in higher levels
> buys us anything valuable. And we do ranging at the request layer already
> ... Guess why most CPUs ended up having pages, and not "memory ranges"?
> It's simpler, thus faster in the common case and easier to debug.
> 
> > Usually we need to cache only block information (for clustering), and
> > not all the other stuff which buffer_head holds.
> 
> well, the other issue is that buffer_heads hold buffer-cache details as
> well. But i think it's too small right now to justify any splitup - and
> those issues are related enough to have significant allocation-merging
> effects.
> 
> 	Ingo

Think about it from the point of view of being able to reduce the number of
times you need to talk to the allocator in a filesystem. You can talk to
the allocator about all of your readahead pages in one go, or you can do
things like allocate on flush rather than allocating page at a time (that is
a bit more complex, but not too much).

Having to talk to the allocator on a page by page basis is my pet peeve about
the current mechanisms.

Steve



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-06 22:26                                               ` Linus Torvalds
@ 2001-02-06 21:13                                                 ` Marcelo Tosatti
  2001-02-06 23:26                                                   ` Linus Torvalds
  2001-02-07 23:15                                                 ` Pavel Machek
  1 sibling, 1 reply; 124+ messages in thread
From: Marcelo Tosatti @ 2001-02-06 21:13 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jens Axboe, Manfred Spraul, Ben LaHaise, Ingo Molnar,
	Stephen C. Tweedie, Alan Cox, Steve Lord, Linux Kernel List,
	kiobuf-io-devel, Ingo Molnar


On Tue, 6 Feb 2001, Linus Torvalds wrote:

> Remember: in the end you HAVE to wait somewhere. You're always going to be
> able to generate data faster than the disk can take it. SOMETHING has to
> throttle - if you don't allow generic_make_request() to throttle, you have
> to do it on your own at some point. It is stupid and counter-productive to
> argue against throttling. The only argument can be _where_ that throttling
> is done, and READA/WRITEA leaves the possibility open of doing it
> somewhere else (or just delaying it and letting a future call with
> READ/WRITE do the throttling).

Its not "arguing against throttling". 

Its arguing against making a smart application block on the disk while its
able to use the CPU for other work.
 
An application which sets non blocking behavior and busy waits for a
request (which seems to be your argument) is just stupid, of course.



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-06 20:25                             ` Christoph Hellwig
  2001-02-06 20:35                               ` Ingo Molnar
@ 2001-02-06 20:59                               ` Linus Torvalds
  2001-02-07 18:26                                 ` Christoph Hellwig
  1 sibling, 1 reply; 124+ messages in thread
From: Linus Torvalds @ 2001-02-06 20:59 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Ben LaHaise, Ingo Molnar, Stephen C. Tweedie, Alan Cox,
	Manfred Spraul, Steve Lord, Linux Kernel List, kiobuf-io-devel,
	Ingo Molnar



On Tue, 6 Feb 2001, Christoph Hellwig wrote:
> 
> The second is that bh's are two things:
> 
>  - a cacheing object
>  - an io buffer

Actually, they really aren't.

They kind of _used_ to be, but more and more they've moved away from that
historical use. Check in particular the page cache, and as a really
extreme case the swap cache version of the page cache.

It certainly _used_ to be true that "bh"s were actually first-class memory
management citizens, and actually had a data buffer and a cache associated
with them. And because of that historical baggage, that's how many people
still think of them.

These days, it's really not true any more. A "bh" doesn't really have an
IO buffer intrisically associated with it any more - all memory management
is done on a _page_ level, and it really works the other way around, ie a
page can have one or more bh's associated with it as the IO entity.

This _does_ show up in the bh itself: you find that bh's end up having the
bh->b_page pointer in it, which is really a layering violation these days,
but you'll notice that it's actually not used very much, and it could
probably be largely removed.

The most fundamental use of it (from an IO standpoint) is actually to
handle high memory issues, because high-memory handling is very
fundamentally based on "struct page", and in order to be able to have
high-memory IO buffers you absolutely have to have the "struct page" the
way things are done now.

(all the other uses tend to not be IO-related at all: they are stuff like
the callbacks that want to find the page that should be free'd up)

The other part of "struct bh" is that it _does_ have support for fast
lookups, and the bh hashing. Again, from a pure IO standpoint you can
easily choose to just ignore this. It's often not used at all (in fact,
_most_ bh's aren't hashed, because the only way to find them are through
the page cache).

> This is not really an clean appropeach, and I would really like to
> get away from it.

Trust me, you really _can_ get away from it. It's not designed into the
bh's at all. You can already just allocate a single (or multiple) "struct
buffer_head" and just use them as IO objects, and give them your _own_
pointers to the IO buffer etc.

In fact, if you look at how the page cache is organized, this is what the
page cache already does. The page cache has it's own IO buffer (the page
itself), and it just uses "struct buffer_head" to allocate temporary IO
entities. It _also_ uses the "struct buffer_head" to cache the meta-data
in the sense of having the buffer head also contain the physical address
on disk so that the page cache doesn't have to ask the low-level
filesystem all the time, so in that sense it actually has a double use for
it.

But you can (and _should_) think of that as a "we got the meta-data
address caching for free, and it fit with our historical use, so why not
use it?".

So you can easily do the equivalent of

 - maintain your own buffers (possibly by looking up pages directly from
   user space, if you want to do zero-copy kind of things)

 - allocate a private buffer head ("get_unused_buffer_head()")

 - make that buffer head point into your buffer

 - submit the IO by just calling "submit_bh()", using the b_end_io()
   callback as your way to maintain _your_ IO buffer ownership.

In particular, think of the things that you do NOT have to do:

 - you do NOT have to allocate a bh-private buffer. Just point the bh at
   your own buffer.
 - you do NOT have to "give" your buffer to the bh. You do, of course,

   want to know when the bh is done with _your_ buffer, but that's what
   the b_end_io callback is all about.

 - you do NOT have to hash the bh you allocated and thus expose it to
   anybody else. It is YOUR private bh, and it does not show up on ANY
   other lists. There are various helper functions to insert the bh on
   various global lists ("mark_bh_dirty()" to put it on the dirty list,
   "buffer_insert_inode_queue()" to put it on the inode lists etc, but
   there is nothing in the thing that _forces_ you to expose your bh.

So don't think of "bh->b_data" as being something that the bh owns. It's
just a pointer. Think of "bh->b_data" and "bh->b_size" as _nothing_ more
than a data range in memory. 

In short, you can, and often should, think of "struct buffer_head" as
nothing but an IO entity. It has some support for being more than that,
but that's secondary. That can validly be seen as another layer, that is
just so common that there is little point in splitting it up (and a lot of
purely historical reasons for not splitting it).

		Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-06 19:05                                 ` Marcelo Tosatti
@ 2001-02-06 20:59                                   ` Ingo Molnar
  2001-02-06 21:20                                     ` Steve Lord
  0 siblings, 1 reply; 124+ messages in thread
From: Ingo Molnar @ 2001-02-06 20:59 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Christoph Hellwig, Linus Torvalds, Ben LaHaise,
	Stephen C. Tweedie, Alan Cox, Manfred Spraul, Steve Lord,
	Linux Kernel List, kiobuf-io-devel, Ingo Molnar


On Tue, 6 Feb 2001, Marcelo Tosatti wrote:

> Think about a given number of pages which are physically contiguous on
> disk -- you dont need to cache the block number for each page, you
> just need to cache the physical block number of the first page of the
> "cluster".

ranges are a hell of a lot more trouble to get right than page or
block-sized objects - and typical access patterns are rarely 'ranged'. As
long as the basic unit is not 'too small' (ie. not 512 byte, but something
more sane, like 4096 bytes), i dont think ranging done in higher levels
buys us anything valuable. And we do ranging at the request layer already
... Guess why most CPUs ended up having pages, and not "memory ranges"?
It's simpler, thus faster in the common case and easier to debug.

> Usually we need to cache only block information (for clustering), and
> not all the other stuff which buffer_head holds.

well, the other issue is that buffer_heads hold buffer-cache details as
well. But i think it's too small right now to justify any splitup - and
those issues are related enough to have significant allocation-merging
effects.

	Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-06 20:41                                   ` Manfred Spraul
@ 2001-02-06 20:50                                     ` Jens Axboe
  2001-02-06 21:26                                       ` Manfred Spraul
  0 siblings, 1 reply; 124+ messages in thread
From: Jens Axboe @ 2001-02-06 20:50 UTC (permalink / raw)
  To: Manfred Spraul
  Cc: Ben LaHaise, Ingo Molnar, Linus Torvalds, Stephen C. Tweedie,
	Alan Cox, Steve Lord, Linux Kernel List, kiobuf-io-devel,
	Ingo Molnar

On Tue, Feb 06 2001, Manfred Spraul wrote:
> > =)  This is what I'm seeing: lots of processes waiting with wchan ==
> > __get_request_wait.  With async io and a database flushing lots of io
> > asynchronously spread out across the disk, the NR_REQUESTS limit is hit
> > very quickly.
> >
> Has that anything to do with kiobuf or buffer head?

Nothing

> Several kernel functions need a "dontblock" parameter (or a callback, or
> a waitqueue address, or a tq_struct pointer). 

We don't even need that, non-blocking is implicitly applied with READA.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-06 20:25                                 ` Ben LaHaise
  2001-02-06 20:41                                   ` Manfred Spraul
@ 2001-02-06 20:49                                   ` Jens Axboe
  1 sibling, 0 replies; 124+ messages in thread
From: Jens Axboe @ 2001-02-06 20:49 UTC (permalink / raw)
  To: Ben LaHaise
  Cc: Ingo Molnar, Linus Torvalds, Stephen C. Tweedie, Alan Cox,
	Manfred Spraul, Steve Lord, Linux Kernel List, kiobuf-io-devel,
	Ingo Molnar

On Tue, Feb 06 2001, Ben LaHaise wrote:
> =)  This is what I'm seeing: lots of processes waiting with wchan ==
> __get_request_wait.  With async io and a database flushing lots of io
> asynchronously spread out across the disk, the NR_REQUESTS limit is hit
> very quickly.

You can't do async I/O this way! In going what Linus said, make submit_bh
return an int telling you if it failed to queue the buffer and use
READA/WRITEA to submit it.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-06 20:25                                 ` Ben LaHaise
@ 2001-02-06 20:41                                   ` Manfred Spraul
  2001-02-06 20:50                                     ` Jens Axboe
  2001-02-06 20:49                                   ` Jens Axboe
  1 sibling, 1 reply; 124+ messages in thread
From: Manfred Spraul @ 2001-02-06 20:41 UTC (permalink / raw)
  To: Ben LaHaise
  Cc: Ingo Molnar, Linus Torvalds, Stephen C. Tweedie, Alan Cox,
	Steve Lord, Linux Kernel List, kiobuf-io-devel, Ingo Molnar

Ben LaHaise wrote:
> 
> On Tue, 6 Feb 2001, Ingo Molnar wrote:
> 
> >
> > On Tue, 6 Feb 2001, Ben LaHaise wrote:
> >
> > > This small correction is the crux of the problem: if it blocks, it
> > > takes away from the ability of the process to continue doing useful
> > > work.  If it returns -EAGAIN, then that's okay, the io will be
> > > resubmitted later when other disk io has completed.  But, it should be
> > > possible to continue servicing network requests or user io while disk
> > > io is underway.
> >
> > typical blocking point is waiting for page completion, not
> > __wait_request(). But, this is really not an issue, NR_REQUESTS can be
> > increased anytime. If NR_REQUESTS is large enough then think of it as the
> > 'absolute upper limit of doing IO', and think of the blocking as 'the
> > kernel pulling the brakes'.
> 
> =)  This is what I'm seeing: lots of processes waiting with wchan ==
> __get_request_wait.  With async io and a database flushing lots of io
> asynchronously spread out across the disk, the NR_REQUESTS limit is hit
> very quickly.
>
Has that anything to do with kiobuf or buffer head?

Several kernel functions need a "dontblock" parameter (or a callback, or
a waitqueue address, or a tq_struct pointer). 

--
	Manfred
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-06 20:25                             ` Christoph Hellwig
@ 2001-02-06 20:35                               ` Ingo Molnar
  2001-02-06 19:05                                 ` Marcelo Tosatti
  2001-02-07 18:27                                 ` Christoph Hellwig
  2001-02-06 20:59                               ` Linus Torvalds
  1 sibling, 2 replies; 124+ messages in thread
From: Ingo Molnar @ 2001-02-06 20:35 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Linus Torvalds, Ben LaHaise, Stephen C. Tweedie, Alan Cox,
	Manfred Spraul, Steve Lord, Linux Kernel List, kiobuf-io-devel,
	Ingo Molnar


On Tue, 6 Feb 2001, Christoph Hellwig wrote:

> The second is that bh's are two things:
>
>  - a cacheing object
>  - an io buffer
>
> This is not really an clean appropeach, and I would really like to get
> away from it.

caching bmap() blocks was a recent addition around 2.3.20, and i suggested
some time ago to cache pagecache blocks via explicit entries in struct
page. That would be one solution - but it creates overhead.

but there isnt anything wrong with having the bhs around to cache blocks -
think of it as a 'cached and recycled IO buffer entry, with the block
information cached'.

frankly, my quick (and limited) hack to abuse bhs to cache blocks just
cannot be a reason to replace bhs ...

	Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-06 19:49                             ` Ben LaHaise
  2001-02-06 19:57                               ` Ingo Molnar
@ 2001-02-06 20:26                               ` Linus Torvalds
  1 sibling, 0 replies; 124+ messages in thread
From: Linus Torvalds @ 2001-02-06 20:26 UTC (permalink / raw)
  To: Ben LaHaise
  Cc: Ingo Molnar, Stephen C. Tweedie, Alan Cox, Manfred Spraul,
	Steve Lord, Linux Kernel List, kiobuf-io-devel, Ingo Molnar



On Tue, 6 Feb 2001, Ben LaHaise wrote:
> 
> This small correction is the crux of the problem: if it blocks, it takes
> away from the ability of the process to continue doing useful work.  If it
> returns -EAGAIN, then that's okay, the io will be resubmitted later when
> other disk io has completed.  But, it should be possible to continue
> servicing network requests or user io while disk io is underway.

Ehh..  The supprot for this is actually all there already. It's just not
used, because nobody asked for it.

Check the "rw_ahead" variable in __make_request(). Notice how it does
everything you ask for.

So remind me again why we should need a whole new interface for something
that already exists but isn't exported because nobody needed it? It got
created for READA, but that isn't used any more.

You could absolutely _trivially_ re-introduce it (along with WRITEA), but
you should probably change the semantics of what happens when it doesn't
get a request. Something like making "submit_bh()" return an error value
for the case, instead of doing "bh->b_end_io(0..)" which is what I think
it does right now. That would make it easier for the submitter to say "oh,
the queue is full".

This is probably all of 5 lines of code.

I really think that people don't give the block device layer enough
credit. Some of it is quite ugly due to 10 years of history, and there is
certainly a lack of some interesting capabilities (there is no "barrier"
operation right now to enforce ordering, for example, and it really would
be sensible to support a wider operation of ops than just read/write and
let the ioctl's use it to pass commands too).

These issues are things that I've been discussing with Jens for the last
few months, and are things that he already to some degree has been toying
with, and we already decided to try to do this during 2.5.x.

It's already been a _lot_ of clean-up with the per-queue request lists
etc, and there's more to be done in the cleanup section too. But the fact
is that too many people seem to have ignored the support that IS there,
and that actually works very well indeed - and is very generic.

> > What more do you think your kiobuf's should be able to do?
> 
> That's what my code is doing today.  There are a ton of bh's setup for a
> single kiobuf request that is issued.  For something like a single 256kb
> io, this is the difference between the batched io requests being passed
> into submit_bh fitting in L1 cache and overflowing it.  Resizable bh's
> would certainly improve this.

bh's _are_ resizeable. You just change bh->b_size, and you're done.

Of course, you'll need to do your own memory management for the backing
store. The generic bread() etc layer makes memory management simpler by
having just one size per page and making "struct page" their native mm
etity, but that's not really a bh issue - it's a MM issue and stems from
the fact that this is how all traditional block filesystems tend to want
to work.

NOTE! If you do start to resize the buffer heads, please give me a ping.
The code has never actually been _tested_ with anything but 512. 1024,
2048, 4096 and 8192-byte blocks. I would not be surprised at all if some
low-level drivers actually have asserts that the sizes are ones they
"recognize". The generic layer should be happy with anything that is a
multiple of 512, but as with all things, you'll probably find some gotchas
when you actually try something new.

		Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-06 19:32                           ` Linus Torvalds
  2001-02-06 19:44                             ` Ingo Molnar
  2001-02-06 19:49                             ` Ben LaHaise
@ 2001-02-06 20:25                             ` Christoph Hellwig
  2001-02-06 20:35                               ` Ingo Molnar
  2001-02-06 20:59                               ` Linus Torvalds
  2 siblings, 2 replies; 124+ messages in thread
From: Christoph Hellwig @ 2001-02-06 20:25 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ben LaHaise, Ingo Molnar, Stephen C. Tweedie, Alan Cox,
	Manfred Spraul, Steve Lord, Linux Kernel List, kiobuf-io-devel,
	Ingo Molnar

On Tue, Feb 06, 2001 at 11:32:43AM -0800, Linus Torvalds wrote:
> Traditionally, a "bh" is only _used_ for small areas, but that's not a
> "bh" issue, that's a memory management issue. The code should pretty much
> handle the issue of a single 64kB bh pretty much as-is, but nothing
> creates them: the VM layer only creates bh's in sizes ranging from 512
> bytes to a single page.
> 
> The IO layer could do more, but there has yet to be anybody who needed
> more (becase once you hit a page-size, you tend to get into
> scatter-gather, so you want to have one bh per area - and let the
> low-level IO level handle the actual merging etc).

Yes.  That's one disadvantage blown away.

The second is that bh's are two things:

 - a cacheing object
 - an io buffer

This is not really an clean appropeach, and I would really like to
get away from it.

	Christoph

-- 
Whip me.  Beat me.  Make me maintain AIX.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-06 19:57                               ` Ingo Molnar
  2001-02-06 20:07                                 ` Jens Axboe
@ 2001-02-06 20:25                                 ` Ben LaHaise
  2001-02-06 20:41                                   ` Manfred Spraul
  2001-02-06 20:49                                   ` Jens Axboe
  2001-02-07  0:21                                 ` Stephen C. Tweedie
  2 siblings, 2 replies; 124+ messages in thread
From: Ben LaHaise @ 2001-02-06 20:25 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Stephen C. Tweedie, Alan Cox, Manfred Spraul,
	Steve Lord, Linux Kernel List, kiobuf-io-devel, Ingo Molnar

On Tue, 6 Feb 2001, Ingo Molnar wrote:

>
> On Tue, 6 Feb 2001, Ben LaHaise wrote:
>
> > This small correction is the crux of the problem: if it blocks, it
> > takes away from the ability of the process to continue doing useful
> > work.  If it returns -EAGAIN, then that's okay, the io will be
> > resubmitted later when other disk io has completed.  But, it should be
> > possible to continue servicing network requests or user io while disk
> > io is underway.
>
> typical blocking point is waiting for page completion, not
> __wait_request(). But, this is really not an issue, NR_REQUESTS can be
> increased anytime. If NR_REQUESTS is large enough then think of it as the
> 'absolute upper limit of doing IO', and think of the blocking as 'the
> kernel pulling the brakes'.

=)  This is what I'm seeing: lots of processes waiting with wchan ==
__get_request_wait.  With async io and a database flushing lots of io
asynchronously spread out across the disk, the NR_REQUESTS limit is hit
very quickly.

> [overhead of 512-byte bhs in the raw IO code is an artificial problem of
> the raw IO code.]

True, and in the tests I've run, raw io is using 2KB blocks (same as the
database).

		-ben

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-06 20:16                             ` Ben LaHaise
@ 2001-02-06 20:22                               ` Ingo Molnar
  0 siblings, 0 replies; 124+ messages in thread
From: Ingo Molnar @ 2001-02-06 20:22 UTC (permalink / raw)
  To: Ben LaHaise
  Cc: Stephen C. Tweedie, Linus Torvalds, Alan Cox, Manfred Spraul,
	Steve Lord, Linux Kernel List, kiobuf-io-devel, Ingo Molnar


On Tue, 6 Feb 2001, Ben LaHaise wrote:

> Sure.  General parameters will be as follows (since I think we both have
> access to these machines):
>
> 	- 4xXeon, 4GB memory, 3GB to be used for the ramdisk (enough for a
> 	  base install plus data files.
> 	- data to/from the ram block device must be copied within the ram
> 	  block driver.
> 	- the filesystem used must be ext2.  optimisations to ext2 for
> 	  tweaks to the interface are permitted & encouraged.
>
> The main item I'm interested in is read (page cache cold)/synchronous
> write performance for blocks from 256 bytes to 16MB in powers of two,
> much like what I've done in testing the aio patches that shows where
> improvement in latency is needed. Including a few other items on disk
> like the timings of find/make -s dep/bonnie/dbench is probably to show
> changes in throughput. Sound fair?

yep, sounds fair.

	Ingo


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-06 19:46                           ` Ingo Molnar
@ 2001-02-06 20:16                             ` Ben LaHaise
  2001-02-06 20:22                               ` Ingo Molnar
  0 siblings, 1 reply; 124+ messages in thread
From: Ben LaHaise @ 2001-02-06 20:16 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Stephen C. Tweedie, Linus Torvalds, Alan Cox, Manfred Spraul,
	Steve Lord, Linux Kernel List, kiobuf-io-devel, Ingo Molnar

On Tue, 6 Feb 2001, Ingo Molnar wrote:

>
> On Tue, 6 Feb 2001, Ben LaHaise wrote:
>
> > > > You mentioned non-spindle base io devices in your last message.  Take
> > > > something like a big RAM disk. Now compare kiobuf base io to buffer
> > > > head based io. Tell me which one is going to perform better.
> > >
> > > roughly equal performance when using 4K bhs. And a hell of a lot more
> > > complex and volatile code in the kiobuf case.
> >
> > I'm willing to benchmark you on this.
>
> sure. Could you specify the actual workload, and desired test-setups?

Sure.  General parameters will be as follows (since I think we both have
access to these machines):

	- 4xXeon, 4GB memory, 3GB to be used for the ramdisk (enough for a
	  base install plus data files.
	- data to/from the ram block device must be copied within the ram
	  block driver.
	- the filesystem used must be ext2.  optimisations to ext2 for
	  tweaks to the interface are permitted & encouraged.

The main item I'm interested in is read (page cache cold)/synchronous
write performance for blocks from 256 bytes to 16MB in powers of two, much
like what I've done in testing the aio patches that shows where
improvement in latency is needed.  Including a few other items on disk
like the timings of find/make -s dep/bonnie/dbench is probably to show
changes in throughput.  Sound fair?

		-ben

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-06 21:42                                         ` Linus Torvalds
@ 2001-02-06 20:16                                           ` Marcelo Tosatti
  2001-02-06 22:09                                             ` Jens Axboe
  2001-02-06 21:57                                           ` Manfred Spraul
  1 sibling, 1 reply; 124+ messages in thread
From: Marcelo Tosatti @ 2001-02-06 20:16 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Manfred Spraul, Jens Axboe, Ben LaHaise, Ingo Molnar,
	Stephen C. Tweedie, Alan Cox, Steve Lord, Linux Kernel List,
	kiobuf-io-devel, Ingo Molnar


On Tue, 6 Feb 2001, Linus Torvalds wrote:

> 
> 
> On Tue, 6 Feb 2001, Manfred Spraul wrote:
> > Jens Axboe wrote:
> > > 
> > > > Several kernel functions need a "dontblock" parameter (or a callback, or
> > > > a waitqueue address, or a tq_struct pointer).
> > > 
> > > We don't even need that, non-blocking is implicitly applied with READA.
> > >
> > READA just returns - I doubt that the aio functions should poll until
> > there are free entries in the request queue.
> 
> The aio functions should NOT use READA/WRITEA. They should just use the
> normal operations, waiting for requests. The things that makes them
> asycnhronous is not waiting for the requests to _complete_. Which you can
> already do, trivially enough.

Reading write(2): 

       EAGAIN Non-blocking  I/O has been selected using O_NONBLOCK and there was
              no room in the pipe or socket connected to fd to  write  the data
              immediately.

I see no reason why "aio function have to block waiting for requests". 

_Why_ they do ? 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-06 19:57                               ` Ingo Molnar
@ 2001-02-06 20:07                                 ` Jens Axboe
  2001-02-06 20:25                                 ` Ben LaHaise
  2001-02-07  0:21                                 ` Stephen C. Tweedie
  2 siblings, 0 replies; 124+ messages in thread
From: Jens Axboe @ 2001-02-06 20:07 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Ben LaHaise, Linus Torvalds, Stephen C. Tweedie, Alan Cox,
	Manfred Spraul, Steve Lord, Linux Kernel List, kiobuf-io-devel,
	Ingo Molnar

On Tue, Feb 06 2001, Ingo Molnar wrote:
> > This small correction is the crux of the problem: if it blocks, it
> > takes away from the ability of the process to continue doing useful
> > work.  If it returns -EAGAIN, then that's okay, the io will be
> > resubmitted later when other disk io has completed.  But, it should be
> > possible to continue servicing network requests or user io while disk
> > io is underway.
> 
> typical blocking point is waiting for page completion, not
> __wait_request(). But, this is really not an issue, NR_REQUESTS can be
> increased anytime. If NR_REQUESTS is large enough then think of it as the
> 'absolute upper limit of doing IO', and think of the blocking as 'the
> kernel pulling the brakes'.

Not just __get_request_wait, but also the limit on max locked buffers
in ll_rw_block. Serves the same purpose though, brake effect.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-06 19:49                             ` Ben LaHaise
@ 2001-02-06 19:57                               ` Ingo Molnar
  2001-02-06 20:07                                 ` Jens Axboe
                                                   ` (2 more replies)
  2001-02-06 20:26                               ` Linus Torvalds
  1 sibling, 3 replies; 124+ messages in thread
From: Ingo Molnar @ 2001-02-06 19:57 UTC (permalink / raw)
  To: Ben LaHaise
  Cc: Linus Torvalds, Stephen C. Tweedie, Alan Cox, Manfred Spraul,
	Steve Lord, Linux Kernel List, kiobuf-io-devel, Ingo Molnar


On Tue, 6 Feb 2001, Ben LaHaise wrote:

> This small correction is the crux of the problem: if it blocks, it
> takes away from the ability of the process to continue doing useful
> work.  If it returns -EAGAIN, then that's okay, the io will be
> resubmitted later when other disk io has completed.  But, it should be
> possible to continue servicing network requests or user io while disk
> io is underway.

typical blocking point is waiting for page completion, not
__wait_request(). But, this is really not an issue, NR_REQUESTS can be
increased anytime. If NR_REQUESTS is large enough then think of it as the
'absolute upper limit of doing IO', and think of the blocking as 'the
kernel pulling the brakes'.

[overhead of 512-byte bhs in the raw IO code is an artificial problem of
the raw IO code.]

	Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-06 19:32                           ` Linus Torvalds
  2001-02-06 19:44                             ` Ingo Molnar
@ 2001-02-06 19:49                             ` Ben LaHaise
  2001-02-06 19:57                               ` Ingo Molnar
  2001-02-06 20:26                               ` Linus Torvalds
  2001-02-06 20:25                             ` Christoph Hellwig
  2 siblings, 2 replies; 124+ messages in thread
From: Ben LaHaise @ 2001-02-06 19:49 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Stephen C. Tweedie, Alan Cox, Manfred Spraul,
	Steve Lord, Linux Kernel List, kiobuf-io-devel, Ingo Molnar

On Tue, 6 Feb 2001, Linus Torvalds wrote:

>
>
> On Tue, 6 Feb 2001, Ben LaHaise wrote:
> >
> > s/impossible/unpleasant/.  ll_rw_blk blocks; it should be possible to have
> > a non blocking variant that does all of the setup in the caller's context.
> > Yes, I know that we can do it with a kernel thread, but that isn't as
> > clean and it significantly penalises small ios (hint: databases issue
> > *lots* of small random ios and a good chunk of large ios).
>
> Ehh.. submit_bh() does everything you want. And, btw, ll_rw_block() does
> NOT block. Never has. Never will.
>
> (Small correction: it doesn't block on anything else than allocating a
> request structure if needed, and quite frankly, you have to block
> SOMETIME. You can't just try to throw stuff at the device faster than it
> can take it. Think of it as a "there can only be this many IO's in
> flight")

This small correction is the crux of the problem: if it blocks, it takes
away from the ability of the process to continue doing useful work.  If it
returns -EAGAIN, then that's okay, the io will be resubmitted later when
other disk io has completed.  But, it should be possible to continue
servicing network requests or user io while disk io is underway.

> If you want to use kiobuf's because you think they are asycnrhonous and
> bh's aren't, then somebody has been feeding you a lot of crap. The kiobuf
> PR department seems to have been working overtime on some FUD strategy.

I'm using bh's to refer to what is currently being done, and kiobuf when
talking about what could be done.  It's probably the wrong thing to do,
and if bh's are extended to operate on arbitrary sized blocks then there
is no difference between the two.

> If you want to make a "raw disk device", you can do so TODAY with bh's.
> How? Don't use "bread()" (which allocates the backing store and creates
> the cache). Allocate a separate anonymous bh (or multiple), and set them
> up to point to whatever data source/sink you have, and let it rip. All
> asynchronous. All with nice completion callbacks. All with existing code,
> no kiobuf's in sight.

> What more do you think your kiobuf's should be able to do?

That's what my code is doing today.  There are a ton of bh's setup for a
single kiobuf request that is issued.  For something like a single 256kb
io, this is the difference between the batched io requests being passed
into submit_bh fitting in L1 cache and overflowing it.  Resizable bh's
would certainly improve this.

		-ben

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-06 19:11                         ` Ben LaHaise
                                             ` (2 preceding siblings ...)
  2001-02-06 19:32                           ` Linus Torvalds
@ 2001-02-06 19:46                           ` Ingo Molnar
  2001-02-06 20:16                             ` Ben LaHaise
  3 siblings, 1 reply; 124+ messages in thread
From: Ingo Molnar @ 2001-02-06 19:46 UTC (permalink / raw)
  To: Ben LaHaise
  Cc: Stephen C. Tweedie, Linus Torvalds, Alan Cox, Manfred Spraul,
	Steve Lord, Linux Kernel List, kiobuf-io-devel, Ingo Molnar


On Tue, 6 Feb 2001, Ben LaHaise wrote:

> > > You mentioned non-spindle base io devices in your last message.  Take
> > > something like a big RAM disk. Now compare kiobuf base io to buffer
> > > head based io. Tell me which one is going to perform better.
> >
> > roughly equal performance when using 4K bhs. And a hell of a lot more
> > complex and volatile code in the kiobuf case.
>
> I'm willing to benchmark you on this.

sure. Could you specify the actual workload, and desired test-setups?

	Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-06 19:32                           ` Linus Torvalds
@ 2001-02-06 19:44                             ` Ingo Molnar
  2001-02-06 19:49                             ` Ben LaHaise
  2001-02-06 20:25                             ` Christoph Hellwig
  2 siblings, 0 replies; 124+ messages in thread
From: Ingo Molnar @ 2001-02-06 19:44 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ben LaHaise, Stephen C. Tweedie, Alan Cox, Manfred Spraul,
	Steve Lord, Linux Kernel List, kiobuf-io-devel, Ingo Molnar


On Tue, 6 Feb 2001, Linus Torvalds wrote:

> (Small correction: it doesn't block on anything else than allocating a
> request structure if needed, and quite frankly, you have to block
> SOMETIME. You can't just try to throw stuff at the device faster than
> it can take it. Think of it as a "there can only be this many IO's in
> flight")

yep. The/my goal would be to get some sort of async IO capability that is
able to read the pagecache without holding up the process. And just
because i've already implemented the helper-kernel-thread async IO variant
[in fact what TUX does is that there are per-CPU async IO helper threads,
and we always pick the 'localized' thread, to avoid unnecessery cross-CPU
traffic], i'd like to explore the possibility of getting this done via a
pure, IRQ-driven state-machine - which arguably has the lowest overhead.

but i just cannot find any robust way to do this with ext2fs (or any other
disk-based FS for that matter). The horror scenario: the inode block is
not cached yet, and the block resides in a triple-indirected block which
triggers 3 other block reads, before the actual data block can be read.
And i definitely do not see why kiobufs would help make this any easier.

	Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-06 18:09                 ` Ben LaHaise
@ 2001-02-06 19:35                   ` Jens Axboe
  0 siblings, 0 replies; 124+ messages in thread
From: Jens Axboe @ 2001-02-06 19:35 UTC (permalink / raw)
  To: Ben LaHaise
  Cc: Stephen C. Tweedie, Linus Torvalds, Alan Cox, Manfred Spraul,
	Steve Lord, linux-kernel, kiobuf-io-devel, Ingo Molnar

On Tue, Feb 06 2001, Ben LaHaise wrote:
> > > As for io completion, can't we just issue seperate requests for the
> > > critical data and the readahead?  That way for SCSI disks, the important
> > > io should be finished while the readahead can continue.  Thoughts?
> >
> > Priorities?
> 
> Definately.  I'd like to be able to issue readaheads with a "don't bother
> executing if this request unless the cost is low" bit set.  It might also
> be helpful for heavy multiuser loads (or even a single user with multiple
> processes) to ensure progress is made for others.

And in other contexts too it might be handy to assign priorities to
requests as well. I don't know how sgi plan on handling grio (or already
handle it in irix), maybe Steve can fill us in on that :)

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-06 19:11                         ` Ben LaHaise
  2001-02-06 19:32                           ` Jens Axboe
  2001-02-06 19:32                           ` Ingo Molnar
@ 2001-02-06 19:32                           ` Linus Torvalds
  2001-02-06 19:44                             ` Ingo Molnar
                                               ` (2 more replies)
  2001-02-06 19:46                           ` Ingo Molnar
  3 siblings, 3 replies; 124+ messages in thread
From: Linus Torvalds @ 2001-02-06 19:32 UTC (permalink / raw)
  To: Ben LaHaise
  Cc: Ingo Molnar, Stephen C. Tweedie, Alan Cox, Manfred Spraul,
	Steve Lord, Linux Kernel List, kiobuf-io-devel, Ingo Molnar



On Tue, 6 Feb 2001, Ben LaHaise wrote:
> 
> s/impossible/unpleasant/.  ll_rw_blk blocks; it should be possible to have
> a non blocking variant that does all of the setup in the caller's context.
> Yes, I know that we can do it with a kernel thread, but that isn't as
> clean and it significantly penalises small ios (hint: databases issue
> *lots* of small random ios and a good chunk of large ios).

Ehh.. submit_bh() does everything you want. And, btw, ll_rw_block() does
NOT block. Never has. Never will.

(Small correction: it doesn't block on anything else than allocating a
request structure if needed, and quite frankly, you have to block
SOMETIME. You can't just try to throw stuff at the device faster than it
can take it. Think of it as a "there can only be this many IO's in
flight")

If you want to use kiobuf's because you think they are asycnrhonous and
bh's aren't, then somebody has been feeding you a lot of crap. The kiobuf
PR department seems to have been working overtime on some FUD strategy.

The fact is that bh's can do MORE than kiobuf's. They have all the
callbacks in place etc. They merge and sort correctly. Oh, they have
limitations: one "bh" always describes just one memory area with a
"start,len" kind of thing. That's fine - scatter-gather is pushed
downwards, and the upper layers do not even need to know about it. Which
is what layering is all about, after all.

Traditionally, a "bh" is only _used_ for small areas, but that's not a
"bh" issue, that's a memory management issue. The code should pretty much
handle the issue of a single 64kB bh pretty much as-is, but nothing
creates them: the VM layer only creates bh's in sizes ranging from 512
bytes to a single page.

The IO layer could do more, but there has yet to be anybody who needed
more (becase once you hit a page-size, you tend to get into
scatter-gather, so you want to have one bh per area - and let the
low-level IO level handle the actual merging etc).

Right now, on many normal setups, the thing that limits our ability to do
big IO requests is actually the fact that IDE cannot do more than 128kB
per request, for example (256 sectors). It's not the bh's or the VM layer.

If you want to make a "raw disk device", you can do so TODAY with bh's.
How? Don't use "bread()" (which allocates the backing store and creates
the cache). Allocate a separate anonymous bh (or multiple), and set them
up to point to whatever data source/sink you have, and let it rip. All
asynchronous. All with nice completion callbacks. All with existing code,
no kiobuf's in sight.

What more do you think your kiobuf's should be able to do?

		Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-06 19:11                         ` Ben LaHaise
  2001-02-06 19:32                           ` Jens Axboe
@ 2001-02-06 19:32                           ` Ingo Molnar
  2001-02-06 19:32                           ` Linus Torvalds
  2001-02-06 19:46                           ` Ingo Molnar
  3 siblings, 0 replies; 124+ messages in thread
From: Ingo Molnar @ 2001-02-06 19:32 UTC (permalink / raw)
  To: Ben LaHaise
  Cc: Stephen C. Tweedie, Linus Torvalds, Alan Cox, Manfred Spraul,
	Steve Lord, Linux Kernel List, kiobuf-io-devel, Ingo Molnar


On Tue, 6 Feb 2001, Ben LaHaise wrote:

> > > 	- make asynchronous io possible in the block layer.  This is
> > > 	  impossible with the current ll_rw_block scheme and io request
> > > 	  plugging.
> >
> > why is it impossible?
>
> s/impossible/unpleasant/. ll_rw_blk blocks; it should be possible to
> have a non blocking variant that does all of the setup in the caller's
> context. [...]

sorry, but exactly what code are you comparing this to? The aio code you
sent a few days ago does not do this either. (And you did not answer my
questions regarding this issue.) What i saw is some scheme that at a point
relies on keventd (a kernel thread) to do the blocking stuff. [or, unless
i have misread the code, does the ->bmap() synchronously.]

indeed an asynchron ll_rw_block() is possible and desirable (and not hard
at all - all structures are interrupt-safe already, opposed to the kiovec
code), but this is only half of the story. What is the big issue for me is
an async ->bmap(). And we wont access ext2fs data structures from IRQ
handlers anytime soon - so true async IO right now is damn near
impossible. No matter what the IO-submission interface is: kiobufs/kiovecs
or bhs/requests.

	Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-06 19:11                         ` Ben LaHaise
@ 2001-02-06 19:32                           ` Jens Axboe
  2001-02-06 19:32                           ` Ingo Molnar
                                             ` (2 subsequent siblings)
  3 siblings, 0 replies; 124+ messages in thread
From: Jens Axboe @ 2001-02-06 19:32 UTC (permalink / raw)
  To: Ben LaHaise
  Cc: Ingo Molnar, Stephen C. Tweedie, Linus Torvalds, Alan Cox,
	Manfred Spraul, Steve Lord, Linux Kernel List, kiobuf-io-devel,
	Ingo Molnar

On Tue, Feb 06 2001, Ben LaHaise wrote:
> > > 	- make asynchronous io possible in the block layer.  This is
> > > 	  impossible with the current ll_rw_block scheme and io request
> > > 	  plugging.
> >
> > why is it impossible?
> 
> s/impossible/unpleasant/.  ll_rw_blk blocks; it should be possible to have
> a non blocking variant that does all of the setup in the caller's context.
> Yes, I know that we can do it with a kernel thread, but that isn't as
> clean and it significantly penalises small ios (hint: databases issue
> *lots* of small random ios and a good chunk of large ios).

So make a non-blocking variant, not a big deal. Users of async I/O
know how to deal with resource limits anyway.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-06 18:54                     ` Ben LaHaise
  2001-02-06 18:58                       ` Ingo Molnar
@ 2001-02-06 19:20                       ` Linus Torvalds
  1 sibling, 0 replies; 124+ messages in thread
From: Linus Torvalds @ 2001-02-06 19:20 UTC (permalink / raw)
  To: Ben LaHaise
  Cc: Ingo Molnar, Stephen C. Tweedie, Alan Cox, Manfred Spraul,
	Steve Lord, Linux Kernel List, kiobuf-io-devel, Ingo Molnar



On Tue, 6 Feb 2001, Ben LaHaise wrote:
> On Tue, 6 Feb 2001, Ingo Molnar wrote:
> 
> > If you are merging based on (device, offset) values, then that's lowlevel
> > - and this is what we have been doing for years.
> >
> > If you are merging based on (inode, offset), then it has flaws like not
> > being able to merge through a loopback or stacked filesystem.
> 
> I disagree.  Loopback filesystems typically have their data contiguously
> on disk and won't split up incoming requests any further.

Face it.

You NEED to merge and sort late. You _cannot_ do a good job early. Early
on, you don't have any concept of what the final IO pattern will be: you
will only have that once you've seen which requests are still pending etc,
something that the higher level layers CANNOT do.

Do you really want the higher levels to know about per-controller request
locking etc? I don't think so. 

Trust me. You HAVE to do the final decisions late in the game. You
absolutely _cannot_ get the best performance except for trivial and
uninteresting cases (ie one process that wants to read gigabytes of data
in one single stream) otherwise.

(It should be pointed out, btw, that SGI etc were often interested exactly
in the trivial and uninteresting cases. When you have the DoD asking you
to stream satellite pictures over the net as fast as you can, money being
no object, you get a rather twisted picture of what is important and what
is not)

And I will turn your own argument against you: if you do merging at a low
level anyway, there's little point in trying to do it at a higher level. 

Higher levels should do high-level sequencing. They can (and should) do
some amount of sorting - the lower levels will still do their own sort as
part of the merging anyway, and the lower level sorting may actually end
up being _different_ from a high-level sort because the lower levels know
about the topology of the device, but higher levels giving data with
"patterns" to it only make it easier for the lower levels to do a good
job. So high-level sorting is not _necessary_, but it's probably a good
idea.

High-level merging is almost certainly not even a good idea - higher
levels should try to _batch_ the requests, but that's a different issue,
and is again all about giving lower levels "patterns". It's can also about
simple issues like cache locality - batching things tends to make for
better icache (and possibly dcache) behaviour.

So you should separate out the issue of batching and merging. An dyou
absolutely should realize that you should NOT ignore Ingo's arguments
about loopback etc just because they don't fit the model you WANT them to
fit. The fact is that higher levels should NOT know about things like RAID
striping etc, yet that has a HUGE impact on the issue of merging (you do
_not_ want to merge requests to separate disks - you'll just have to split
them up again).

> Here are the points I'm trying to address:
> 
> 	- reduce the overhead in submitting block ios, especially for
> 	  large ios. Look at the %CPU usages differences between 512 byte
> 	  blocks and 4KB blocks, this can be better.

This is often a filesystem layer issue. Design your filesystem well, and
you get a lot of batching for free.

You can also batch the requests - this is basically what "readahead" is.
That helps a lot. But that is NOT the same thing as merging. Not at all.
The "batched" read-ahead requests may actually be split up among many
different disks - and they will each then get separately merged with
_other_ requests to those disks. See?

And trust me, THAT is how you get good performance. Not by merging early.
By merging late, and letting the disk layers do their own thing.

> 	- make asynchronous io possible in the block layer.  This is
> 	  impossible with the current ll_rw_block scheme and io request
> 	  plugging.

I'm surprised you say that. It's not only possible, but we do it all the
time. What do you think the swapout and writing is? How do you think that
read-ahead is actually _implemented_? Right. Read-ahead is NOT done as a
"merge" operation. It's done as several asynchronous IO operations that
the low-level stuff can choose (or not) to merge.

What do you think happens if you do a "submit_bh()"? It's a _purely_
asynchronous operation. It turns synchronous when you wait for the bh, not
before.

Your argument is nonsense.

> 	- provide a generic mechanism for reordering io requests for
> 	  devices which will benefit from this.  Make it a library for
> 	  drivers to call into.  IDE for example will probably make use of
> 	  it, but some high end devices do this on the controller.  This
> 	  is the important point: Make it OPTIONAL.

Ehh. You've just described exatcly what we have.

This is what the whole elevator thing _is_. It's a library of routines.
You don't have to use them, and in fact many things DO NOT use them. The
loopback driver, for example, doesn't bother with sorting or merging at
all, because it knows that it's only supposed to pass the request on to
somebody else - who will do a hell of a lot better job of it.

Some high-end drivers have their own merging stuff, exactly because they
don't need the overhead - you're better off just feeding the request to
the controller as soon as you can, as the controller itself will do all
the merging and sorting anyway.

> You mentioned non-spindle base io devices in your last message.  Take
> something like a big RAM disk.  Now compare kiobuf base io to buffer head
> based io.  Tell me which one is going to perform better.

Buffer heads? 

Go and read the code.

Sure, it has some historical baggage still, but the fact is that it works
a hell of a lot better than kiobufs and it _does_ know about merging
multiple requests and handling errors in the middle of one request etc.
You can get the full advantage of streaming megabytes of data in one
request, AND still get proper error handling if it turns out that one
sector in the middle was bad.

		Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-06 18:58                       ` Ingo Molnar
@ 2001-02-06 19:11                         ` Ben LaHaise
  2001-02-06 19:32                           ` Jens Axboe
                                             ` (3 more replies)
  0 siblings, 4 replies; 124+ messages in thread
From: Ben LaHaise @ 2001-02-06 19:11 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Stephen C. Tweedie, Linus Torvalds, Alan Cox, Manfred Spraul,
	Steve Lord, Linux Kernel List, kiobuf-io-devel, Ingo Molnar

On Tue, 6 Feb 2001, Ingo Molnar wrote:

>
> On Tue, 6 Feb 2001, Ben LaHaise wrote:
>
> > 	- reduce the overhead in submitting block ios, especially for
> > 	  large ios. Look at the %CPU usages differences between 512 byte
> > 	  blocks and 4KB blocks, this can be better.
>
> my system is already submitting 4KB bhs. If anyone's raw-IO setup submits
> 512 byte bhs thats a problem of the raw IO code ...
>
> > 	- make asynchronous io possible in the block layer.  This is
> > 	  impossible with the current ll_rw_block scheme and io request
> > 	  plugging.
>
> why is it impossible?

s/impossible/unpleasant/.  ll_rw_blk blocks; it should be possible to have
a non blocking variant that does all of the setup in the caller's context.
Yes, I know that we can do it with a kernel thread, but that isn't as
clean and it significantly penalises small ios (hint: databases issue
*lots* of small random ios and a good chunk of large ios).

> > You mentioned non-spindle base io devices in your last message.  Take
> > something like a big RAM disk. Now compare kiobuf base io to buffer
> > head based io. Tell me which one is going to perform better.
>
> roughly equal performance when using 4K bhs. And a hell of a lot more
> complex and volatile code in the kiobuf case.

I'm willing to benchmark you on this.

		-ben

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-06 20:35                               ` Ingo Molnar
@ 2001-02-06 19:05                                 ` Marcelo Tosatti
  2001-02-06 20:59                                   ` Ingo Molnar
  2001-02-07 18:27                                 ` Christoph Hellwig
  1 sibling, 1 reply; 124+ messages in thread
From: Marcelo Tosatti @ 2001-02-06 19:05 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Christoph Hellwig, Linus Torvalds, Ben LaHaise,
	Stephen C. Tweedie, Alan Cox, Manfred Spraul, Steve Lord,
	Linux Kernel List, kiobuf-io-devel, Ingo Molnar



On Tue, 6 Feb 2001, Ingo Molnar wrote:

> 
> On Tue, 6 Feb 2001, Christoph Hellwig wrote:
> 
> > The second is that bh's are two things:
> >
> >  - a cacheing object
> >  - an io buffer
> >
> > This is not really an clean appropeach, and I would really like to get
> > away from it.
> 
> caching bmap() blocks was a recent addition around 2.3.20, and i suggested
> some time ago to cache pagecache blocks via explicit entries in struct
> page. That would be one solution - but it creates overhead.

Think about a given number of pages which are physically contiguous on
disk -- you dont need to cache the block number for each page, you just
need to cache the physical block number of the first page of the
"cluster".

SGI's pagebuf do that, and it would be great if we had something similar
in 2.5. 

It allows us to have fast IO clustering. 

> but there isnt anything wrong with having the bhs around to cache blocks -
> think of it as a 'cached and recycled IO buffer entry, with the block
> information cached'.

Usually we need to cache only block information (for clustering), and not
all the other stuff which buffer_head holds.

> frankly, my quick (and limited) hack to abuse bhs to cache blocks just
> cannot be a reason to replace bhs ...


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-06 18:54                     ` Ben LaHaise
@ 2001-02-06 18:58                       ` Ingo Molnar
  2001-02-06 19:11                         ` Ben LaHaise
  2001-02-06 19:20                       ` Linus Torvalds
  1 sibling, 1 reply; 124+ messages in thread
From: Ingo Molnar @ 2001-02-06 18:58 UTC (permalink / raw)
  To: Ben LaHaise
  Cc: Stephen C. Tweedie, Linus Torvalds, Alan Cox, Manfred Spraul,
	Steve Lord, Linux Kernel List, kiobuf-io-devel, Ingo Molnar


On Tue, 6 Feb 2001, Ben LaHaise wrote:

> 	- reduce the overhead in submitting block ios, especially for
> 	  large ios. Look at the %CPU usages differences between 512 byte
> 	  blocks and 4KB blocks, this can be better.

my system is already submitting 4KB bhs. If anyone's raw-IO setup submits
512 byte bhs thats a problem of the raw IO code ...

> 	- make asynchronous io possible in the block layer.  This is
> 	  impossible with the current ll_rw_block scheme and io request
> 	  plugging.

why is it impossible?

> You mentioned non-spindle base io devices in your last message.  Take
> something like a big RAM disk. Now compare kiobuf base io to buffer
> head based io. Tell me which one is going to perform better.

roughly equal performance when using 4K bhs. And a hell of a lot more
complex and volatile code in the kiobuf case.

	Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-06 18:35                   ` Ingo Molnar
@ 2001-02-06 18:54                     ` Ben LaHaise
  2001-02-06 18:58                       ` Ingo Molnar
  2001-02-06 19:20                       ` Linus Torvalds
  0 siblings, 2 replies; 124+ messages in thread
From: Ben LaHaise @ 2001-02-06 18:54 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Stephen C. Tweedie, Linus Torvalds, Alan Cox, Manfred Spraul,
	Steve Lord, Linux Kernel List, kiobuf-io-devel, Ingo Molnar

On Tue, 6 Feb 2001, Ingo Molnar wrote:

> If you are merging based on (device, offset) values, then that's lowlevel
> - and this is what we have been doing for years.
>
> If you are merging based on (inode, offset), then it has flaws like not
> being able to merge through a loopback or stacked filesystem.

I disagree.  Loopback filesystems typically have their data contiguously
on disk and won't split up incoming requests any further.

Here are the points I'm trying to address:

	- reduce the overhead in submitting block ios, especially for
	  large ios. Look at the %CPU usages differences between 512 byte
	  blocks and 4KB blocks, this can be better.
	- make asynchronous io possible in the block layer.  This is
	  impossible with the current ll_rw_block scheme and io request
	  plugging.
	- provide a generic mechanism for reordering io requests for
	  devices which will benefit from this.  Make it a library for
	  drivers to call into.  IDE for example will probably make use of
	  it, but some high end devices do this on the controller.  This
	  is the important point: Make it OPTIONAL.

You mentioned non-spindle base io devices in your last message.  Take
something like a big RAM disk.  Now compare kiobuf base io to buffer head
based io.  Tell me which one is going to perform better.

		-ben

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-06 18:25                 ` Ben LaHaise
@ 2001-02-06 18:35                   ` Ingo Molnar
  2001-02-06 18:54                     ` Ben LaHaise
  0 siblings, 1 reply; 124+ messages in thread
From: Ingo Molnar @ 2001-02-06 18:35 UTC (permalink / raw)
  To: Ben LaHaise
  Cc: Stephen C. Tweedie, Linus Torvalds, Alan Cox, Manfred Spraul,
	Steve Lord, Linux Kernel List, kiobuf-io-devel, Ingo Molnar


On Tue, 6 Feb 2001, Ben LaHaise wrote:

> > - higher levels do not have the kind of state to eg. merge requests done
> >   by different users. The only chance for merging is often the lowest
> >   level, where we already know what disk, which sector.
>
> That's what a readaround buffer is for, [...]

If you are merging based on (device, offset) values, then that's lowlevel
- and this is what we have been doing for years.

If you are merging based on (inode, offset), then it has flaws like not
being able to merge through a loopback or stacked filesystem.

	Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-06 17:22             ` Christoph Hellwig
@ 2001-02-06 18:26               ` Stephen C. Tweedie
  0 siblings, 0 replies; 124+ messages in thread
From: Stephen C. Tweedie @ 2001-02-06 18:26 UTC (permalink / raw)
  To: Stephen C. Tweedie, Linus Torvalds, Alan Cox, Manfred Spraul,
	Steve Lord, linux-kernel, kiobuf-io-devel, Ben LaHaise,
	Ingo Molnar

Hi,

On Tue, Feb 06, 2001 at 06:22:58PM +0100, Christoph Hellwig wrote:
> On Tue, Feb 06, 2001 at 05:05:06PM +0000, Stephen C. Tweedie wrote:
> > The whole point of the post was that it is merging, not splitting,
> > which is troublesome.  How are you going to merge requests without
> > having chains of scatter-gather entities each with their own
> > completion callbacks?
> 
> The object passed down to the low-level driver just needs to ne able
> to contain multiple end-io callbacks.  The decision what to call when
> some of the scatter-gather entities fail is of course not so easy to
> handle and needs further discussion.

Umm, and if you want the separate higher-level IOs to be told which
IOs succeeded and which ones failed on error, you need to associate
each of the multiple completion callbacks with its particular
scatter-gather fragment or fragments.  So you end up with the same
sort of kiobuf/kiovec concept where you have chains of sg chunks, each
chunk with its own completion information.

This is *precisely* what I've been trying to get people to address.
Forget whether the individual sg fragments are based on pages or not:
if you want to have IO merging and accurate completion callbacks, you
need not just one sg list but multiple lists each with a separate
callback header.

Abandon the merging of sg-list requests (by moving that functionality
into the higher-level layers) and that problem disappears: flat
sg-lists will then work quite happily at the request layer.

Cheers,
 Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-06 18:18               ` Ingo Molnar
@ 2001-02-06 18:25                 ` Ben LaHaise
  2001-02-06 18:35                   ` Ingo Molnar
  0 siblings, 1 reply; 124+ messages in thread
From: Ben LaHaise @ 2001-02-06 18:25 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Stephen C. Tweedie, Linus Torvalds, Alan Cox, Manfred Spraul,
	Steve Lord, Linux Kernel List, kiobuf-io-devel, Ingo Molnar

On Tue, 6 Feb 2001, Ingo Molnar wrote:

> - higher levels do not have the kind of state to eg. merge requests done
>   by different users. The only chance for merging is often the lowest
>   level, where we already know what disk, which sector.

That's what a readaround buffer is for, and I suspect that readaround will
give use a big performance boost.

> - merging is not even *required* for some devices - and chances are high
>   that we'll get away from this inefficient and unreliable 'rotating array
>   of disks' business of storing bulk data in this century. (solid state
>   disks, holographic storage, whatever.)

Interesting that you've brought up this point, as its an example

> i'm truly shocked that you and Stephen are both saying this.

Merging != sorting.  Sorting of requests has to be carried out at the
lower layers, and the specific block device should be able to choose the
Right Thing To Do for the next item in a chain of sequential requests.

		-ben

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-06 17:37             ` Ben LaHaise
  2001-02-06 18:00               ` Jens Axboe
  2001-02-06 18:14               ` Linus Torvalds
@ 2001-02-06 18:18               ` Ingo Molnar
  2001-02-06 18:25                 ` Ben LaHaise
  2 siblings, 1 reply; 124+ messages in thread
From: Ingo Molnar @ 2001-02-06 18:18 UTC (permalink / raw)
  To: Ben LaHaise
  Cc: Stephen C. Tweedie, Linus Torvalds, Alan Cox, Manfred Spraul,
	Steve Lord, Linux Kernel List, kiobuf-io-devel, Ingo Molnar


On Tue, 6 Feb 2001, Ben LaHaise wrote:

> Let me just emphasize what Stephen is pointing out: if requests are
> properly merged at higher layers, then merging is neither required nor
> desired. [...]

this is just so incorrect that it's not funny anymore.

- higher levels just do not have the kind of knowledge lower levels have.

- merging decisions are often not even *deterministic*.

- higher levels do not have the kind of state to eg. merge requests done
  by different users. The only chance for merging is often the lowest
  level, where we already know what disk, which sector.

- merging is not even *required* for some devices - and chances are high
  that we'll get away from this inefficient and unreliable 'rotating array
  of disks' business of storing bulk data in this century. (solid state
  disks, holographic storage, whatever.)

i'm truly shocked that you and Stephen are both saying this.

	Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-06 17:37             ` Ben LaHaise
  2001-02-06 18:00               ` Jens Axboe
@ 2001-02-06 18:14               ` Linus Torvalds
  2001-02-08 11:21                 ` Andi Kleen
  2001-02-08 14:11                 ` Martin Dalecki
  2001-02-06 18:18               ` Ingo Molnar
  2 siblings, 2 replies; 124+ messages in thread
From: Linus Torvalds @ 2001-02-06 18:14 UTC (permalink / raw)
  To: Ben LaHaise
  Cc: Stephen C. Tweedie, Alan Cox, Manfred Spraul, Steve Lord,
	linux-kernel, kiobuf-io-devel, Ingo Molnar



On Tue, 6 Feb 2001, Ben LaHaise wrote:
> 
> On Tue, 6 Feb 2001, Stephen C. Tweedie wrote:
> 
> > The whole point of the post was that it is merging, not splitting,
> > which is troublesome.  How are you going to merge requests without
> > having chains of scatter-gather entities each with their own
> > completion callbacks?
> 
> Let me just emphasize what Stephen is pointing out: if requests are
> properly merged at higher layers, then merging is neither required nor
> desired.

I will claim that you CANNOT merge at higher levels and get good
performance.

Sure, you can do read-ahead, and try to get big merges that way at a high
level. Good for you.

But you'll have a bitch of a time trying to merge multiple
threads/processes reading from the same area on disk at roughly the same
time. Your higher levels won't even _know_ that there is merging to be
done until the IO requests hit the wall in waiting for the disk.

Qutie frankly, this whole discussion sounds worthless. We have solved this
problem already: it's called a "buffer head". Deceptively simple at higher
levels, and lower levels can easily merge them together into chains and do
fancy scatter-gather structures of them that can be dynamically extended
at any time.

The buffer heads together with "struct request" do a hell of a lot more
than just a simple scatter-gather: it's able to create ordered lists of
independent sg-events, together with full call-backs etc. They are
low-cost, fairly efficient, and they have worked beautifully for years. 

The fact that kiobufs can't be made to do the same thing is somebody elses
problem. I _know_ that merging has to happen late, and if others are
hitting their heads against this issue until they turn silly, then that's
their problem. You'll eventually learn, or you'll hit your heads into a
pulp. 

		Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-06 18:00               ` Jens Axboe
@ 2001-02-06 18:09                 ` Ben LaHaise
  2001-02-06 19:35                   ` Jens Axboe
  0 siblings, 1 reply; 124+ messages in thread
From: Ben LaHaise @ 2001-02-06 18:09 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Stephen C. Tweedie, Linus Torvalds, Alan Cox, Manfred Spraul,
	Steve Lord, linux-kernel, kiobuf-io-devel, Ingo Molnar

On Tue, 6 Feb 2001, Jens Axboe wrote:

> Stephen already covered this point, the merging is not a problem
> to deal with for read-ahead. The underlying system can easily

I just wanted to make sure that was clear =)

> queue that in nice big chunks. Delayed allocation makes it
> easier to to flush big chunks as well. I seem to recall the xfs people
> having problems with the lack of merging causing a performance hit
> on smaller I/O.

That's where readaround buffers come into play.  If we have a fixed number
of readaround buffers that are used when small ios are issued, they should
provide a low overhead means of substantially improving things like find
(which reads many nearby inodes out of order but sequentially).  I need to
implement this can get cache hit rates for various workloads. ;-)

> Of course merging doesn't have to happen in ll_rw_blk.
>
> > As for io completion, can't we just issue seperate requests for the
> > critical data and the readahead?  That way for SCSI disks, the important
> > io should be finished while the readahead can continue.  Thoughts?
>
> Priorities?

Definately.  I'd like to be able to issue readaheads with a "don't bother
executing if this request unless the cost is low" bit set.  It might also
be helpful for heavy multiuser loads (or even a single user with multiple
processes) to ensure progress is made for others.

		-ben

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-06 17:37             ` Ben LaHaise
@ 2001-02-06 18:00               ` Jens Axboe
  2001-02-06 18:09                 ` Ben LaHaise
  2001-02-06 18:14               ` Linus Torvalds
  2001-02-06 18:18               ` Ingo Molnar
  2 siblings, 1 reply; 124+ messages in thread
From: Jens Axboe @ 2001-02-06 18:00 UTC (permalink / raw)
  To: Ben LaHaise
  Cc: Stephen C. Tweedie, Linus Torvalds, Alan Cox, Manfred Spraul,
	Steve Lord, linux-kernel, kiobuf-io-devel, Ingo Molnar

On Tue, Feb 06 2001, Ben LaHaise wrote:
> > The whole point of the post was that it is merging, not splitting,
> > which is troublesome.  How are you going to merge requests without
> > having chains of scatter-gather entities each with their own
> > completion callbacks?
> 
> Let me just emphasize what Stephen is pointing out: if requests are
> properly merged at higher layers, then merging is neither required nor
> desired.  Traditionally, ext2 has not done merging because the underlying
> system doesn't support it.  This leads to rather convoluted code for
> readahead which doesn't result in appropriately merged requests on
> indirect block boundries, and in fact leads to suboptimal performance.
> The only case I see where merging of requests can improve things is when
> dealing with lots of small files.  But we already know that small files
> need to be treated differently (fe tail merging).  Besides, most of the
> benefit of merging can be had by doing readaround for these small files.

Stephen already covered this point, the merging is not a problem
to deal with for read-ahead. The underlying system can easily
queue that in nice big chunks. Delayed allocation makes it
easier to to flush big chunks as well. I seem to recall the xfs people
having problems with the lack of merging causing a performance hit
on smaller I/O.

Of course merging doesn't have to happen in ll_rw_blk.

> As for io completion, can't we just issue seperate requests for the
> critical data and the readahead?  That way for SCSI disks, the important
> io should be finished while the readahead can continue.  Thoughts?

Priorities?

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-06 17:05           ` Stephen C. Tweedie
  2001-02-06 17:14             ` Jens Axboe
  2001-02-06 17:22             ` Christoph Hellwig
@ 2001-02-06 17:37             ` Ben LaHaise
  2001-02-06 18:00               ` Jens Axboe
                                 ` (2 more replies)
  2 siblings, 3 replies; 124+ messages in thread
From: Ben LaHaise @ 2001-02-06 17:37 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Linus Torvalds, Alan Cox, Manfred Spraul, Steve Lord,
	linux-kernel, kiobuf-io-devel, Ingo Molnar

Hey folks,

On Tue, 6 Feb 2001, Stephen C. Tweedie wrote:

> The whole point of the post was that it is merging, not splitting,
> which is troublesome.  How are you going to merge requests without
> having chains of scatter-gather entities each with their own
> completion callbacks?

Let me just emphasize what Stephen is pointing out: if requests are
properly merged at higher layers, then merging is neither required nor
desired.  Traditionally, ext2 has not done merging because the underlying
system doesn't support it.  This leads to rather convoluted code for
readahead which doesn't result in appropriately merged requests on
indirect block boundries, and in fact leads to suboptimal performance.
The only case I see where merging of requests can improve things is when
dealing with lots of small files.  But we already know that small files
need to be treated differently (fe tail merging).  Besides, most of the
benefit of merging can be had by doing readaround for these small files.

As for io completion, can't we just issue seperate requests for the
critical data and the readahead?  That way for SCSI disks, the important
io should be finished while the readahead can continue.  Thoughts?

		-ben

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-06 17:05           ` Stephen C. Tweedie
  2001-02-06 17:14             ` Jens Axboe
@ 2001-02-06 17:22             ` Christoph Hellwig
  2001-02-06 18:26               ` Stephen C. Tweedie
  2001-02-06 17:37             ` Ben LaHaise
  2 siblings, 1 reply; 124+ messages in thread
From: Christoph Hellwig @ 2001-02-06 17:22 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Linus Torvalds, Alan Cox, Manfred Spraul, Steve Lord,
	linux-kernel, kiobuf-io-devel, Ben LaHaise, Ingo Molnar

On Tue, Feb 06, 2001 at 05:05:06PM +0000, Stephen C. Tweedie wrote:
> The whole point of the post was that it is merging, not splitting,
> which is troublesome.  How are you going to merge requests without
> having chains of scatter-gather entities each with their own
> completion callbacks?

The object passed down to the low-level driver just needs to ne able
to contain multiple end-io callbacks.  The decision what to call when
some of the scatter-gather entities fail is of course not so easy to
handle and needs further discussion.

	Christoph

-- 
Whip me.  Beat me.  Make me maintain AIX.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-06 17:05           ` Stephen C. Tweedie
@ 2001-02-06 17:14             ` Jens Axboe
  2001-02-06 17:22             ` Christoph Hellwig
  2001-02-06 17:37             ` Ben LaHaise
  2 siblings, 0 replies; 124+ messages in thread
From: Jens Axboe @ 2001-02-06 17:14 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Linus Torvalds, Alan Cox, Manfred Spraul, Steve Lord,
	linux-kernel, kiobuf-io-devel, Ben LaHaise, Ingo Molnar

On Tue, Feb 06 2001, Stephen C. Tweedie wrote:
> > I don't think os.  If we minimize the state in the IO container object,
> > the lower levels could split them at their guess and the IO completion
> > function just has to handle the case that it might be called for a smaller
> > object.
> 
> The whole point of the post was that it is merging, not splitting,
> which is troublesome.  How are you going to merge requests without
> having chains of scatter-gather entities each with their own
> completion callbacks?

You can't, the stuff I played with turned out to be horrible. At
least with the current kiobuf I/O stuff, merging will have to be
done before its submitted. And IMO we don't want to loose the
ability to cluster buffers and requests in ll_rw_blk.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-06 17:00         ` Christoph Hellwig
@ 2001-02-06 17:05           ` Stephen C. Tweedie
  2001-02-06 17:14             ` Jens Axboe
                               ` (2 more replies)
  0 siblings, 3 replies; 124+ messages in thread
From: Stephen C. Tweedie @ 2001-02-06 17:05 UTC (permalink / raw)
  To: Stephen C. Tweedie, Linus Torvalds, Alan Cox, Manfred Spraul,
	Steve Lord, linux-kernel, kiobuf-io-devel, Ben LaHaise,
	Ingo Molnar

Hi,

On Tue, Feb 06, 2001 at 06:00:58PM +0100, Christoph Hellwig wrote:
> On Tue, Feb 06, 2001 at 12:07:04AM +0000, Stephen C. Tweedie wrote:
> > 
> > Is that a realistic basis for a cleaned-up ll_rw_blk.c?
> 
> I don't think os.  If we minimize the state in the IO container object,
> the lower levels could split them at their guess and the IO completion
> function just has to handle the case that it might be called for a smaller
> object.

The whole point of the post was that it is merging, not splitting,
which is troublesome.  How are you going to merge requests without
having chains of scatter-gather entities each with their own
completion callbacks?

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-06  0:07       ` Stephen C. Tweedie
@ 2001-02-06 17:00         ` Christoph Hellwig
  2001-02-06 17:05           ` Stephen C. Tweedie
  0 siblings, 1 reply; 124+ messages in thread
From: Christoph Hellwig @ 2001-02-06 17:00 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Linus Torvalds, Alan Cox, Manfred Spraul, Steve Lord,
	linux-kernel, kiobuf-io-devel, Ben LaHaise, Ingo Molnar

On Tue, Feb 06, 2001 at 12:07:04AM +0000, Stephen C. Tweedie wrote:
> This is the current situation.  If the page cache submits a 64K IO to
> the block layer, it does so in pieces, and then expects to be told on
> return exactly which pages succeeded and which failed.
> 
> That's where the mess of having multiple completion objects in a
> single IO request comes from.  Can we just forbid this case?
> 
> That's the short cut that SGI's kiobuf block dev patches do when they
> get kiobufs: they currently deal with either buffer_heads or kiobufs
> in struct requests, but they don't merge kiobuf requests.

IIRC Jens Axboe has done some work on merging kiobuf-based requests.

> (XFS already clusters the IOs for them in that case.)
> 
> Is that a realistic basis for a cleaned-up ll_rw_blk.c?

I don't think os.  If we minimize the state in the IO container object,
the lower levels could split them at their guess and the IO completion
function just has to handle the case that it might be called for a smaller
object.

	Christoph

-- 
Of course it doesn't work. We've performed a software upgrade.
Whip me.  Beat me.  Make me maintain AIX.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-06  1:01       ` Linus Torvalds
  2001-02-06  9:22         ` Roman Zippel
@ 2001-02-06  9:30         ` Ingo Molnar
  1 sibling, 0 replies; 124+ messages in thread
From: Ingo Molnar @ 2001-02-06  9:30 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Roman Zippel, Alan Cox, Stephen C. Tweedie, Manfred Spraul,
	Christoph Hellwig, Steve Lord, linux-kernel, kiobuf-io-devel


On Mon, 5 Feb 2001, Linus Torvalds wrote:

> [...] But talk to Davem and ank about why they wanted vectors.

one issue is allocation overhead. The fragment array is a natural and
constant-size part of an skb, thus we get all the control structures in
place while allocating a structure that we have to allocate anyway.

another issue is that certain cards have (or can have) SG-limits, so we
have to be prepared to have a 'limited' array of fragments anyway, and
have to be prepared to split/refragment packets. Whether there is a global
MAX_SKB_FRAGS limit or not makes no difference.

	Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-06  1:01       ` Linus Torvalds
@ 2001-02-06  9:22         ` Roman Zippel
  2001-02-06  9:30         ` Ingo Molnar
  1 sibling, 0 replies; 124+ messages in thread
From: Roman Zippel @ 2001-02-06  9:22 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Alan Cox, Stephen C. Tweedie, Manfred Spraul, Christoph Hellwig,
	Steve Lord, linux-kernel, kiobuf-io-devel

Hi,

On Mon, 5 Feb 2001, Linus Torvalds wrote:

> > Does it has to be vectors? What about lists?
> 
> I'd prefer to avoid lists unless there is some overriding concern, like a
> real implementation issue. But I don't care much one way or the other -
> what I care about is that the setup and usage time is as low as possible.
> I suspect arrays are better for that.

I was more thinking about the higher layers. Here it's simpler to setup a
list of pages which can be send to a lower layer. In the page cache we
already have per address space lists, so it would be very easy to use
that. A lower layer can generate of course anything it wants out of this,
e.g. it can generate sublists or vectors.

bye, Roman

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-06  0:31     ` Roman Zippel
  2001-02-06  1:01       ` Linus Torvalds
@ 2001-02-06  1:08       ` David S. Miller
  1 sibling, 0 replies; 124+ messages in thread
From: David S. Miller @ 2001-02-06  1:08 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Roman Zippel, Alan Cox, Stephen C. Tweedie, Manfred Spraul,
	Christoph Hellwig, Steve Lord, linux-kernel, kiobuf-io-devel


Linus Torvalds writes:
 > But talk to Davem and ank about why they wanted vectors.

SKB setup and free needs to be as light as possible.
Using vectors leads to code like:

skb_data_free(...)
{
...
	for (i = 0; i < MAX_SKB_FRAGS; i++)
		put_page(skb_shinfo(skb)->frags[i].page);
}

Currently, the ZC patches have a fixed frag vector size
(MAX_SKB_FRAGS).  But a part of me wants this to be
made dynamic (to handle HIPPI etc. properly) whereas
another part of me doesn't want to do it that way because
it would increase the complexity of paged SKB handling
and add yet another member to the SKB structure.

Later,
David S. Miller
davem@redhat.com
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-06  0:31     ` Roman Zippel
@ 2001-02-06  1:01       ` Linus Torvalds
  2001-02-06  9:22         ` Roman Zippel
  2001-02-06  9:30         ` Ingo Molnar
  2001-02-06  1:08       ` David S. Miller
  1 sibling, 2 replies; 124+ messages in thread
From: Linus Torvalds @ 2001-02-06  1:01 UTC (permalink / raw)
  To: Roman Zippel
  Cc: Alan Cox, Stephen C. Tweedie, Manfred Spraul, Christoph Hellwig,
	Steve Lord, linux-kernel, kiobuf-io-devel



On Tue, 6 Feb 2001, Roman Zippel wrote:
> > 
> > 	int nr_buffers:
> > 	struct buffer *array;
> > 
> > should be the low-level abstraction. 
> 
> Does it has to be vectors? What about lists?

I'd prefer to avoid lists unless there is some overriding concern, like a
real implementation issue. But I don't care much one way or the other -
what I care about is that the setup and usage time is as low as possible.
I suspect arrays are better for that.

I have this strong suspicion that networking is going to be the most
latency-critical and complex part of this, and the fact that the
networking code wanted arrays is what makes me think that arrays are the
right way to go. But talk to Davem and ank about why they wanted vectors.

		Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-05 19:28   ` Linus Torvalds
  2001-02-05 20:54     ` Stephen C. Tweedie
@ 2001-02-06  0:31     ` Roman Zippel
  2001-02-06  1:01       ` Linus Torvalds
  2001-02-06  1:08       ` David S. Miller
  1 sibling, 2 replies; 124+ messages in thread
From: Roman Zippel @ 2001-02-06  0:31 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Alan Cox, Stephen C. Tweedie, Manfred Spraul, Christoph Hellwig,
	Steve Lord, linux-kernel, kiobuf-io-devel

Hi,

On Mon, 5 Feb 2001, Linus Torvalds wrote:

> This all proves that the lowest level of layering should be pretty much
> noting but the vectors. No callbacks, no crap like that. That's already a
> level of abstraction away, and should not get tacked on. Your lowest level
> of abstraction should be just the "area". Something like
> 
> 	struct buffer {
> 		struct page *page;
> 		u16 offset, length;
> 	};
> 
> 	int nr_buffers:
> 	struct buffer *array;
> 
> should be the low-level abstraction. 

Does it has to be vectors? What about lists? I'm thinking about this for
some time now and I think lists are more flexible. At higher level we can
easily generate a list of pages and in a lower level you can still split
them up as needed. It would be basically the same structure, but you
could use it everywhere with the same kind of operations.

bye, Roman

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-05 20:54     ` Stephen C. Tweedie
  2001-02-05 21:08       ` David Lang
  2001-02-05 21:51       ` Alan Cox
@ 2001-02-06  0:07       ` Stephen C. Tweedie
  2001-02-06 17:00         ` Christoph Hellwig
  2 siblings, 1 reply; 124+ messages in thread
From: Stephen C. Tweedie @ 2001-02-06  0:07 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Alan Cox, Stephen C. Tweedie, Manfred Spraul, Christoph Hellwig,
	Steve Lord, linux-kernel, kiobuf-io-devel, Ben LaHaise,
	Ingo Molnar

Hi,

OK, if we take a step back what does this look like:

On Mon, Feb 05, 2001 at 08:54:29PM +0000, Stephen C. Tweedie wrote:
> 
> If we are doing readahead, we want completion callbacks raised as soon
> as possible on IO completions, no matter how many other IOs have been
> merged with the current one.  More importantly though, when we are
> merging multiple page or buffer_head IOs in a request, we want to know
> exactly which buffer/page contents are valid and which are not once
> the IO completes.

This is the current situation.  If the page cache submits a 64K IO to
the block layer, it does so in pieces, and then expects to be told on
return exactly which pages succeeded and which failed.

That's where the mess of having multiple completion objects in a
single IO request comes from.  Can we just forbid this case?

That's the short cut that SGI's kiobuf block dev patches do when they
get kiobufs: they currently deal with either buffer_heads or kiobufs
in struct requests, but they don't merge kiobuf requests.  (XFS
already clusters the IOs for them in that case.)

Is that a realistic basis for a cleaned-up ll_rw_blk.c?

It implies that the caller has to do IO merging.  For read, that's not
much pain, as the most important case --- readahead --- is already
done in a generic way which could submit larger IOs relatively easily.
It would be harder for writes, but high-level write clustering code
has already been started.

It implies that for any IO, on IO failure you don't get told which
part of the IO failed.  That adds code to the caller: the page cache
would have to retry per-page to work out which pages are readable and
which are not.  It means that for soft raid, you don't get told which
blocks are bad if a stripe has an error anywhere.  Ingo, is that a
potential problem?

But it gives very, very simple semantics to the request layer: single
IOs go in (with a completion callback and a single scatter-gather
list), and results go back with success or failure.

With that change, it becomes _much_ more natural to push a simple sg
list down through the disk layers.

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-05 20:54     ` Stephen C. Tweedie
  2001-02-05 21:08       ` David Lang
@ 2001-02-05 21:51       ` Alan Cox
  2001-02-06  0:07       ` Stephen C. Tweedie
  2 siblings, 0 replies; 124+ messages in thread
From: Alan Cox @ 2001-02-05 21:51 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Linus Torvalds, Alan Cox, Stephen C. Tweedie, Manfred Spraul,
	Christoph Hellwig, Steve Lord, linux-kernel, kiobuf-io-devel

> OK, this is exactly where we have a problem: I can see too many cases
> where we *do* need to know about completion stuff at a fine
> granularity when it comes to disk IO (unlike network IO, where we can
> usually rely on a caller doing retransmit at some point in the stack).

Ok so whats wrong with embedded kiovec points into somethign bigger, one
kmalloc can allocate two arrays, one of buffers (shared with networking etc)
followed by a second of block io completion data.

Now you can also kind of cast from the bigger to the smaller object and get
the right result if the kiovec array is the start of the combined allocation


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-05 20:54     ` Stephen C. Tweedie
@ 2001-02-05 21:08       ` David Lang
  2001-02-05 21:51       ` Alan Cox
  2001-02-06  0:07       ` Stephen C. Tweedie
  2 siblings, 0 replies; 124+ messages in thread
From: David Lang @ 2001-02-05 21:08 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Linus Torvalds, Alan Cox, Manfred Spraul, Christoph Hellwig,
	Steve Lord, linux-kernel, kiobuf-io-devel

so you have two concepts in one here

1. SG items that can be more then a single page

2. a container for #1 that includes details for completion callbacks, etc

it looks like Linus is objecting to having both in the same structure and
then using that structure as your generic low-level bucket.

define these as two seperate structures, the #1 structure may now be
lightweight enough to be used for networking and other functions, and when
you go to use it with disk IO you then wrap it in the #2 structure. this
still lets you have the completion callbacks at as low a level as you
want, you just have to explicitly add this layer when it makes sense.

David Lang



On Mon, 5 Feb 2001, Stephen C. Tweedie wrote:

> Date: Mon, 5 Feb 2001 20:54:29 +0000
> From: Stephen C. Tweedie <sct@redhat.com>
> To: Linus Torvalds <torvalds@transmeta.com>
> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>, Stephen C. Tweedie <sct@redhat.com>,
>      Manfred Spraul <manfred@colorfullife.com>,
>      Christoph Hellwig <hch@caldera.de>, Steve Lord <lord@sgi.com>,
>      linux-kernel@vger.kernel.org, kiobuf-io-devel@lists.sourceforge.net
> Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
>
> Hi,
>
> On Mon, Feb 05, 2001 at 11:28:17AM -0800, Linus Torvalds wrote:
>
> > The _vectors_ are needed at the very lowest levels: the levels that do not
> > necessarily have to worry at all about completion notification etc. You
> > want the arbitrary scatter-gather vectors passed down to the stuff that
> > sets up the SG arrays etc, the stuff that doesn't care AT ALL about the
> > high-level semantics.
>
> OK, this is exactly where we have a problem: I can see too many cases
> where we *do* need to know about completion stuff at a fine
> granularity when it comes to disk IO (unlike network IO, where we can
> usually rely on a caller doing retransmit at some point in the stack).
>
> If we are doing readahead, we want completion callbacks raised as soon
> as possible on IO completions, no matter how many other IOs have been
> merged with the current one.  More importantly though, when we are
> merging multiple page or buffer_head IOs in a request, we want to know
> exactly which buffer/page contents are valid and which are not once
> the IO completes.
>
> The current request struct's buffer_head list provides that quite
> naturally, but is a hugely heavyweight way of performing large IOs.
> What I'm really after is a way of sending IOs to make_request in such
> a way that if the caller provides an array of buffer_heads, it gets
> back completion information on each one, but if the IO is requested in
> large chunks (eg. XFS's pagebufs or large kiobufs from raw IO), then
> the request code can deal with it in those large chunks.
>
> What worries me is things like the soft raid1/5 code: pretending that
> we can skimp on the return information about which blocks were
> transferred successfully and which were not sounds like a really bad
> idea when you've got a driver which relies on that completion
> information in order to do intelligent error recovery.
>
> Cheers,
>  Stephen
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> Please read the FAQ at http://www.tux.org/lkml/
>
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-05 19:28   ` Linus Torvalds
@ 2001-02-05 20:54     ` Stephen C. Tweedie
  2001-02-05 21:08       ` David Lang
                         ` (2 more replies)
  2001-02-06  0:31     ` Roman Zippel
  1 sibling, 3 replies; 124+ messages in thread
From: Stephen C. Tweedie @ 2001-02-05 20:54 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Alan Cox, Stephen C. Tweedie, Manfred Spraul, Christoph Hellwig,
	Steve Lord, linux-kernel, kiobuf-io-devel

Hi,

On Mon, Feb 05, 2001 at 11:28:17AM -0800, Linus Torvalds wrote:

> The _vectors_ are needed at the very lowest levels: the levels that do not
> necessarily have to worry at all about completion notification etc. You
> want the arbitrary scatter-gather vectors passed down to the stuff that
> sets up the SG arrays etc, the stuff that doesn't care AT ALL about the
> high-level semantics.

OK, this is exactly where we have a problem: I can see too many cases
where we *do* need to know about completion stuff at a fine
granularity when it comes to disk IO (unlike network IO, where we can
usually rely on a caller doing retransmit at some point in the stack).

If we are doing readahead, we want completion callbacks raised as soon
as possible on IO completions, no matter how many other IOs have been
merged with the current one.  More importantly though, when we are
merging multiple page or buffer_head IOs in a request, we want to know
exactly which buffer/page contents are valid and which are not once
the IO completes.

The current request struct's buffer_head list provides that quite
naturally, but is a hugely heavyweight way of performing large IOs.
What I'm really after is a way of sending IOs to make_request in such
a way that if the caller provides an array of buffer_heads, it gets
back completion information on each one, but if the IO is requested in
large chunks (eg. XFS's pagebufs or large kiobufs from raw IO), then
the request code can deal with it in those large chunks.

What worries me is things like the soft raid1/5 code: pretending that
we can skimp on the return information about which blocks were
transferred successfully and which were not sounds like a really bad
idea when you've got a driver which relies on that completion
information in order to do intelligent error recovery.

Cheers,
 Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-05 19:16 ` [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait Alan Cox
@ 2001-02-05 19:28   ` Linus Torvalds
  2001-02-05 20:54     ` Stephen C. Tweedie
  2001-02-06  0:31     ` Roman Zippel
  0 siblings, 2 replies; 124+ messages in thread
From: Linus Torvalds @ 2001-02-05 19:28 UTC (permalink / raw)
  To: Alan Cox
  Cc: Stephen C. Tweedie, Manfred Spraul, Christoph Hellwig,
	Steve Lord, linux-kernel, kiobuf-io-devel



On Mon, 5 Feb 2001, Alan Cox wrote:

> > Stop this idiocy, Stephen. You're _this_ close to be the first person I
> > ever blacklist from my mailbox. 
> 
> I think I've just figured out what the miscommunication is around here
> 
> kiovecs can describe arbitary scatter gather

I know. But they are entirely useless for anything that requires low
latency handling. They are big, bloated, and slow. 

It is also an example of layering gone horribly horribly wrong.

The _vectors_ are needed at the very lowest levels: the levels that do not
necessarily have to worry at all about completion notification etc. You
want the arbitrary scatter-gather vectors passed down to the stuff that
sets up the SG arrays etc, the stuff that doesn't care AT ALL about the
high-level semantics.

This all proves that the lowest level of layering should be pretty much
noting but the vectors. No callbacks, no crap like that. That's already a
level of abstraction away, and should not get tacked on. Your lowest level
of abstraction should be just the "area". Something like

	struct buffer {
		struct page *page;
		u16 offset, length;
	};

	int nr_buffers:
	struct buffer *array;

should be the low-level abstraction. 

And on top of _that_ you build a more complex entity (so a "kiobuf" would
be defined not just by the memory area, but by the operation you want to
do on it, adn the callback on completion etc).

Currently kiobufs do it the other way around: you can build up an array,
but only by having the overhead of passing kiovec's around - ie you have
to pass the _highest_ level of abstraction around just to get the lowest
level of details. That's wrong.

And that wrongness comes _exactly_ from Stephens opinion that the
fundamental IO entity is an array of contiguous pages. 

And, btw, this is why the networking layer will never be able to use
kiobufs.

Which makes kiobufs as they stand now basically useless for anything but
some direct disk stuff. And I'd rather work on making the low-level disk
drivers use something saner.

		Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-05 19:09 [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains Linus Torvalds
@ 2001-02-05 19:16 ` Alan Cox
  2001-02-05 19:28   ` Linus Torvalds
  0 siblings, 1 reply; 124+ messages in thread
From: Alan Cox @ 2001-02-05 19:16 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Stephen C. Tweedie, Alan Cox, Manfred Spraul, Christoph Hellwig,
	Steve Lord, linux-kernel, kiobuf-io-devel

> Stop this idiocy, Stephen. You're _this_ close to be the first person I
> ever blacklist from my mailbox. 

I think I've just figured out what the miscommunication is around here

kiovecs can describe arbitary scatter gather

its just that they can also cleanly describe the common case of contiguous
pages in one entry.

After all a subpage block is simply a contiguous set of 1 page.

Alan


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-05 16:56 [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains Linus Torvalds
@ 2001-02-05 17:27 ` Alan Cox
  0 siblings, 0 replies; 124+ messages in thread
From: Alan Cox @ 2001-02-05 17:27 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Manfred Spraul, Stephen C. Tweedie, Christoph Hellwig,
	Steve Lord, linux-kernel, kiobuf-io-devel, Alan Cox

> In fact, regular IDE DMA allows arbitrary scatter-gather at least in
> theory. Linux has never used it, so I don't know how well it works in

Purely in theory, as Jeff found out. 

> But despite a lot of likely practical reasons why it won't work for
> arbitrary sg lists on plain IDE DMA, there is no _theoretical_ reason it
> wouldn't. And there are bound to be better controllers that could handle
> it.

I2O controllers are required too handle it (most dont) and some of the high
end scsi/fc controllers even get it right


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
       [not found] <CA2569E9.004A4E23.00@d73mta05.au.ibm.com>
@ 2001-02-04 16:46 ` Alan Cox
  0 siblings, 0 replies; 124+ messages in thread
From: Alan Cox @ 2001-02-04 16:46 UTC (permalink / raw)
  To: bsuparna
  Cc: Stephen C. Tweedie, linux-kernel, kiobuf-io-devel, Alan Cox,
	Christoph Hellwig, Andi Kleen

> It appears that we are coming across 2 kinds of requirements for kiobuf
> vectors - and quite a bit of debate centering around that.
> 
> 1. In the block device i/o world, where large i/os may be involved, we'd
> 2. In the networking world, we deal with smaller fragments (for protocol

Its probably worth commenting at this point that the I2O message passing layers
do indeed have both #1 and #2 type descriptor chains to optimise performance
for different tasks. We arent the only people to hit this.

I2O supports 
	offset, pagelist, length

where the middle pages in the list are entirely copied

And sets of
	addr, len

tuples.



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
  2001-02-01 18:39 [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains Rik van Riel
@ 2001-02-01 18:46 ` Alan Cox
  0 siblings, 0 replies; 124+ messages in thread
From: Alan Cox @ 2001-02-01 18:46 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Alan Cox, Christoph Hellwig, Stephen C. Tweedie, bsuparna,
	linux-kernel, kiobuf-io-devel

> OTOH, somehow a zero-copy system which converts the zero-copy
> metadata every time the buffer is handed to another subsystem
> just doesn't sound right ...
> 
> (well, maybe it _is_, but it looks quite inefficient at first
> glance)

I would certainly be a lot happier if there is a single sensible zero copy
format doing the lot, but only if it doesnt turn into a cross between a 747
and bicycle
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 124+ messages in thread

end of thread, other threads:[~2001-02-12 16:21 UTC | newest]

Thread overview: 124+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2001-02-12 14:56 [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait bsuparna
  -- strict thread matches above, loose matches on Subject: below --
2001-02-05 19:09 [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains Linus Torvalds
2001-02-05 19:16 ` [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait Alan Cox
2001-02-05 19:28   ` Linus Torvalds
2001-02-05 20:54     ` Stephen C. Tweedie
2001-02-05 21:08       ` David Lang
2001-02-05 21:51       ` Alan Cox
2001-02-06  0:07       ` Stephen C. Tweedie
2001-02-06 17:00         ` Christoph Hellwig
2001-02-06 17:05           ` Stephen C. Tweedie
2001-02-06 17:14             ` Jens Axboe
2001-02-06 17:22             ` Christoph Hellwig
2001-02-06 18:26               ` Stephen C. Tweedie
2001-02-06 17:37             ` Ben LaHaise
2001-02-06 18:00               ` Jens Axboe
2001-02-06 18:09                 ` Ben LaHaise
2001-02-06 19:35                   ` Jens Axboe
2001-02-06 18:14               ` Linus Torvalds
2001-02-08 11:21                 ` Andi Kleen
2001-02-08 14:11                 ` Martin Dalecki
2001-02-08 17:59                   ` Linus Torvalds
2001-02-06 18:18               ` Ingo Molnar
2001-02-06 18:25                 ` Ben LaHaise
2001-02-06 18:35                   ` Ingo Molnar
2001-02-06 18:54                     ` Ben LaHaise
2001-02-06 18:58                       ` Ingo Molnar
2001-02-06 19:11                         ` Ben LaHaise
2001-02-06 19:32                           ` Jens Axboe
2001-02-06 19:32                           ` Ingo Molnar
2001-02-06 19:32                           ` Linus Torvalds
2001-02-06 19:44                             ` Ingo Molnar
2001-02-06 19:49                             ` Ben LaHaise
2001-02-06 19:57                               ` Ingo Molnar
2001-02-06 20:07                                 ` Jens Axboe
2001-02-06 20:25                                 ` Ben LaHaise
2001-02-06 20:41                                   ` Manfred Spraul
2001-02-06 20:50                                     ` Jens Axboe
2001-02-06 21:26                                       ` Manfred Spraul
2001-02-06 21:42                                         ` Linus Torvalds
2001-02-06 20:16                                           ` Marcelo Tosatti
2001-02-06 22:09                                             ` Jens Axboe
2001-02-06 22:26                                               ` Linus Torvalds
2001-02-06 21:13                                                 ` Marcelo Tosatti
2001-02-06 23:26                                                   ` Linus Torvalds
2001-02-08 15:06                                                     ` Ben LaHaise
2001-02-08 13:44                                                       ` Marcelo Tosatti
2001-02-08 13:45                                                         ` Marcelo Tosatti
2001-02-07 23:15                                                 ` Pavel Machek
2001-02-08 13:22                                                   ` Stephen C. Tweedie
2001-02-08 12:03                                                     ` Marcelo Tosatti
2001-02-08 15:46                                                       ` Mikulas Patocka
2001-02-08 14:05                                                         ` Marcelo Tosatti
2001-02-08 16:11                                                           ` Mikulas Patocka
2001-02-08 14:44                                                             ` Marcelo Tosatti
2001-02-08 16:57                                                             ` Rik van Riel
2001-02-08 17:13                                                               ` James Sutherland
2001-02-08 18:38                                                               ` Linus Torvalds
2001-02-09 12:17                                                                 ` Martin Dalecki
2001-02-08 15:55                                                         ` Jens Axboe
2001-02-08 18:09                                                       ` Linus Torvalds
2001-02-08 14:52                                                   ` Mikulas Patocka
2001-02-08 19:50                                                     ` Stephen C. Tweedie
2001-02-11 21:30                                                     ` Pavel Machek
2001-02-06 21:57                                           ` Manfred Spraul
2001-02-06 22:13                                             ` Linus Torvalds
2001-02-06 22:26                                               ` Andre Hedrick
2001-02-06 20:49                                   ` Jens Axboe
2001-02-07  0:21                                 ` Stephen C. Tweedie
2001-02-07  0:25                                   ` Ingo Molnar
2001-02-07  0:36                                     ` Stephen C. Tweedie
2001-02-07  0:50                                       ` Linus Torvalds
2001-02-07  1:49                                         ` Stephen C. Tweedie
2001-02-07  2:37                                           ` Linus Torvalds
2001-02-07 14:52                                             ` Stephen C. Tweedie
2001-02-07 19:12                                             ` Richard Gooch
2001-02-07 20:03                                               ` Stephen C. Tweedie
2001-02-07  1:51                                         ` Jeff V. Merkey
2001-02-07  1:01                                           ` Ingo Molnar
2001-02-07  1:59                                             ` Jeff V. Merkey
2001-02-07  1:02                                           ` Jens Axboe
2001-02-07  1:19                                             ` Linus Torvalds
2001-02-07  1:39                                               ` Jens Axboe
2001-02-07  1:45                                                 ` Linus Torvalds
2001-02-07  1:55                                                   ` Jens Axboe
2001-02-07  9:10                                                   ` David Howells
2001-02-07 12:16                                                     ` Stephen C. Tweedie
2001-02-07  2:00                                             ` Jeff V. Merkey
2001-02-07  1:06                                               ` Ingo Molnar
2001-02-07  1:09                                                 ` Jens Axboe
2001-02-07  1:11                                                   ` Ingo Molnar
2001-02-07  1:26                                                 ` Linus Torvalds
2001-02-07  2:07                                                 ` Jeff V. Merkey
2001-02-07  1:08                                               ` Jens Axboe
2001-02-07  2:08                                                 ` Jeff V. Merkey
2001-02-07  1:42                                       ` Jeff V. Merkey
2001-02-07  0:42                                     ` Linus Torvalds
2001-02-07  0:35                                   ` Jens Axboe
2001-02-07  0:41                                   ` Linus Torvalds
2001-02-07  1:27                                     ` Stephen C. Tweedie
2001-02-07  1:40                                       ` Linus Torvalds
2001-02-12 10:07                                         ` Jamie Lokier
2001-02-06 20:26                               ` Linus Torvalds
2001-02-06 20:25                             ` Christoph Hellwig
2001-02-06 20:35                               ` Ingo Molnar
2001-02-06 19:05                                 ` Marcelo Tosatti
2001-02-06 20:59                                   ` Ingo Molnar
2001-02-06 21:20                                     ` Steve Lord
2001-02-07 18:27                                 ` Christoph Hellwig
2001-02-06 20:59                               ` Linus Torvalds
2001-02-07 18:26                                 ` Christoph Hellwig
2001-02-07 18:36                                   ` Linus Torvalds
2001-02-07 18:44                                     ` Christoph Hellwig
2001-02-08  0:34                                     ` Neil Brown
2001-02-06 19:46                           ` Ingo Molnar
2001-02-06 20:16                             ` Ben LaHaise
2001-02-06 20:22                               ` Ingo Molnar
2001-02-06 19:20                       ` Linus Torvalds
2001-02-06  0:31     ` Roman Zippel
2001-02-06  1:01       ` Linus Torvalds
2001-02-06  9:22         ` Roman Zippel
2001-02-06  9:30         ` Ingo Molnar
2001-02-06  1:08       ` David S. Miller
2001-02-05 16:56 [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains Linus Torvalds
2001-02-05 17:27 ` [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait Alan Cox
     [not found] <CA2569E9.004A4E23.00@d73mta05.au.ibm.com>
2001-02-04 16:46 ` Alan Cox
2001-02-01 18:39 [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains Rik van Riel
2001-02-01 18:46 ` [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait Alan Cox

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).