linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait  /notify + callback chains
@ 2001-02-04 13:24 bsuparna
  0 siblings, 0 replies; 76+ messages in thread
From: bsuparna @ 2001-02-04 13:24 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: linux-kernel, kiobuf-io-devel, Alan Cox, Christoph Hellwig, Andi Kleen


>Hi,
>
>On Fri, Feb 02, 2001 at 12:51:35PM +0100, Christoph Hellwig wrote:
>> >
>> > If I have a page vector with a single offset/length pair, I can build
>> > a new header with the same vector and modified offset/length to split
>> > the vector in two without copying it.
>>
>> You just say in the higher-level structure ignore from x to y even if
>> they have an offset in their own vector.
>
>Exactly --- and so you end up with something _much_ uglier, because
>you end up with all sorts of combinations of length/offset fields all
>over the place.
>
>This is _precisely_ the mess I want to avoid.
>
>Cheers,
> Stephen

It appears that we are coming across 2 kinds of requirements for kiobuf
vectors - and quite a bit of debate centering around that.

1. In the block device i/o world, where large i/os may be involved, we'd
like to be able to describe chunks/fragments that contain multiple pages;
which is why it  make sense to have a single <offset, length> pair for the
entire set of pages in a kiobuf, rather than having to deal with per page
offset/len fields.

2. In the networking world, we deal with smaller fragments (for protocol
headers and stuff, and small packets) ideally chained together, typically
not page aligned, with the ability to extend the list at least at the head
and tail (and maybe some reshuffling in case of ip fragmentation?); so I
guess that's why it seems good to have an <offset, length> pair per
page/fragment. (If there can be multiple fragments in a page, even this
might not be frugal enough ... )

Looks like there are 2 kinds of entities that we are looking for in the kio
descriptor:
     - A collection of physical memory pages (call it say, a page_list)
     - A collection of fragments of memory described as <offset, len>
tuples w.r.t this collection
     (offset in turn could be <index in page-list, offset-in-page> if it
helps) (call this collection a frag_list)

Can't we define a kiobuf structure as just this ? A combination of a
frag_list and a page_list ? (Clone kiobufs might share the original
kiobuf's page_list, but just split parts of the frag_list)
How hard is it to maintain and to manipulate such a structure ?

BTW, We could have a higher level io container that includes a <status>
field and a <wait_queue_head> to take care of i/o completion (If we have a
wait queue head, then I don't think we need a separate callback function if
we have Ben's wakeup functions in place).

Or,  is this going in the direction of a cross between and elephant and a
bicycle :-)  ?

Regards
Suparna


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
  2001-02-06 13:50 bsuparna
@ 2001-02-06 14:07 ` Jens Axboe
  0 siblings, 0 replies; 76+ messages in thread
From: Jens Axboe @ 2001-02-06 14:07 UTC (permalink / raw)
  To: bsuparna
  Cc: Stephen C. Tweedie, linux-kernel, kiobuf-io-devel, Alan Cox,
	Christoph Hellwig, Andi Kleen

On Tue, Feb 06 2001, bsuparna@in.ibm.com wrote:
> >It depends on the device driver.  Different controllers will have
> >different maximum transfer size.  For IDE, for example, we get wakeups
> >all over the place.  For SCSI, it depends on how many scatter-gather
> >entries the driver can push into a single on-the-wire request.  Exceed
> >that limit and the driver is forced to open a new scsi mailbox, and
> >you get independent completion signals for each such chunk.

SCSI does not build a request bigger than the low level driver
can handle. If you exceed the scatter count in a single request,
you just stop and fire of that request, later on restarting I/O
on the remainder.

> I see. I remember Jens Axboe mentioning something like this with IDE.
> So, in this case, you want every such chunk to check if its completed
> filling up a buffer and then trigger a wakeup on that ?

Yes. Which is why dealing with buffer heads are so nice in this
regard, you never have problems with ending I/O on a single "piece".

> But, does this also mean that in such a case combining requests beyond this
> limit doesn't really help ? (Reordering requests to get contiguity would
> help of course in terms of seek times, I guess, but not merging beyond this
> limit)

There's a slight benefit in building bigger requests than the driver
can handle, in that you can have more I/O pending on the queue. It's
not worth spending too much time on though.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait  /notify + callback chains
@ 2001-02-06 13:50 bsuparna
  2001-02-06 14:07 ` Jens Axboe
  0 siblings, 1 reply; 76+ messages in thread
From: bsuparna @ 2001-02-06 13:50 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: linux-kernel, kiobuf-io-devel, Alan Cox, Christoph Hellwig, Andi Kleen


>Hi,
>
>On Mon, Feb 05, 2001 at 08:01:45PM +0530, bsuparna@in.ibm.com wrote:
>>
>> >It's the very essence of readahead that we wake up the earlier buffers
>> >as soon as they become available, without waiting for the later ones
>> >to complete, so we _need_ this multiple completion concept.
>>
>> I can understand this in principle, but when we have a single request
going
>> down to the device that actually fills in multiple buffers, do we get
>> notified (interrupted) by the device before all the data in that request
>> got transferred ?
>
>It depends on the device driver.  Different controllers will have
>different maximum transfer size.  For IDE, for example, we get wakeups
>all over the place.  For SCSI, it depends on how many scatter-gather
>entries the driver can push into a single on-the-wire request.  Exceed
>that limit and the driver is forced to open a new scsi mailbox, and
>you get independent completion signals for each such chunk.

I see. I remember Jens Axboe mentioning something like this with IDE.
So, in this case, you want every such chunk to check if its completed
filling up a buffer and then trigger a wakeup on that ?
But, does this also mean that in such a case combining requests beyond this
limit doesn't really help ? (Reordering requests to get contiguity would
help of course in terms of seek times, I guess, but not merging beyond this
limit)

>> >Which is exactly why we have one kiobuf per higher-level buffer, and
>> >we chain together kiobufs when we need to for a long request, but we
>> >still get the independent completion notifiers.
>>
>> As I mentioned above, the alternative is to have the i/o completion
related
>> linkage information within the wakeup structures instead. That way, it
>> doesn't matter to the lower level driver what higher level structure we
>> have above (maybe buffer heads, may be page cache structures, may be
>> kiobufs). We only chain together memory descriptors for the buffers
during
>> the io.
>
>You forgot IO failures: it is essential, once the IO completes, to
>know exactly which higher-level structures completed successfully and
>which did not.  The low-level drivers have to have access to the
>independent completion notifications for this to work.
>
No, I didn't forget IO failures; just that I expect the wait structure
containing the wakeup function to be embedded in a cev structure that
contains a pointer to the wait_queue_head field in the higher level
structure. The rest is for the wakeup function to interpret (it can always
access the other fields in the higher level structure - just like
list_entry() does)

Later I realized that instead of having multiple wakeup functions queued on
the low level structures wait queue, its perhaps better to just sort of
turn the cev_wait structure upside down (entry on the lower level
structure's queue should link to the parent entries instead).




-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify  + callback chains
  2001-02-05 22:58                       ` Stephen C. Tweedie
  2001-02-05 23:06                         ` Alan Cox
@ 2001-02-06  0:19                         ` Manfred Spraul
  1 sibling, 0 replies; 76+ messages in thread
From: Manfred Spraul @ 2001-02-06  0:19 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Ingo Molnar, Steve Lord, linux-kernel, kiobuf-io-devel, Alan Cox,
	Linus Torvalds

"Stephen C. Tweedie" wrote:
> 
> The original multi-page buffers came from the map_user_kiobuf
> interface: they represented a user data buffer.  I'm not wedded to
> that format --- we can happily replace it with a fine-grained sg list
>
Could you change that interface?

<<< from Linus mail:

        struct buffer {
                struct page *page;
                u16 offset, length;
        };

>>>>>>

/* returns the number of used buffers, or <0 on error */
int map_user_buffer(struct buffer *ba, int max_bcount,
			void* addr, int len);
void unmap_buffer(struct buffer *ba, int bcount);

That's enough for the zero copy pipe code ;-)

Real hw drivers probably need a replacement for pci_map_single()
(pci_map_and_align_and_bounce_buffer_array())

The kiobuf structure could contain these 'struct buffer' instead of the
current 'struct page' pointers.

> 
> In other words, even if we expand the kiobuf into a sg vector list,
> when it comes to merging requests in ll_rw_blk.c we still need to
> track the callbacks on each independent source kiobufs.
>
Probably.


--
	Manfred

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
  2001-02-05 23:06                         ` Alan Cox
@ 2001-02-05 23:16                           ` Stephen C. Tweedie
  0 siblings, 0 replies; 76+ messages in thread
From: Stephen C. Tweedie @ 2001-02-05 23:16 UTC (permalink / raw)
  To: Alan Cox
  Cc: Stephen C. Tweedie, Ingo Molnar, Steve Lord, linux-kernel,
	kiobuf-io-devel, Linus Torvalds

Hi,

On Mon, Feb 05, 2001 at 11:06:48PM +0000, Alan Cox wrote:
> > do you then tell the application _above_ raid0 if one of the
> > underlying IOs succeeds and the other fails halfway through?
> 
> struct 
> {
> 	u32 flags;	/* because everything needs flags */
> 	struct io_completion *completions;
> 	kiovec_t sglist[0];
> } thingy;
> 
> now kmalloc one object of the header the sglist of the right size and the
> completion list. Shove the completion list on the end of it as another
> array of objects and what is the problem.

XFS uses both small metadata items in the buffer cache and large
pagebufs.  You may have merged a 512-byte read with a large pagebuf
read: one completion callback is associated with a single sg fragment,
the next callback belongs to a dozen different fragments.  Associating
the two lists becomes non-trivial, although it could be done.

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
  2001-02-05 22:58                       ` Stephen C. Tweedie
@ 2001-02-05 23:06                         ` Alan Cox
  2001-02-05 23:16                           ` Stephen C. Tweedie
  2001-02-06  0:19                         ` Manfred Spraul
  1 sibling, 1 reply; 76+ messages in thread
From: Alan Cox @ 2001-02-05 23:06 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Ingo Molnar, Stephen C. Tweedie, Steve Lord, linux-kernel,
	kiobuf-io-devel, Alan Cox, Linus Torvalds

> do you then tell the application _above_ raid0 if one of the
> underlying IOs succeeds and the other fails halfway through?

struct 
{
	u32 flags;	/* because everything needs flags */
	struct io_completion *completions;
	kiovec_t sglist[0];
} thingy;

now kmalloc one object of the header the sglist of the right size and the
completion list. Shove the completion list on the end of it as another
array of objects and what is the problem.

> In other words, even if we expand the kiobuf into a sg vector list,
> when it comes to merging requests in ll_rw_blk.c we still need to
> track the callbacks on each independent source kiobufs.  

But that can be two arrays

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
  2001-02-05 21:28                     ` Ingo Molnar
@ 2001-02-05 22:58                       ` Stephen C. Tweedie
  2001-02-05 23:06                         ` Alan Cox
  2001-02-06  0:19                         ` Manfred Spraul
  0 siblings, 2 replies; 76+ messages in thread
From: Stephen C. Tweedie @ 2001-02-05 22:58 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Stephen C. Tweedie, Steve Lord, linux-kernel, kiobuf-io-devel,
	Alan Cox, Linus Torvalds

Hi,

On Mon, Feb 05, 2001 at 10:28:37PM +0100, Ingo Molnar wrote:
> 
> On Mon, 5 Feb 2001, Stephen C. Tweedie wrote:
> 
> it's exactly these 'compound' structures i'm vehemently against. I do
> think it's a design nightmare. I can picture these monster kiobufs
> complicating the whole code for no good reason - we couldnt even get the
> bh-list code in block_device.c right - why do you think kiobufs *all
> across the kernel* will be any better?
> 
> RAID0 is not an issue. Split it up, use separate kiobufs for every
> different disk.

Umm, that's not the point --- of course you can use separate kiobufs
for the communication between raid0 and the underlying disks, but what
do you then tell the application _above_ raid0 if one of the
underlying IOs succeeds and the other fails halfway through?

And what about raid1?  Are you really saying that raid1 doesn't need
to know which blocks succeeded and which failed?  That's the level of
completion information I'm worrying about at the moment.

> fragmented skbs are a different matter: they are simply a bit more generic
> abstractions of 'memory buffer'. Clear goal, clear solution. I do not
> think kiobufs have clear goals.

The goal: allow arbitrary IOs to be pushed down through the stack in
such a way that the callers can get meaningful information back about
what worked and what did not.  If the write was a 128kB raw IO, then
you obviously get coarse granularity of completion callback.  If the
write was a series of independent pages which happened to be
contiguous on disk, you actually get told which pages hit disk and
which did not.

> and what is the goal of having multi-page kiobufs. To avoid having to do
> multiple function calls via a simpler interface? Shouldnt we optimize that
> codepath instead?

The original multi-page buffers came from the map_user_kiobuf
interface: they represented a user data buffer.  I'm not wedded to
that format --- we can happily replace it with a fine-grained sg list
--- but the reason they have been pushed so far down the IO stack is
the need for accurate completion information on the originally
requested IOs.

In other words, even if we expand the kiobuf into a sg vector list,
when it comes to merging requests in ll_rw_blk.c we still need to
track the callbacks on each independent source kiobufs.  

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
  2001-02-05 15:03                       ` Stephen C. Tweedie
  2001-02-05 15:19                         ` Alan Cox
@ 2001-02-05 22:09                         ` Ingo Molnar
  1 sibling, 0 replies; 76+ messages in thread
From: Ingo Molnar @ 2001-02-05 22:09 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Manfred Spraul, Linus Torvalds, Christoph Hellwig, Steve Lord,
	Linux Kernel List, kiobuf-io-devel, Alan Cox


On Mon, 5 Feb 2001, Stephen C. Tweedie wrote:

> > Obviously the disk access itself must be sector aligned and the total
> > length must be a multiple of the sector length, but there shouldn't be
> > any restrictions on the data buffers.
>
> But there are. Many controllers just break down and corrupt things
> silently if you don't align the data buffers (Jeff Merkey found this
> by accident when he started generating unaligned IOs within page
> boundaries in his NWFS code). And a lot of controllers simply cannot
> break a sector dma over a page boundary (at least not without some
> form of IOMMU remapping).

so we are putting workarounds for hardware bugs into the design?

	Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
  2001-02-05 12:19                   ` Stephen C. Tweedie
@ 2001-02-05 21:28                     ` Ingo Molnar
  2001-02-05 22:58                       ` Stephen C. Tweedie
  0 siblings, 1 reply; 76+ messages in thread
From: Ingo Molnar @ 2001-02-05 21:28 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Steve Lord, linux-kernel, kiobuf-io-devel, Alan Cox, Linus Torvalds


On Mon, 5 Feb 2001, Stephen C. Tweedie wrote:

> And no, the IO success is *not* necessarily sequential from the start
> of the IO: if you are doing IO to raid0, for example, and the IO gets
> striped across two disks, you might find that the first disk gets an
> error so the start of the IO fails but the rest completes.  It's the
> completion code which notifies the caller of what worked and what did
> not.

it's exactly these 'compound' structures i'm vehemently against. I do
think it's a design nightmare. I can picture these monster kiobufs
complicating the whole code for no good reason - we couldnt even get the
bh-list code in block_device.c right - why do you think kiobufs *all
across the kernel* will be any better?

RAID0 is not an issue. Split it up, use separate kiobufs for every
different disk. We need simple constructs - i do not believe why nobody
sees that these big fat monster-trucks of IO workload are *trouble*. They
keep things localized, instead of putting workload components into the
system immediately. We'll have performance bugs nobody has seen before.
bhs have one very nice property: they are simple, modularized. I think
this is like CISC vs. RISC: CISC designs ended up splitting 'fat
instructions' up into RISC-like instructions.

fragmented skbs are a different matter: they are simply a bit more generic
abstractions of 'memory buffer'. Clear goal, clear solution. I do not
think kiobufs have clear goals.

and i do not buy the performance arguments. In 2.4.1 we improved block-IO
performance dramatically by fixing high-load IO scheduling. Write
performance suddenly improved dramatically, there is a 30-40% improvement
in dbench performance. To put in another way: *we needed 5 years to fix a
serious IO-subsystem performance bug*. Block IO was already too complex -
and Alex & Andrea have done a nice job streamlining and cleaning it up for
2.4. We should simplify it further - and optimize the components, instead
of bringing in yet another *big* complication into the API.

and what is the goal of having multi-page kiobufs. To avoid having to do
multiple function calls via a simpler interface? Shouldnt we optimize that
codepath instead?

	Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
  2001-02-05 18:49                               ` Stephen C. Tweedie
  2001-02-05 19:04                                 ` Alan Cox
@ 2001-02-05 19:09                                 ` Linus Torvalds
  1 sibling, 0 replies; 76+ messages in thread
From: Linus Torvalds @ 2001-02-05 19:09 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Alan Cox, Manfred Spraul, Christoph Hellwig, Steve Lord,
	linux-kernel, kiobuf-io-devel



On Mon, 5 Feb 2001, Stephen C. Tweedie wrote:
> > Thats true for _block_ disk devices but if we want a generic kiovec then
> > if I am going from video capture to network I dont need to force anything more
> > than 4 byte align
> 
> Kiobufs have never, ever required the IO to be aligned on any
> particular boundary.  They simply make the assumption that the
> underlying buffered object can be described in terms of pages with
> some arbitrary (non-aligned) start/offset.  Every video framebuffer
> I've ever seen satisfies that, so you can easily map an arbitrary
> contiguous region of the framebuffer with a kiobuf already.

Stop this idiocy, Stephen. You're _this_ close to be the first person I
ever blacklist from my mailbox. 

Network. Packets. Fragmentation. Or just non-page-sized MTU's. 

It is _not_ a "series of contiguous pages". Never has been. Never will be.
So stop making excuses.

Also, think of protocols that may want to gather stuff from multiple
places, where the boundaries have little to do with pages but are
specified some other way. Imagine doing "writev()" style operations to
disk, gathering stuff from multiple sources into one operation.

Think of GART remappings - you can have multiple pages that show up as one
"linear" chunk to the graphics device behind the AGP bridge, but that are
_not_ contiguous in real memory.

There just is NO excuse for the "linear series of pages" view. And if you
cannot realize that, then I don't know what's wrong with you. Your
arguments are obviously crap, and the stuff you seem unable to argue
against (like networking) you decide to just ignore. Get your act
together.

		Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
  2001-02-05 16:36                     ` Linus Torvalds
@ 2001-02-05 19:08                       ` Stephen C. Tweedie
  0 siblings, 0 replies; 76+ messages in thread
From: Stephen C. Tweedie @ 2001-02-05 19:08 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Stephen C. Tweedie, Christoph Hellwig, Steve Lord, linux-kernel,
	kiobuf-io-devel, Alan Cox

Hi,

On Mon, Feb 05, 2001 at 08:36:31AM -0800, Linus Torvalds wrote:

> Have you ever thought about other things, like networking, special
> devices, stuff like that? They can (and do) have packet boundaries that
> have nothing to do with pages what-so-ever. They can have such notions as
> packets that contain multiple streams in one packet, where it ends up
> being split up into several pieces. Where neither the original packet
> _nor_ the final pieces have _anything_ to do with "pages".
> 
> THERE IS NO PAGE ALIGNMENT.

And kiobufs don't require IO to be page aligned, and they have never
done.  The only page alignment they assume is that if a *single*
scatter-gather element spans multiple pages, then the joins between
those pages occur on page boundaries.

Remember, a kiobuf is only designed to represent one scatter-gather
fragment, not a full sg list.  That was the whole reason for having a
kiovec as a separate concept: if you have more than one independent
fragment in the sg-list, you need more than one kiobuf.

And the reason why we created sg fragments which can span pages was so
that we can encode IOs which interact with the VM: any arbitrary
virtually-contiguous user data buffer can be mapped into a *single*
kiobuf for a write() call, so it's a generic way of supporting things
like O_DIRECT without the IO layers having to know anything about VM
(and Ben's async IO patches also use kiobufs in this way to allow
read()s to write to the user's data buffer once the IO completes,
without having to have a context switch back into that user's
context.)  Similarly, any extent of a file in the page cache can be
encoded in a single kiobuf.

And no, the simpler networking-style sg-list does not cut it for block
device IO, because for block devices, we want to have separate
completion status made available for each individual sg fragment in
the IO.  *That* is why the kiobuf is more heavyweight than the
networking variant: each fragment [kiobuf] in the scatter-gather list
[kiovec] has its own completion information.  

If we have a bunch of separate data buffers queued for sequential disk
IO as a single request, then we still want things like readahead and
error handling to work.  That means that we want the first kiobuf in
the chain to get its completion wakeup as soon as that segment of the
IO is complete, without having to wait for the remaining sectors of
the IO to be transferred.  It also means that if we've done something
like split the IO over a raid stripe, then when an error occurs, we
still want to know which of the callers' buffers succeeded and which
failed.

Yes, I agree that the original kiovec mechanism of using a *kiobuf[]
array to assemble the scatter-gather fragments sucked.  But I don't
believe that just throwing away the concept of kiobuf as a sc-fragment
will work either when it comes to disk IOs: the need for per-fragment
completion is too compelling.  I'd rather shift to allowing kiobufs to
be assembled into linked lists for IO to avoid *kiobuf[] vectors, in
just the same way that we currently chain buffer_heads for IO.  

Cheers,
 Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
  2001-02-05 18:49                               ` Stephen C. Tweedie
@ 2001-02-05 19:04                                 ` Alan Cox
  2001-02-05 19:09                                 ` Linus Torvalds
  1 sibling, 0 replies; 76+ messages in thread
From: Alan Cox @ 2001-02-05 19:04 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Alan Cox, Stephen C. Tweedie, Manfred Spraul, Linus Torvalds,
	Christoph Hellwig, Steve Lord, linux-kernel, kiobuf-io-devel

> Kiobufs have never, ever required the IO to be aligned on any
> particular boundary.  They simply make the assumption that the
> underlying buffered object can be described in terms of pages with
> some arbitrary (non-aligned) start/offset.  Every video framebuffer

start/length per page ?

> I've ever seen satisfies that, so you can easily map an arbitrary
> contiguous region of the framebuffer with a kiobuf already.

Video is non contiguous ranges. In fact if you are blitting to a card with
tiled memory it gets very interesting in its video lists

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
  2001-02-05 17:29                             ` Alan Cox
@ 2001-02-05 18:49                               ` Stephen C. Tweedie
  2001-02-05 19:04                                 ` Alan Cox
  2001-02-05 19:09                                 ` Linus Torvalds
  0 siblings, 2 replies; 76+ messages in thread
From: Stephen C. Tweedie @ 2001-02-05 18:49 UTC (permalink / raw)
  To: Alan Cox
  Cc: Stephen C. Tweedie, Manfred Spraul, Linus Torvalds,
	Christoph Hellwig, Steve Lord, linux-kernel, kiobuf-io-devel

Hi,

On Mon, Feb 05, 2001 at 05:29:47PM +0000, Alan Cox wrote:
> > 
> > _All_ drivers would have to do that in the degenerate case, because
> > none of our drivers can deal with a dma boundary in the middle of a
> > sector, and even in those places where the hardware supports it in
> > theory, you are still often limited to word-alignment.
> 
> Thats true for _block_ disk devices but if we want a generic kiovec then
> if I am going from video capture to network I dont need to force anything more
> than 4 byte align

Kiobufs have never, ever required the IO to be aligned on any
particular boundary.  They simply make the assumption that the
underlying buffered object can be described in terms of pages with
some arbitrary (non-aligned) start/offset.  Every video framebuffer
I've ever seen satisfies that, so you can easily map an arbitrary
contiguous region of the framebuffer with a kiobuf already.

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
  2001-02-05 17:20                           ` Stephen C. Tweedie
@ 2001-02-05 17:29                             ` Alan Cox
  2001-02-05 18:49                               ` Stephen C. Tweedie
  0 siblings, 1 reply; 76+ messages in thread
From: Alan Cox @ 2001-02-05 17:29 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Alan Cox, Stephen C. Tweedie, Manfred Spraul, Linus Torvalds,
	Christoph Hellwig, Steve Lord, linux-kernel, kiobuf-io-devel

> > 	kiovec_align(kiovec, 512);
> > and have it do the bounce buffers ?
> 
> _All_ drivers would have to do that in the degenerate case, because
> none of our drivers can deal with a dma boundary in the middle of a
> sector, and even in those places where the hardware supports it in
> theory, you are still often limited to word-alignment.

Thats true for _block_ disk devices but if we want a generic kiovec then
if I am going from video capture to network I dont need to force anything more
than 4 byte align

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
  2001-02-05 15:19                         ` Alan Cox
@ 2001-02-05 17:20                           ` Stephen C. Tweedie
  2001-02-05 17:29                             ` Alan Cox
  0 siblings, 1 reply; 76+ messages in thread
From: Stephen C. Tweedie @ 2001-02-05 17:20 UTC (permalink / raw)
  To: Alan Cox
  Cc: Stephen C. Tweedie, Manfred Spraul, Linus Torvalds,
	Christoph Hellwig, Steve Lord, linux-kernel, kiobuf-io-devel

Hi,

On Mon, Feb 05, 2001 at 03:19:09PM +0000, Alan Cox wrote:
> > Yes, it's the sort of thing that you would hope should work, but in
> > practice it's not reliable.
> 
> So the less smart devices need to call something like
> 
> 	kiovec_align(kiovec, 512);
> 
> and have it do the bounce buffers ?

_All_ drivers would have to do that in the degenerate case, because
none of our drivers can deal with a dma boundary in the middle of a
sector, and even in those places where the hardware supports it in
theory, you are still often limited to word-alignment.

--Stephen

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify  + callback chains
  2001-02-05 12:00                     ` Manfred Spraul
  2001-02-05 15:03                       ` Stephen C. Tweedie
@ 2001-02-05 16:56                       ` Linus Torvalds
  1 sibling, 0 replies; 76+ messages in thread
From: Linus Torvalds @ 2001-02-05 16:56 UTC (permalink / raw)
  To: Manfred Spraul
  Cc: Stephen C. Tweedie, Christoph Hellwig, Steve Lord, linux-kernel,
	kiobuf-io-devel, Alan Cox



On Mon, 5 Feb 2001, Manfred Spraul wrote:
> "Stephen C. Tweedie" wrote:
> > 
> > You simply cannot do physical disk IO on
> > non-sector-aligned memory or in chunks which aren't a multiple of
> > sector size.
> 
> Why not?
> 
> Obviously the disk access itself must be sector aligned and the total
> length must be a multiple of the sector length, but there shouldn't be
> any restrictions on the data buffers.

In fact, regular IDE DMA allows arbitrary scatter-gather at least in
theory. Linux has never used it, so I don't know how well it works in
practice - I would not be surprised if it ends up causing no end of nasty 
corner-cases that have bugs. It's not as if IDE controllers always follow 
the documentation ;)

The _total_ length of the buffers have to be a multiple of the sector
size, and there are some alignment issues (each scatter-gather area has to
be at least 16-bit aligned both in physical memory and in length, and
apparently many controllers need 32-bit alignment). And I'd almost be
surprised if there wouldn't be hardware that wanted cache alignment
because they always expect to burst. 

But despite a lot of likely practical reasons why it won't work for
arbitrary sg lists on plain IDE DMA, there is no _theoretical_ reason it
wouldn't. And there are bound to be better controllers that could handle
it.

		Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
  2001-02-05 11:03                   ` Stephen C. Tweedie
  2001-02-05 12:00                     ` Manfred Spraul
@ 2001-02-05 16:36                     ` Linus Torvalds
  2001-02-05 19:08                       ` Stephen C. Tweedie
  1 sibling, 1 reply; 76+ messages in thread
From: Linus Torvalds @ 2001-02-05 16:36 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Christoph Hellwig, Steve Lord, linux-kernel, kiobuf-io-devel, Alan Cox



On Mon, 5 Feb 2001, Stephen C. Tweedie wrote:
> 
> On Sat, Feb 03, 2001 at 12:28:47PM -0800, Linus Torvalds wrote:
> > 
> > Neither the read nor the write are page-aligned. I don't know where you
> > got that idea. It's obviously not true even in the common case: it depends
> > _entirely_ on what the file offsets are, and expecting the offset to be
> > zero is just being stupid. It's often _not_ zero. With networking it is in
> > fact seldom zero, because the network packets are seldom aligned either in
> > size or in location.
> 
> The underlying buffer is.  The VFS (and the current kiobuf code) is
> already happy about IO happening at odd offsets within a page.

Stephen. 

Don't bother even talking about this. You're so damn hung up about the
page cache that it's not funny.

Have you ever thought about other things, like networking, special
devices, stuff like that? They can (and do) have packet boundaries that
have nothing to do with pages what-so-ever. They can have such notions as
packets that contain multiple streams in one packet, where it ends up
being split up into several pieces. Where neither the original packet
_nor_ the final pieces have _anything_ to do with "pages".

THERE IS NO PAGE ALIGNMENT.

So stop blathering about it.

Of _course_ the current kiobuf code has page-alignment assumptions. You
_designed_ it that way. So bringing it up as an example is a circular
argument. And a really stupid one at that, as that's the thing I've been
quoting as the single biggest design bug in all of kiobufs. It's the thing
that makes them entirely useless for things like describing "struct
msghdr" etc. 

We should get _away_ from this page-alignment fallacy. It's not true. It's
not necessarily even true for the page cache - which has no real
fundamental reasons any more for not being able to be a "variable-size"
cache some time in the future (ie it might be a per-address-space decision
on whether the granularity is 1, 2, 4 or more pages).

Anything that designs for "everything is a page" will automatically be
limited for cases where you might sometimes have 64kB chunks of data.

Instead, just face the realization that "everything is a bunch or ranges",
and leave it at that. It's true _already_ - thing about fragmented IP
packets. We may not handle it that way completely yet, but the zero-copy
networking is going in this direction.

And as long as you keep on harping about page alignment, you're not going
to play in this game. End of story. 

		Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
  2001-02-05 15:03                       ` Stephen C. Tweedie
@ 2001-02-05 15:19                         ` Alan Cox
  2001-02-05 17:20                           ` Stephen C. Tweedie
  2001-02-05 22:09                         ` Ingo Molnar
  1 sibling, 1 reply; 76+ messages in thread
From: Alan Cox @ 2001-02-05 15:19 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Manfred Spraul, Stephen C. Tweedie, Linus Torvalds,
	Christoph Hellwig, Steve Lord, linux-kernel, kiobuf-io-devel,
	Alan Cox

> Yes, it's the sort of thing that you would hope should work, but in
> practice it's not reliable.

So the less smart devices need to call something like

	kiovec_align(kiovec, 512);

and have it do the bounce buffers ?


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
  2001-02-05 12:00                     ` Manfred Spraul
@ 2001-02-05 15:03                       ` Stephen C. Tweedie
  2001-02-05 15:19                         ` Alan Cox
  2001-02-05 22:09                         ` Ingo Molnar
  2001-02-05 16:56                       ` Linus Torvalds
  1 sibling, 2 replies; 76+ messages in thread
From: Stephen C. Tweedie @ 2001-02-05 15:03 UTC (permalink / raw)
  To: Manfred Spraul
  Cc: Stephen C. Tweedie, Linus Torvalds, Christoph Hellwig,
	Steve Lord, linux-kernel, kiobuf-io-devel, Alan Cox

Hi,

On Mon, Feb 05, 2001 at 01:00:51PM +0100, Manfred Spraul wrote:
> "Stephen C. Tweedie" wrote:
> > 
> > You simply cannot do physical disk IO on
> > non-sector-aligned memory or in chunks which aren't a multiple of
> > sector size.
> 
> Why not?
> 
> Obviously the disk access itself must be sector aligned and the total
> length must be a multiple of the sector length, but there shouldn't be
> any restrictions on the data buffers.

But there are.  Many controllers just break down and corrupt things
silently if you don't align the data buffers (Jeff Merkey found this
by accident when he started generating unaligned IOs within page
boundaries in his NWFS code).  And a lot of controllers simply cannot
break a sector dma over a page boundary (at least not without some
form of IOMMU remapping).

Yes, it's the sort of thing that you would hope should work, but in
practice it's not reliable.

Cheers,
 Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
       [not found] <CA2569EA.00506BBC.00@d73mta05.au.ibm.com>
@ 2001-02-05 15:01 ` Stephen C. Tweedie
  0 siblings, 0 replies; 76+ messages in thread
From: Stephen C. Tweedie @ 2001-02-05 15:01 UTC (permalink / raw)
  To: bsuparna
  Cc: Stephen C. Tweedie, linux-kernel, kiobuf-io-devel, Alan Cox,
	Christoph Hellwig, Andi Kleen

Hi,

On Mon, Feb 05, 2001 at 08:01:45PM +0530, bsuparna@in.ibm.com wrote:
> 
> >It's the very essence of readahead that we wake up the earlier buffers
> >as soon as they become available, without waiting for the later ones
> >to complete, so we _need_ this multiple completion concept.
> 
> I can understand this in principle, but when we have a single request going
> down to the device that actually fills in multiple buffers, do we get
> notified (interrupted) by the device before all the data in that request
> got transferred ?

It depends on the device driver.  Different controllers will have
different maximum transfer size.  For IDE, for example, we get wakeups
all over the place.  For SCSI, it depends on how many scatter-gather
entries the driver can push into a single on-the-wire request.  Exceed
that limit and the driver is forced to open a new scsi mailbox, and
you get independent completion signals for each such chunk.

> >Which is exactly why we have one kiobuf per higher-level buffer, and
> >we chain together kiobufs when we need to for a long request, but we
> >still get the independent completion notifiers.
> 
> As I mentioned above, the alternative is to have the i/o completion related
> linkage information within the wakeup structures instead. That way, it
> doesn't matter to the lower level driver what higher level structure we
> have above (maybe buffer heads, may be page cache structures, may be
> kiobufs). We only chain together memory descriptors for the buffers during
> the io.

You forgot IO failures: it is essential, once the IO completes, to
know exactly which higher-level structures completed successfully and
which did not.  The low-level drivers have to have access to the
independent completion notifications for this to work.

Cheers,
 Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait  /notify + callback chains
@ 2001-02-05 14:31 bsuparna
  0 siblings, 0 replies; 76+ messages in thread
From: bsuparna @ 2001-02-05 14:31 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: linux-kernel, kiobuf-io-devel, Alan Cox, Christoph Hellwig, Andi Kleen



>Hi,
>
>On Sun, Feb 04, 2001 at 06:54:58PM +0530, bsuparna@in.ibm.com wrote:
>>
>> Can't we define a kiobuf structure as just this ? A combination of a
>> frag_list and a page_list ?
>

>Then all code which needs to accept an arbitrary kiobuf needs to be
>able to parse both --- ugh.
>

Making this a little more explicit to help analyse tradeoffs:

/* Memory descriptor portion of a kiobuf - this is something that may get
passed around between layers and subsystems */
struct kio_mdesc {
     int nr_frags;
     struct frag *frag_list;
     int nr_pages;
     struct page **page_list;
     /* list follows */
};

For block i/o requiring #1 type descriptors, the list could have allocated
extra space for:
struct kio_type1_ext {
     struct frag frag;
     struct page *pages[NUM_STATIC_PAGES];
}

For n/w i/o or cases requiring  #2 type descriptors, the list could have
allocated extra space for:

struct kio_type2_ext {
     struct frag frags[NUM_STATIC_FRAGS];
     struct page *page[NUM_STATIC_FRAGS];
}


struct  kiobuf {
     int            status;
     wait_queue_head_t   waitq;
     struct kio_mdesc    mdesc;
     /* list follows - leaves room for allocation for mem descs, completion
sub structs etc */
}

Code that accepts an arbitrary kiobuf needs to do the following :
     process the fragments one by one
          - type #1 case, only one fragment would typically be there, but
processing it would involve crossing all pages in the page list
               So extra processing vs a kiobuf with single <offset, len>
pair, involves the following:
                    dereferencing the frag_list pointer
                    checking the nr_frags field
          - type #2 case, the number of fragments would be equal to or
greater than number of pages, so processing will typically go over each
fragments and thus cross each page in the list one by one
               So extra processing vs a kiobuf with per-page <offset, len>
pairs, involves
                    deferencing the page list entry (involves computing the
page-index in the page_list from the offset value)
                    check if offset+len doesn't fall outside the page


Boils down to approx one extra dereference and one comparison  per kiobuf
for the common cases (have I missed something critical ?)  vs the most
optimized choices of descriptors for those cases.

In terms of resource consumption (extra bytes taken up), two fields extra
per kiobuf chain (e.g. nr_frags and frag_list pointer when it comes to #1),
i.e. a total of 8 bytes, for the common cases vs the most optimized choice
of structures for those cases.

This seems to be more uniformly balanced across #1 and #2 cases, than an
<offset, len> for every page, as well as an overall <offset, len>. But,
then, come to think of it, since the need for lightweight structures is
greater in the case of #2, should the point of balance (if at all we want
to find one) be tilted towards #2 ?

On the other hand, since having a common structure does involve extra bytes
and cycles, if there are very few situations where we need both #1 and #2 -
conversion only at subsystem boundaries like i2o does may turn out to be
better.

Oh well ...


>> BTW, We could have a higher level io container that includes a <status>
>> field and a <wait_queue_head> to take care of i/o completion
>
>IO completion requirements are much more complex.  Think of disk
>readahead: we can create a single request struct for an IO of a
>hundred buffer heads, and as the device driver satisfies that request,
>it wakes up the buffer heads as it goes.  There is a separete
>completion notification for every single buffer head in the chain.
>
I understand the requirement of independent completion notifiers for higher
level buffers/other structures, since they are indeed independently usable
structures. That was one aspect that I thought I was being able to address
in the cev_wait design based on wait_queue wakeup functions.
The way it would work is that there would be multiple wakeup functions
registered on the container for the big request, each wakeup function being
responsible for waking up a higher level buffer. This way, the linkage
information is actually external to the buffer structures (which seems
reasonable, since it is only required while the i/o is happening, unless
there is another reason to keep a more lasting association)

>It's the very essence of readahead that we wake up the earlier buffers
>as soon as they become available, without waiting for the later ones
>to complete, so we _need_ this multiple completion concept.
>

I can understand this in principle, but when we have a single request going
down to the device that actually fills in multiple buffers, do we get
notified (interrupted) by the device before all the data in that request
got transferred ? I mean, how do we know that some buffers have become
available until the overall device request has completed (unless of course
the request actually gets broken up at this level and completed bit by
bit).


>Which is exactly why we have one kiobuf per higher-level buffer, and
>we chain together kiobufs when we need to for a long request, but we
>still get the independent completion notifiers.

As I mentioned above, the alternative is to have the i/o completion related
linkage information within the wakeup structures instead. That way, it
doesn't matter to the lower level driver what higher level structure we
have above (maybe buffer heads, may be page cache structures, may be
kiobufs). We only chain together memory descriptors for the buffers during
the io.

>
>Cheers,
> Stephen



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
  2001-02-02 12:02                 ` Christoph Hellwig
@ 2001-02-05 12:19                   ` Stephen C. Tweedie
  2001-02-05 21:28                     ` Ingo Molnar
  0 siblings, 1 reply; 76+ messages in thread
From: Stephen C. Tweedie @ 2001-02-05 12:19 UTC (permalink / raw)
  To: Stephen C. Tweedie, Steve Lord, linux-kernel, kiobuf-io-devel,
	Alan Cox, Linus Torvalds

Hi,

On Fri, Feb 02, 2001 at 01:02:28PM +0100, Christoph Hellwig wrote:
> 
> > I may still be persuaded that we need the full scatter-gather list
> > fields throughout, but for now I tend to think that, at least in the
> > disk layers, we may get cleaner results by allow linked lists of
> > page-aligned kiobufs instead.  That allows for merging of kiobufs
> > without having to copy all of the vector information each time.
> 
> But it will have the same problems as the array soloution: there will
> be one complete kio structure for each kiobuf, with it's own end_io
> callback, etc.

And what's the problem with that?

You *need* this.  You have to have that multiple-completion concept in
the disk layers.  Think about chains of buffer_heads being sent to
disk as a single IO --- you need to know which buffers make it to disk
successfully and which had IO errors.

And no, the IO success is *not* necessarily sequential from the start
of the IO: if you are doing IO to raid0, for example, and the IO gets
striped across two disks, you might find that the first disk gets an
error so the start of the IO fails but the rest completes.  It's the
completion code which notifies the caller of what worked and what did
not.

And for readahead, you want to notify the caller as early as posssible
about completion for the first part of the IO, even if the device
driver is still processing the rest.

Multiple completions are a necessary feature of the current block
device interface.  Removing that would be a step backwards.

Cheers,
 Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
       [not found] <CA2569E9.004A4E23.00@d73mta05.au.ibm.com>
@ 2001-02-05 12:09 ` Stephen C. Tweedie
  0 siblings, 0 replies; 76+ messages in thread
From: Stephen C. Tweedie @ 2001-02-05 12:09 UTC (permalink / raw)
  To: bsuparna
  Cc: Stephen C. Tweedie, linux-kernel, kiobuf-io-devel, Alan Cox,
	Christoph Hellwig, Andi Kleen

Hi,

On Sun, Feb 04, 2001 at 06:54:58PM +0530, bsuparna@in.ibm.com wrote:
> 
> Can't we define a kiobuf structure as just this ? A combination of a
> frag_list and a page_list ?

Then all code which needs to accept an arbitrary kiobuf needs to be
able to parse both --- ugh.

> BTW, We could have a higher level io container that includes a <status>
> field and a <wait_queue_head> to take care of i/o completion

IO completion requirements are much more complex.  Think of disk
readahead: we can create a single request struct for an IO of a
hundred buffer heads, and as the device driver satisfies that request,
it wakes up the buffer heads as it goes.  There is a separete
completion notification for every single buffer head in the chain.

It's the very essence of readahead that we wake up the earlier buffers
as soon as they become available, without waiting for the later ones
to complete, so we _need_ this multiple completion concept.

Which is exactly why we have one kiobuf per higher-level buffer, and
we chain together kiobufs when we need to for a long request, but we
still get the independent completion notifiers.

Cheers,
 Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify  + callback chains
  2001-02-05 11:03                   ` Stephen C. Tweedie
@ 2001-02-05 12:00                     ` Manfred Spraul
  2001-02-05 15:03                       ` Stephen C. Tweedie
  2001-02-05 16:56                       ` Linus Torvalds
  2001-02-05 16:36                     ` Linus Torvalds
  1 sibling, 2 replies; 76+ messages in thread
From: Manfred Spraul @ 2001-02-05 12:00 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Linus Torvalds, Christoph Hellwig, Steve Lord, linux-kernel,
	kiobuf-io-devel, Alan Cox

"Stephen C. Tweedie" wrote:
> 
> You simply cannot do physical disk IO on
> non-sector-aligned memory or in chunks which aren't a multiple of
> sector size.

Why not?

Obviously the disk access itself must be sector aligned and the total
length must be a multiple of the sector length, but there shouldn't be
any restrictions on the data buffers.

I remember that even Windoze 95 has scatter-gather support for physical
disk IO with arbitraty buffer chunks. (If the hardware supports it,
otherwise the io subsystem will copy the data into a contiguous
temporary buffer)

--
	Manfred
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
  2001-02-03 20:28                 ` Linus Torvalds
@ 2001-02-05 11:03                   ` Stephen C. Tweedie
  2001-02-05 12:00                     ` Manfred Spraul
  2001-02-05 16:36                     ` Linus Torvalds
  0 siblings, 2 replies; 76+ messages in thread
From: Stephen C. Tweedie @ 2001-02-05 11:03 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Stephen C. Tweedie, Christoph Hellwig, Steve Lord, linux-kernel,
	kiobuf-io-devel, Alan Cox

Hi,

On Sat, Feb 03, 2001 at 12:28:47PM -0800, Linus Torvalds wrote:
> 
> On Thu, 1 Feb 2001, Stephen C. Tweedie wrote:
> > 
> Neither the read nor the write are page-aligned. I don't know where you
> got that idea. It's obviously not true even in the common case: it depends
> _entirely_ on what the file offsets are, and expecting the offset to be
> zero is just being stupid. It's often _not_ zero. With networking it is in
> fact seldom zero, because the network packets are seldom aligned either in
> size or in location.

The underlying buffer is.  The VFS (and the current kiobuf code) is
already happy about IO happening at odd offsets within a page.
However, the more general case --- doing zero-copy IO on arbitrary
unaligned buffers --- simply won't work if you expect to be able to
push those buffers to disk without a copy.  

The splice case you talked about is fine because it's doing the normal
prepare/commit logic where the underlying buffer is page aligned, even
if the splice IO is not to a page aligned location.  That's _exactly_
what kiobufs were intended to support.  The prepare_read/prepare_write/
pull/push cycle lets the caller tell the pull() function where to
store its data, becausse there are alignment constraints which just
can't be ignored: you simply cannot do physical disk IO on
non-sector-aligned memory or in chunks which aren't a multiple of
sector size.  (The buffer address alignment can sometimes be relaxed
--- obviously if you're doing PIO then it doesn't matter --- but the
length granularity is rigidly enforced.)
 
Cheers,
 Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
  2001-02-01 22:07               ` Stephen C. Tweedie
  2001-02-02 12:02                 ` Christoph Hellwig
@ 2001-02-03 20:28                 ` Linus Torvalds
  2001-02-05 11:03                   ` Stephen C. Tweedie
  1 sibling, 1 reply; 76+ messages in thread
From: Linus Torvalds @ 2001-02-03 20:28 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Christoph Hellwig, Steve Lord, linux-kernel, kiobuf-io-devel, Alan Cox



On Thu, 1 Feb 2001, Stephen C. Tweedie wrote:
> 
> On Thu, Feb 01, 2001 at 09:33:27PM +0100, Christoph Hellwig wrote:
> 
> > I think you want the whole kio concept only for disk-like IO.  
> 
> No.  I want something good for zero-copy IO in general, but a lot of
> that concerns the problem of interacting with the user, and the basic
> center of that interaction in 99% of the interesting cases is either a
> user VM buffer or the page cache --- all of which are page-aligned.  
> 
> If you look at the sorts of models being proposed (even by Linus) for
> splice, you get
> 
> 	len = prepare_read();
> 	prepare_write();
> 	pull_fd();
> 	commit_write();
> 
> in which the read is being pulled into a known location in the page
> cache -- it's page-aligned, again.

Wrong.

Neither the read nor the write are page-aligned. I don't know where you
got that idea. It's obviously not true even in the common case: it depends
_entirely_ on what the file offsets are, and expecting the offset to be
zero is just being stupid. It's often _not_ zero. With networking it is in
fact seldom zero, because the network packets are seldom aligned either in
size or in location.

Also, there are many reasons why "page" may have different meaning. We
will eventually have a page-cache where the pagecace granularity is not
the same as the user-level visible one. User-level may do mmap at 4kB
boundaries, even if the page cache itself uses 8kB or 16kB pages.

THERE IS NO PAGE-ALIGNMENT. And anything that even _mentions_ the word
page-aligned is going into my trash-can faster than you can say "bug".

		Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait  /notify + callback chains
@ 2001-02-02 15:31 bsuparna
  0 siblings, 0 replies; 76+ messages in thread
From: bsuparna @ 2001-02-02 15:31 UTC (permalink / raw)
  To: Stephen C. Tweedie; +Cc: Ben LaHaise, linux-kernel, kiobuf-io-devel


>Hi,
>
>On Thu, Feb 01, 2001 at 01:28:33PM +0530, bsuparna@in.ibm.com wrote:
>>
>> Here's a second pass attempt, based on Ben's wait queue extensions:
> Does this sound any better ?
>
>It's a mechanism, all right, but you haven't described what problems
>it is trying to solve, and where it is likely to be used, so it's hard
>to judge it. :)

Hmm .. I thought I had done that in my first posting, but obviously, I
mustn't have done a good job at expressing it, so let me take another stab
at trying to convey why I started on this.

There are certain specific situations that I have in mind right now, but to
me it looks like the very nature of the abstraction is such that it is
quite likely that there would be uses in some other situations which I may
not have thought of yet, or just do not understand well enough to vouch for
at this point. What those situations could be, and the associated issues
involved (especially performance related) is something that I hope other
people on this forum would be able to help pinpoint, based on their
experiences and areas of expertise.

I do realize that generic and yet simple and performance optimal in all
kinds of situations is a really difficult (if not impossible :-) ) thing to
achieve, but even then, won't it be nice to at least abstract out
uniformity in patterns across situations in a way which can be
tweaked/tuned for each specific class of situations ?

And the nice thing which I see about Ben's wait queue extensions is that it
gives us a route to try to do that ...

Some needs considered (and associated problems):

a. Stacking of completion events - asynchronously, through multiple layers
     - layered drivers  (encryption, conversion)
     - filter filesystems
    Key aspects:
     1. It should be possible to pass the same (original) i/o container
structure all the way down (no copies/clones should need to happen, unless
actual i/o splitting, or extra buffer space or multiple sub-ios are
involved)
     2. Transparency: Neither the upper layer nor the layer below it should
need to have any specific knowledge about the existence/absense of an
intermediate filter layer (the mechanism should hide all that)
     3. LIFO ordering of completion actions
     4. The i/o structure should be marked as up-to-date only after all the
completion actions are done.
     5. Preferably have waiters on the i/o structure woken up only after
all completion actions are through (to avoid spurious/redundant wakeups
since the data won't be ready for use)
     6. Possible to have completion actions execute later in task context

b. Co-relation between multiple completion events and their associated
operations and data structures
     -  (bottom up aspect) merging results of split i/o requests, and
marking the completion of the compound i/o through multiple such layers
(tree), e.g
          - lvm
          - md / raid
          - evms aggregator features
     - (top down aspect) cascading down i/o cancellation requests /
sub-event waits , monitoring sub-io status etc
      Some aspects:
     1. Result of collation of sub-i/os may be driver specific  (In some
situations like lvm  - each sub i/o maps to a particular portion of a
buffer; with software raid or some other kind of scheme the collation may
involve actually interpreting the data read)
     2. Re-start/retries of sub-ios (in case of errors) can be handled.
     3. Transparency : Neither the upper layer nor the layer below it
should need to have any specific knowledge about the existence/absense of
an intermediate layer (that sends out multiple sub i/os)
     4. The system should be devised to avoid extra logic/fields in the
generic i/o structures being passed around, in situations where no compound
i/o is involved (i.e. in the simple i/o cases and most common situations).
As far as possible it is desirable to keep the linkage information outside
of the i/o structure for this reason.
     5. Possible to have collation/completion actions execute later in task
context


Ben LaHaise's wait queue extensions takes care of most of the aspects of
(a), if used with a little care to ensure a(4).
[This just means that function that marks the i/o structure as up-to-date
should be put in the completion queue first]
With this, we don't even need and explicit end_io() in bh/kiobufs etc. Just
the wait queue would do.

Only a(5) needs some thought since cache efficiency is upset by changing
the ordering of waits.

But, (b) needs a little more work as a higher level construct/mechanism
that latches on to the wait queue extensions. That is what the cev_wait
structure was designed for.
It keeps the chaining information outside of the i/o structures by default
(They can be allocated together, if desired anyway)

Is this still too much in the air ? Maybe I should describe the flow in a
specific scenario to illustrate ?

Regards
Suparna


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
  2001-02-02 11:51                 ` Christoph Hellwig
@ 2001-02-02 14:04                   ` Stephen C. Tweedie
  0 siblings, 0 replies; 76+ messages in thread
From: Stephen C. Tweedie @ 2001-02-02 14:04 UTC (permalink / raw)
  To: Stephen C. Tweedie, bsuparna, linux-kernel, kiobuf-io-devel

Hi,

On Fri, Feb 02, 2001 at 12:51:35PM +0100, Christoph Hellwig wrote:
> > 
> > If I have a page vector with a single offset/length pair, I can build
> > a new header with the same vector and modified offset/length to split
> > the vector in two without copying it.
> 
> You just say in the higher-level structure ignore from x to y even if
> they have an offset in their own vector.

Exactly --- and so you end up with something _much_ uglier, because
you end up with all sorts of combinations of length/offset fields all
over the place.

This is _precisely_ the mess I want to avoid.

Cheers,
 Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
  2001-02-02  4:18           ` bcrl
@ 2001-02-02 12:12             ` Christoph Hellwig
  0 siblings, 0 replies; 76+ messages in thread
From: Christoph Hellwig @ 2001-02-02 12:12 UTC (permalink / raw)
  To: bcrl
  Cc: Christoph Hellwig, Stephen C. Tweedie, bsuparna, linux-kernel,
	kiobuf-io-devel

On Thu, Feb 01, 2001 at 11:18:56PM -0500, bcrl@redhat.com wrote:
> On Thu, 1 Feb 2001, Christoph Hellwig wrote:
> 
> > A kiobuf is 124 bytes, a buffer_head 96.  And a buffer_head is additionally
> > used for caching data, a kiobuf not.
> 
> Go measure the cost of a distant cache miss, then complain about having
> everything in one structure.  Also, 1 kiobuf maps 16-128 times as much
> data as a single buffer head.

I'd never dipute that.  It was just an answers to Stephen's "a kiobuf is
already smaller".

> > enum kio_flags {
> > 	KIO_LOANED,     /* the calling subsystem wants this buf back    */
> > 	KIO_GIFTED,     /* thanks for the buffer, man!                  */
> > 	KIO_COW         /* copy on write (XXX: not yet)                 */
> > };
> 
> This is a Really Bad Idea.  Having semantics depend on a subtle flag
> determined by a caller is a sure way to

The semantics aren't different for the using subsystem.  LOANED vs GIFTED
is an issue for the free function, COW will probably be a page-level mm
thing - though I haven't thought a lot about it yet an am not sure wether
it actually makes sense.

> 
> >
> >
> > struct kio {
> > 	struct kiovec *         kio_data;       /* our kiovecs          */
> > 	int                     kio_ndata;      /* # of kiovecs         */
> > 	int                     kio_flags;      /* loaned or giftet?    */
> > 	void *                  kio_priv;       /* caller private data  */
> > 	wait_queue_head_t       kio_wait;	/* wait queue           */
> > };
> >
> > makes it a lot simpler for the subsytems to integrate.
> 
> Keep in mind that using distant memory allocations for kio_data will incur
> additional cache misses.

It could also be a [0] array at the end, allowing for a single allocation,
but that looks more like a implementation detail then a design problem to me.

> The atomic count is probably going to be widely
> used; I see it being applicable to the network stack, block io layers and
> others.

Hmm.  Currently it is used only for the multiple buffer_head's per iobuf
cruft, and I don't see why multiple outstanding IOs should be noted in a
kiobuf.

> Also, how is information about io completion status passed back
> to the caller?

Yes, there needs to be an kio_errno field - though I wanted to get rid of
it I had to readd in in later versions of my design.

	Christoph

-- 
Of course it doesn't work. We've performed a software upgrade.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
  2001-02-01 22:07               ` Stephen C. Tweedie
@ 2001-02-02 12:02                 ` Christoph Hellwig
  2001-02-05 12:19                   ` Stephen C. Tweedie
  2001-02-03 20:28                 ` Linus Torvalds
  1 sibling, 1 reply; 76+ messages in thread
From: Christoph Hellwig @ 2001-02-02 12:02 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Christoph Hellwig, Steve Lord, linux-kernel, kiobuf-io-devel,
	Alan Cox, Linus Torvalds

On Thu, Feb 01, 2001 at 10:07:44PM +0000, Stephen C. Tweedie wrote:
> No.  I want something good for zero-copy IO in general, but a lot of
> that concerns the problem of interacting with the user, and the basic
> center of that interaction in 99% of the interesting cases is either a
> user VM buffer or the page cache --- all of which are page-aligned.

Yes.

> If you look at the sorts of models being proposed (even by Linus) for
> splice, you get
> 
> 	len = prepare_read();
> 	prepare_write();
> 	pull_fd();
> 	commit_write();

Yepp.

> in which the read is being pulled into a known location in the page
> cache -- it's page-aligned, again.  I'm perfectly willing to accept
> that there may be a need for scatter-gather boundaries including
> non-page-aligned fragments in this model, but I can't see one if
> you're using the page cache as a mediator, nor if you're doing it
> through a user mmapped buffer.

True.

> The only reason you need finer scatter-gather boundaries --- and it
> may be a compelling reason --- is if you are merging multiple IOs
> together into a single device-level IO.  That makes perfect sense for
> the zerocopy tcp case where you're doing MSG_MORE-type coalescing.  It
> doesn't help the existing SGI kiobuf block device code, because that
> performs its merging in the filesystem layers and the block device
> code just squirts the IOs to the wire as-is,

Yes - but that is no soloution for a generic model.  AFAICS even XFS
falls back to buffer_head's for small requests.

> but if we want to start
> merging those kiobuf-based IOs within make_request() then the block
> device layer may want it too.

Yes.

> And Linus is right, the old way of using a *kiobuf[] for that was
> painful, but the solution of adding start/length to every entry in
> the page vector just doesn't sit right with many components of the
> block device environment either.

What do you thing is the alternative?

> I may still be persuaded that we need the full scatter-gather list
> fields throughout, but for now I tend to think that, at least in the
> disk layers, we may get cleaner results by allow linked lists of
> page-aligned kiobufs instead.  That allows for merging of kiobufs
> without having to copy all of the vector information each time.

But it will have the same problems as the array soloution: there will
be one complete kio structure for each kiobuf, with it's own end_io
callback, etc.

	Christoph

-- 
Of course it doesn't work. We've performed a software upgrade.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
  2001-02-01 21:25               ` Stephen C. Tweedie
@ 2001-02-02 11:51                 ` Christoph Hellwig
  2001-02-02 14:04                   ` Stephen C. Tweedie
  0 siblings, 1 reply; 76+ messages in thread
From: Christoph Hellwig @ 2001-02-02 11:51 UTC (permalink / raw)
  To: Stephen C. Tweedie; +Cc: bsuparna, linux-kernel, kiobuf-io-devel

On Thu, Feb 01, 2001 at 09:25:08PM +0000, Stephen C. Tweedie wrote:
> > No.  Just allow passing the multiple of the devices blocksize over
> > ll_rw_block.
> 
> That was just one example: you need the sub-ios just as much when
> you split up an IO over stripe boundaries in LVM or raid0, for
> example.

IIRC that's why you designed (and I thought of independandly) clone-kiobufs.

> Secondly, ll_rw_block needs to die anyway: you can expand
> the blocksize up to PAGE_SIZE but not beyond, whereas something like
> ll_rw_kiobuf can submit a much larger IO atomically (and we have
> devices which don't start to deliver good throughput until you use
> IO sizes of 1MB or more).

Completly agreed.

> If I've got a vector (page X, offset 0, length PAGE_SIZE) and I want
> to split it in two, I have to make two new vectors (page X, offset 0,
> length n) and (page X, offset n, length PAGE_SIZE-n).  That implies
> copying both vectors.
> 
> If I have a page vector with a single offset/length pair, I can build
> a new header with the same vector and modified offset/length to split
> the vector in two without copying it.

You just say in the higher-level structure ignore from x to y even if
they have an offset in their own vector.

	Christoph

-- 
Of course it doesn't work. We've performed a software upgrade.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
  2001-02-01 18:14         ` Christoph Hellwig
  2001-02-01 18:25           ` Alan Cox
  2001-02-01 19:32           ` Stephen C. Tweedie
@ 2001-02-02  4:18           ` bcrl
  2001-02-02 12:12             ` Christoph Hellwig
  2 siblings, 1 reply; 76+ messages in thread
From: bcrl @ 2001-02-02  4:18 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Stephen C. Tweedie, bsuparna, linux-kernel, kiobuf-io-devel

On Thu, 1 Feb 2001, Christoph Hellwig wrote:

> A kiobuf is 124 bytes, a buffer_head 96.  And a buffer_head is additionally
> used for caching data, a kiobuf not.

Go measure the cost of a distant cache miss, then complain about having
everything in one structure.  Also, 1 kiobuf maps 16-128 times as much
data as a single buffer head.

> enum kio_flags {
> 	KIO_LOANED,     /* the calling subsystem wants this buf back    */
> 	KIO_GIFTED,     /* thanks for the buffer, man!                  */
> 	KIO_COW         /* copy on write (XXX: not yet)                 */
> };

This is a Really Bad Idea.  Having semantics depend on a subtle flag
determined by a caller is a sure way to

>
>
> struct kio {
> 	struct kiovec *         kio_data;       /* our kiovecs          */
> 	int                     kio_ndata;      /* # of kiovecs         */
> 	int                     kio_flags;      /* loaned or giftet?    */
> 	void *                  kio_priv;       /* caller private data  */
> 	wait_queue_head_t       kio_wait;	/* wait queue           */
> };
>
> makes it a lot simpler for the subsytems to integrate.

Keep in mind that using distant memory allocations for kio_data will incur
additional cache misses.  The atomic count is probably going to be widely
used; I see it being applicable to the network stack, block io layers and
others.  Also, how is information about io completion status passed back
to the caller?  That information is required across layers so that io can
be properly aborted or proceed with the partial amount of io.  Add those
back in and we're right back to the original kiobuf structure.

		-ben

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
  2001-02-01 20:33             ` Christoph Hellwig
  2001-02-01 20:56               ` Steve Lord
  2001-02-01 21:44               ` Stephen C. Tweedie
@ 2001-02-01 22:07               ` Stephen C. Tweedie
  2001-02-02 12:02                 ` Christoph Hellwig
  2001-02-03 20:28                 ` Linus Torvalds
  2 siblings, 2 replies; 76+ messages in thread
From: Stephen C. Tweedie @ 2001-02-01 22:07 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Stephen C. Tweedie, Steve Lord, linux-kernel, kiobuf-io-devel,
	Alan Cox, Linus Torvalds

Hi,

On Thu, Feb 01, 2001 at 09:33:27PM +0100, Christoph Hellwig wrote:

> I think you want the whole kio concept only for disk-like IO.  

No.  I want something good for zero-copy IO in general, but a lot of
that concerns the problem of interacting with the user, and the basic
center of that interaction in 99% of the interesting cases is either a
user VM buffer or the page cache --- all of which are page-aligned.  

If you look at the sorts of models being proposed (even by Linus) for
splice, you get

	len = prepare_read();
	prepare_write();
	pull_fd();
	commit_write();

in which the read is being pulled into a known location in the page
cache -- it's page-aligned, again.  I'm perfectly willing to accept
that there may be a need for scatter-gather boundaries including
non-page-aligned fragments in this model, but I can't see one if
you're using the page cache as a mediator, nor if you're doing it
through a user mmapped buffer.

The only reason you need finer scatter-gather boundaries --- and it
may be a compelling reason --- is if you are merging multiple IOs
together into a single device-level IO.  That makes perfect sense for
the zerocopy tcp case where you're doing MSG_MORE-type coalescing.  It
doesn't help the existing SGI kiobuf block device code, because that
performs its merging in the filesystem layers and the block device
code just squirts the IOs to the wire as-is, but if we want to start
merging those kiobuf-based IOs within make_request() then the block
device layer may want it too.

And Linus is right, the old way of using a *kiobuf[] for that was
painful, but the solution of adding start/length to every entry in
the page vector just doesn't sit right with many components of the
block device environment either.

I may still be persuaded that we need the full scatter-gather list
fields throughout, but for now I tend to think that, at least in the
disk layers, we may get cleaner results by allow linked lists of
page-aligned kiobufs instead.  That allows for merging of kiobufs
without having to copy all of the vector information each time.

The killer, however, is what happens if you want to split such a
merged kiobuf.  Right now, that's something that I can only imagine
happening in the block layers if we start encoding buffer_head chains
as kiobufs, but if we do that in the future, or if we start merging
genuine kiobuf requests requests, then doing that split later on (for
raid0 etc) may require duplicating whole chains of kiobufs.  At that
point, just doing scatter-gather lists is cleaner.

But for now, the way to picture what I'm trying to achieve is that
kiobufs are a bit like buffer_heads --- they represent the physical
pages of some VM object that a higher layer has constructed, such as
the page cache or a user VM buffer.  You can chain these objects
together for IO, but that doesn't stop the individual objects from
being separate entities with independent IO completion callbacks to be
honoured.  

Cheers,
 Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
  2001-02-01 20:33             ` Christoph Hellwig
  2001-02-01 20:56               ` Steve Lord
@ 2001-02-01 21:44               ` Stephen C. Tweedie
  2001-02-01 22:07               ` Stephen C. Tweedie
  2 siblings, 0 replies; 76+ messages in thread
From: Stephen C. Tweedie @ 2001-02-01 21:44 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Stephen C. Tweedie, Steve Lord, linux-kernel, kiobuf-io-devel, Alan Cox

Hi,

On Thu, Feb 01, 2001 at 09:33:27PM +0100, Christoph Hellwig wrote:
> 
> > On Thu, Feb 01, 2001 at 05:34:49PM +0000, Alan Cox wrote:
> > In the disk IO case, you basically don't get that (the only thing
> > which comes close is raid5 parity blocks).  The data which the user
> > started with is the data sent out on the wire.  You do get some
> > interesting cases such as soft raid and LVM, or even in the scsi stack
> > if you run out of mailbox space, where you need to send only a
> > sub-chunk of the input buffer. 
> 
> Though your describption is right, I don't think the case is very common:
> Sometimes in LVM on a pv boundary and maybe sometimes in the scsi code.

On raid0 stripes, it's common to have stripes of between 16k and 64k,
so it's rather more common there than you'd like.  In any case, you
need the code to handle it, and I don't want to make the code paths
any more complex than necessary.

> In raid1 you need some kind of clone iobuf, which should work with both
> cases.  In raid0 you need a complete new pagelist anyway

No you don't.  You take the existing one, specify which region of it
is going to the current stripe, and send it off.  Nothing more.

> > In that case, having offset/len as the kiobuf limit markers is ideal:
> > you can clone a kiobuf header using the same page vector as the
> > parent, narrow down the start/end points, and continue down the stack
> > without having to copy any part of the page list.  If you had the
> > offset/len data encoded implicitly into each entry in the sglist, you
> > would not be able to do that.
> 
> Sure you could: you embedd that information in a higher-level structure.

What's the point in a common data container structure if you need
higher-level information to make any sense out of it?

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
  2001-02-01 20:46             ` Christoph Hellwig
@ 2001-02-01 21:25               ` Stephen C. Tweedie
  2001-02-02 11:51                 ` Christoph Hellwig
  0 siblings, 1 reply; 76+ messages in thread
From: Stephen C. Tweedie @ 2001-02-01 21:25 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Stephen C. Tweedie, bsuparna, linux-kernel, kiobuf-io-devel

Hi,

On Thu, Feb 01, 2001 at 09:46:27PM +0100, Christoph Hellwig wrote:

> > Right now we can take a kiobuf and turn it into a bunch of
> > buffer_heads for IO.  The io_count lets us track all of those sub-IOs
> > so that we know when all submitted IO has completed, so that we can
> > pass the completion callback back up the chain without having to
> > allocate yet more descriptor structs for the IO.
> 
> > Again, remove this and the IO becomes more heavyweight because we need
> > to create a separate struct for the info.
> 
> No.  Just allow passing the multiple of the devices blocksize over
> ll_rw_block.

That was just one example: you need the sub-ios just as much when
you split up an IO over stripe boundaries in LVM or raid0, for
example.  Secondly, ll_rw_block needs to die anyway: you can expand
the blocksize up to PAGE_SIZE but not beyond, whereas something like
ll_rw_kiobuf can submit a much larger IO atomically (and we have
devices which don't start to deliver good throughput until you use
IO sizes of 1MB or more).

> >> and the lack of
> >> scatter gather in one kiobuf struct (you always need an array)
> 
> > Again, _all_ data being sent down through the block device layer is
> > either in buffer heads or is page aligned.
> 
> That's the point.  You are always talking about the block-layer only.

I'm talking about why the minimal, generic solution doesn't provide
what the block layer needs.


> > Obviously, extra code will be needed to scan kiobufs if we do that,
> > and unless we have both per-page _and_ per-kiobuf start/offset pairs
> > (adding even further to the complexity), those scatter-gather lists
> > would prevent us from carving up a kiobuf into smaller sub-ios without
> > copying the whole (expanded) vector.
> 
> No.  I think I explained that in my last mail.

How?

If I've got a vector (page X, offset 0, length PAGE_SIZE) and I want
to split it in two, I have to make two new vectors (page X, offset 0,
length n) and (page X, offset n, length PAGE_SIZE-n).  That implies
copying both vectors.

If I have a page vector with a single offset/length pair, I can build
a new header with the same vector and modified offset/length to split
the vector in two without copying it.

> > Possibly, but I remain to be convinced, because you may end up with a
> > mechanism which is generic but is not well-tuned for any specific
> > case, so everything goes slower.
> 
> As kiobufs are widely used for real IO, just as containers, this is
> better then nothing.

Surely having all of the subsystems working fast is better still?

> And IMHO a nice generic concepts that lets different subsystems work
> toegther is a _lot_ better then a bunch of over-optimized, rather isolated
> subsytems.  The IO-Lite people have done a nice research of the effect of
> an unified IO-Caching system vs. the typical isolated systems.

I know, and IO-Lite has some major problems (the close integration of
that code into the cache, for example, makes it harder to expose the
zero-copy to user-land).

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
  2001-02-01 20:59                 ` Christoph Hellwig
@ 2001-02-01 21:17                   ` Steve Lord
  0 siblings, 0 replies; 76+ messages in thread
From: Steve Lord @ 2001-02-01 21:17 UTC (permalink / raw)
  To: Steve Lord, Stephen C . Tweedie, linux-kernel, kiobuf-io-devel, Alan Cox

> On Thu, Feb 01, 2001 at 02:56:47PM -0600, Steve Lord wrote:
> > And if you are writing to a striped volume via a filesystem which can do
> > it's own I/O clustering, e.g. I throw 500 pages at LVM in one go and LVM
> > is striped on 64K boundaries.
> 
> But usually I want to have pages 0-63, 128-191, etc together, because they ar
> e
> contingous on disk, or?

I was just giving an example of how kiobufs might need splitting up more often
than you think, crossing a stripe boundary is one obvious case. Yes you do
want to keep the pages which are contiguous on disk together, but you will
often get requests which cover multiple stripes, otherwise you don't really
get much out of stripes and may as well just concatenate drives.

Ideally the file is striped across the various disks in the volume, and one
large write (direct or from the cache) gets scattered across the disks. All
the I/O's run in parallel (and on different controllers if you have the 
budget).

Steve

> 
> 	Christoph
> 
> -- 
> Of course it doesn't work. We've performed a software upgrade.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
  2001-02-01 20:56               ` Steve Lord
@ 2001-02-01 20:59                 ` Christoph Hellwig
  2001-02-01 21:17                   ` Steve Lord
  0 siblings, 1 reply; 76+ messages in thread
From: Christoph Hellwig @ 2001-02-01 20:59 UTC (permalink / raw)
  To: Steve Lord; +Cc: Stephen C . Tweedie, linux-kernel, kiobuf-io-devel, Alan Cox

On Thu, Feb 01, 2001 at 02:56:47PM -0600, Steve Lord wrote:
> And if you are writing to a striped volume via a filesystem which can do
> it's own I/O clustering, e.g. I throw 500 pages at LVM in one go and LVM
> is striped on 64K boundaries.

But usually I want to have pages 0-63, 128-191, etc together, because they are
contingous on disk, or?

	Christoph

-- 
Of course it doesn't work. We've performed a software upgrade.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
  2001-02-01 20:33             ` Christoph Hellwig
@ 2001-02-01 20:56               ` Steve Lord
  2001-02-01 20:59                 ` Christoph Hellwig
  2001-02-01 21:44               ` Stephen C. Tweedie
  2001-02-01 22:07               ` Stephen C. Tweedie
  2 siblings, 1 reply; 76+ messages in thread
From: Steve Lord @ 2001-02-01 20:56 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: "Stephen C. Tweedie",
	Steve Lord, linux-kernel,
	kiobuf-io-devel@lists.sourceforge.net Alan Cox

> In article <20010201174946.B11607@redhat.com> you wrote:
> > Hi,
> 
> > On Thu, Feb 01, 2001 at 05:34:49PM +0000, Alan Cox wrote:
> > In the disk IO case, you basically don't get that (the only thing
> > which comes close is raid5 parity blocks).  The data which the user
> > started with is the data sent out on the wire.  You do get some
> > interesting cases such as soft raid and LVM, or even in the scsi stack
> > if you run out of mailbox space, where you need to send only a
> > sub-chunk of the input buffer. 
> 
> Though your describption is right, I don't think the case is very common:
> Sometimes in LVM on a pv boundary and maybe sometimes in the scsi code.


And if you are writing to a striped volume via a filesystem which can do
it's own I/O clustering, e.g. I throw 500 pages at LVM in one go and LVM
is striped on 64K boundaries.

Steve


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
  2001-02-01 19:32           ` Stephen C. Tweedie
@ 2001-02-01 20:46             ` Christoph Hellwig
  2001-02-01 21:25               ` Stephen C. Tweedie
  0 siblings, 1 reply; 76+ messages in thread
From: Christoph Hellwig @ 2001-02-01 20:46 UTC (permalink / raw)
  To: "Stephen C. Tweedie"; +Cc: bsuparna, linux-kernel, kiobuf-io-devel

In article <20010201193221.D11607@redhat.com> you wrote:
> Buffer_heads are _sometimes_ used for caching data.

Actually they are mostly used, but that should have any value for the
discussion...

> That's one of the
> big problems with them, they are too overloaded, being both IO
> descriptors _and_ cache descriptors.

Agreed.

> If you've got 128k of data to
> write out from user space, do you want to set up one kiobuf or 256
> buffer_heads?  Buffer_heads become really very heavy indeed once you
> start doing non-trivial IO.

Sure - I was never arguing in favor of buffer_head's ...

>> > What is so heavyweight in the current kiobuf (other than the embedded
>> > vector, which I've already noted I'm willing to cut)?
>> 
>> array_len

> kiobufs can be reused after IO.  You can depopulate a kiobuf,
> repopulate it with new pages and submit new IO without having to
> deallocate the kiobuf.  You can't do this without knowing how big the
> data vector is.  Removing that functionality will prevent reuse,
> making them _more_ heavyweight.

>> io_count,

> Right now we can take a kiobuf and turn it into a bunch of
> buffer_heads for IO.  The io_count lets us track all of those sub-IOs
> so that we know when all submitted IO has completed, so that we can
> pass the completion callback back up the chain without having to
> allocate yet more descriptor structs for the IO.

> Again, remove this and the IO becomes more heavyweight because we need
> to create a separate struct for the info.

No.  Just allow passing the multiple of the devices blocksize over
ll_rw_block.  XFS is doing that and it just needs an audit of the lesser
used block drivers.

>> and the lack of
>> scatter gather in one kiobuf struct (you always need an array)

> Again, _all_ data being sent down through the block device layer is
> either in buffer heads or is page aligned.

That's the point.  You are always talking about the block-layer only.
And I think it should be generic instead.
Looks like that is the major point.

> You want us to triple the
> size of the "heavyweight" kiobuf's data vector for what gain, exactly?

double.

> Obviously, extra code will be needed to scan kiobufs if we do that,
> and unless we have both per-page _and_ per-kiobuf start/offset pairs
> (adding even further to the complexity), those scatter-gather lists
> would prevent us from carving up a kiobuf into smaller sub-ios without
> copying the whole (expanded) vector.

No.  I think I explained that in my last mail.

> That's a _lot_ of extra complexity in the disk IO layers.

> Possibly, but I remain to be convinced, because you may end up with a
> mechanism which is generic but is not well-tuned for any specific
> case, so everything goes slower.

As kiobufs are widely used for real IO, just as containers, this is
better then nothing.
And IMHO a nice generic concepts that lets different subsystems work
toegther is a _lot_ better then a bunch of over-optimized, rather isolated
subsytems.  The IO-Lite people have done a nice research of the effect of
an unified IO-Caching system vs. the typical isolated systems.

	Christoph

-- 
Of course it doesn't work. We've performed a software upgrade.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
  2001-02-01 17:49           ` Stephen C. Tweedie
  2001-02-01 17:09             ` Chaitanya Tumuluri
@ 2001-02-01 20:33             ` Christoph Hellwig
  2001-02-01 20:56               ` Steve Lord
                                 ` (2 more replies)
  1 sibling, 3 replies; 76+ messages in thread
From: Christoph Hellwig @ 2001-02-01 20:33 UTC (permalink / raw)
  To: "Stephen C. Tweedie"
  Cc: Steve Lord, linux-kernel, kiobuf-io-devel@lists.sourceforge.net Alan Cox

In article <20010201174946.B11607@redhat.com> you wrote:
> Hi,

> On Thu, Feb 01, 2001 at 05:34:49PM +0000, Alan Cox wrote:
> In the disk IO case, you basically don't get that (the only thing
> which comes close is raid5 parity blocks).  The data which the user
> started with is the data sent out on the wire.  You do get some
> interesting cases such as soft raid and LVM, or even in the scsi stack
> if you run out of mailbox space, where you need to send only a
> sub-chunk of the input buffer. 

Though your describption is right, I don't think the case is very common:
Sometimes in LVM on a pv boundary and maybe sometimes in the scsi code.

In raid1 you need some kind of clone iobuf, which should work with both
cases.  In raid0 you need a complete new pagelist anyway, same for raid5.


> In that case, having offset/len as the kiobuf limit markers is ideal:
> you can clone a kiobuf header using the same page vector as the
> parent, narrow down the start/end points, and continue down the stack
> without having to copy any part of the page list.  If you had the
> offset/len data encoded implicitly into each entry in the sglist, you
> would not be able to do that.

Sure you could: you embedd that information in a higher-level structure.
I think you want the whole kio concept only for disk-like IO.  Then many
of the things you do are completly right and I don't see much problems
(besides thinking that some thing may go away - but that's no major point).

With a generic object that is used over subsytem boundaries things are
different.

	Christoph

-- 
Of course it doesn't work. We've performed a software upgrade.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
  2001-02-01 17:41       ` Stephen C. Tweedie
  2001-02-01 18:14         ` Christoph Hellwig
@ 2001-02-01 20:04         ` Chaitanya Tumuluri
  1 sibling, 0 replies; 76+ messages in thread
From: Chaitanya Tumuluri @ 2001-02-01 20:04 UTC (permalink / raw)
  To: Stephen C. Tweedie; +Cc: bsuparna, linux-kernel, kiobuf-io-devel

On Thu, 1 Feb 2001, Stephen C. Tweedie wrote:
> Hi,
> 
> On Thu, Feb 01, 2001 at 06:05:15PM +0100, Christoph Hellwig wrote:
> > On Thu, Feb 01, 2001 at 04:16:15PM +0000, Stephen C. Tweedie wrote:
> > > > 
> > > > No, and with the current kiobufs it would not make sense, because they
> > > > are to heavy-weight.
> > > 
> > > Really?  In what way?  
> > 
> > We can't allocate a huge kiobuf structure just for requesting one page of
> > IO.  It might get better with VM-level IO clustering though.
> 
> A kiobuf is *much* smaller than, say, a buffer_head, and we currently
> allocate a buffer_head per block for all IO.
> 
> A kiobuf contains enough embedded page vector space for 16 pages by
> default, but I'm happy enough to remove that if people want.  However,
> note that that memory is not initialised, so there is no memory access
> cost at all for that empty space.  Remove that space and instead of
> one memory allocation per kiobuf, you get two, so the cost goes *UP*
> for small IOs.
> 
> > > > With page,length,offsett iobufs this makes sense
> > > > and is IMHO the way to go.
> > > 
> > > What, you mean adding *extra* stuff to the heavyweight kiobuf makes it
> > > lean enough to do the job??
> > 
> > No.  I was speaking abou the light-weight kiobuf Linux & Me discussed on
> > lkml some time ago (though I'd much more like to call it kiovec analogous
> > to BSD iovecs).
> 
> What is so heavyweight in the current kiobuf (other than the embedded
> vector, which I've already noted I'm willing to cut)?


Hi,

It'd seem that "array_len", "locked", "bounced", "io_count" and "errno" 
are the fields that need to go away (apart from the "maplist").

The field "nr_pages" would reincarnate in the kiovec struct (which is
is not a plain array anymore) as the field "nbufs". See below.

Based on what I've seen fly by on the lists here's my understanding of 
the proposed new kiobuf/kiovec structures:

===========================================================================
/*
 * a simple page,offset,length tuple like Linus wants it
 */
struct kiobuf {
	struct page *   page;   /* The page itself               */
	u_16       	offset; /* Offset to start of valid data */
	u_16       	length; /* Number of valid bytes of data */
};

struct kiovec {
	int             nbufs;          /* Kiobufs actually referenced */
	struct kiobuf * bufs;
}

/*
 * the name is just plain stupid, but that shouldn't matter
 */
struct vfs_kiovec {
        struct kiovec * iov;

        /* private data, mostly for the callback */
        void * private;

        /* completion callback */
        void (*end_io)  (struct vfs_kiovec *);
        wait_queue_head_t wait_queue;
};
===========================================================================

Is this correct? 

If so, I have a few questions/clarifications:

	- The [ll_rw_blk, scsi/ide request-functions, scsi/ide 
	  I/O completion handling] functions would be handed the 
	  "X_kiovec" struct, correct?

	- So, the soft-RAID / LVM layers need to construct their 
	  own "lvm_kiovec" structs to handle request splits and
	  the partial completions, correct? 

	- Then, what are the semantics of request-merges containing 
	  the "X_kiovec" structs in the block I/O queueing layers?
	  Do we add "X_kiovec->next", "X_kiovec->prev" etc. fields?

	  It will also require a re-allocation of a new and longer
	  kiovec->bufs array, correct?
	  
	- How are I/O error codes to be propagated back to the 
	  higher (calling) layers? I think that needs to be added
	  into the "X_kiovec" struct, no?

	- How is bouncing to be handled with this setup? (some state 
	  is needed to (a) determine that bouncing occurred, (b) find 
	  out which pages have been bounced and, (c) find out the 
	  bounce-page for each of these bounced pages).

Cheers,
-Chait.






-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
  2001-02-01 17:49           ` Christoph Hellwig
  2001-02-01 17:58             ` Alan Cox
@ 2001-02-01 19:33             ` Stephen C. Tweedie
  1 sibling, 0 replies; 76+ messages in thread
From: Stephen C. Tweedie @ 2001-02-01 19:33 UTC (permalink / raw)
  To: Alan Cox, Stephen C. Tweedie, Steve Lord, linux-kernel, kiobuf-io-devel

Hi,

On Thu, Feb 01, 2001 at 06:49:50PM +0100, Christoph Hellwig wrote:
> 
> > Adding tons of base/limit pairs to kiobufs makes it worse not better
> 
> For disk I/O it makes the handling a little easier for the cost of the
> additional offset/length fields.

Umm, actually, no, it makes it much worse for many of the cases.  

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
  2001-02-01 18:14         ` Christoph Hellwig
  2001-02-01 18:25           ` Alan Cox
@ 2001-02-01 19:32           ` Stephen C. Tweedie
  2001-02-01 20:46             ` Christoph Hellwig
  2001-02-02  4:18           ` bcrl
  2 siblings, 1 reply; 76+ messages in thread
From: Stephen C. Tweedie @ 2001-02-01 19:32 UTC (permalink / raw)
  To: Stephen C. Tweedie, bsuparna, linux-kernel, kiobuf-io-devel

Hi,

On Thu, Feb 01, 2001 at 07:14:03PM +0100, Christoph Hellwig wrote:
> On Thu, Feb 01, 2001 at 05:41:20PM +0000, Stephen C. Tweedie wrote:
> > > 
> > > We can't allocate a huge kiobuf structure just for requesting one page of
> > > IO.  It might get better with VM-level IO clustering though.
> > 
> > A kiobuf is *much* smaller than, say, a buffer_head, and we currently
> > allocate a buffer_head per block for all IO.
> 
> A kiobuf is 124 bytes,

... the vast majority of which is room for the page vector to expand
without having to be copied.  You don't touch that in the normal case.

> a buffer_head 96.  And a buffer_head is additionally
> used for caching data, a kiobuf not.

Buffer_heads are _sometimes_ used for caching data.  That's one of the
big problems with them, they are too overloaded, being both IO
descriptors _and_ cache descriptors.  If you've got 128k of data to
write out from user space, do you want to set up one kiobuf or 256
buffer_heads?  Buffer_heads become really very heavy indeed once you
start doing non-trivial IO.

> > What is so heavyweight in the current kiobuf (other than the embedded
> > vector, which I've already noted I'm willing to cut)?
> 
> array_len

kiobufs can be reused after IO.  You can depopulate a kiobuf,
repopulate it with new pages and submit new IO without having to
deallocate the kiobuf.  You can't do this without knowing how big the
data vector is.  Removing that functionality will prevent reuse,
making them _more_ heavyweight.

> io_count,

Right now we can take a kiobuf and turn it into a bunch of
buffer_heads for IO.  The io_count lets us track all of those sub-IOs
so that we know when all submitted IO has completed, so that we can
pass the completion callback back up the chain without having to
allocate yet more descriptor structs for the IO.

Again, remove this and the IO becomes more heavyweight because we need
to create a separate struct for the info.

> the presence of wait_queue AND end_io,

That's fine, I'm happy scrapping the wait queue: people can always use
the kiobuf private data field to refer to a wait queue if they want
to.

> and the lack of
> scatter gather in one kiobuf struct (you always need an array)

Again, _all_ data being sent down through the block device layer is
either in buffer heads or is page aligned.  You want us to triple the
size of the "heavyweight" kiobuf's data vector for what gain, exactly?
Obviously, extra code will be needed to scan kiobufs if we do that,
and unless we have both per-page _and_ per-kiobuf start/offset pairs
(adding even further to the complexity), those scatter-gather lists
would prevent us from carving up a kiobuf into smaller sub-ios without
copying the whole (expanded) vector.

That's a _lot_ of extra complexity in the disk IO layers.

I'm all for a fast kiobuf_to_sglist converter.  But I haven't seen any
evidence that such scatter-gather lists will do anything in the block
device case except complicate the code and decrease performance.

> S.th. like:
...
> makes it a lot simpler for the subsytems to integrate.

Possibly, but I remain to be convinced, because you may end up with a
mechanism which is generic but is not well-tuned for any specific
case, so everything goes slower.

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
  2001-02-01 18:57               ` Alan Cox
@ 2001-02-01 19:00                 ` Christoph Hellwig
  0 siblings, 0 replies; 76+ messages in thread
From: Christoph Hellwig @ 2001-02-01 19:00 UTC (permalink / raw)
  To: Alan Cox; +Cc: Stephen C. Tweedie, bsuparna, linux-kernel, kiobuf-io-devel

On Thu, Feb 01, 2001 at 06:57:41PM +0000, Alan Cox wrote:
> Not for raw I/O. Although for the drivers that can't cope then going via
> the page cache is certainly the next best alternative

True - but raw-io has it's own alignment issues anyway.

> Yes. You also need a way to describe it in terms of page * in order to do
> mm locking for raw I/O (like the video capture stuff wants)

Right. (That's why we have the struct page * always as part of the structure)

> Certainly having the lightweight one a subset of the heavyweight one is a good
> target. 

Yes, I'm trying to address that...

	Christoph

-- 
Of course it doesn't work. We've performed a software upgrade.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
  2001-02-01 18:32               ` Rik van Riel
@ 2001-02-01 18:59                 ` yodaiken
  0 siblings, 0 replies; 76+ messages in thread
From: yodaiken @ 2001-02-01 18:59 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Alan Cox, Christoph Hellwig, Stephen C. Tweedie, Steve Lord,
	linux-kernel, kiobuf-io-devel

On Thu, Feb 01, 2001 at 04:32:48PM -0200, Rik van Riel wrote:
> On Thu, 1 Feb 2001, Alan Cox wrote:
> 
> > > Sure.  But Linus saing that he doesn't want more of that (shit, crap,
> > > I don't rember what he said exactly) in the kernel is a very good reason
> > > for thinking a little more aboyt it.
> > 
> > No. Linus is not a God, Linus is fallible, regularly makes mistakes and
> > frequently opens his mouth and says stupid things when he is far too busy.
> 
> People may remember Linus saying a resolute no to SMP
> support in Linux ;)

And perhaps he was right!

-- 
---------------------------------------------------------
Victor Yodaiken 
Finite State Machine Labs: The RTLinux Company.
 www.fsmlabs.com  www.rtlinux.com

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
  2001-02-01 18:48             ` Christoph Hellwig
@ 2001-02-01 18:57               ` Alan Cox
  2001-02-01 19:00                 ` Christoph Hellwig
  0 siblings, 1 reply; 76+ messages in thread
From: Alan Cox @ 2001-02-01 18:57 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Alan Cox, Christoph Hellwig, Stephen C. Tweedie, bsuparna,
	linux-kernel, kiobuf-io-devel

> It doesn't really matter that much, because we write to the pagecache
> first anyway.

Not for raw I/O. Although for the drivers that can't cope then going via
the page cache is certainly the next best alternative

> The real thing is that we want to have some common data structure for
> describing physical memory used for IO.  We could either use special

Yes. You also need a way to describe it in terms of page * in order to do
mm locking for raw I/O (like the video capture stuff wants)

> by Larry McVoy's splice paper) should allow just that, nothing more an
> nothing less.  For use in disk-io and networking or v4l there are probably
> other primary data structures needed, and that's ok.

Certainly having the lightweight one a subset of the heavyweight one is a good
target. 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
  2001-02-01 17:34         ` Alan Cox
  2001-02-01 17:49           ` Stephen C. Tweedie
  2001-02-01 17:49           ` Christoph Hellwig
@ 2001-02-01 18:51           ` bcrl
  2 siblings, 0 replies; 76+ messages in thread
From: bcrl @ 2001-02-01 18:51 UTC (permalink / raw)
  To: Alan Cox
  Cc: Christoph Hellwig, Stephen C. Tweedie, Steve Lord, linux-kernel,
	kiobuf-io-devel

On Thu, 1 Feb 2001, Alan Cox wrote:

> Linus list of reasons like the amount of state are more interesting

The state is required, not optional, if we are to have a decent basis for
building asyncronous io into the kernel.

> Networking wants something lighter rather than heavier. Adding tons of
> base/limit pairs to kiobufs makes it worse not better

I'm still not seeing what I consider valid arguments from the networking
people regarding the use of kiobufs as the interface they present to the
VFS for asynchronous/bulk io.  I agree with their needs for a light weight
mechanism for getting small io requests from userland, and even the need
for using lightweight scatter gather lists within the network layer
itself.  If the statement is that map_user_kiobuf is too heavy for use on
every single io, sure.  But that is a seperate issue.

		-ben


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
  2001-02-01 18:25           ` Alan Cox
  2001-02-01 18:39             ` Rik van Riel
@ 2001-02-01 18:48             ` Christoph Hellwig
  2001-02-01 18:57               ` Alan Cox
  1 sibling, 1 reply; 76+ messages in thread
From: Christoph Hellwig @ 2001-02-01 18:48 UTC (permalink / raw)
  To: Alan Cox
  Cc: Christoph Hellwig, Stephen C. Tweedie, bsuparna, linux-kernel,
	kiobuf-io-devel

On Thu, Feb 01, 2001 at 06:25:16PM +0000, Alan Cox wrote:
> > array_len, io_count, the presence of wait_queue AND end_io, and the lack of
> > scatter gather in one kiobuf struct (you always need an array), and AFAICS
> > that is what the networking guys dislike.
> 
> You need a completion pointer. Its arguable whether you want the wait_queue
> in the default structure or as part of whatever its contained in and handled
> by the completion pointer.

I personaly think that Ben's function pointer on wakeup work is the alternative in
this area.

> And I've actually bothered to talk to the networking people and they dont have
> a problem with the completion pointer.

I have never said that they don't like it - but having both the waitqueue and the
completion handler in the kiobuf makes it bigger.

> > Now one could say: just let the networkers use their own kind of buffers
> > (and that's exactly what is done in the zerocopy patches), but that again leds
> > to inefficient buffer passing and ungeneric IO handling.
> 
> Careful.  This is the line of reasoning which also says
> 
> Aeroplanes are good for travelling long distances
> Cars are better for getting to my front door
> Therefore everyone should drive a 747 home

Hehe ;)

> It is quite possible that the right thing to do is to do conversions in the
> cases it happens.

Yes, this would be THE alternative to my suggestion.

> That might seem a good reason for having offset/length
> pairs on each block, because streaming from the network to disk you may well
> get a collection of partial pages of data you need to write to disk. 
> Unfortunately the reality of DMA support on almost (but not quite) all
> disk controllers is that you don't get that degree of scatter gather.
> 
> My I2O controllers and I think the fusion controllers could indeed benefit
> and cope with being given a pile of randomly located 1480 byte chunks of 
> data and being asked to put them on disk.

It doesn't really matter that much, because we write to the pagecache
first anyway.

The real thing is that we want to have some common data structure for
describing physical memory used for IO.  We could either use special
structures in every subsystem and then copy between them or pass
struct page * and lose meta information.  Or we could try to find a
structure that holds enough information to make passing it from one
subsystem to another usefull.  The cut-down kio design (heavily inspired
by Larry McVoy's splice paper) should allow just that, nothing more an
nothing less.  For use in disk-io and networking or v4l there are probably
other primary data structures needed, and that's ok.

	Christoph

-- 
Of course it doesn't work. We've performed a software upgrade.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
  2001-02-01 18:25           ` Alan Cox
@ 2001-02-01 18:39             ` Rik van Riel
  2001-02-01 18:48             ` Christoph Hellwig
  1 sibling, 0 replies; 76+ messages in thread
From: Rik van Riel @ 2001-02-01 18:39 UTC (permalink / raw)
  To: Alan Cox
  Cc: Christoph Hellwig, Stephen C. Tweedie, bsuparna, linux-kernel,
	kiobuf-io-devel

On Thu, 1 Feb 2001, Alan Cox wrote:

> > Now one could say: just let the networkers use their own kind of buffers
> > (and that's exactly what is done in the zerocopy patches), but that again leds
> > to inefficient buffer passing and ungeneric IO handling.

	[snip]
> It is quite possible that the right thing to do is to do
> conversions in the cases it happens.

OTOH, somehow a zero-copy system which converts the zero-copy
metadata every time the buffer is handed to another subsystem
just doesn't sound right ...

(well, maybe it _is_, but it looks quite inefficient at first
glance)

regards,

Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

		http://www.surriel.com/
http://www.conectiva.com/	http://distro.conectiva.com.br/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
  2001-02-01 17:58             ` Alan Cox
@ 2001-02-01 18:32               ` Rik van Riel
  2001-02-01 18:59                 ` yodaiken
  0 siblings, 1 reply; 76+ messages in thread
From: Rik van Riel @ 2001-02-01 18:32 UTC (permalink / raw)
  To: Alan Cox
  Cc: Christoph Hellwig, Stephen C. Tweedie, Steve Lord, linux-kernel,
	kiobuf-io-devel

On Thu, 1 Feb 2001, Alan Cox wrote:

> > Sure.  But Linus saing that he doesn't want more of that (shit, crap,
> > I don't rember what he said exactly) in the kernel is a very good reason
> > for thinking a little more aboyt it.
> 
> No. Linus is not a God, Linus is fallible, regularly makes mistakes and
> frequently opens his mouth and says stupid things when he is far too busy.

People may remember Linus saying a resolute no to SMP
support in Linux ;)

In my experience, when Linus says "NO" to a certain
idea, he's usually objecting to bad design decisions
in the proposed implementation of the idea and the
lack of a nice alternative solution ...

... but as soon as a clean, efficient and maintainable
alternative to the original bad idea surfaces, it seems
to be quite easy to convince Linus to include it.

cheers,

Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

		http://www.surriel.com/
http://www.conectiva.com/	http://distro.conectiva.com.br/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
  2001-02-01 18:14         ` Christoph Hellwig
@ 2001-02-01 18:25           ` Alan Cox
  2001-02-01 18:39             ` Rik van Riel
  2001-02-01 18:48             ` Christoph Hellwig
  2001-02-01 19:32           ` Stephen C. Tweedie
  2001-02-02  4:18           ` bcrl
  2 siblings, 2 replies; 76+ messages in thread
From: Alan Cox @ 2001-02-01 18:25 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Stephen C. Tweedie, bsuparna, linux-kernel, kiobuf-io-devel

> array_len, io_count, the presence of wait_queue AND end_io, and the lack of
> scatter gather in one kiobuf struct (you always need an array), and AFAICS
> that is what the networking guys dislike.

You need a completion pointer. Its arguable whether you want the wait_queue
in the default structure or as part of whatever its contained in and handled
by the completion pointer.

And I've actually bothered to talk to the networking people and they dont have
a problem with the completion pointer.

> Now one could say: just let the networkers use their own kind of buffers
> (and that's exactly what is done in the zerocopy patches), but that again leds
> to inefficient buffer passing and ungeneric IO handling.

Careful.  This is the line of reasoning which also says

Aeroplanes are good for travelling long distances
Cars are better for getting to my front door
Therefore everyone should drive a 747 home

It is quite possible that the right thing to do is to do conversions in the
cases it happens. That might seem a good reason for having offset/length
pairs on each block, because streaming from the network to disk you may well
get a collection of partial pages of data you need to write to disk. 
Unfortunately the reality of DMA support on almost (but not quite) all
disk controllers is that you don't get that degree of scatter gather.

My I2O controllers and I think the fusion controllers could indeed benefit
and cope with being given a pile of randomly located 1480 byte chunks of 
data and being asked to put them on disk.

I do seriously doubt there are any real world situations this is useful.




-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
  2001-02-01 17:41       ` Stephen C. Tweedie
@ 2001-02-01 18:14         ` Christoph Hellwig
  2001-02-01 18:25           ` Alan Cox
                             ` (2 more replies)
  2001-02-01 20:04         ` Chaitanya Tumuluri
  1 sibling, 3 replies; 76+ messages in thread
From: Christoph Hellwig @ 2001-02-01 18:14 UTC (permalink / raw)
  To: Stephen C. Tweedie; +Cc: bsuparna, linux-kernel, kiobuf-io-devel

On Thu, Feb 01, 2001 at 05:41:20PM +0000, Stephen C. Tweedie wrote:
> Hi,
> 
> On Thu, Feb 01, 2001 at 06:05:15PM +0100, Christoph Hellwig wrote:
> > On Thu, Feb 01, 2001 at 04:16:15PM +0000, Stephen C. Tweedie wrote:
> > > > 
> > > > No, and with the current kiobufs it would not make sense, because they
> > > > are to heavy-weight.
> > > 
> > > Really?  In what way?  
> > 
> > We can't allocate a huge kiobuf structure just for requesting one page of
> > IO.  It might get better with VM-level IO clustering though.
> 
> A kiobuf is *much* smaller than, say, a buffer_head, and we currently
> allocate a buffer_head per block for all IO.

A kiobuf is 124 bytes, a buffer_head 96.  And a buffer_head is additionally
used for caching data, a kiobuf not.

> 
> A kiobuf contains enough embedded page vector space for 16 pages by
> default, but I'm happy enough to remove that if people want.  However,
> note that that memory is not initialised, so there is no memory access
> cost at all for that empty space.  Remove that space and instead of
> one memory allocation per kiobuf, you get two, so the cost goes *UP*
> for small IOs.

You could still embed it into a surrounding structure, even if there are cases
where an additional memory allocation is needed, yes.

> 
> > > > With page,length,offsett iobufs this makes sense
> > > > and is IMHO the way to go.
> > > 
> > > What, you mean adding *extra* stuff to the heavyweight kiobuf makes it
> > > lean enough to do the job??
> > 
> > No.  I was speaking abou the light-weight kiobuf Linux & Me discussed on
> > lkml some time ago (though I'd much more like to call it kiovec analogous
> > to BSD iovecs).
> 
> What is so heavyweight in the current kiobuf (other than the embedded
> vector, which I've already noted I'm willing to cut)?

array_len, io_count, the presence of wait_queue AND end_io, and the lack of
scatter gather in one kiobuf struct (you always need an array), and AFAICS
that is what the networking guys dislike.

They often just want multiple buffers in one physical page, and and array of
those.

Now one could say: just let the networkers use their own kind of buffers
(and that's exactly what is done in the zerocopy patches), but that again leds
to inefficient buffer passing and ungeneric IO handling.

S.th. like:

struct kiovec {
	struct page *           kv_page;        /* physical page        */
	u_short                 kv_offset;      /* offset into page     */
	u_short                 kv_length;      /* data length          */
};
			 
enum kio_flags {
	KIO_LOANED,     /* the calling subsystem wants this buf back    */
	KIO_GIFTED,     /* thanks for the buffer, man!                  */
	KIO_COW         /* copy on write (XXX: not yet)                 */
};


struct kio {
	struct kiovec *         kio_data;       /* our kiovecs          */
	int                     kio_ndata;      /* # of kiovecs         */
	int                     kio_flags;      /* loaned or giftet?    */
	void *                  kio_priv;       /* caller private data  */
	wait_queue_head_t       kio_wait;	/* wait queue           */
};

makes it a lot simpler for the subsytems to integrate.

	Christoph

-- 
Of course it doesn't work. We've performed a software upgrade.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
  2001-02-01 17:49           ` Christoph Hellwig
@ 2001-02-01 17:58             ` Alan Cox
  2001-02-01 18:32               ` Rik van Riel
  2001-02-01 19:33             ` Stephen C. Tweedie
  1 sibling, 1 reply; 76+ messages in thread
From: Alan Cox @ 2001-02-01 17:58 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Alan Cox, Christoph Hellwig, Stephen C. Tweedie, Steve Lord,
	linux-kernel, kiobuf-io-devel

> > Linus basically designed the original kiobuf scheme of course so I guess
> > he's allowed to dislike it. Linus disliking something however doesn't mean
> > its wrong. Its not a technically valid basis for argument.
> 
> Sure.  But Linus saing that he doesn't want more of that (shit, crap,
> I don't rember what he said exactly) in the kernel is a very good reason
> for thinking a little more aboyt it.

No. Linus is not a God, Linus is fallible, regularly makes mistakes and
frequently opens his mouth and says stupid things when he is far too busy.

> Espescially if most arguments look right to one after thinking more about
> it...

I agree with the issues about networking wanting lightweight objects, Im
unconvinced however the existing setup for networking is sanely applicable
for real world applications in other spaces.

Take video capture. I want to stream 60Mbytes/second in multi-megabyte
chunks between my capture cards and a high end raid array. The array wants
1Mbyte or large blocks per I/O to reach 60Mbytes/second performance.

This btw isnt benchmark crap like most of the zero copy networking, this is
a real world application..

The current buffer head stuff is already heavier than the kio stuff. The
networking stuff isnt oriented to that kind of I/O and would end up
needing to do tons of extra processing.

> For disk I/O it makes the handling a little easier for the cost of the
> additional offset/length fields.

I remain to be convinced by that. However you do get 64bytes/cacheline on
a real processor nowdays so if you touch any of that 64byte block you are
practically zero cost to fill the rest. 

Alan

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
  2001-02-01 17:34         ` Alan Cox
  2001-02-01 17:49           ` Stephen C. Tweedie
@ 2001-02-01 17:49           ` Christoph Hellwig
  2001-02-01 17:58             ` Alan Cox
  2001-02-01 19:33             ` Stephen C. Tweedie
  2001-02-01 18:51           ` bcrl
  2 siblings, 2 replies; 76+ messages in thread
From: Christoph Hellwig @ 2001-02-01 17:49 UTC (permalink / raw)
  To: Alan Cox
  Cc: Christoph Hellwig, Stephen C. Tweedie, Steve Lord, linux-kernel,
	kiobuf-io-devel

On Thu, Feb 01, 2001 at 05:34:49PM +0000, Alan Cox wrote:
> > > I'm in the middle of some parts of it, and am actively soliciting
> > > feedback on what cleanups are required.  
> > 
> > The real issue is that Linus dislikes the current kiobuf scheme.
> > I do not like everything he proposes, but lots of things makes sense.
> 
> Linus basically designed the original kiobuf scheme of course so I guess
> he's allowed to dislike it. Linus disliking something however doesn't mean
> its wrong. Its not a technically valid basis for argument.

Sure.  But Linus saing that he doesn't want more of that (shit, crap,
I don't rember what he said exactly) in the kernel is a very good reason
for thinking a little more aboyt it.

Espescially if most arguments look right to one after thinking more about
it...

> Linus list of reasons like the amount of state are more interesting

True.  The arument that they are to heaviweight also.
That they should allow scatter gather without an array of structs also.


> > > So, what are the benefits in the disk IO stack of adding length/offset
> > > pairs to each page of the kiobuf?
> > 
> > I don't see any real advantage for disk IO.  The real advantage is that
> > we can have a generic structure that is also usefull in e.g. networking
> > and can lead to a unified IO buffering scheme (a little like IO-Lite).
> 
> Networking wants something lighter rather than heavier.

Right.  That's what the new design was about, besides adding a offset and
length to every page instead of the page array, something also wanted by
the networking in the first place.
Look at the skb_frag struct in the zero-copy patch for what networking
thinks it needs for physical page based buffers.

> Adding tons of base/limit pairs to kiobufs makes it worse not better

>From looking at the networking code and listening to Dave and Ingo it looks
like it makes the thing better for networking, although I can not verify
this due to the lack of familarity with the networking code.

For disk I/O it makes the handling a little easier for the cost of the
additional offset/length fields.

	Christoph

P.S. the tuple things is also what Larry had in his inital slice paper.
-- 
Of course it doesn't work. We've performed a software upgrade.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
  2001-02-01 17:34         ` Alan Cox
@ 2001-02-01 17:49           ` Stephen C. Tweedie
  2001-02-01 17:09             ` Chaitanya Tumuluri
  2001-02-01 20:33             ` Christoph Hellwig
  2001-02-01 17:49           ` Christoph Hellwig
  2001-02-01 18:51           ` bcrl
  2 siblings, 2 replies; 76+ messages in thread
From: Stephen C. Tweedie @ 2001-02-01 17:49 UTC (permalink / raw)
  To: Alan Cox
  Cc: Christoph Hellwig, Stephen C. Tweedie, Steve Lord, linux-kernel,
	kiobuf-io-devel

Hi,

On Thu, Feb 01, 2001 at 05:34:49PM +0000, Alan Cox wrote:
> > 
> > I don't see any real advantage for disk IO.  The real advantage is that
> > we can have a generic structure that is also usefull in e.g. networking
> > and can lead to a unified IO buffering scheme (a little like IO-Lite).
> 
> Networking wants something lighter rather than heavier. Adding tons of
> base/limit pairs to kiobufs makes it worse not better

Networking has fundamentally different requirements.  In a network
stack, you want the ability to add fragments to unaligned chunks of
data to represent headers at any point in the stack.

In the disk IO case, you basically don't get that (the only thing
which comes close is raid5 parity blocks).  The data which the user
started with is the data sent out on the wire.  You do get some
interesting cases such as soft raid and LVM, or even in the scsi stack
if you run out of mailbox space, where you need to send only a
sub-chunk of the input buffer.  

In that case, having offset/len as the kiobuf limit markers is ideal:
you can clone a kiobuf header using the same page vector as the
parent, narrow down the start/end points, and continue down the stack
without having to copy any part of the page list.  If you had the
offset/len data encoded implicitly into each entry in the sglist, you
would not be able to do that.

--Stephen

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
  2001-02-01 17:05     ` Christoph Hellwig
  2001-02-01 17:09       ` Christoph Hellwig
@ 2001-02-01 17:41       ` Stephen C. Tweedie
  2001-02-01 18:14         ` Christoph Hellwig
  2001-02-01 20:04         ` Chaitanya Tumuluri
  1 sibling, 2 replies; 76+ messages in thread
From: Stephen C. Tweedie @ 2001-02-01 17:41 UTC (permalink / raw)
  To: Stephen C. Tweedie, bsuparna, linux-kernel, kiobuf-io-devel

Hi,

On Thu, Feb 01, 2001 at 06:05:15PM +0100, Christoph Hellwig wrote:
> On Thu, Feb 01, 2001 at 04:16:15PM +0000, Stephen C. Tweedie wrote:
> > > 
> > > No, and with the current kiobufs it would not make sense, because they
> > > are to heavy-weight.
> > 
> > Really?  In what way?  
> 
> We can't allocate a huge kiobuf structure just for requesting one page of
> IO.  It might get better with VM-level IO clustering though.

A kiobuf is *much* smaller than, say, a buffer_head, and we currently
allocate a buffer_head per block for all IO.

A kiobuf contains enough embedded page vector space for 16 pages by
default, but I'm happy enough to remove that if people want.  However,
note that that memory is not initialised, so there is no memory access
cost at all for that empty space.  Remove that space and instead of
one memory allocation per kiobuf, you get two, so the cost goes *UP*
for small IOs.

> > > With page,length,offsett iobufs this makes sense
> > > and is IMHO the way to go.
> > 
> > What, you mean adding *extra* stuff to the heavyweight kiobuf makes it
> > lean enough to do the job??
> 
> No.  I was speaking abou the light-weight kiobuf Linux & Me discussed on
> lkml some time ago (though I'd much more like to call it kiovec analogous
> to BSD iovecs).

What is so heavyweight in the current kiobuf (other than the embedded
vector, which I've already noted I'm willing to cut)?

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
  2001-02-01 17:02       ` Christoph Hellwig
@ 2001-02-01 17:34         ` Alan Cox
  2001-02-01 17:49           ` Stephen C. Tweedie
                             ` (2 more replies)
  0 siblings, 3 replies; 76+ messages in thread
From: Alan Cox @ 2001-02-01 17:34 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Stephen C. Tweedie, Steve Lord, linux-kernel, kiobuf-io-devel

> > I'm in the middle of some parts of it, and am actively soliciting
> > feedback on what cleanups are required.  
> 
> The real issue is that Linus dislikes the current kiobuf scheme.
> I do not like everything he proposes, but lots of things makes sense.

Linus basically designed the original kiobuf scheme of course so I guess
he's allowed to dislike it. Linus disliking something however doesn't mean
its wrong. Its not a technically valid basis for argument.

Linus list of reasons like the amount of state are more interesting

> > So, what are the benefits in the disk IO stack of adding length/offset
> > pairs to each page of the kiobuf?
> 
> I don't see any real advantage for disk IO.  The real advantage is that
> we can have a generic structure that is also usefull in e.g. networking
> and can lead to a unified IO buffering scheme (a little like IO-Lite).

Networking wants something lighter rather than heavier. Adding tons of
base/limit pairs to kiobufs makes it worse not better

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
  2001-02-01 17:05     ` Christoph Hellwig
@ 2001-02-01 17:09       ` Christoph Hellwig
  2001-02-01 17:41       ` Stephen C. Tweedie
  1 sibling, 0 replies; 76+ messages in thread
From: Christoph Hellwig @ 2001-02-01 17:09 UTC (permalink / raw)
  To: Stephen C. Tweedie, bsuparna, linux-kernel, kiobuf-io-devel

On Thu, Feb 01, 2001 at 06:05:15PM +0100, Christoph Hellwig wrote:
> > What, you mean adding *extra* stuff to the heavyweight kiobuf makes it
> > lean enough to do the job??
> 
> No.  I was speaking abou the light-weight kiobuf Linux & Me discussed on
						   ^^^^^ Linus ...
> lkml some time ago (though I'd much more like to call it kiovec analogous
> to BSD iovecs).
> 
> And a page,offset,length tuple is pretty cheap compared to a current kiobuf.

	Christoph (slapping himself for the stupid typo and selfreply ...)

-- 
Of course it doesn't work. We've performed a software upgrade.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
  2001-02-01 17:49           ` Stephen C. Tweedie
@ 2001-02-01 17:09             ` Chaitanya Tumuluri
  2001-02-01 20:33             ` Christoph Hellwig
  1 sibling, 0 replies; 76+ messages in thread
From: Chaitanya Tumuluri @ 2001-02-01 17:09 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Alan Cox, Christoph Hellwig, Steve Lord, linux-kernel, kiobuf-io-devel

On Thu, 1 Feb 2001, Stephen C. Tweedie wrote:
> Hi,
> 
> On Thu, Feb 01, 2001 at 05:34:49PM +0000, Alan Cox wrote:
> > > 
> > > I don't see any real advantage for disk IO.  The real advantage is that
> > > we can have a generic structure that is also usefull in e.g. networking
> > > and can lead to a unified IO buffering scheme (a little like IO-Lite).
> > 
> > Networking wants something lighter rather than heavier. Adding tons of
> > base/limit pairs to kiobufs makes it worse not better
> 
> Networking has fundamentally different requirements.  In a network
> stack, you want the ability to add fragments to unaligned chunks of
> data to represent headers at any point in the stack.
> 
> In the disk IO case, you basically don't get that (the only thing
> which comes close is raid5 parity blocks).  The data which the user
> started with is the data sent out on the wire.  You do get some
> interesting cases such as soft raid and LVM, or even in the scsi stack
> if you run out of mailbox space, where you need to send only a
> sub-chunk of the input buffer.  

Or the case of BSD-style UIO implementing the readv() and writev() calls.
This may/may-not align perfectly so, address-length lists per page could
be helpful.

I did try an implementation of this for rawio and found that I had to
restrict the a-len lists coming in via the user iovecs to be aligned.

> In that case, having offset/len as the kiobuf limit markers is ideal:
> you can clone a kiobuf header using the same page vector as the
> parent, narrow down the start/end points, and continue down the stack
> without having to copy any part of the page list.  If you had the
> offset/len data encoded implicitly into each entry in the sglist, you
> would not be able to do that.

This would solve the issue with UIO, yes. Also, I think Martin Peterson
(mkp) had taken a stab at doing "clone-kiobufs" for LVM at some point.

Martin?

Cheers,
-Chait.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
  2001-02-01 16:16   ` Stephen C. Tweedie
@ 2001-02-01 17:05     ` Christoph Hellwig
  2001-02-01 17:09       ` Christoph Hellwig
  2001-02-01 17:41       ` Stephen C. Tweedie
  0 siblings, 2 replies; 76+ messages in thread
From: Christoph Hellwig @ 2001-02-01 17:05 UTC (permalink / raw)
  To: Stephen C. Tweedie; +Cc: bsuparna, linux-kernel, kiobuf-io-devel

On Thu, Feb 01, 2001 at 04:16:15PM +0000, Stephen C. Tweedie wrote:
> Hi,
> 
> On Thu, Feb 01, 2001 at 04:09:53PM +0100, Christoph Hellwig wrote:
> > On Thu, Feb 01, 2001 at 08:14:58PM +0530, bsuparna@in.ibm.com wrote:
> > > 
> > > That would require the vfs interfaces themselves (address space
> > > readpage/writepage ops) to take kiobufs as arguments, instead of struct
> > > page *  . That's not the case right now, is it ?
> > 
> > No, and with the current kiobufs it would not make sense, because they
> > are to heavy-weight.
> 
> Really?  In what way?  

We can't allocate a huge kiobuf structure just for requesting one page of
IO.  It might get better with VM-level IO clustering though.

> 
> > With page,length,offsett iobufs this makes sense
> > and is IMHO the way to go.
> 
> What, you mean adding *extra* stuff to the heavyweight kiobuf makes it
> lean enough to do the job??

No.  I was speaking abou the light-weight kiobuf Linux & Me discussed on
lkml some time ago (though I'd much more like to call it kiovec analogous
to BSD iovecs).

And a page,offset,length tuple is pretty cheap compared to a current kiobuf.

	Christoph

-- 
Of course it doesn't work. We've performed a software upgrade.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
  2001-02-01 16:49     ` Stephen C. Tweedie
@ 2001-02-01 17:02       ` Christoph Hellwig
  2001-02-01 17:34         ` Alan Cox
  0 siblings, 1 reply; 76+ messages in thread
From: Christoph Hellwig @ 2001-02-01 17:02 UTC (permalink / raw)
  To: Stephen C. Tweedie; +Cc: Steve Lord, linux-kernel, kiobuf-io-devel

On Thu, Feb 01, 2001 at 04:49:58PM +0000, Stephen C. Tweedie wrote:
> > Enquiring minds would like to know if you are working towards this 
> > revamp of the kiobuf structure at the moment, you have been very quiet
> > recently. 
> 
> I'm in the middle of some parts of it, and am actively soliciting
> feedback on what cleanups are required.  

The real issue is that Linus dislikes the current kiobuf scheme.
I do not like everything he proposes, but lots of things makes sense.

> I've been merging all of the 2.2 fixes into a 2.4 kiobuf tree, and
> have started doing some of the cleanups needed --- removing the
> embedded page vector, and adding support for lightweight stacking of
> kiobufs for completion callback chains.

Ok, great.

> However, filesystem IO is almost *always* page aligned: O_DIRECT IO
> comes from VM pages, and internal filesystem IO comes from page cache
> pages.  Buffer cache IOs are the only exception, and kiobufs only fail
> for such IOs once you have multiple buffer_heads being merged into
> single requests.
> 
> So, what are the benefits in the disk IO stack of adding length/offset
> pairs to each page of the kiobuf?

I don't see any real advantage for disk IO.  The real advantage is that
we can have a generic structure that is also usefull in e.g. networking
and can lead to a unified IO buffering scheme (a little like IO-Lite).

	Christoph

-- 
Of course it doesn't work. We've performed a software upgrade.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
  2001-02-01 16:08   ` Steve Lord
@ 2001-02-01 16:49     ` Stephen C. Tweedie
  2001-02-01 17:02       ` Christoph Hellwig
  0 siblings, 1 reply; 76+ messages in thread
From: Stephen C. Tweedie @ 2001-02-01 16:49 UTC (permalink / raw)
  To: Steve Lord; +Cc: hch, linux-kernel, kiobuf-io-devel

Hi,

On Thu, Feb 01, 2001 at 10:08:45AM -0600, Steve Lord wrote:
> Christoph Hellwig wrote:
> > On Thu, Feb 01, 2001 at 08:14:58PM +0530, bsuparna@in.ibm.com wrote:
> > > 
> > > That would require the vfs interfaces themselves (address space
> > > readpage/writepage ops) to take kiobufs as arguments, instead of struct
> > > page *  . That's not the case right now, is it ?
> > 
> > No, and with the current kiobufs it would not make sense, because they
> > are to heavy-weight.  With page,length,offsett iobufs this makes sense
> > and is IMHO the way to go.
> 
> Enquiring minds would like to know if you are working towards this 
> revamp of the kiobuf structure at the moment, you have been very quiet
> recently. 

I'm in the middle of some parts of it, and am actively soliciting
feedback on what cleanups are required.  

I've been merging all of the 2.2 fixes into a 2.4 kiobuf tree, and
have started doing some of the cleanups needed --- removing the
embedded page vector, and adding support for lightweight stacking of
kiobufs for completion callback chains.

However, filesystem IO is almost *always* page aligned: O_DIRECT IO
comes from VM pages, and internal filesystem IO comes from page cache
pages.  Buffer cache IOs are the only exception, and kiobufs only fail
for such IOs once you have multiple buffer_heads being merged into
single requests.

So, what are the benefits in the disk IO stack of adding length/offset
pairs to each page of the kiobuf?  Basically, the only advantage right
now is that it would allow us to merge requests together without
having to chain separate kiobufs.  However, chaining kiobufs in this
case is actually much better than merging them if the original IOs
came in as kiobufs: merging kiobufs requires us to reallocate a new,
longer (page/offset/len) vector, whereas chaining kiobufs is just a
list operation.

Having true scatter-gather lists in the kiobuf would let us represent
arbitrary lists of buffer_heads as a single kiobuf, though, and that
_is_ a big win if we can avoid using buffer_heads below the
ll_rw_block layer at all.  (It's not clear that this is really
possible, though, since we still need to propagate completion
information back up into each individual buffer head's status and wait
queue.)

Cheers,
 Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
  2001-02-01 12:19 ` Stephen C. Tweedie
@ 2001-02-01 16:30   ` Chaitanya Tumuluri
  0 siblings, 0 replies; 76+ messages in thread
From: Chaitanya Tumuluri @ 2001-02-01 16:30 UTC (permalink / raw)
  To: Stephen C. Tweedie; +Cc: bsuparna, lord, linux-kernel, kiobuf-io-devel

On Thu, 1 Feb 2001, Stephen C. Tweedie wrote:
> Hi,
> 
> On Thu, Feb 01, 2001 at 10:25:22AM +0530, bsuparna@in.ibm.com wrote:
> > 
> > Being able to track the children of a kiobuf would help with I/O
> > cancellation (e.g. to pull sub-ios off their request queues if I/O
> > cancellation for the parent kiobuf was issued). Not essential, I guess, in
> > general, but useful in some situations.
> 
> What exactly is the justification for IO cancellation?  It really
> upsets the normal flow of control through the IO stack to have
> voluntary cancellation semantics.
> 
XFS does something called a "forced shutdown" of the filesystem in which
it requires outstanding I/Os issued against file data to be cancelled. 
This is triggered by (among other things) errors in writing out file 
metadata. I'm cc'ing Steve Lord so he can provide more information.

Of course, I was thinking along the lines of an API flushing the requests
out of the elevator at that time .... didn't get too far with it though.

Cheers,
-Chait.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
  2001-02-01 15:09 ` Christoph Hellwig
  2001-02-01 16:08   ` Steve Lord
@ 2001-02-01 16:16   ` Stephen C. Tweedie
  2001-02-01 17:05     ` Christoph Hellwig
  1 sibling, 1 reply; 76+ messages in thread
From: Stephen C. Tweedie @ 2001-02-01 16:16 UTC (permalink / raw)
  To: bsuparna, Stephen C. Tweedie, linux-kernel, kiobuf-io-devel

Hi,

On Thu, Feb 01, 2001 at 04:09:53PM +0100, Christoph Hellwig wrote:
> On Thu, Feb 01, 2001 at 08:14:58PM +0530, bsuparna@in.ibm.com wrote:
> > 
> > That would require the vfs interfaces themselves (address space
> > readpage/writepage ops) to take kiobufs as arguments, instead of struct
> > page *  . That's not the case right now, is it ?
> 
> No, and with the current kiobufs it would not make sense, because they
> are to heavy-weight.

Really?  In what way?  

> With page,length,offsett iobufs this makes sense
> and is IMHO the way to go.

What, you mean adding *extra* stuff to the heavyweight kiobuf makes it
lean enough to do the job??

Cheers,
 Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
  2001-02-01 15:09 ` Christoph Hellwig
@ 2001-02-01 16:08   ` Steve Lord
  2001-02-01 16:49     ` Stephen C. Tweedie
  2001-02-01 16:16   ` Stephen C. Tweedie
  1 sibling, 1 reply; 76+ messages in thread
From: Steve Lord @ 2001-02-01 16:08 UTC (permalink / raw)
  To: hch, linux-kernel, kiobuf-io-devel

Christoph Hellwig wrote:
> On Thu, Feb 01, 2001 at 08:14:58PM +0530, bsuparna@in.ibm.com wrote:
> > 
> > That would require the vfs interfaces themselves (address space
> > readpage/writepage ops) to take kiobufs as arguments, instead of struct
> > page *  . That's not the case right now, is it ?
> 
> No, and with the current kiobufs it would not make sense, because they
> are to heavy-weight.  With page,length,offsett iobufs this makes sense
> and is IMHO the way to go.
> 
> 	Christoph
> 

Enquiring minds would like to know if you are working towards this 
revamp of the kiobuf structure at the moment, you have been very quiet
recently. 

Steve


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
  2001-02-01 14:44 bsuparna
@ 2001-02-01 15:09 ` Christoph Hellwig
  2001-02-01 16:08   ` Steve Lord
  2001-02-01 16:16   ` Stephen C. Tweedie
  0 siblings, 2 replies; 76+ messages in thread
From: Christoph Hellwig @ 2001-02-01 15:09 UTC (permalink / raw)
  To: bsuparna; +Cc: Stephen C. Tweedie, linux-kernel, kiobuf-io-devel

On Thu, Feb 01, 2001 at 08:14:58PM +0530, bsuparna@in.ibm.com wrote:
> 
> >Hi,
> >
> >On Thu, Feb 01, 2001 at 10:25:22AM +0530, bsuparna@in.ibm.com wrote:
> >>
> >> >We _do_ need the ability to stack completion events, but as far as the
> >> >kiobuf work goes, my current thoughts are to do that by stacking
> >> >lightweight "clone" kiobufs.
> >>
> >> Would that work with stackable filesystems ?
> >
> >Only if the filesystems were using VFS interfaces which used kiobufs.
> >Right now, the only filesystem using kiobufs is XFS, and it only
> >passes them down to the block device layer, not to other filesystems.
> 
> That would require the vfs interfaces themselves (address space
> readpage/writepage ops) to take kiobufs as arguments, instead of struct
> page *  . That's not the case right now, is it ?

No, and with the current kiobufs it would not make sense, because they
are to heavy-weight.  With page,length,offsett iobufs this makes sense
and is IMHO the way to go.

	Christoph

-- 
Of course it doesn't work. We've performed a software upgrade.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait  /notify + callback chains
@ 2001-02-01 14:44 bsuparna
  2001-02-01 15:09 ` Christoph Hellwig
  0 siblings, 1 reply; 76+ messages in thread
From: bsuparna @ 2001-02-01 14:44 UTC (permalink / raw)
  To: Stephen C. Tweedie; +Cc: linux-kernel, kiobuf-io-devel


>Hi,
>
>On Thu, Feb 01, 2001 at 10:25:22AM +0530, bsuparna@in.ibm.com wrote:
>>
>> >We _do_ need the ability to stack completion events, but as far as the
>> >kiobuf work goes, my current thoughts are to do that by stacking
>> >lightweight "clone" kiobufs.
>>
>> Would that work with stackable filesystems ?
>
>Only if the filesystems were using VFS interfaces which used kiobufs.
>Right now, the only filesystem using kiobufs is XFS, and it only
>passes them down to the block device layer, not to other filesystems.

That would require the vfs interfaces themselves (address space
readpage/writepage ops) to take kiobufs as arguments, instead of struct
page *  . That's not the case right now, is it ?
A filter filesystem would be layered over XFS to take this example.
So right now a filter filesystem only sees the struct page * and passes
this along. Any completion event stacking has to be applied with reference
to this.


>> Being able to track the children of a kiobuf would help with I/O
>> cancellation (e.g. to pull sub-ios off their request queues if I/O
>> cancellation for the parent kiobuf was issued). Not essential, I guess,
in
>> general, but useful in some situations.
>
>What exactly is the justification for IO cancellation?  It really
>upsets the normal flow of control through the IO stack to have
>voluntary cancellation semantics.

One reason that I saw is that if the results of an i/o are no longer
required due to some condition (e.g. aio cancellation situations, or if the
process that issued the i/o gets killed), then this avoids the unnecessary
disk i/o, if the request hadn't been scheduled as yet.

Too remote a requirement ? If the capability/support doesn't exist at the
driver level I guess its difficult.

--Stephen

_______________________________________________
Kiobuf-io-devel mailing list
Kiobuf-io-devel@lists.sourceforge.net
http://lists.sourceforge.net/lists/listinfo/kiobuf-io-devel



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait  /notify + callback chains
@ 2001-02-01 13:20 bsuparna
  0 siblings, 0 replies; 76+ messages in thread
From: bsuparna @ 2001-02-01 13:20 UTC (permalink / raw)
  To: mjacob, dank; +Cc: linux-kernel, kiobuf-io-devel


sct wrote:
>> >
>> > Thanks for mentioning this. I didn't know about it earlier. I've been
>> > going through the 4/00 kqueue patch on freebsd ...
>>
>> Linus has already denounced them as massively over-engineered...
>
>That shouldn't stop anyone from looking at them and learning, though.
>There might be a good idea or two hiding in there somewhere.
>- Dan
>

There is always a scope to learn from a different approach to tackle a
problem of a similar nature -  both good ideas as well as over-engineered
ones - sometimes more from the later :-)

As far as I have understood so far from looking at the original kevent
patch and notes (which perhaps isn't enough and maybe out of date as well),
the concept of knotes and filter ops, and the event queuing mechanism in
itself is interesting and generic, but most of it seems to have been
designed with linkage to user-mode issueable event waits in mind - like
poll/select/aio/signal etc, at least as it appears from the way its been
used in the kernel. A little different from what I had in mind, though its
perhaps possible to use it otherwise. But maybe I've just not thought about
it enough or understood it.

Regards
Suparna

  Suparna Bhattacharya
  Systems Software Group, IBM Global Services, India
  E-mail : bsuparna@in.ibm.com
  Phone : 91-80-5267117, Extn : 2525


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
  2001-02-01  7:58 bsuparna
@ 2001-02-01 12:39 ` Stephen C. Tweedie
  0 siblings, 0 replies; 76+ messages in thread
From: Stephen C. Tweedie @ 2001-02-01 12:39 UTC (permalink / raw)
  To: bsuparna; +Cc: Stephen C. Tweedie, Ben LaHaise, linux-kernel, kiobuf-io-devel

Hi,

On Thu, Feb 01, 2001 at 01:28:33PM +0530, bsuparna@in.ibm.com wrote:
> 
> Here's a second pass attempt, based on Ben's wait queue extensions:
> Does this sound any better ?

It's a mechanism, all right, but you haven't described what problems
it is trying to solve, and where it is likely to be used, so it's hard
to judge it. :)

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
  2001-02-01  4:55 bsuparna
@ 2001-02-01 12:19 ` Stephen C. Tweedie
  2001-02-01 16:30   ` Chaitanya Tumuluri
  0 siblings, 1 reply; 76+ messages in thread
From: Stephen C. Tweedie @ 2001-02-01 12:19 UTC (permalink / raw)
  To: bsuparna; +Cc: Stephen C. Tweedie, linux-kernel, kiobuf-io-devel

Hi,

On Thu, Feb 01, 2001 at 10:25:22AM +0530, bsuparna@in.ibm.com wrote:
> 
> >We _do_ need the ability to stack completion events, but as far as the
> >kiobuf work goes, my current thoughts are to do that by stacking
> >lightweight "clone" kiobufs.
> 
> Would that work with stackable filesystems ?

Only if the filesystems were using VFS interfaces which used kiobufs.
Right now, the only filesystem using kiobufs is XFS, and it only
passes them down to the block device layer, not to other filesystems.

> Being able to track the children of a kiobuf would help with I/O
> cancellation (e.g. to pull sub-ios off their request queues if I/O
> cancellation for the parent kiobuf was issued). Not essential, I guess, in
> general, but useful in some situations.

What exactly is the justification for IO cancellation?  It really
upsets the normal flow of control through the IO stack to have
voluntary cancellation semantics.

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait  /notify + callback chains
@ 2001-02-01  7:58 bsuparna
  2001-02-01 12:39 ` Stephen C. Tweedie
  0 siblings, 1 reply; 76+ messages in thread
From: bsuparna @ 2001-02-01  7:58 UTC (permalink / raw)
  To: Stephen C. Tweedie; +Cc: Ben LaHaise, linux-kernel, kiobuf-io-devel


Here's a second pass attempt, based on Ben's wait queue extensions:
Does this sound any better ?

[This doesn't require any changes to the existing wait_queue_head based i/o
structures or to existing drivers, and the constructs mentioned come into
the picture only when compound events are actually required]

The key aspects are:
1.  Just using an extended wait queue now instead of the callbackq for
completion (this can take care of layered callbacks, and aggregation via
wakeup functions)
2. The io structures don't need to change - as they already have a
wait_queue_head embedded anyway (e.g b_wait; in fact io completion happens
simply by waking up the waiters in the wait queue, just as it happens now.
3. Instead, all co-relation information is maintained in the wait_queue
entries that involve compound events
4. No cancel callback queue any more.

(a) For simple layered callbacks (as in encryption filesystems/drivers):
     Intermediate layers simply use add_wait_queue(_lifo) to add their
callbacks to the object's wait queue as wakeup functions. The wakeup
function can access fields in the object associated with the wait queue,
using the wait_queue_head address since the wait_queue_head is embedded in
the object.
     If the wakeup function has to be associated with any other private
data, then an embedding structure is required, e.g:
/* Layered event structure */
 struct lev {
     wait_queue_t        wait;
     void           *data;
}

or, maybe something like the work_todo structure that Ben had stated as an
example (if callback actions have to be delayed to task context). Actually
in that case, we might like to have the wakeup function return 1 if it
needs to do some work later, and that work needs to be completed before the
remaining waiters are worken up.

(b) For compound events:

/* Compound event structure */
 struct cev_wait {
     wait_queue_t        wait;
     wait_queue_head_t * parent;
     unsigned int        flags;      /* optional */
     struct list_head         cev_list;  /* links to siblings or child
cev_waits as applicable*/
     wait_queue_head_t   *head;    /* head of wait queue on which this
is/was queued  - optional ? */
  };

In this case , for each child:
 wait.func() is set up to a routine that performs any necessary
transfer/status/count updates from the child to the parent object, issues a
wakeup on the parent's wait queue (it also removes itself from the child's
wait queue, and optionally from the parent's cev_list too ).
It is this update step that will be situation/subsystem specific, and also
have a return value to indicate whether to detach from the parent or not.

And for the parent queue, a cev_wait would be registered at the beginning,
with its wait.func() set up to collate ios and let completion proceed if
the relevant criteria is met. It can reach all the child cev_waits through
the cev_list links, useful for aggregating data from all children.
During i/o cancellation, the status of the parent object is set to indicate
cancellation and wakeup issued on its wait queue. The parent cev_wait's
wakeup function, if it recognizes the cancel, would then cancel all the
sub-events.
(Is there a nice way to access the object's status from the wakeup function
that doesn't involve subsystem specific code ? )

So, it is the step of collating ios and deciding whether to proceed  which
is situation/subsystem specific. Similarly, the actual operation
cancellation logic (e.g cancelling the underlying io request) is also
non-generic.

For this reason, I was toying with the option of introducing two function
pointers - complete() and cancel() to the cev_wait  structure, so that the
rest of the logic in the wakeup function can be kept common. Does that make
sense ?

Need to define routines for initializing and setting up parent-child
cev_waits.

Right now this assumes that the changes suggested in my last posting can be
made. So still need to think if there is a way to address the cache
efficiency issue (that's a little hard).

Regards
Suparna

  Suparna Bhattacharya
  Systems Software Group, IBM Global Services, India
  E-mail : bsuparna@in.ibm.com
  Phone : 91-80-5267117, Extn : 2525


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
@ 2001-02-01  4:55 bsuparna
  2001-02-01 12:19 ` Stephen C. Tweedie
  0 siblings, 1 reply; 76+ messages in thread
From: bsuparna @ 2001-02-01  4:55 UTC (permalink / raw)
  To: Stephen C. Tweedie; +Cc: linux-kernel, kiobuf-io-devel, Stephen Tweedie



>My first comment is that this looks very heavyweight indeed.  Isn't it
>just over-engineered?

Yes, I know it is, in its current form (sigh !).

But at the same time, I do not want to give up (not yet, at least) on
trying to arrive at something that can serve the objectives, and yet be
simple in principle and lightweight too. I feel the need may  grow as we
have more filter layers coming in, and as async i/o and even i/o
cancellation usage increases. And it may not be just with kiobufs ...

I took a second pass attempt at it last night based on Ben's wait queue
extensions. Will write that up in a separate note after this. Do let me
know if it seems like any improvement at all.

>We _do_ need the ability to stack completion events, but as far as the
>kiobuf work goes, my current thoughts are to do that by stacking
>lightweight "clone" kiobufs.

Would that work with stackable filesystems ?

>
>The idea is that completion needs to pass upwards (a)
>bytes-transferred, and (b) errno, to satisfy the caller: everything
>else, including any private data, can be hooked by the caller off the
>kiobuf private data (or in fact the caller's private data can embed
>the clone kiobuf).
>
>A clone kiobuf is a simple header, nothing more, nothing less: it
>shares the same page vector as its parent kiobuf.  It has private
>length/offset fields, so (for example) a LVM driver can carve the
>parent kiobuf into multiple non-overlapping children, all sharing the
>same page list but each one actually referencing only a small region
>of the whole.
>
>That ought to clean up a great deal of the problems of passing kiobufs
>through soft raid, LVM or loop drivers.
>
>I am tempted to add fields to allow the children of a kiobuf to be
>tracked and identified, but I'm really not sure it's necessary so I'll
>hold off for now.  We already have the "io-count" field which
>enumerates sub-ios, so we can define each child to count as one such
>sub-io; and adding a parent kiobuf reference to each kiobuf makes a
>lot of sense if we want to make it easy to pass callbacks up the
>stack.  More than that seems unnecessary for now.

Being able to track the children of a kiobuf would help with I/O
cancellation (e.g. to pull sub-ios off their request queues if I/O
cancellation for the parent kiobuf was issued). Not essential, I guess, in
general, but useful in some situations.
With a clone kiobuf there is no direct way to reach a clone kiobuf given
the original kiobuf (without adding some indexing scheme ).

>
>--Stephen



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait  /notify + callback chains
@ 2001-02-01  3:59 bsuparna
  0 siblings, 0 replies; 76+ messages in thread
From: bsuparna @ 2001-02-01  3:59 UTC (permalink / raw)
  To: Stephen C. Tweedie; +Cc: Ben LaHaise, linux-kernel, kiobuf-io-devel



>Hi,
>
>On Wed, Jan 31, 2001 at 07:28:01PM +0530, bsuparna@in.ibm.com wrote:
>>
>> Do the following modifications to your wait queue extension sound
>> reasonable ?
>>
>> 1. Change add_wait_queue to add elements to the end of queue (fifo, by
>> default) and instead have an add_wait_queue_lifo() routine that adds to
the
>> head of the queue ?
>
>Cache efficiency: you wake up the task whose data set is most likely
>to be in L1 cache by waking it before its triggering event is flushed
>from cache.
>
>--Stephen

Valid point.


_______________________________________________
Kiobuf-io-devel mailing list
Kiobuf-io-devel@lists.sourceforge.net
http://lists.sourceforge.net/lists/listinfo/kiobuf-io-devel



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
       [not found] <CA2569E5.004D51A7.00@d73mta05.au.ibm.com>
@ 2001-01-31 23:32 ` Stephen C. Tweedie
  0 siblings, 0 replies; 76+ messages in thread
From: Stephen C. Tweedie @ 2001-01-31 23:32 UTC (permalink / raw)
  To: bsuparna; +Cc: Ben LaHaise, Stephen C. Tweedie, linux-kernel, kiobuf-io-devel

Hi,

On Wed, Jan 31, 2001 at 07:28:01PM +0530, bsuparna@in.ibm.com wrote:
> 
> Do the following modifications to your wait queue extension sound
> reasonable ?
> 
> 1. Change add_wait_queue to add elements to the end of queue (fifo, by
> default) and instead have an add_wait_queue_lifo() routine that adds to the
> head of the queue ?

Cache efficiency: you wake up the task whose data set is most likely
to be in L1 cache by waking it before its triggering event is flushed
from cache.

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
@ 2001-01-31 13:58 bsuparna
  0 siblings, 0 replies; 76+ messages in thread
From: bsuparna @ 2001-01-31 13:58 UTC (permalink / raw)
  To: Ben LaHaise; +Cc: Stephen C. Tweedie, linux-kernel, kiobuf-io-devel


>The waitqueue extension below is a minimalist approach for providing
>kernel support for fully asynchronous io.  The basic idea is that a
>function pointer is added to the wait queue structure that is called
>during wake_up on a wait queue head.  (The patch below also includes
>support for exclusive lifo wakeups, which isn't crucial/perfect, but just
>happened to be part of the code.)  No function pointer or other data is
>added to the wait queue structure.  Rather, users are expected to make use
>of it by embedding the wait queue structure within their own data
>structure that contains all needed info for running the state machine.

>I suspect that chaining of events should be built on top of the
>primatives, which should be kept as simple as possible.  Comments?

Do the following modifications to your wait queue extension sound
reasonable ?

1. Change add_wait_queue to add elements to the end of queue (fifo, by
default) and instead have an add_wait_queue_lifo() routine that adds to the
head of the queue ?
  [This will help avoid the problem of waiters getting woken up before LIFO
wakeup functions have run, just because the wait happened to have been
issued after the LIFO callbacks were registered, for example, while an IO
is going on]
   Or is there a reason why add_wait_queue adds elements to the head by
default ?

2. Pass the wait_queue_head pointer as a parameter to the wakeup function
(in addition to wait queue entry pointer).
[This will make it easier for the wakeup function to access the  structure
in which the wait queue is embedded, i.e. the object which the wait queue
is associated with. Without this, we might have to store a pointer to this
object in each element linked in the wait queue. This never was a problem
with sleeping waiters because the a reference to the object being waited
for would have been on the waiter's stack/context, but with wakeup
functions there is no such context]

3. Have  __wake_up_common break out of the loop if the wakeup function
returns 1 (or some other value) ?
[This makes it possible to abort the loop based on conditional logic in the
wakeup function ]


Regards
Suparna


  Suparna Bhattacharya
  Systems Software Group, IBM Global Services, India
  E-mail : bsuparna@in.ibm.com
  Phone : 91-80-5267117, Extn : 2525




-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
@ 2001-01-30 14:09 bsuparna
  0 siblings, 0 replies; 76+ messages in thread
From: bsuparna @ 2001-01-30 14:09 UTC (permalink / raw)
  To: Ben LaHaise; +Cc: linux-kernel, kiobuf-io-devel


Ben,

This indeed looks neat and simple !
I'd avoided touching the wait queue structure as I suspected that you might
already have something like this in place :-)
and was hoping that you'd see this posting and comment.
I agree entirely that it makes sense to have chaining of events built over
simple minimalist primitives. That's what was making me uncomfortable with
the cev design I had.

So now I'm thinking how to do this using the wait queues extension you
have. Some things to consider:
     - Since non-exclusive waiters are always added to the head of the
queue (unless we use a tq in a wtd kind of structure as ), ordering of
layered callbacks might still be a problem. (e.g. with an encryption filter
fs, we want the decrypt callback to run  before any waiter gets woken up;
irrespective of whether the wait was issued before or after the decrypt
callback was added by the filter layer)
     - The wait_queue_func gets only a pointer to the wait structure as an
argument, with no other means to pass any state about the sub-event that
caused it (could that be a problem with event chaining ... ? every
encapsulating structure will have to maintain a pointer to the related
sub-event ...  ? )

Regards
Suparna


  Suparna Bhattacharya
  Systems Software Group, IBM Global Services, India
  E-mail : bsuparna@in.ibm.com
  Phone : 91-80-5267117, Extn : 2525


Ben LaHaise <bcrl@redhat.com> on 01/30/2001 10:59:46 AM

Please respond to Ben LaHaise <bcrl@redhat.com>

To:   Suparna Bhattacharya/India/IBM@IBMIN
cc:   linux-kernel@vger.kernel.org, kiobuf-io-devel@lists.sourceforge.net
Subject:  Re: [Kiobuf-io-devel] RFC:  Kernel mechanism: Compound event
      wait/notify + callback chains




On Tue, 30 Jan 2001 bsuparna@in.ibm.com wrote:

>
> Comments, suggestions, advise, feedback solicited !
>
> If this seems like something that might (after some refinements) be a
> useful abstraction to have, then I need some help in straightening out
the
> design. I am not very satisfied with it in its current form.

Here's my first bit of feedback from the point of "this is what my code
currently does and why".

The waitqueue extension below is a minimalist approach for providing
kernel support for fully asynchronous io.  The basic idea is that a
function pointer is added to the wait queue structure that is called
during wake_up on a wait queue head.  (The patch below also includes
support for exclusive lifo wakeups, which isn't crucial/perfect, but just
happened to be part of the code.)  No function pointer or other data is
added to the wait queue structure.  Rather, users are expected to make use
of it by embedding the wait queue structure within their own data
structure that contains all needed info for running the state machine.

Here's a snippet of code which demonstrates a non blocking lock of a page
cache page:

struct worktodo {
     wait_queue_t            wait;
     struct tq_struct        tq;
     void *data;
};

static void __wtd_lock_page_waiter(wait_queue_t *wait)
{
        struct worktodo *wtd = (struct worktodo *)wait;
        struct page *page = (struct page *)wtd->data;

        if (!TryLockPage(page)) {
                __remove_wait_queue(&page->wait, &wtd->wait);
                wtd_queue(wtd);
        } else {
                schedule_task(&run_disk_tq);
        }
}

void wtd_lock_page(struct worktodo *wtd, struct page *page)
{
        if (TryLockPage(page)) {
                int raced = 0;
                wtd->data = page;
                init_waitqueue_func_entry(&wtd->wait,
__wtd_lock_page_waiter);
                add_wait_queue_cond(&page->wait, &wtd->wait,
TryLockPage(page), raced = 1);

                if (!raced) {
                        run_task_queue(&tq_disk);
                        return;
                }
        }

        wtd->tq.routine(wtd->tq.data);
}


The use of wakeup functions is also useful for waking a specific reader or
writer in the rw_sems, making semaphore avoid spurious wakeups, etc.

I suspect that chaining of events should be built on top of the
primatives, which should be kept as simple as possible.  Comments?

          -ben


diff -urN v2.4.1pre10/include/linux/mm.h work/include/linux/mm.h
--- v2.4.1pre10/include/linux/mm.h Fri Jan 26 19:03:05 2001
+++ work/include/linux/mm.h   Fri Jan 26 19:14:07 2001
@@ -198,10 +198,11 @@
  */
 #define UnlockPage(page)     do { \
                         smp_mb__before_clear_bit(); \
+                        if (!test_bit(PG_locked, &(page)->flags)) {
printk("last: %p\n", (page)->last_unlock); BUG(); } \
+                        (page)->last_unlock = current_text_addr(); \
                         if (!test_and_clear_bit(PG_locked,
&(page)->flags)) BUG(); \
                         smp_mb__after_clear_bit(); \
-                        if (waitqueue_active(&page->wait)) \
-                             wake_up(&page->wait); \
+                        wake_up(&page->wait); \
                    } while (0)
 #define PageError(page)      test_bit(PG_error, &(page)->flags)
 #define SetPageError(page)   set_bit(PG_error, &(page)->flags)
diff -urN v2.4.1pre10/include/linux/sched.h work/include/linux/sched.h
--- v2.4.1pre10/include/linux/sched.h    Fri Jan 26 19:03:05 2001
+++ work/include/linux/sched.h     Fri Jan 26 19:14:07 2001
@@ -751,6 +751,7 @@

 extern void FASTCALL(add_wait_queue(wait_queue_head_t *q, wait_queue_t *
wait));
 extern void FASTCALL(add_wait_queue_exclusive(wait_queue_head_t *q,
wait_queue_t * wait));
+extern void FASTCALL(add_wait_queue_exclusive_lifo(wait_queue_head_t *q,
wait_queue_t * wait));
 extern void FASTCALL(remove_wait_queue(wait_queue_head_t *q, wait_queue_t
* wait));

 #define __wait_event(wq, condition)                         \
diff -urN v2.4.1pre10/include/linux/wait.h work/include/linux/wait.h
--- v2.4.1pre10/include/linux/wait.h     Thu Jan  4 17:50:46 2001
+++ work/include/linux/wait.h Fri Jan 26 19:14:06 2001
@@ -43,17 +43,20 @@
 } while (0)
 #endif

+typedef struct __wait_queue wait_queue_t;
+typedef void (*wait_queue_func_t)(wait_queue_t *wait);
+
 struct __wait_queue {
     unsigned int flags;
 #define WQ_FLAG_EXCLUSIVE    0x01
     struct task_struct * task;
     struct list_head task_list;
+    wait_queue_func_t func;
 #if WAITQUEUE_DEBUG
     long __magic;
     long __waker;
 #endif
 };
-typedef struct __wait_queue wait_queue_t;

 /*
  * 'dual' spinlock architecture. Can be switched between spinlock_t and
@@ -110,7 +113,7 @@
 #endif

 #define __WAITQUEUE_INITIALIZER(name,task) \
-    { 0x0, task, { NULL, NULL } __WAITQUEUE_DEBUG_INIT(name)}
+    { 0x0, task, { NULL, NULL }, NULL __WAITQUEUE_DEBUG_INIT(name)}
 #define DECLARE_WAITQUEUE(name,task) \
     wait_queue_t name = __WAITQUEUE_INITIALIZER(name,task)

@@ -144,6 +147,22 @@
 #endif
     q->flags = 0;
     q->task = p;
+    q->func = NULL;
+#if WAITQUEUE_DEBUG
+    q->__magic = (long)&q->__magic;
+#endif
+}
+
+static inline void init_waitqueue_func_entry(wait_queue_t *q,
+                        wait_queue_func_t func)
+{
+#if WAITQUEUE_DEBUG
+    if (!q || !p)
+         WQ_BUG();
+#endif
+    q->flags = 0;
+    q->task = NULL;
+    q->func = func;
 #if WAITQUEUE_DEBUG
     q->__magic = (long)&q->__magic;
 #endif
@@ -200,6 +219,19 @@
 #endif
     list_del(&old->task_list);
 }
+
+#define add_wait_queue_cond(q, wait, cond, fail) \
+    do {                                \
+         unsigned long flags;                     \
+         wq_write_lock_irqsave(&(q)->lock, flags);     \
+         (wait)->flags = 0;                  \
+         if (cond)                      \
+              __add_wait_queue((q), (wait));           \
+         else {                              \
+              fail;                          \
+         }                              \
+         wq_write_unlock_irqrestore(&(q)->lock, flags);     \
+    } while (0)

 #endif /* __KERNEL__ */

diff -urN v2.4.1pre10/kernel/fork.c work/kernel/fork.c
--- v2.4.1pre10/kernel/fork.c Fri Jan 26 19:03:05 2001
+++ work/kernel/fork.c   Fri Jan 26 19:06:29 2001
@@ -44,6 +44,16 @@
     wq_write_unlock_irqrestore(&q->lock, flags);
 }

+void add_wait_queue_exclusive_lifo(wait_queue_head_t *q, wait_queue_t *
wait)
+{
+    unsigned long flags;
+
+    wq_write_lock_irqsave(&q->lock, flags);
+    wait->flags = WQ_FLAG_EXCLUSIVE;
+    __add_wait_queue(q, wait);
+    wq_write_unlock_irqrestore(&q->lock, flags);
+}
+
 void add_wait_queue_exclusive(wait_queue_head_t *q, wait_queue_t * wait)
 {
     unsigned long flags;
diff -urN v2.4.1pre10/kernel/sched.c work/kernel/sched.c
--- v2.4.1pre10/kernel/sched.c     Fri Jan 26 19:03:05 2001
+++ work/kernel/sched.c  Fri Jan 26 19:10:19 2001
@@ -714,12 +714,22 @@
     while (tmp != head) {
          unsigned int state;
                 wait_queue_t *curr = list_entry(tmp, wait_queue_t,
task_list);
+         wait_queue_func_t func;

          tmp = tmp->next;

 #if WAITQUEUE_DEBUG
          CHECK_MAGIC(curr->__magic);
 #endif
+         func = curr->func;
+         if (func) {
+              unsigned flags = curr->flags;
+              func(curr);
+              if (flags & WQ_FLAG_EXCLUSIVE && !--nr_exclusive)
+                   break;
+              continue;
+         }
+
          p = curr->task;
          state = p->state;
          if (state & mode) {


_______________________________________________
Kiobuf-io-devel mailing list
Kiobuf-io-devel@lists.sourceforge.net
http://lists.sourceforge.net/lists/listinfo/kiobuf-io-devel



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 76+ messages in thread

end of thread, other threads:[~2001-02-06 14:08 UTC | newest]

Thread overview: 76+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2001-02-04 13:24 [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains bsuparna
  -- strict thread matches above, loose matches on Subject: below --
2001-02-06 13:50 bsuparna
2001-02-06 14:07 ` Jens Axboe
     [not found] <CA2569EA.00506BBC.00@d73mta05.au.ibm.com>
2001-02-05 15:01 ` Stephen C. Tweedie
2001-02-05 14:31 bsuparna
     [not found] <CA2569E9.004A4E23.00@d73mta05.au.ibm.com>
2001-02-05 12:09 ` Stephen C. Tweedie
2001-02-02 15:31 bsuparna
2001-02-01 14:44 bsuparna
2001-02-01 15:09 ` Christoph Hellwig
2001-02-01 16:08   ` Steve Lord
2001-02-01 16:49     ` Stephen C. Tweedie
2001-02-01 17:02       ` Christoph Hellwig
2001-02-01 17:34         ` Alan Cox
2001-02-01 17:49           ` Stephen C. Tweedie
2001-02-01 17:09             ` Chaitanya Tumuluri
2001-02-01 20:33             ` Christoph Hellwig
2001-02-01 20:56               ` Steve Lord
2001-02-01 20:59                 ` Christoph Hellwig
2001-02-01 21:17                   ` Steve Lord
2001-02-01 21:44               ` Stephen C. Tweedie
2001-02-01 22:07               ` Stephen C. Tweedie
2001-02-02 12:02                 ` Christoph Hellwig
2001-02-05 12:19                   ` Stephen C. Tweedie
2001-02-05 21:28                     ` Ingo Molnar
2001-02-05 22:58                       ` Stephen C. Tweedie
2001-02-05 23:06                         ` Alan Cox
2001-02-05 23:16                           ` Stephen C. Tweedie
2001-02-06  0:19                         ` Manfred Spraul
2001-02-03 20:28                 ` Linus Torvalds
2001-02-05 11:03                   ` Stephen C. Tweedie
2001-02-05 12:00                     ` Manfred Spraul
2001-02-05 15:03                       ` Stephen C. Tweedie
2001-02-05 15:19                         ` Alan Cox
2001-02-05 17:20                           ` Stephen C. Tweedie
2001-02-05 17:29                             ` Alan Cox
2001-02-05 18:49                               ` Stephen C. Tweedie
2001-02-05 19:04                                 ` Alan Cox
2001-02-05 19:09                                 ` Linus Torvalds
2001-02-05 22:09                         ` Ingo Molnar
2001-02-05 16:56                       ` Linus Torvalds
2001-02-05 16:36                     ` Linus Torvalds
2001-02-05 19:08                       ` Stephen C. Tweedie
2001-02-01 17:49           ` Christoph Hellwig
2001-02-01 17:58             ` Alan Cox
2001-02-01 18:32               ` Rik van Riel
2001-02-01 18:59                 ` yodaiken
2001-02-01 19:33             ` Stephen C. Tweedie
2001-02-01 18:51           ` bcrl
2001-02-01 16:16   ` Stephen C. Tweedie
2001-02-01 17:05     ` Christoph Hellwig
2001-02-01 17:09       ` Christoph Hellwig
2001-02-01 17:41       ` Stephen C. Tweedie
2001-02-01 18:14         ` Christoph Hellwig
2001-02-01 18:25           ` Alan Cox
2001-02-01 18:39             ` Rik van Riel
2001-02-01 18:48             ` Christoph Hellwig
2001-02-01 18:57               ` Alan Cox
2001-02-01 19:00                 ` Christoph Hellwig
2001-02-01 19:32           ` Stephen C. Tweedie
2001-02-01 20:46             ` Christoph Hellwig
2001-02-01 21:25               ` Stephen C. Tweedie
2001-02-02 11:51                 ` Christoph Hellwig
2001-02-02 14:04                   ` Stephen C. Tweedie
2001-02-02  4:18           ` bcrl
2001-02-02 12:12             ` Christoph Hellwig
2001-02-01 20:04         ` Chaitanya Tumuluri
2001-02-01 13:20 bsuparna
2001-02-01  7:58 bsuparna
2001-02-01 12:39 ` Stephen C. Tweedie
2001-02-01  4:55 bsuparna
2001-02-01 12:19 ` Stephen C. Tweedie
2001-02-01 16:30   ` Chaitanya Tumuluri
2001-02-01  3:59 bsuparna
     [not found] <CA2569E5.004D51A7.00@d73mta05.au.ibm.com>
2001-01-31 23:32 ` Stephen C. Tweedie
2001-01-31 13:58 bsuparna
2001-01-30 14:09 bsuparna

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).