linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1
@ 2001-01-08  1:24 ` David S. Miller
  2001-01-08 10:39   ` Christoph Hellwig
                     ` (3 more replies)
  0 siblings, 4 replies; 119+ messages in thread
From: David S. Miller @ 2001-01-08  1:24 UTC (permalink / raw)
  To: linux-kernel; +Cc: netdev


I've put a patch up for testing on the kernel.org mirrors:

/pub/linux/kernel/people/davem/zerocopy-2.4.0-1.diff.gz

It provides a framework for zerocopy transmits and delayed
receive fragment coalescing.  TUX-1.01 uses this framework.

Zerocopy transmit requires some driver support, things run
as they did before for drivers which do not have the support
added.  Currently sg+csum driver support has been added to
Acenic, 3c59x, sunhme, and loopback drivers.  We had eepro100
support coded at one point, but it was removed because we didn't know
how to identify the cards which support hw csum assist vs. ones
which could not.

I would like people to test this hard and report bugs they may
discover.  _PLEASE_ try to see if 2.4.0 without this patch produces
the same problem, and if so report it is a 2.4.0 bug _not_ as a
bug in the zerocopy patch.  Thank you.

In particular, I am interested in hearing about any new breakage
caused by the zerocopy patches when using netfilter.  When reporting
bugs, please note what networking cards you are using as whether the
card actually is using hw csum assist and sg support is an important
data point.

Finally, regardless of networking card, there should be a measurable
performance boost for NFS clients with this patch due to the delayed
fragment coalescing.  KNFSD does not take full advantage of this
facility yet.

Later,
David S. Miller
davem@redhat.com
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1
  2001-01-08 10:39   ` Christoph Hellwig
@ 2001-01-08 10:34     ` David S. Miller
  2001-01-08 18:05       ` Rik van Riel
  0 siblings, 1 reply; 119+ messages in thread
From: David S. Miller @ 2001-01-08 10:34 UTC (permalink / raw)
  To: hch; +Cc: netdev, linux-kernel

   Date: Mon, 8 Jan 2001 11:39:15 +0100
   From: Christoph Hellwig <hch@caldera.de>

   don't you think the writepage file operation is rather hackish?

Not at all, it's simply direct sendfile support.  It does
not try to be any fancier than that.

Later,
David S. Miller
davem@redhat.com
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1
  2001-01-08  1:24 ` David S. Miller
@ 2001-01-08 10:39   ` Christoph Hellwig
  2001-01-08 10:34     ` David S. Miller
  2001-01-08 21:48   ` [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 David S. Miller
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 119+ messages in thread
From: Christoph Hellwig @ 2001-01-08 10:39 UTC (permalink / raw)
  To: "David S. Miller"; +Cc: netdev, linux-kernel

In article <200101080124.RAA08134@pizda.ninka.net> you wrote:

> I've put a patch up for testing on the kernel.org mirrors:
>
> /pub/linux/kernel/people/davem/zerocopy-2.4.0-1.diff.gz
>
> It provides a framework for zerocopy transmits and delayed
> receive fragment coalescing.  TUX-1.01 uses this framework.

Hi Dave,

don't you think the writepage file operation is rather hackish?
I'd much prefer Ben La Haise's rw_kiovec [1] operation, it is more
generic (supports read and write) and should be easily usable for
zerocopy networking with plain old write (using map_user_kio).
Besides that the FS crew thinks it should go in soon because of
aio anyway...

	Christoph


[1] for those that don't know yet, the prototype is:

	rw_kiovec(struct file * filp, int rw, int nr,
		struct kiobuf ** kiovec, int flags,
		size_t size, loff_t pos);
-- 
Whip me.  Beat me.  Make me maintain AIX.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1
  2001-01-08 10:34     ` David S. Miller
@ 2001-01-08 18:05       ` Rik van Riel
  2001-01-08 21:07         ` David S. Miller
  2001-01-09 10:23         ` Ingo Molnar
  0 siblings, 2 replies; 119+ messages in thread
From: Rik van Riel @ 2001-01-08 18:05 UTC (permalink / raw)
  To: David S. Miller; +Cc: hch, netdev, linux-kernel

On Mon, 8 Jan 2001, David S. Miller wrote:
>    From: Christoph Hellwig <hch@caldera.de>
> 
>    don't you think the writepage file operation is rather hackish?
> 
> Not at all, it's simply direct sendfile support.  It does
> not try to be any fancier than that.

I really think the zerocopy network stuff should be ported
to kiobuf proper.

The usefulness of the patch you posted is rather .. umm ..
limited. Having proper kiobuf support would make it possible
to, for example, do zerocopy network->disk data transfers
and lots of other things.

Furthermore, by using kiobuf for the network zerocopy stuff
there's a good chance the networking code will be integrated.
Otherwise we just might end up with a zero-copy-for-everything-
except-networking Linux 2.5 kernel ;)

regards,

Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

		http://www.surriel.com/
http://www.conectiva.com/	http://distro.conectiva.com.br/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1
  2001-01-08 18:05       ` Rik van Riel
@ 2001-01-08 21:07         ` David S. Miller
  2001-01-09 10:23         ` Ingo Molnar
  1 sibling, 0 replies; 119+ messages in thread
From: David S. Miller @ 2001-01-08 21:07 UTC (permalink / raw)
  To: riel; +Cc: hch, netdev, linux-kernel

   Date: Mon, 8 Jan 2001 16:05:23 -0200 (BRDT)
   From: Rik van Riel <riel@conectiva.com.br>

   I really think the zerocopy network stuff should be ported
   to kiobuf proper.

That is how it could be done in 2.5.x, sure.

But this patch is intended for 2.4.x so "minimum impact"
applies.

Later,
David S. Miller
davem@redhat.com
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1
  2001-01-08  1:24 ` David S. Miller
  2001-01-08 10:39   ` Christoph Hellwig
@ 2001-01-08 21:48   ` David S. Miller
  2001-01-08 22:32     ` Jes Sorensen
  2001-01-08 22:36     ` David S. Miller
  2001-01-09 13:42   ` David S. Miller
  2001-01-09 13:52   ` Trond Myklebust
  3 siblings, 2 replies; 119+ messages in thread
From: David S. Miller @ 2001-01-08 21:48 UTC (permalink / raw)
  To: jes; +Cc: linux-kernel, netdev

   From: Jes Sorensen <jes@linuxcare.com>
   Date: 08 Jan 2001 22:56:48 +0100

   I don't think it's too much to ask that one actually tries to
   communicate with an author of a piece of code before making such
   major changes and submitting them opting for inclusion in the
   kernel.

Jes, I have not submitted this for inclusion into the kernel.

This is the "everyone, including driver authors, take a look"
part of the development process.

We _had_ to change some drivers to show how to support this
new SKB api for transmit sg+csum support.  If you can think of
a way for us to effectively do this work without changing at least a
few drivers as examples (and proof of concept), please let us know.

In the process we hit real bugs in your driver, and tried to deal
with them as best we could so that we could continue testing and
debugging our own code.

As a side note, as much as you may hate some of Alexey's changes to
your driver, several things he does fixes long standing real bugs in
the Acenic driver that you've been papering over with workarounds
for quite some time.  I would even go so far as to say that in many
regards Alexey understands the Acenic much better than you, and you
would be wise to work with Alexey and not against him.  Thanks.

Later,
David S. Miller
davem@redhat.com
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1
  2001-01-08  1:24 ` David S. Miller
@ 2001-01-08 21:56 Jes Sorensen
  2001-01-08  1:24 ` David S. Miller
  3 siblings, 1 reply; 119+ messages in thread
From: Jes Sorensen @ 2001-01-08 21:56 UTC (permalink / raw)
  To: David S. Miller; +Cc: linux-kernel, netdev

>>>>> "David" == David S Miller <davem@redhat.com> writes:

David> I've put a patch up for testing on the kernel.org mirrors:

David> /pub/linux/kernel/people/davem/zerocopy-2.4.0-1.diff.gz

David> It provides a framework for zerocopy transmits and delayed
David> receive fragment coalescing.  TUX-1.01 uses this framework.

David> Zerocopy transmit requires some driver support, things run as
David> they did before for drivers which do not have the support
David> added.  Currently sg+csum driver support has been added to
David> Acenic, 3c59x, sunhme, and loopback drivers.  We had eepro100
David> support coded at one point, but it was removed because we
David> didn't know how to identify the cards which support hw csum
David> assist vs. ones which could not.

I haven't had time to test this patch, but looking over the changes to
the acenic driver I have to say that I am quite displeased with the
way the changes were done. I can't comment on how authors of the other
drivers which were changed feel about it. However I find it highly
annoying that someone goes off and makes major cosmetic structural
changes to someone elses code without even consulting the author who
happens to maintain the code. It doesn't help that the patch reverts
changes that should not have been reverted.

I don't think it's too much to ask that one actually tries to
communicate with an author of a piece of code before making such major
changes and submitting them opting for inclusion in the kernel.

Jes
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1
  2001-01-08 21:48   ` [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 David S. Miller
@ 2001-01-08 22:32     ` Jes Sorensen
  2001-01-08 22:37       ` David S. Miller
  2001-01-08 22:43       ` Stephen Frost
  2001-01-08 22:36     ` David S. Miller
  1 sibling, 2 replies; 119+ messages in thread
From: Jes Sorensen @ 2001-01-08 22:32 UTC (permalink / raw)
  To: David S. Miller; +Cc: linux-kernel, netdev

>>>>> "David" == David S Miller <davem@redhat.com> writes:

David> We _had_ to change some drivers to show how to support this new
David> SKB api for transmit sg+csum support.  If you can think of a
David> way for us to effectively do this work without changing at
David> least a few drivers as examples (and proof of concept), please
David> let us know.

Dave, I am not complaining about drivers having to be changed for this
to work I am fully aware of this need. My complaints are about how
this is being done, ie. I some people try to maintain drivers and have
certain ideas about how they structure their code etc. If you had sent
me a short email saying this is what we plan to do and this is what we
think should be done to your code, whats your oppinion. I would have
volunteered to help write the code and get the stuff integrated much
earlier as well as given you my input on how I would like to see the
changes implemented. Instead we now have a fairly large patch which
will take me a long time to merge into the driver version that I
maintain.

David> In the process we hit real bugs in your driver, and tried to
David> deal with them as best we could so that we could continue
David> testing and debugging our own code.

I would have appreciated a simple email saying "we found bug X in your
driver" with either a patch attached or a short note of your
observations.

David> As a side note, as much as you may hate some of Alexey's
David> changes to your driver, several things he does fixes long
David> standing real bugs in the Acenic driver that you've been
David> papering over with workarounds for quite some time.  I would
David> even go so far as to say that in many regards Alexey
David> understands the Acenic much better than you, and you would be
David> wise to work with Alexey and not against him.  Thanks.

I don't question Alexey's skills and I have no intentions of working
against him. All I am asking is that someone lets me know if they make
major changes to my code so I can keep track of whats happening. It is
really hard to maintain code if you work on major changes while
someone else branches off in a different direction without you
knowing. It's simply a waste of everybody's time.

Thanks
Jes
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1
  2001-01-08 21:48   ` [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 David S. Miller
  2001-01-08 22:32     ` Jes Sorensen
@ 2001-01-08 22:36     ` David S. Miller
  2001-01-09 12:12       ` Ingo Molnar
  1 sibling, 1 reply; 119+ messages in thread
From: David S. Miller @ 2001-01-08 22:36 UTC (permalink / raw)
  To: jes; +Cc: linux-kernel, netdev

   From: Jes Sorensen <jes@linuxcare.com>
   Date: 08 Jan 2001 23:32:48 +0100

   All I am asking is that someone lets me know if they make major
   changes to my code so I can keep track of whats happening.

We have not made any major changes to your code, in lieu of this
not being code which is actually being submitted yet.

If it bothers you that publicly someone has published changes to your
driver which you disagree with, oh well... :-)

This "please check things out" phase is precisely what you are
asking of us, it is how we are saying "here is what we need to
do with your driver, please comment".

Later,
David S. Miller
davem@redhat.com
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1
  2001-01-08 22:32     ` Jes Sorensen
@ 2001-01-08 22:37       ` David S. Miller
  2001-01-08 22:43       ` Stephen Frost
  1 sibling, 0 replies; 119+ messages in thread
From: David S. Miller @ 2001-01-08 22:37 UTC (permalink / raw)
  To: sfrost; +Cc: jes, linux-kernel, netdev

   Date: Mon, 8 Jan 2001 17:43:56 -0500
   From: Stephen Frost <sfrost@snowman.net>

	   Perhaps you missed it, but I believe Dave's intent is for
   this to only be a proof-of-concept idea at this time.

Thank you Stephen, this is the point Jes continues to miss.

Later,
David S. Miller
davem@redhat.com
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1
  2001-01-08 22:32     ` Jes Sorensen
  2001-01-08 22:37       ` David S. Miller
@ 2001-01-08 22:43       ` Stephen Frost
  1 sibling, 0 replies; 119+ messages in thread
From: Stephen Frost @ 2001-01-08 22:43 UTC (permalink / raw)
  To: Jes Sorensen; +Cc: David S. Miller, linux-kernel, netdev

[-- Attachment #1: Type: text/plain, Size: 1186 bytes --]

* Jes Sorensen (jes@linuxcare.com) wrote:
> >>>>> "David" == David S Miller <davem@redhat.com> writes:
> 
> I don't question Alexey's skills and I have no intentions of working
> against him. All I am asking is that someone lets me know if they make
> major changes to my code so I can keep track of whats happening. It is
> really hard to maintain code if you work on major changes while
> someone else branches off in a different direction without you
> knowing. It's simply a waste of everybody's time.

	Perhaps you missed it, but I believe Dave's intent is for this to
only be a proof-of-concept idea at this time.  These changes are not 
currently up for inclusion into the mainstream kernel.  I can not think
that Dave would ever just step around a maintainer and submit a patch to
Linus for large changes.

	If many people test these and things work out well for them 
then I'm sure Dave will go back to the maintainers with the code and 
the api and work with them to get it into the mainstream kernel.  
Soliciting ideas and suggestions on how to improve the api and the code 
paths in the drivers to handle this new method most effectively.

		Stephen

[-- Attachment #2: Type: application/pgp-signature, Size: 232 bytes --]

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1
  2001-01-08 18:05       ` Rik van Riel
  2001-01-08 21:07         ` David S. Miller
@ 2001-01-09 10:23         ` Ingo Molnar
  2001-01-09 10:31           ` David S. Miller
                             ` (3 more replies)
  1 sibling, 4 replies; 119+ messages in thread
From: Ingo Molnar @ 2001-01-09 10:23 UTC (permalink / raw)
  To: Rik van Riel; +Cc: David S. Miller, hch, netdev, linux-kernel


On Mon, 8 Jan 2001, Rik van Riel wrote:

> I really think the zerocopy network stuff should be ported to kiobuf
> proper.

yep, we talked to Stephen Tweedie about this already, but it involves some
changes in kiovec support and we didnt want to touch too much code for
2.4. In any case, the zerocopy code is 'kiovec in spirit' (uses vectors of
struct page *, offset, size entities), so transition to a finalized kiovec
framework (or whatever other mechanizm) is trivial. Right now kiovecs are
*way* too bloated for the purposes of skb fragments.

> The usefulness of the patch you posted is rather .. umm .. limited.
> [...]

i violently disagree :-) The upcoming TUX release is based on David's and
Alexey's cleaned-up zerocopy framework. [thus TUX and zerocopy are
separated.] David's patch adds a *very* scalable implementation of
zerocopy sendfile() and zerocopy sendmsg(), the panacea of fileserver
(webserver) scalability - it can be used by Apache, Samba and other
fileservers. The new zerocopy networking code DMA-s straight out of the
pagecache, natively supports hardware-checksumming and highmem (64-bit DMA
on 32-bit systems) zerocopy as well and multi-fragment DMA - no
limitations. We can saturate a gigabit link with TCP traffic, at about 20%
CPU usage on a 500 MHz x86 UP system. David and Alexey's patch is cool -
check it out!

> Having proper kiobuf support would make it possible to, for example,
> do zerocopy network->disk data transfers and lots of other things.

i used to think that this is useful, but these days it isnt. It's a waste
of PCI bandwidth resources, and it's much cheaper to keep a cache in RAM
instead of doing direct disk=>network DMA *all the time* some resource is
requested.

> Furthermore, by using kiobuf for the network zerocopy stuff there's a
> good chance the networking code will be integrated.

David and Alexey are TCP/IP networking code maintainers. So if you see a
'test this' networking framework patch from them on l-k, it has quite high
chances of being integrated into the networking code :-)

	Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1
  2001-01-09 10:23         ` Ingo Molnar
@ 2001-01-09 10:31           ` David S. Miller
  2001-01-09 11:28             ` Christoph Hellwig
  2001-01-09 11:42             ` David S. Miller
  2001-01-09 10:31           ` Christoph Hellwig
                             ` (2 subsequent siblings)
  3 siblings, 2 replies; 119+ messages in thread
From: David S. Miller @ 2001-01-09 10:31 UTC (permalink / raw)
  To: hch; +Cc: mingo, riel, netdev, linux-kernel

   Date: Tue, 9 Jan 2001 11:31:45 +0100
   From: Christoph Hellwig <hch@caldera.de>

   Yuck.  A new file_opo just to get a few benchmarks right ...  I
   hope the writepages stuff will not be merged in Linus tree (but I
   wish the code behind it!)

It's a "I know how to send a page somewhere via this filedescriptor
all by myself" operation.  I don't see why people need to take
painkillers over this for 2.4.x.  I think f_op->write is stupid, such
a special case file operation just to get a few benchmarks right.
This is the kind of argument I am hearing.

Orthogonal to f_op->write being for specifying a low-level
implementation of sys_write, f_op->writepage is for specifying a
low-level implementation of sys_sendfile.  Can you grok that?

Linus has already seen this.  Originally he had a gripe because in an
older revision of the code used to allow multiple pages to be passed
in an array to the writepage(s) operation.  He didn't like that, so I
made it take only one page as he requested.  He had no other major
objections to the infrastructure.

Later,
David S. Miller
davem@redhat.com
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1
  2001-01-09 10:23         ` Ingo Molnar
  2001-01-09 10:31           ` David S. Miller
@ 2001-01-09 10:31           ` Christoph Hellwig
  2001-01-09 11:05             ` Ingo Molnar
  2001-01-09 14:18           ` Stephen C. Tweedie
  2001-01-10  2:56           ` storage over IP (was Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1) dean gaudet
  3 siblings, 1 reply; 119+ messages in thread
From: Christoph Hellwig @ 2001-01-09 10:31 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Rik van Riel, David S. Miller, netdev, linux-kernel

On Tue, Jan 09, 2001 at 11:23:41AM +0100, Ingo Molnar wrote:
> 
> On Mon, 8 Jan 2001, Rik van Riel wrote:
> 
> > I really think the zerocopy network stuff should be ported to kiobuf
> > proper.
> 
> yep, we talked to Stephen Tweedie about this already, but it involves some
> changes in kiovec support and we didnt want to touch too much code for
> 2.4. In any case, the zerocopy code is 'kiovec in spirit' (uses vectors of
> struct page *, offset, size entities),

Yep.  That is why I was so worried aboit the writepages file op.
It's rather hackish (only write, looks usefull only for networking)
instead of the proposed rw_kiovec fop.

> 
> > The usefulness of the patch you posted is rather .. umm .. limited.
> > [...]
> 
> i violently disagree :-) The upcoming TUX release is based on David's and
> Alexey's cleaned-up zerocopy framework. [thus TUX and zerocopy are
> separated.] David's patch adds a *very* scalable implementation of
> zerocopy sendfile() and zerocopy sendmsg(), the panacea of fileserver
> (webserver) scalability - it can be used by Apache, Samba and other
> fileservers. The new zerocopy networking code DMA-s straight out of the
> pagecache, natively supports hardware-checksumming and highmem (64-bit DMA
> on 32-bit systems) zerocopy as well and multi-fragment DMA - no
> limitations. We can saturate a gigabit link with TCP traffic, at about 20%
> CPU usage on a 500 MHz x86 UP system. David and Alexey's patch is cool -
> check it out!

Yuck.  A new file_opo just to get a few benchmarks right ...
I hope the writepages stuff will not be merged in Linus tree
(but I wish the code behind it!)

	Christoph

-- 
Whip me.  Beat me.  Make me maintain AIX.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1
  2001-01-09 10:31           ` Christoph Hellwig
@ 2001-01-09 11:05             ` Ingo Molnar
  2001-01-09 18:27               ` Christoph Hellwig
  0 siblings, 1 reply; 119+ messages in thread
From: Ingo Molnar @ 2001-01-09 11:05 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Rik van Riel, David S. Miller, netdev, linux-kernel


On Tue, 9 Jan 2001, Christoph Hellwig wrote:

> > 2.4. In any case, the zerocopy code is 'kiovec in spirit' (uses
> > vectors of struct page *, offset, size entities),

> Yep. That is why I was so worried aboit the writepages file op.

i believe you misunderstand. kiovecs (in their current form) are simply
too bloated for networking purposes. Due to its nature and nonpersistency,
networking is very lightweight and memory-footprint-sensitive code (as
opposed to eg. block IO code), right now an 'struct skb_shared_info'
[which is roughly equivalent to a kiovec] is 12+4*6 == 36 bytes, which
includes support for 6 distinct fragments (each fragment can be on any
page, any offset, any size). A *single* kiobuf (which is roughly
equivalent to an skb fragment) is 52+16*4 == 116 bytes. 6 of these would
be 696 bytes, for a single TCP packet (!!!). This is simply not something
to be used for lightweight zero-copy networking.

so it's easy to say 'use kiovecs', but right now it's simply not
practical. kiobufs are a loaded concept, and i'm not sure whether it's
desirable at all to mix networking zero-copy concepts with
block-IO/filesystem zero-copy concepts. Just to make it even more clear:
although i do believe it to be desirable from an architectural point of
view, i'm not sure at all whether it's possible, based on the experience
we gathered while implementing TCP-zerocopy.

we talked (and are talking) to Stephen about this problem, but it's a
clealy 2.5 kernel issue. Merging to a finalized zero-copy framework will
be easy. (The overwhelming percentage of zero-copy code is in the
networking code itself and is insensitive to any kiovec issues.)

> It's rather hackish (only write, looks usefull only for networking)
> instead of the proposed rw_kiovec fop.

i'm not sure what you are trying to say. You mean we should remove
sendfile() as well? It's only write, looks useful mostly for networking. A
substantial percentage of kernel code is useful only for networking :-)

> > zerocopy sendfile() and zerocopy sendmsg(), the panacea of fileserver
> > (webserver) scalability - it can be used by Apache, Samba and other
> > fileservers. The new zerocopy networking code DMA-s straight out of the
> > The new zerocopy networking code DMA-s straight out of the
> > pagecache, natively supports hardware-checksumming and highmem (64-bit
> > DMA on 32-bit systems) zerocopy as well and multi-fragment DMA - no
> > limitations. We can saturate a gigabit link with TCP traffic, at about
> > 20% CPU usage on a 500 MHz x86 UP system. David and Alexey's patch is
> > cool - check it out!

> Yuck. A new file_opo just to get a few benchmarks right ...

no. As David said, it's direct sendfile() support. It's completely
isolated, it's 20 lines of code, it does not impact filesystems, it only
shows up in sendfile(). So i truly dont understand your point. This
interface has gone through several iterations and was actually further
simplified.

	Ingo

ps1. "first they say it's impossible, then they ridicule you, then they
     oppose you, finally they say it's self-evident". Looks like, after
     many many years, zero-copy networking for Linux is now finally in
     phase III. :-)

ps2. i'm joking :-)

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1
  2001-01-09 10:31           ` David S. Miller
@ 2001-01-09 11:28             ` Christoph Hellwig
  2001-01-09 12:04               ` Ingo Molnar
  2001-01-09 19:14               ` Linus Torvalds
  2001-01-09 11:42             ` David S. Miller
  1 sibling, 2 replies; 119+ messages in thread
From: Christoph Hellwig @ 2001-01-09 11:28 UTC (permalink / raw)
  To: David S. Miller; +Cc: mingo, riel, netdev, linux-kernel

On Tue, Jan 09, 2001 at 02:31:13AM -0800, David S. Miller wrote:
>    Date: Tue, 9 Jan 2001 11:31:45 +0100
>    From: Christoph Hellwig <hch@caldera.de>
> 
>    Yuck.  A new file_opo just to get a few benchmarks right ...  I
>    hope the writepages stuff will not be merged in Linus tree (but I
>    wish the code behind it!)
> 
> It's a "I know how to send a page somewhere via this filedescriptor
> all by myself" operation.  I don't see why people need to take
> painkillers over this for 2.4.x.  I think f_op->write is stupid, such
> a special case file operation just to get a few benchmarks right.
> This is the kind of argument I am hearing.
> 
> Orthogonal to f_op->write being for specifying a low-level
> implementation of sys_write, f_op->writepage is for specifying a
> low-level implementation of sys_sendfile.  Can you grok that?

Sure.  But sendfile is not one of the fundamental UNIX operations...
If there was no alternative to this I would probably have not said
anything, but with the rw_kiovec file op just before the door I don't
see any reason to add this _very_ specific file operation.

An alloc_kiovec before and an free_kiovec after the actual call
and the memory overhaed of a kiobuf won't hurt so much that it stands
against a clean interface, IMHO.


> 
> Linus has already seen this.  Originally he had a gripe because in an
> older revision of the code used to allow multiple pages to be passed
> in an array to the writepage(s) operation.  He didn't like that, so I
> made it take only one page as he requested.  He had no other major
> objections to the infrastructure.

You get that multiple page call with kiobufs for free...

	Christoph

-- 
Whip me.  Beat me.  Make me maintain AIX.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1
  2001-01-09 10:31           ` David S. Miller
  2001-01-09 11:28             ` Christoph Hellwig
@ 2001-01-09 11:42             ` David S. Miller
  1 sibling, 0 replies; 119+ messages in thread
From: David S. Miller @ 2001-01-09 11:42 UTC (permalink / raw)
  To: hch; +Cc: mingo, riel, netdev, linux-kernel

   Date: Tue, 9 Jan 2001 12:28:10 +0100
   From: Christoph Hellwig <hch@caldera.de>

   Sure.  But sendfile is not one of the fundamental UNIX operations...

It's a fundamental Linux interface and VFS-->networking interface.

   An alloc_kiovec before and an free_kiovec after the actual call
   and the memory overhaed of a kiobuf won't hurt so much that it stands
   against a clean interface, IMHO.

This whole exercise is pointless unless it performs well.

The overhead _DOES_ matter, we've tested and profiled all of this with
full specweb99 runs, zerocopy ftp server loads, etc.  Removing one
word of information from anything involved in these code paths makes
enormous differences.  Have you run such tests with your suggested
kiobuf scheme?

Know what I really hate?  People who are talking, "almost done", and
"designing" the "real solution" to a problem and have no code to show
for it.  Ie. a total working implementation.  Often they have not one
line of code to show.

Then the folks who actually get off their lazy asses and make
something real, which works, and in fact exceeded most of our personal
performance expectations, are the ones who are getting told that what
they did was crap.

What was the first thing out of people's mouths?  Not "nice work", but
"I think writepage is ugly and an eyesore, I hope nobody seriously
considers this code for inclusion."  Keep designing... like Linus
says, "show me the code".

Later,
David S. Miller
davem@redhat.com
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1
  2001-01-09 11:28             ` Christoph Hellwig
@ 2001-01-09 12:04               ` Ingo Molnar
  2001-01-09 14:25                 ` Stephen C. Tweedie
  2001-01-09 21:13                 ` David S. Miller
  2001-01-09 19:14               ` Linus Torvalds
  1 sibling, 2 replies; 119+ messages in thread
From: Ingo Molnar @ 2001-01-09 12:04 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: David S. Miller, riel, netdev, linux-kernel


On Tue, 9 Jan 2001, Christoph Hellwig wrote:

> Sure.  But sendfile is not one of the fundamental UNIX operations...

Neither were eg. kernel-based semaphores. So what? Unix wasnt perfect and
isnt perfect - but it was a (very) good starting point. If you are arguing
against the existence or importance of sendfile() you should re-think,
sendfile() is a unique (and important) interface because it enables moving
information between files (streams) without involving any interim
user-space memory buffer. No original Unix API did this AFAIK, so we
obviously had to add it. It's an important Linux API category.

> If there was no alternative to this I would probably have not said
> anything, but with the rw_kiovec file op just before the door I don't
> see any reason to add this _very_ specific file operation.

I do think that the kiovec code has to be rewritten substantially before
it can be used for networking zero-copy, so right now we do the least
damange if we do not increase the coverage of kiovec code.

> An alloc_kiovec before and an free_kiovec after the actual call and
> the memory overhaed of a kiobuf won't hurt so much that it stands
> against a clean interface, IMHO.

please study the networking portions of the zerocopy patch and you'll see
why this is not desirable. An alloc_kiovec()/free_kiovec() is exactly the
thing we cannot afford in a sendfile() operation. sendfile() is
lightweight, the setup times of kiovecs are not.

basically the current kiovec design does not deal with the realities of
high-speed, featherweight networking. DO NOT talk in hypotheticals. The
code is there, do it, measure it. You might not care about performance, we
do.

another, more theoretical issue is that i think the kernel should not be
littered with multi-page interfaces, we should keep the one "struct page *
at a time" interfaces. Eg. check out how the new zerocopy code generates
perfect MTU sized frames via the ->writepage() interface. No interim
container objects are necessary.

	Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1
  2001-01-08 22:36     ` David S. Miller
@ 2001-01-09 12:12       ` Ingo Molnar
  0 siblings, 0 replies; 119+ messages in thread
From: Ingo Molnar @ 2001-01-09 12:12 UTC (permalink / raw)
  To: David S. Miller; +Cc: jes, linux-kernel, netdev


On Mon, 8 Jan 2001, David S. Miller wrote:

>    All I am asking is that someone lets me know if they make major
>    changes to my code so I can keep track of whats happening.
>
> We have not made any major changes to your code, in lieu of this
> not being code which is actually being submitted yet.
>
> If it bothers you that publicly someone has published changes to your
> driver which you disagree with, oh well... :-)

i did tell Jes about our zerocopy work, months ago (and IIRC we even
exchanged emails about technical issues briefly). The changes were first
published in the TUX 1.0 source code last August, and subsequent cleanups
(more than 10 iterations) were published on Alexey's public FTP site:

	ftp://ftp.inr.ac.ru/ip-routing/

i think this whole issue got miscommunicated because Jes moved to Canada
exactly when we wrote the fragmented-API changes. I do believe Jes will
like most of our changes though, and i can surely tell that the elegant
and clean code of the Acenic driver made these changes so much easier.
Jen's Acenic driver was the first Linux networking driver in history to
support zero-copy TCP.

	Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1
  2001-01-08  1:24 ` David S. Miller
  2001-01-08 10:39   ` Christoph Hellwig
  2001-01-08 21:48   ` [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 David S. Miller
@ 2001-01-09 13:42   ` David S. Miller
  2001-01-09 21:19     ` David S. Miller
  2001-01-09 13:52   ` Trond Myklebust
  3 siblings, 1 reply; 119+ messages in thread
From: David S. Miller @ 2001-01-09 13:42 UTC (permalink / raw)
  To: trond.myklebust; +Cc: linux-kernel, netdev

   From: Trond Myklebust <trond.myklebust@fys.uio.no>
   Date: 09 Jan 2001 14:52:40 +0100

   I don't really want to be chiming in with another 'make it a kiobuf',
   but given that you already have written 'do_tcp_sendpages()' why did
   you make sock->ops->sendpage() take the single page as an argument
   rather than just have it take the 'struct page **'?

It was like that to begin with.  But to do it cleanly you have to pass
in not a vector of "pages" but a vector of "page+offset+len" triplets.

Linus hated it, and I understood why, so I reverted the API to be
single page based.

   I would have thought one of the main interests of doing something
   like this would be to allow us to speed up large writes to the
   socket for ncpfs/knfsd/nfs/smbfs/...

This is what TCP_CORK/MSG_MORE et al. are all for, things get
coalesced perfectly.  Sending in a vector of pages seems nice, but
none of the page cache infrastructure works like this, all of the core
routines work on a page at a time.  It actually simplifies a lot.

The writepage interface optimizes large file writes to a socket just
fine.

Later,
David S. Miller
davem@redhat.com

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1
  2001-01-08  1:24 ` David S. Miller
                     ` (2 preceding siblings ...)
  2001-01-09 13:42   ` David S. Miller
@ 2001-01-09 13:52   ` Trond Myklebust
  2001-01-09 15:27     ` Trond Myklebust
  3 siblings, 1 reply; 119+ messages in thread
From: Trond Myklebust @ 2001-01-09 13:52 UTC (permalink / raw)
  To: David S. Miller; +Cc: linux-kernel, netdev

>>>>> " " == David S Miller <davem@redhat.com> writes:

     > I've put a patch up for testing on the kernel.org mirrors:

     > /pub/linux/kernel/people/davem/zerocopy-2.4.0-1.diff.gz

.....

     > Finally, regardless of networking card, there should be a
     > measurable performance boost for NFS clients with this patch
     > due to the delayed fragment coalescing.  KNFSD does not take
     > full advantage of this facility yet.

Hi David,

I don't really want to be chiming in with another 'make it a kiobuf',
but given that you already have written 'do_tcp_sendpages()' why did
you make sock->ops->sendpage() take the single page as an argument
rather than just have it take the 'struct page **'?

I would have thought one of the main interests of doing something like
this would be to allow us to speed up large writes to the socket for
ncpfs/knfsd/nfs/smbfs/...
After all, in both the case of the client WRITE requests and the
server READ responses, we end up with a set of several pages that just
need to be pushed down the network without further ado. Unless I
misunderstood the code, it seems that do_tcp_sendpages() fits the bill
nicely...

Cheers,
  Trond
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1
  2001-01-09 10:23         ` Ingo Molnar
  2001-01-09 10:31           ` David S. Miller
  2001-01-09 10:31           ` Christoph Hellwig
@ 2001-01-09 14:18           ` Stephen C. Tweedie
  2001-01-09 14:40             ` Ingo Molnar
  2001-01-10  2:56           ` storage over IP (was Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1) dean gaudet
  3 siblings, 1 reply; 119+ messages in thread
From: Stephen C. Tweedie @ 2001-01-09 14:18 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Rik van Riel, David S. Miller, hch, netdev, linux-kernel,
	Stephen Tweedie

Hi,

On Tue, Jan 09, 2001 at 11:23:41AM +0100, Ingo Molnar wrote:
> 
> > Having proper kiobuf support would make it possible to, for example,
> > do zerocopy network->disk data transfers and lots of other things.
> 
> i used to think that this is useful, but these days it isnt. It's a waste
> of PCI bandwidth resources, and it's much cheaper to keep a cache in RAM
> instead of doing direct disk=>network DMA *all the time* some resource is
> requested.

No.  I'm certain you're right when talking about things like web
serving, but it just doesn't apply when you look at some other
applications, such as streaming out video data or performing
fileserving in a high-performance compute cluster where you are
serving bulk data.  The multimedia and HPC worlds typically operate on
datasets which are far too large to cache, so you want to keep them in
memory as little as possible when you ship them over the wire.

Cheers,
 Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1
  2001-01-09 12:04               ` Ingo Molnar
@ 2001-01-09 14:25                 ` Stephen C. Tweedie
  2001-01-09 14:33                   ` Alan Cox
  2001-01-09 15:00                   ` Ingo Molnar
  2001-01-09 21:13                 ` David S. Miller
  1 sibling, 2 replies; 119+ messages in thread
From: Stephen C. Tweedie @ 2001-01-09 14:25 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Christoph Hellwig, David S. Miller, riel, netdev, linux-kernel,
	Stephen Tweedie

Hi,

On Tue, Jan 09, 2001 at 01:04:49PM +0100, Ingo Molnar wrote:
> 
> On Tue, 9 Jan 2001, Christoph Hellwig wrote:
> 
> please study the networking portions of the zerocopy patch and you'll see
> why this is not desirable. An alloc_kiovec()/free_kiovec() is exactly the
> thing we cannot afford in a sendfile() operation. sendfile() is
> lightweight, the setup times of kiovecs are not.
> 
Right.  However, kiobufs can be kept around for as long as you want
and can be reused easily, and even if allocating and freeing them is
more work than you want, populating an existing kiobuf is _very_
cheap.

> another, more theoretical issue is that i think the kernel should not be
> littered with multi-page interfaces, we should keep the one "struct page *
> at a time" interfaces.

Bad bad bad.  We already have SCSI devices optimised for bandwidth
which don't approach decent performance until you are passing them 1MB
IOs, and even in networking the 1.5K packet limit kills us in some
cases and we need an interface capable of generating jumbograms.
Perhaps tcp can merge internal 4K requests, but if you're doing udp
jumbograms (or STP or VIA), you do need an interface which can give
the networking stack more than one page at once.

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1
  2001-01-09 14:25                 ` Stephen C. Tweedie
@ 2001-01-09 14:33                   ` Alan Cox
  2001-01-09 15:00                   ` Ingo Molnar
  1 sibling, 0 replies; 119+ messages in thread
From: Alan Cox @ 2001-01-09 14:33 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Ingo Molnar, Christoph Hellwig, David S. Miller, riel, netdev,
	linux-kernel, Stephen Tweedie

> Bad bad bad.  We already have SCSI devices optimised for bandwidth
> which don't approach decent performance until you are passing them 1MB
> IOs, and even in networking the 1.5K packet limit kills us in some

Even low end cheap raid cards like the AMI megaraid dearly want 128K writes.
Its quite a difference on them

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1
  2001-01-09 14:18           ` Stephen C. Tweedie
@ 2001-01-09 14:40             ` Ingo Molnar
  2001-01-09 14:51               ` Alan Cox
                                 ` (3 more replies)
  0 siblings, 4 replies; 119+ messages in thread
From: Ingo Molnar @ 2001-01-09 14:40 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Rik van Riel, David S. Miller, hch, netdev, linux-kernel


On Tue, 9 Jan 2001, Stephen C. Tweedie wrote:

> > i used to think that this is useful, but these days it isnt. It's a waste
> > of PCI bandwidth resources, and it's much cheaper to keep a cache in RAM
> > instead of doing direct disk=>network DMA *all the time* some resource is
> > requested.
>
> No.  I'm certain you're right when talking about things like web
> serving, [...]

yep, i was concentrating on fileserving load.

> but it just doesn't apply when you look at some other applications,
> such as streaming out video data or performing fileserving in a
> high-performance compute cluster where you are serving bulk data.
> The multimedia and HPC worlds typically operate on datasets which are
> far too large to cache, so you want to keep them in memory as little
> as possible when you ship them over the wire.

i'd love to first see these kinds of applications (under Linux) before
designing for them. Eg. if an IO operation (eg. streaming video webcast)
does a DMA from a camera card to an outgoing networking card, would it be
possible to access the packet data in case of a TCP retransmit? Basically
these applications are limited enough in scope to justify even temporary
'hacks' that enable them - and once we *see* things in action, we could
design for them. Not the other way around.

	Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1
  2001-01-09 14:40             ` Ingo Molnar
@ 2001-01-09 14:51               ` Alan Cox
  2001-01-09 15:17               ` Stephen C. Tweedie
                                 ` (2 subsequent siblings)
  3 siblings, 0 replies; 119+ messages in thread
From: Alan Cox @ 2001-01-09 14:51 UTC (permalink / raw)
  To: mingo
  Cc: Stephen C. Tweedie, Rik van Riel, David S. Miller, hch, netdev,
	linux-kernel

> designing for them. Eg. if an IO operation (eg. streaming video webcast)
> does a DMA from a camera card to an outgoing networking card, would it be

Most mpeg2 hardware isnt set up for that kind of use. And webcast protocols 
like h.263 tend to be software implemented. 

Capturing raw video for pre-processing is similar. Right now thats best
done with mmap() on the ring buffer and O_DIRECT I/O it seems

Alan

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1
  2001-01-09 14:25                 ` Stephen C. Tweedie
  2001-01-09 14:33                   ` Alan Cox
@ 2001-01-09 15:00                   ` Ingo Molnar
  2001-01-09 15:27                     ` Stephen C. Tweedie
  2001-01-09 15:38                     ` Benjamin C.R. LaHaise
  1 sibling, 2 replies; 119+ messages in thread
From: Ingo Molnar @ 2001-01-09 15:00 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Christoph Hellwig, David S. Miller, riel, netdev, linux-kernel


On Tue, 9 Jan 2001, Stephen C. Tweedie wrote:

> > please study the networking portions of the zerocopy patch and you'll see
> > why this is not desirable. An alloc_kiovec()/free_kiovec() is exactly the
> > thing we cannot afford in a sendfile() operation. sendfile() is
> > lightweight, the setup times of kiovecs are not.
> >
> Right.  However, kiobufs can be kept around for as long as you want
> and can be reused easily, and even if allocating and freeing them is
> more work than you want, populating an existing kiobuf is _very_
> cheap.

we do have SLAB [which essentially caches structures, on a per-CPU basis]
which i did take into account, but still, initializing a 600+ byte kiovec
is probably more work than the rest of sending a packet! I mean i'd love
to eliminate the 200+ bytes skb initialization as well, it shows up.

> > another, more theoretical issue is that i think the kernel should not be
> > littered with multi-page interfaces, we should keep the one "struct page *
> > at a time" interfaces.
>
> Bad bad bad.  We already have SCSI devices optimised for bandwidth
> which don't approach decent performance until you are passing them 1MB
> IOs, [...]

The fact that we're using single-page interfaces doesnt preclude us from
having nicely clustered requests, this is what IO-plugging is about!

> and even in networking the 1.5K packet limit kills us in some cases
> and we need an interface capable of generating jumbograms.

which cases?

> Perhaps tcp can merge internal 4K requests, [...]

yes, because depending on the application to send properly sized requests
is a futile act IMO. So we do have intelligent buffering and clustering in
basically every kernel subsystem - and we'll continue to have it because
we have no choice - most of Linux's user-visible IO APIs have byte
granularity (which is good btw.). Adding a multi-page interface will IMO
mostly just complicate the design and the implementation. Do you have
empirical (or theoretical) proof which shows that single-page interfaces
cannot perform well?

> but if you're doing udp jumbograms (or STP or VIA), you do need an
> interface which can give the networking stack more than one page at
> once.

nothing prevents the introduction of specialized interfaces - if they feel
like they can get enough traction. I was talking about the normal Linux IO
APIs, read()/write()/sendfile(), which are byte granularity and invoke an
almost mandatory buffering/clustering mechanizm in every kernel subsystem
they deal with.

	Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1
  2001-01-09 14:40             ` Ingo Molnar
  2001-01-09 14:51               ` Alan Cox
@ 2001-01-09 15:17               ` Stephen C. Tweedie
  2001-01-09 15:37                 ` Ingo Molnar
  2001-01-09 22:25                 ` Linus Torvalds
  2001-01-09 15:25               ` Stephen Frost
  2001-01-09 21:18               ` David S. Miller
  3 siblings, 2 replies; 119+ messages in thread
From: Stephen C. Tweedie @ 2001-01-09 15:17 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Stephen C. Tweedie, Rik van Riel, David S. Miller, hch, netdev,
	linux-kernel

Hi,

On Tue, Jan 09, 2001 at 03:40:56PM +0100, Ingo Molnar wrote:
> 
> i'd love to first see these kinds of applications (under Linux) before
> designing for them.

Things like Beowulf have been around for a while now, and SGI have
been doing that sort of multimedia stuff for ages.  I don't think that
there's any doubt that there's a demand for this.
 
> Eg. if an IO operation (eg. streaming video webcast)
> does a DMA from a camera card to an outgoing networking card, would it be
> possible to access the packet data in case of a TCP retransmit? 

I'm not thinking about pci-to-pci as much as pci-to-memory-to-pci
with no memory-to-memory copies.  That's no different to writepage:
doing a zero-copy writepage on a page cache page still gives you the
problem of maintaining retransmit semantics if a user mmaps the file
or writes to it after your initial transmit.

And if you want other examples, we have applications such as Oracle
who want to do raw disk IO in chunks of at least 128K.  Going through
a page-by-page interface for large IOs is almost as bad as the
existing buffer_head-by-buffer_head interface, and we have already
demonstrated that to be a bottleneck in the block device layer.

Jes has also got hard numbers for the performance advantages of
jumbograms on some of the networks he's been using, and you ain't
going to get udp jumbograms through a page-by-page API, ever.

Cheers,
 Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1
  2001-01-09 14:40             ` Ingo Molnar
  2001-01-09 14:51               ` Alan Cox
  2001-01-09 15:17               ` Stephen C. Tweedie
@ 2001-01-09 15:25               ` Stephen Frost
  2001-01-09 15:40                 ` Ingo Molnar
  2001-01-09 21:18               ` David S. Miller
  3 siblings, 1 reply; 119+ messages in thread
From: Stephen Frost @ 2001-01-09 15:25 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Stephen C. Tweedie, Rik van Riel, David S. Miller, hch, netdev,
	linux-kernel

[-- Attachment #1: Type: text/plain, Size: 2153 bytes --]

* Ingo Molnar (mingo@elte.hu) wrote:
> 
> On Tue, 9 Jan 2001, Stephen C. Tweedie wrote:
> 
> > but it just doesn't apply when you look at some other applications,
> > such as streaming out video data or performing fileserving in a
> > high-performance compute cluster where you are serving bulk data.
> > The multimedia and HPC worlds typically operate on datasets which are
> > far too large to cache, so you want to keep them in memory as little
> > as possible when you ship them over the wire.
> 
> i'd love to first see these kinds of applications (under Linux) before
> designing for them. Eg. if an IO operation (eg. streaming video webcast)
> does a DMA from a camera card to an outgoing networking card, would it be
> possible to access the packet data in case of a TCP retransmit? Basically
> these applications are limited enough in scope to justify even temporary
> 'hacks' that enable them - and once we *see* things in action, we could
> design for them. Not the other way around.

	Well, I know I for one use a system that you might have heard
of called 'MOSIX'.  It's a (kinda large) kernel patch with some user-space
tools but allows for migration of processes between machines without
modifying any code.  There are some limitations (threaded applications and
shared memory and whatnot) but it works very well for the rendering work
we use it for.  We use radiance which in general has pretty little inter-
process communication and what it has is done through the filesystem.

	Now, the interesting bit here is that the processes can grow to be
pretty large (200M+, up as high as 500M, higher if we let it ;) ) and what
happens with MOSIX is that entire processes get sent over the wire to 
other machines for work.  MOSIX will also attempt to rebalance the load on
all of the machines in the cluster and whatnot so it can often be moving
processes back and forth.

	So, anyhow, this is just an fyi if you weren't aware of it that I
believe more than a few people are using MOSIX these days for similar
appliactions and that it's availible at http://www.mosix.org if you're
curious.

		Stephen

[-- Attachment #2: Type: application/pgp-signature, Size: 232 bytes --]

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1
  2001-01-09 15:00                   ` Ingo Molnar
@ 2001-01-09 15:27                     ` Stephen C. Tweedie
  2001-01-09 16:16                       ` Ingo Molnar
  2001-01-09 15:38                     ` Benjamin C.R. LaHaise
  1 sibling, 1 reply; 119+ messages in thread
From: Stephen C. Tweedie @ 2001-01-09 15:27 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Stephen C. Tweedie, Christoph Hellwig, David S. Miller, riel,
	netdev, linux-kernel

Hi,

On Tue, Jan 09, 2001 at 04:00:34PM +0100, Ingo Molnar wrote:
> 
> On Tue, 9 Jan 2001, Stephen C. Tweedie wrote:
> 
> we do have SLAB [which essentially caches structures, on a per-CPU basis]
> which i did take into account, but still, initializing a 600+ byte kiovec
> is probably more work than the rest of sending a packet! I mean i'd love
> to eliminate the 200+ bytes skb initialization as well, it shows up.

Reusing a kiobuf for a request involves setting up the length, offset
and maybe errno fields, and writing the struct page *'s into the
maplist[].  Nothing more.

> > Bad bad bad.  We already have SCSI devices optimised for bandwidth
> > which don't approach decent performance until you are passing them 1MB
> > IOs, [...]
> 
> The fact that we're using single-page interfaces doesnt preclude us from
> having nicely clustered requests, this is what IO-plugging is about!

We've already got measurements showing how insane this is.  Raw IO
requests, plus internal pagebuf contiguous requests from XFS, have to
get broken down into page-sized chunks by the current ll_rw_block()
API, only to get reassembled by the make_request code.  It's
*enormous* overhead, and the kiobuf-based disk IO code demonstrates
this clearly.  

We have already shown that the IO-plugging API sucks, I'm afraid.

> > and even in networking the 1.5K packet limit kills us in some cases
> > and we need an interface capable of generating jumbograms.
> 
> which cases?

Gig Ethernet, HIPPI...  It's not so bad with an intelligent
controller, admittedly.

> > but if you're doing udp jumbograms (or STP or VIA), you do need an
> > interface which can give the networking stack more than one page at
> > once.
> 
> nothing prevents the introduction of specialized interfaces - if they feel
> like they can get enough traction.

So you mean we'll introduce two separate APIs for general zero-copy,
just to get around the problems in the single-page-based on?

> I was talking about the normal Linux IO
> APIs, read()/write()/sendfile(), which are byte granularity and invoke an
> almost mandatory buffering/clustering mechanizm in every kernel subsystem
> they deal with.

Only tcp and ll_rw_block.  ll_rw_block has already been fixed in the
SGI patches, and gets _much_ better performance as a result.  udp
doesn't do any such clustering.  That leaves tcp.

The presence of terrible performance in the old ll_rw_block code is
NOT a good excuse for perpetuating that model.

Cheers,
 Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1
  2001-01-09 13:52   ` Trond Myklebust
@ 2001-01-09 15:27     ` Trond Myklebust
  2001-01-10  9:21       ` Trond Myklebust
  0 siblings, 1 reply; 119+ messages in thread
From: Trond Myklebust @ 2001-01-09 15:27 UTC (permalink / raw)
  To: David S. Miller; +Cc: linux-kernel, netdev

>>>>> David S Miller <davem@redhat.com> writes:

     >    I would have thought one of the main interests of doing
     >    something like this would be to allow us to speed up large
     >    writes to the socket for ncpfs/knfsd/nfs/smbfs/...

     > This is what TCP_CORK/MSG_MORE et al. are all for, things get
     > coalesced perfectly.  Sending in a vector of pages seems nice,
     > but none of the page cache infrastructure works like this, all
     > of the core routines work on a page at a time.  It actually
     > simplifies a lot.

     > The writepage interface optimizes large file writes to a socket
     > just fine.

OK, but can you eventually generalize it to non-stream protocols
(i.e. UDP)?
After all, it doesn't make sense to differentiate between zero-copy on
stream and non-stream sockets, and Linux NFS, at least, remains
heavily UDP-oriented...

Cheers,
  Trond
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1
  2001-01-09 15:17               ` Stephen C. Tweedie
@ 2001-01-09 15:37                 ` Ingo Molnar
  2001-01-09 22:25                 ` Linus Torvalds
  1 sibling, 0 replies; 119+ messages in thread
From: Ingo Molnar @ 2001-01-09 15:37 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Rik van Riel, David S. Miller, hch, netdev, linux-kernel


On Tue, 9 Jan 2001, Stephen C. Tweedie wrote:

> Jes has also got hard numbers for the performance advantages of
> jumbograms on some of the networks he's been using, and you ain't
> going to get udp jumbograms through a page-by-page API, ever.

i know the performance advantages of jumbograms (typically when it's over
a local network), it's undisputed. Still i dont see why it should be
impossible to do effective UDP via a single-page interface. Eg. buffering
of outgoing pages could be supported, and MSG_MORE in sendmsg() used to
indicate end of stream. This is why ->writepage() has a 'more' flag (and
tcp_sendpage() has a flag as well).

	Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1
  2001-01-09 15:00                   ` Ingo Molnar
  2001-01-09 15:27                     ` Stephen C. Tweedie
@ 2001-01-09 15:38                     ` Benjamin C.R. LaHaise
  2001-01-09 16:40                       ` Ingo Molnar
  2001-01-09 17:53                       ` Christoph Hellwig
  1 sibling, 2 replies; 119+ messages in thread
From: Benjamin C.R. LaHaise @ 2001-01-09 15:38 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Stephen C. Tweedie, Christoph Hellwig, David S. Miller, riel,
	netdev, linux-kernel

On Tue, 9 Jan 2001, Ingo Molnar wrote:

> 
> On Tue, 9 Jan 2001, Stephen C. Tweedie wrote:
> 
> > > please study the networking portions of the zerocopy patch and you'll see
> > > why this is not desirable. An alloc_kiovec()/free_kiovec() is exactly the
> > > thing we cannot afford in a sendfile() operation. sendfile() is
> > > lightweight, the setup times of kiovecs are not.
> > >
> > Right.  However, kiobufs can be kept around for as long as you want
> > and can be reused easily, and even if allocating and freeing them is
> > more work than you want, populating an existing kiobuf is _very_
> > cheap.
> 
> we do have SLAB [which essentially caches structures, on a per-CPU basis]
> which i did take into account, but still, initializing a 600+ byte kiovec
> is probably more work than the rest of sending a packet! I mean i'd love
> to eliminate the 200+ bytes skb initialization as well, it shows up.

Do the math again: for transmitting a single page in a kiobuf only 64
bytes needs to be initialized.  If map_array is moved to the end of the
structure, that's all contiguous data and is a single cacheline.

What you're completely ignoring is that sendpages is lacking a huge amount
of functionality that is *needed*.  I can't implement clean async io on
top of sendpages -- it'll require keeping 1 task around per outstanding
io, which is exactly the bottleneck we're trying to work around.

> The fact that we're using single-page interfaces doesnt preclude us from
> having nicely clustered requests, this is what IO-plugging is about!

It does waste a significant amount of CPU cycles trying to reassemble io
requests and is not deterministic.  Unplugging the io queue is a real pain
with async io.

		-ben

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1
  2001-01-09 15:25               ` Stephen Frost
@ 2001-01-09 15:40                 ` Ingo Molnar
  2001-01-09 15:48                   ` Stephen Frost
  2001-01-10  1:14                   ` Dave Zarzycki
  0 siblings, 2 replies; 119+ messages in thread
From: Ingo Molnar @ 2001-01-09 15:40 UTC (permalink / raw)
  To: Stephen Frost
  Cc: Stephen C. Tweedie, Rik van Riel, David S. Miller, hch, netdev,
	linux-kernel


On Tue, 9 Jan 2001, Stephen Frost wrote:

> 	Now, the interesting bit here is that the processes can grow to be
> pretty large (200M+, up as high as 500M, higher if we let it ;) ) and what
> happens with MOSIX is that entire processes get sent over the wire to
> other machines for work.  MOSIX will also attempt to rebalance the load on
> all of the machines in the cluster and whatnot so it can often be moving
> processes back and forth.

then you'll love the zerocopy patch :-) Just use sendfile() or specify
MSG_NOCOPY to sendmsg(), and you'll see effective memory-to-card
DMA-and-checksumming on cards that support it.

the discussion with Stephen is about various device-to-device schemes.
(which Mosix i dont think wants to use. Mosix wants to use memory to
device zero-copy, right?)

	Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1
  2001-01-09 15:40                 ` Ingo Molnar
@ 2001-01-09 15:48                   ` Stephen Frost
  2001-01-10  1:14                   ` Dave Zarzycki
  1 sibling, 0 replies; 119+ messages in thread
From: Stephen Frost @ 2001-01-09 15:48 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Stephen C. Tweedie, Rik van Riel, David S. Miller, hch, netdev,
	linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1335 bytes --]

* Ingo Molnar (mingo@elte.hu) wrote:
> 
> On Tue, 9 Jan 2001, Stephen Frost wrote:
> 
> > 	Now, the interesting bit here is that the processes can grow to be
> > pretty large (200M+, up as high as 500M, higher if we let it ;) ) and what
> > happens with MOSIX is that entire processes get sent over the wire to
> > other machines for work.  MOSIX will also attempt to rebalance the load on
> > all of the machines in the cluster and whatnot so it can often be moving
> > processes back and forth.
> 
> then you'll love the zerocopy patch :-) Just use sendfile() or specify
> MSG_NOCOPY to sendmsg(), and you'll see effective memory-to-card
> DMA-and-checksumming on cards that support it.

	Excellent, this patch certainly sounds interesting which is why
I've been following this discussion.  Once the MOSIX patch for 2.4 comes
out I think I'm going to tinker with this and see if I can get MOSIX to
use these methods.

> the discussion with Stephen is about various device-to-device schemes.
> (which Mosix i dont think wants to use. Mosix wants to use memory to
> device zero-copy, right?)

	Yes, very much so actually now that I think about it.  Alot of
memory->device and device->memory work going on.  I was mainly replying
to the idea of clustering since that's what MOSIX is all about.


		Stephen

[-- Attachment #2: Type: application/pgp-signature, Size: 232 bytes --]

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1
  2001-01-09 15:27                     ` Stephen C. Tweedie
@ 2001-01-09 16:16                       ` Ingo Molnar
  2001-01-09 16:37                         ` Alan Cox
  2001-01-09 18:10                         ` Stephen C. Tweedie
  0 siblings, 2 replies; 119+ messages in thread
From: Ingo Molnar @ 2001-01-09 16:16 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Christoph Hellwig, David S. Miller, riel, netdev, linux-kernel


On Tue, 9 Jan 2001, Stephen C. Tweedie wrote:

> > we do have SLAB [which essentially caches structures, on a per-CPU basis]
> > which i did take into account, but still, initializing a 600+ byte kiovec
> > is probably more work than the rest of sending a packet! I mean i'd love
> > to eliminate the 200+ bytes skb initialization as well, it shows up.
>
> Reusing a kiobuf for a request involves setting up the length, offset
> and maybe errno fields, and writing the struct page *'s into the
> maplist[].  Nothing more.

i'm talking about kiovecs not kiobufs (because those are equivalent to a
fragmented packet - every packet fragment can be anywhere). Initializing a
kiovec involves touching a dozen cachelines. Keeping structures compressed
is very important.

i dont know. I dont think it's necesserily bad for a subsystem to have its
own 'native structure' how it manages data.

> We've already got measurements showing how insane this is.  Raw IO
> requests, plus internal pagebuf contiguous requests from XFS, have to
> get broken down into page-sized chunks by the current ll_rw_block()
> API, only to get reassembled by the make_request code.  It's
> *enormous* overhead, and the kiobuf-based disk IO code demonstrates
> this clearly.

i do believe that you are wrong here. We did have a multi-page API between
sendfile and the TCP layer initially, and it made *absolutely no
performance difference*. But it was more complex, and harder to fix. And
we had to keep intelligent buffering/clustering/merging in any case,
because some native Linux interfaces such as write() and read() have byte
granularity.

so unless there is some fundamental difference between the two approaches,
i dont buy this argument. I dont necesserily say that your measurements
are wrong, i'm saying that the performance analysis is wrong.

> We have already shown that the IO-plugging API sucks, I'm afraid.

it might not be important to others, but we do hold one particular
SPECweb99 world record: on 2-way, 2 GB RAM, testing a load with a full
fileset of ~9 GB. It generates insane block-IO load, and we do beat other
OSs that have multipage support, including SGI. (and no, it's not due to
kernel-space acceleration alone this time - it's mostly due to very good
block-IO performance.) We use Jens Axobe's IO-batching fixes that
dramatically improve the block scheduler's performance under high load.

> > > and even in networking the 1.5K packet limit kills us in some cases
> > > and we need an interface capable of generating jumbograms.
> >
> > which cases?
>
> Gig Ethernet, [...]

we handle gigabit ethernet with 1.5K zero-copy packets just fine. One
thing people forget is IRQ throttling: when switching from 1500 byte
packets to 9000 byte packets then the amount of interrupts drops by a
factor of 6. Now if the tunings of a driver are not changed accordingly,
1500 byte MTU can show dramatically lower performance than 9000 byte MTU.
But if tuned properly, i see little difference between 1500 byte and 9000
byte MTU. (when using a good protocol such as TCP.)

> > nothing prevents the introduction of specialized interfaces - if they feel
> > like they can get enough traction.
>
> So you mean we'll introduce two separate APIs for general zero-copy,
> just to get around the problems in the single-page-based on?

no. But i think that none of the mainstream protocols or APIs mandate a
multi-page interface - i do think that the performance problems mentioned
were mis-analyzed. I'd call the multi-page API thing an urban legend.
Nobody in their right mind can claim that a series of function calls shows
any difference in *block IO* performance, compared to a multi-page API
(which has an additional vector-setup cost). Only functional differences
can explain any measured performance difference - and for those
merging/clustering bugs, multipage support is only a workaround.

> > I was talking about the normal Linux IO
> > APIs, read()/write()/sendfile(), which are byte granularity and invoke an
> > almost mandatory buffering/clustering mechanizm in every kernel subsystem
> > they deal with.
>
> Only tcp and ll_rw_block.  ll_rw_block has already been fixed in the
> SGI patches, and gets _much_ better performance as a result. [...]

as mentioned above, i think this is not due to going multipage.

> The presence of terrible performance in the old ll_rw_block code is
> NOT a good excuse for perpetuating that model.

i'd like to measure this performance problem (because i'd like to
double-check it) - what measurement method was used?

	Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1
  2001-01-09 16:16                       ` Ingo Molnar
@ 2001-01-09 16:37                         ` Alan Cox
  2001-01-09 16:48                           ` Ingo Molnar
  2001-01-09 19:20                           ` [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 J Sloan
  2001-01-09 18:10                         ` Stephen C. Tweedie
  1 sibling, 2 replies; 119+ messages in thread
From: Alan Cox @ 2001-01-09 16:37 UTC (permalink / raw)
  To: mingo
  Cc: Stephen C. Tweedie, Christoph Hellwig, David S. Miller, riel,
	netdev, linux-kernel

> > We have already shown that the IO-plugging API sucks, I'm afraid.
> 
> it might not be important to others, but we do hold one particular
> SPECweb99 world record: on 2-way, 2 GB RAM, testing a load with a full

And its real world value is exactly the same as the mindcraft NT values. Don't
forget that.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1
  2001-01-09 15:38                     ` Benjamin C.R. LaHaise
@ 2001-01-09 16:40                       ` Ingo Molnar
  2001-01-09 17:30                         ` Benjamin C.R. LaHaise
  2001-01-09 17:53                       ` Christoph Hellwig
  1 sibling, 1 reply; 119+ messages in thread
From: Ingo Molnar @ 2001-01-09 16:40 UTC (permalink / raw)
  To: Benjamin C.R. LaHaise
  Cc: Stephen C. Tweedie, Christoph Hellwig, David S. Miller, riel,
	netdev, linux-kernel


On Tue, 9 Jan 2001, Benjamin C.R. LaHaise wrote:

> Do the math again: for transmitting a single page in a kiobuf only 64
> bytes needs to be initialized.  If map_array is moved to the end of
> the structure, that's all contiguous data and is a single cacheline.

but you are comparing apples to oranges: an iobuf to a fragment-array. A
fragment-array is equivalent to an array of iobufs. In typical (eg. HTTP)
output we have mixed sendfile() and sendmsg() based output, so we have an
array of (page, offset, size) memory-areas, not a (initial_offset, page[])
array like kiobufs. The closest thing would be an array of kiobufs (where
each kiobuf would use a single page only).

this is why i ment that *right now* kiobufs are not suited for networking,
at least the way we do it. Maybe if kiobufs had the same kind of internal
structure as sk_frag (ie. array of (page,offset,size) triples, not array
of pages), that would work out better.

> What you're completely ignoring is that sendpages is lacking a huge
> amount of functionality that is *needed*. I can't implement clean
> async io on top of sendpages -- it'll require keeping 1 task around
> per outstanding io, which is exactly the bottleneck we're trying to
> work around.

Please take a look at next release of TUX. Probably the last missing piece
was that i added O_NONBLOCK to generic_file_read() && sendfile(), so not
fully cached requests can be offloaded to IO threads.

Otherwise the current lowlevel filesystem infrastructure is not suited for
implementing "process-less async IO "- and kiovecs wont be able to help
that either. Unless we implement async, IRQ-driven bmap(), we'll always
need some sort of process context to set up IO.

	Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1
  2001-01-09 16:37                         ` Alan Cox
@ 2001-01-09 16:48                           ` Ingo Molnar
  2001-01-09 17:29                             ` Alan Cox
  2001-01-09 17:56                             ` Chris Evans
  2001-01-09 19:20                           ` [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 J Sloan
  1 sibling, 2 replies; 119+ messages in thread
From: Ingo Molnar @ 2001-01-09 16:48 UTC (permalink / raw)
  To: Alan Cox
  Cc: Stephen C. Tweedie, Christoph Hellwig, David S. Miller, riel,
	netdev, linux-kernel


On Tue, 9 Jan 2001, Alan Cox wrote:

> > > We have already shown that the IO-plugging API sucks, I'm afraid.
> >
> > it might not be important to others, but we do hold one particular
> > SPECweb99 world record: on 2-way, 2 GB RAM, testing a load with a full
>
> And its real world value is exactly the same as the mindcraft NT
> values. Don't forget that.

( what you have not quoted is the part that says that the fileset is 9GB.
This is one of the busiest and most complex block-IO Linux systems i've
ever seen, this is why i quoted it - the talk was about block-IO
performance, and Stephen said that our block IO sucks. It used to suck,
but in 2.4, with the right patch from Jens, it doesnt suck anymore. )

	Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1
  2001-01-09 16:48                           ` Ingo Molnar
@ 2001-01-09 17:29                             ` Alan Cox
  2001-01-09 17:38                               ` Jens Axboe
  2001-01-09 17:56                             ` Chris Evans
  1 sibling, 1 reply; 119+ messages in thread
From: Alan Cox @ 2001-01-09 17:29 UTC (permalink / raw)
  To: mingo
  Cc: Alan Cox, Stephen C. Tweedie, Christoph Hellwig, David S. Miller,
	riel, netdev, linux-kernel

> ever seen, this is why i quoted it - the talk was about block-IO
> performance, and Stephen said that our block IO sucks. It used to suck,
> but in 2.4, with the right patch from Jens, it doesnt suck anymore. )

Thats fine. Get me 128K-512K chunks nicely streaming into my raid controller
and I'll be a happy man

I don't have a problem with the claim that its not the per page stuff and 
plugging that breaks ll_rw_blk. If there is evidence contradicting the SGI
stuff it's very interesting

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1
  2001-01-09 16:40                       ` Ingo Molnar
@ 2001-01-09 17:30                         ` Benjamin C.R. LaHaise
  2001-01-09 18:12                           ` Stephen C. Tweedie
  2001-01-09 18:35                           ` Ingo Molnar
  0 siblings, 2 replies; 119+ messages in thread
From: Benjamin C.R. LaHaise @ 2001-01-09 17:30 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Stephen C. Tweedie, Christoph Hellwig, David S. Miller, riel,
	netdev, linux-kernel

On Tue, 9 Jan 2001, Ingo Molnar wrote:

> this is why i ment that *right now* kiobufs are not suited for networking,
> at least the way we do it. Maybe if kiobufs had the same kind of internal
> structure as sk_frag (ie. array of (page,offset,size) triples, not array
> of pages), that would work out better.

That I can agree with, and it would make my life easier since I really
only care about the completion of an entire io, not the individual
fragments of it.

> Please take a look at next release of TUX. Probably the last missing piece
> was that i added O_NONBLOCK to generic_file_read() && sendfile(), so not
> fully cached requests can be offloaded to IO threads.
> 
> Otherwise the current lowlevel filesystem infrastructure is not suited for
> implementing "process-less async IO "- and kiovecs wont be able to help
> that either. Unless we implement async, IRQ-driven bmap(), we'll always
> need some sort of process context to set up IO.

I've already got fully async read and write working via a helper thread
for doing the bmaps when the page is not uptodate in the page cache.  The
primatives for async locking of pages and waiting on events such that
converting ext2 to performing full async bmap should be trivial.  Note
that O_NONBLOCK is not good enough because you can't implement an
asynchronous O_SYNC write with it.

		-ben

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1
  2001-01-09 17:29                             ` Alan Cox
@ 2001-01-09 17:38                               ` Jens Axboe
  2001-01-09 18:38                                 ` Ingo Molnar
  0 siblings, 1 reply; 119+ messages in thread
From: Jens Axboe @ 2001-01-09 17:38 UTC (permalink / raw)
  To: Alan Cox
  Cc: mingo, Stephen C. Tweedie, Christoph Hellwig, David S. Miller,
	riel, netdev, linux-kernel

On Tue, Jan 09 2001, Alan Cox wrote:
> > ever seen, this is why i quoted it - the talk was about block-IO
> > performance, and Stephen said that our block IO sucks. It used to suck,
> > but in 2.4, with the right patch from Jens, it doesnt suck anymore. )
> 
> Thats fine. Get me 128K-512K chunks nicely streaming into my raid controller
> and I'll be a happy man

No problem, apply blk-13B and you'll get 512K chunks for SCSI and RAID.

-- 
* Jens Axboe <axboe@suse.de>
* SuSE Labs
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1
  2001-01-09 15:38                     ` Benjamin C.R. LaHaise
  2001-01-09 16:40                       ` Ingo Molnar
@ 2001-01-09 17:53                       ` Christoph Hellwig
  1 sibling, 0 replies; 119+ messages in thread
From: Christoph Hellwig @ 2001-01-09 17:53 UTC (permalink / raw)
  To: Benjamin C.R. LaHaise
  Cc: Ingo Molnar, Stephen C. Tweedie, David S. Miller, riel, netdev,
	linux-kernel

On Tue, Jan 09, 2001 at 10:38:30AM -0500, Benjamin C.R. LaHaise wrote:
> What you're completely ignoring is that sendpages is lacking a huge amount
> of functionality that is *needed*.  I can't implement clean async io on
> top of sendpages -- it'll require keeping 1 task around per outstanding
> io, which is exactly the bottleneck we're trying to work around.

Yepp.  That's why I proposed to ue rw_kiovec.  Currently Alexy seems
to have an own hack for socket-only asynch IO with some COW semantics
for the userlevel buffers, but I would much prefer a generic version...

	Christoph

P.S. Any chance to find a new version of your aio-patch somewhere?
-- 
Of course it doesn't work. We've performed a software upgrade.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1
  2001-01-09 16:48                           ` Ingo Molnar
  2001-01-09 17:29                             ` Alan Cox
@ 2001-01-09 17:56                             ` Chris Evans
  2001-01-09 18:41                               ` Ingo Molnar
  1 sibling, 1 reply; 119+ messages in thread
From: Chris Evans @ 2001-01-09 17:56 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: linux-kernel


On Tue, 9 Jan 2001, Ingo Molnar wrote:

> This is one of the busiest and most complex block-IO Linux systems i've
> ever seen, this is why i quoted it - the talk was about block-IO
> performance, and Stephen said that our block IO sucks. It used to suck,
> but in 2.4, with the right patch from Jens, it doesnt suck anymore. )

Is this "right patch from Jens" on the radar for 2.4 inclusion?

Cheers
Chris

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1
  2001-01-09 16:16                       ` Ingo Molnar
  2001-01-09 16:37                         ` Alan Cox
@ 2001-01-09 18:10                         ` Stephen C. Tweedie
  1 sibling, 0 replies; 119+ messages in thread
From: Stephen C. Tweedie @ 2001-01-09 18:10 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Stephen C. Tweedie, Christoph Hellwig, David S. Miller, riel,
	netdev, linux-kernel

Hi,

On Tue, Jan 09, 2001 at 05:16:40PM +0100, Ingo Molnar wrote:
> On Tue, 9 Jan 2001, Stephen C. Tweedie wrote:
> 
> i'm talking about kiovecs not kiobufs (because those are equivalent to a
> fragmented packet - every packet fragment can be anywhere). Initializing a
> kiovec involves touching a dozen cachelines. Keeping structures compressed
> is very important.
> 
> i dont know. I dont think it's necesserily bad for a subsystem to have its
> own 'native structure' how it manages data.

For the transmit case, unless the sender needs seriously fragmented
data, the kiovec is just a kiobuf*.

> i do believe that you are wrong here. We did have a multi-page API between
> sendfile and the TCP layer initially, and it made *absolutely no
> performance difference*.

That may be fine for tcp, but tcp explicitly maintains the state of
the caller and can stream things sequentially to a specific file
descriptor.

The block device layer, on the other hand, has to accept requests _in
any order_ and still reorder them to the optimal elevator order.  The
merging in ll_rw_block is _far_ more expensive than adding a request
to the end of a list.  It's not helped by the fact that each such
request has a buffer_head and a struct request associated with it, so
deconstructing the large IO into buffer_heads results in huge amounts
of data being allocated and deleted.

We could streamline this greatly if the block device layer kept
per-caller context in the way that tcp does, but the block device API
just doesn't work that way.

> > We have already shown that the IO-plugging API sucks, I'm afraid.
> 
> it might not be important to others, but we do hold one particular
> SPECweb99 world record: on 2-way, 2 GB RAM, testing a load with a full
> fileset of ~9 GB. It generates insane block-IO load, and we do beat other
> OSs that have multipage support, including SGI. (and no, it's not due to
> kernel-space acceleration alone this time - it's mostly due to very good
> block-IO performance.) We use Jens Axobe's IO-batching fixes that
> dramatically improve the block scheduler's performance under high load.

Perhaps, but we have proven and significant reductions in CPU
utilisation from eliminating the per-buffer_head API to the block
layer.  Next time M$ gets close to our specweb records, maybe this is
the next place to look for those extra few % points!

> > Gig Ethernet, [...]
> 
> we handle gigabit ethernet with 1.5K zero-copy packets just fine. One
> thing people forget is IRQ throttling: when switching from 1500 byte
> packets to 9000 byte packets then the amount of interrupts drops by a
> factor of 6. Now if the tunings of a driver are not changed accordingly,
> 1500 byte MTU can show dramatically lower performance than 9000 byte MTU.
> But if tuned properly, i see little difference between 1500 byte and 9000
> byte MTU. (when using a good protocol such as TCP.)

Maybe you see good throughput numbers, but I still bet the CPU
utilisation could be bettered significantly with jumbograms.

That's one of the problems with benchmarks: our CPU may be fast enough
that we can keep the IO subsystems streaming, and the benchmark will
not show up any OS bottlenecks, but we may still be consuming far too
much CPU time internally.  That's certainly the case with the block IO
measurements made on XFS: sure, ext2 can keep a fast disk loaded to
pretty much 100%, but at the cost of far more system CPU time than
XFS+pagebuf+kiobuf-IO takes on the same disk.

> > The presence of terrible performance in the old ll_rw_block code is
> > NOT a good excuse for perpetuating that model.
> 
> i'd like to measure this performance problem (because i'd like to
> double-check it) - what measurement method was used?

"time" will show it.  A 13MB/sec raw IO dd using 64K blocks uses
something between 5% and 15% of CPU time on the various systems I've
tested on (up to 30% on an old 486 with a 1540, but that's hardly
representative. :)  The kernel profile clearly shows the buffer
management as the biggest cost, with the SCSI code walking those
buffer heads a close second.

On my main scsi server test box, I get raw 32K reads taking about 7%
system time on the cpu, with make_request and __get_request_wait being
the biggest hogs.

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1
  2001-01-09 17:30                         ` Benjamin C.R. LaHaise
@ 2001-01-09 18:12                           ` Stephen C. Tweedie
  2001-01-09 18:35                           ` Ingo Molnar
  1 sibling, 0 replies; 119+ messages in thread
From: Stephen C. Tweedie @ 2001-01-09 18:12 UTC (permalink / raw)
  To: Benjamin C.R. LaHaise
  Cc: Ingo Molnar, Stephen C. Tweedie, Christoph Hellwig,
	David S. Miller, riel, netdev, linux-kernel

Hi,

On Tue, Jan 09, 2001 at 12:30:39PM -0500, Benjamin C.R. LaHaise wrote:
> On Tue, 9 Jan 2001, Ingo Molnar wrote:
> 
> > this is why i ment that *right now* kiobufs are not suited for networking,
> > at least the way we do it. Maybe if kiobufs had the same kind of internal
> > structure as sk_frag (ie. array of (page,offset,size) triples, not array
> > of pages), that would work out better.
> 
> That I can agree with, and it would make my life easier since I really
> only care about the completion of an entire io, not the individual
> fragments of it.

Right, but this is why the kiobuf IO functions are supposed to accept
kiovecs (ie. counted vectors of kiobuf *s, just like ll_rw_block
receives buffer_heads).

The kiobuf is supposed to be a unit of memory, not of IO.  You can map
several different kiobufs from different sources and send them all
together to brw_kiovec() as a single IO.

Cheers,
 Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1
  2001-01-09 11:05             ` Ingo Molnar
@ 2001-01-09 18:27               ` Christoph Hellwig
  2001-01-09 19:19                 ` Ingo Molnar
  0 siblings, 1 reply; 119+ messages in thread
From: Christoph Hellwig @ 2001-01-09 18:27 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Rik van Riel, David S. Miller, netdev, linux-kernel

On Tue, Jan 09, 2001 at 12:05:59PM +0100, Ingo Molnar wrote:
> 
> On Tue, 9 Jan 2001, Christoph Hellwig wrote:
> 
> > > 2.4. In any case, the zerocopy code is 'kiovec in spirit' (uses
> > > vectors of struct page *, offset, size entities),
> 
> > Yep. That is why I was so worried aboit the writepages file op.
> 
> i believe you misunderstand. kiovecs (in their current form) are simply
> too bloated for networking purposes.

Stop.  I NEVER said you should use them internally.
My concern is too use a file operation with a kiobuf ** as main argument
instead of page *.  With a little more bloat it allows you to do the same
you do now.  But it also offers a real advantage:  you don't have to call
into the network stack for every single page, and this fits easily in Ben's
AIO stuff, so your stuff is very well integrated into the (futur) asynch IO
framework. (he latter was my main concern).

You pay 116 bytes and a few cycles for a _lot_ more abstraction and
integration.  Exactly such a design principle (design vs speed) is the cause
why UNIX survived so long.


> Due to its nature and nonpersistency,
> networking is very lightweight and memory-footprint-sensitive code (as
> opposed to eg. block IO code), right now an 'struct skb_shared_info'
> [which is roughly equivalent to a kiovec] is 12+4*6 == 36 bytes, which
> includes support for 6 distinct fragments (each fragment can be on any
> page, any offset, any size). A *single* kiobuf (which is roughly
> equivalent to an skb fragment) is 52+16*4 == 116 bytes. 6 of these would
> be 696 bytes, for a single TCP packet (!!!). This is simply not something
> to be used for lightweight zero-copy networking.

This doesn't matter, because rw_kiovec can easily take only one kiobuf,
and you don't really need the different fragments there.

> so it's easy to say 'use kiovecs', but right now it's simply not
> practical. kiobufs are a loaded concept, and i'm not sure whether it's
> desirable at all to mix networking zero-copy concepts with
> block-IO/filesystem zero-copy concepts.

I didn't wnat to suggest that - I'm to clueless concerning networking to
even consider an internal design for network zero-copy IO.
I'm just talking about the VFS interface to the rest of the kernel.

> we talked (and are talking) to Stephen about this problem, but it's a
> clealy 2.5 kernel issue. Merging to a finalized zero-copy framework will
> be easy. (The overwhelming percentage of zero-copy code is in the
> networking code itself and is insensitive to any kiovec issues.)

Agreed.

> > It's rather hackish (only write, looks usefull only for networking)
> > instead of the proposed rw_kiovec fop.
> 
> i'm not sure what you are trying to say. You mean we should remove
> sendfile() as well? It's only write, looks useful mostly for networking. A
> substantial percentage of kernel code is useful only for networking :-)

No.  But it looks like a recvmsg syscall wouldn't too bad either ...

	Christoph

-- 
Whip me.  Beat me.  Make me maintain AIX.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1
  2001-01-09 17:30                         ` Benjamin C.R. LaHaise
  2001-01-09 18:12                           ` Stephen C. Tweedie
@ 2001-01-09 18:35                           ` Ingo Molnar
  1 sibling, 0 replies; 119+ messages in thread
From: Ingo Molnar @ 2001-01-09 18:35 UTC (permalink / raw)
  To: Benjamin C.R. LaHaise
  Cc: Stephen C. Tweedie, Christoph Hellwig, David S. Miller, riel,
	netdev, linux-kernel


On Tue, 9 Jan 2001, Benjamin C.R. LaHaise wrote:

> I've already got fully async read and write working via a helper thread
                                                      ^^^^^^^^^^^^^^^^^^^
> for doing the bmaps when the page is not uptodate in the page cache.
  ^^^^^^^^^^^^^^^^^^^

thats what TUX 2.0 does. (it does async reads at the moment.)

> The primatives for async locking of pages and waiting on events such
> that converting ext2 to performing full async bmap should be trivial.

well - if you think it's trivial (ie. no process context, no helper thread
will be needed), more power to you. How are you going to assure that the
issuing process does not block during the bmap()? [without extensive
lowlevel-FS changes that is.]

> Note that O_NONBLOCK is not good enough because you can't implement an
> asynchronous O_SYNC write with it.

(i'm using it for reads only.)

	Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1
  2001-01-09 17:38                               ` Jens Axboe
@ 2001-01-09 18:38                                 ` Ingo Molnar
  2001-01-09 19:54                                   ` Andrea Arcangeli
  0 siblings, 1 reply; 119+ messages in thread
From: Ingo Molnar @ 2001-01-09 18:38 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Alan Cox, Stephen C. Tweedie, Christoph Hellwig, David S. Miller,
	riel, netdev, linux-kernel


On Tue, 9 Jan 2001, Jens Axboe wrote:

> > > ever seen, this is why i quoted it - the talk was about block-IO
> > > performance, and Stephen said that our block IO sucks. It used to suck,
> > > but in 2.4, with the right patch from Jens, it doesnt suck anymore. )
> >
> > Thats fine. Get me 128K-512K chunks nicely streaming into my raid controller
> > and I'll be a happy man
>
> No problem, apply blk-13B and you'll get 512K chunks for SCSI and RAID.

i cannot agree more - Jens' patch did wonders to IO performance here. It
fixes a long-standing bug in the Linux block-IO-scheduler that caused very
suboptimal requests being issued to lowlevel drivers once the request
queue gets full. I think this patch is a clear candidate for 2.4.x
inclusion.

	Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1
  2001-01-09 17:56                             ` Chris Evans
@ 2001-01-09 18:41                               ` Ingo Molnar
  2001-01-09 22:58                                 ` [patch]: ac4 blk (was Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1) Jens Axboe
  0 siblings, 1 reply; 119+ messages in thread
From: Ingo Molnar @ 2001-01-09 18:41 UTC (permalink / raw)
  To: Chris Evans; +Cc: linux-kernel


On Tue, 9 Jan 2001, Chris Evans wrote:

> > but in 2.4, with the right patch from Jens, it doesnt suck anymore. )
>
> Is this "right patch from Jens" on the radar for 2.4 inclusion?

i do hope so!

	Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1
  2001-01-09 11:28             ` Christoph Hellwig
  2001-01-09 12:04               ` Ingo Molnar
@ 2001-01-09 19:14               ` Linus Torvalds
  2001-01-09 20:07                 ` Ingo Molnar
  2001-01-12  1:42                 ` Stephen C. Tweedie
  1 sibling, 2 replies; 119+ messages in thread
From: Linus Torvalds @ 2001-01-09 19:14 UTC (permalink / raw)
  To: linux-kernel

In article <20010109122810.A3115@caldera.de>,
Christoph Hellwig  <hch@caldera.de> wrote:
>
>You get that multiple page call with kiobufs for free...

No, you don't.

kiobufs are crap. Face it. They do NOT allow proper multi-page scatter
gather, regardless of what the kiobuf PR department has said.

I've complained about it before, and nobody listened. Davids zero-copy
network code had the same bug. I complained about it to David, and David
took about a day to understand my arguments, and fixed it.

It's more likely that the zero-copy network code will be used in real
life than kiobufs will ever be.  The kiobufs are damn ugly by
comparison, and the fact that the kiobuf people don't even seem to
realize the problems makes me just more convinced that it's not worth
even arguing about.

What is the problem with kiobuf's? Simple: they have a "offset" and a
"length", and an array of pages.  What that completely and utterly
misses is that if you have an array of pages, you should have an array
of "offset" and "length" too.  As it is, kiobuf's cannot be used for
things like readv() and writev(). 

Yes, to work around this limitation, there's the notion of "kiovec", an
array of kiobuf's.  Never mind the fact that if kiobuf's had been
properly designed in the first place, you wouldn't need kiovec's at all. 
And kiovec's are too damn heavy to use for something like the networking
zero-copy, with all the double indirection etc. 

I told David that he can fix the network zero-copy code two ways: either
he makes it _truly_ scatter-gather (an array of not just pages, but of
proper page-offset-length tuples), or he makes it just a single area and
lets the low-level TCP/whatever code build up multiple segments
internally.  Either of which are good designs.

It so happens that none of the users actually wanted multi-page
scatter-gather, and the only thing that really wanted to do the sg was
the networking layer when it created a single packet out of multiple
areas, so the zero-copy stuff uses the simpler non-array interface. 

And kiobufs can rot in hell for their design mistakes.  Maybe somebody
will listen some day and fix them up, and in the meantime they can look
at the networking code for an example of how to do it. 

		Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1
  2001-01-09 18:27               ` Christoph Hellwig
@ 2001-01-09 19:19                 ` Ingo Molnar
  0 siblings, 0 replies; 119+ messages in thread
From: Ingo Molnar @ 2001-01-09 19:19 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Rik van Riel, David S. Miller, netdev, linux-kernel


On Tue, 9 Jan 2001, Christoph Hellwig wrote:

> I didn't want to suggest that - I'm to clueless concerning networking
> to even consider an internal design for network zero-copy IO. I'm just
> talking about the VFS interface to the rest of the kernel.

(well, i think you just cannot be clueless about one and then demand
various things about the other...)

	Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1
  2001-01-09 16:37                         ` Alan Cox
  2001-01-09 16:48                           ` Ingo Molnar
@ 2001-01-09 19:20                           ` J Sloan
  1 sibling, 0 replies; 119+ messages in thread
From: J Sloan @ 2001-01-09 19:20 UTC (permalink / raw)
  To: Alan Cox
  Cc: mingo, Stephen C. Tweedie, Christoph Hellwig, David S. Miller,
	riel, netdev, linux-kernel

Alan Cox wrote:

>
> > it might not be important to others, but we do hold one particular
> > SPECweb99 world record: on 2-way, 2 GB RAM, testing a load with a full
>
> And its real world value is exactly the same as the mindcraft NT values. Don't
> forget that.

In other words, devastating.

jjs

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1
  2001-01-09 18:38                                 ` Ingo Molnar
@ 2001-01-09 19:54                                   ` Andrea Arcangeli
  2001-01-09 20:10                                     ` Ingo Molnar
                                                       ` (2 more replies)
  0 siblings, 3 replies; 119+ messages in thread
From: Andrea Arcangeli @ 2001-01-09 19:54 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Jens Axboe, Alan Cox, Stephen C. Tweedie, Christoph Hellwig,
	David S. Miller, riel, netdev, linux-kernel

On Tue, Jan 09, 2001 at 07:38:28PM +0100, Ingo Molnar wrote:
> 
> On Tue, 9 Jan 2001, Jens Axboe wrote:
> 
> > > > ever seen, this is why i quoted it - the talk was about block-IO
> > > > performance, and Stephen said that our block IO sucks. It used to suck,
> > > > but in 2.4, with the right patch from Jens, it doesnt suck anymore. )
> > >
> > > Thats fine. Get me 128K-512K chunks nicely streaming into my raid controller
> > > and I'll be a happy man
> >
> > No problem, apply blk-13B and you'll get 512K chunks for SCSI and RAID.
> 
> i cannot agree more - Jens' patch did wonders to IO performance here. It

BTW, I noticed what is left in blk-13B seems to be my work (Jens's fixes for
merging when the I/O queue is full are just been integrated in test1x). The
512K SCSI command, wake_up_nr, elevator fixes and cleanups and removal of the
bogus 64 max_segment limit in scsi.c that matters only with the IOMMU to allow
devices with sg_tablesize <64 to do SG with 64 segments were all thought and
implemented by me. My last public patch with most of the blk-13B stuff in it
was here:

	ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/patches/2.4.0-test7/blkdev-3

I sumbitted a later revision of the above blkdev-3 to Jens and he kept nicely
maintaining it in sync with 2.4.x-latest.

My blkdev tree is even more advanced but I didn't had time to update with 2.4.0
and marge it with Jens yet (I just described to Jens what "more advanced"
means though, in practice it means something like a x2 speedup in tiotest seek
write numbers, streaming I/O doesn't change on highmem boxes but it doesn't
hurt lowmem boxes anymore). Current blk-13B isn't ok for integration yet
because it hurts with lowmem (try with mem=32m with your scsi array that gets
512K*512 requests in flight :) and it's not able to exploit the elevator as
well as my tree even on highmemory machines. So I'd wait until I merge the last
bits with Jens (I raised the QUEUE_NR_REQUESTS to 3000) before inclusion.

Confirm Jens?

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1
  2001-01-09 19:14               ` Linus Torvalds
@ 2001-01-09 20:07                 ` Ingo Molnar
  2001-01-09 20:15                   ` Linus Torvalds
  2001-01-12  1:42                 ` Stephen C. Tweedie
  1 sibling, 1 reply; 119+ messages in thread
From: Ingo Molnar @ 2001-01-09 20:07 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel


On 9 Jan 2001, Linus Torvalds wrote:

> I told David that he can fix the network zero-copy code two ways: either
> he makes it _truly_ scatter-gather (an array of not just pages, but of
> proper page-offset-length tuples), or he makes it just a single area and
> lets the low-level TCP/whatever code build up multiple segments
> internally.  Either of which are good designs.

it's actually truly zero-copy internally, we use an array of
(page,offset,length) tuples, with proper per-page usage counting. We did
this for than half a year. I believe the array-of-pages solution you refer
to went only from the pagecache layer into the highest level of TCP - then
it got converted into the internal representation. These tuples right now
do not have their own life, they are always associated with actual
outgoing packets (and in fact are allocated together with skb's and are at
the end of the header area).

the lowlevel networking drivers (and even midlevel networking code) knows
nothing about kiovecs or arrays of pages, it's using the array-of-tuples
representation:

typedef struct skb_frag_struct skb_frag_t;

struct skb_frag_struct
{
        struct page *page;
        __u16 page_offset;
        __u16 size;
};

/* This data is invariant across clones and lives at
 * the end of the header data, ie. at skb->end.
 */
struct skb_shared_info {
        atomic_t        dataref;
        unsigned int    nr_frags;
        struct sk_buff  *frag_list;
        skb_frag_t      frags[MAX_SKB_FRAGS];
};

(the __u16 thing is more of a cache footprint paranoia than real
necessity, it could be int as well.). So i do believe that the networking
code is properly designed in this respect, and this concept goes to the
highest level of the networking code.

	Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1
  2001-01-09 19:54                                   ` Andrea Arcangeli
@ 2001-01-09 20:10                                     ` Ingo Molnar
  2001-01-10  0:00                                       ` Andrea Arcangeli
  2001-01-09 20:12                                     ` Jens Axboe
  2001-01-17  5:16                                     ` Rik van Riel
  2 siblings, 1 reply; 119+ messages in thread
From: Ingo Molnar @ 2001-01-09 20:10 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Jens Axboe, Alan Cox, Stephen C. Tweedie, Christoph Hellwig,
	David S. Miller, riel, netdev, linux-kernel


On Tue, 9 Jan 2001, Andrea Arcangeli wrote:

> BTW, I noticed what is left in blk-13B seems to be my work (Jens's
> fixes for merging when the I/O queue is full are just been integrated
> in test1x).  [...]

it was Jens' [i think those were implemented by Jens entirely]
batch-freeing changes that made the most difference. (we did
profile it step by step.)

> ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/patches/2.4.0-test7/blkdev-3

great! i'm happy that the block IO layer and IO scheduler now has
a real home :-) nice work.

	Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1
  2001-01-09 19:54                                   ` Andrea Arcangeli
  2001-01-09 20:10                                     ` Ingo Molnar
@ 2001-01-09 20:12                                     ` Jens Axboe
  2001-01-09 23:20                                       ` Andrea Arcangeli
  2001-01-17  5:16                                     ` Rik van Riel
  2 siblings, 1 reply; 119+ messages in thread
From: Jens Axboe @ 2001-01-09 20:12 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Ingo Molnar, Alan Cox, Stephen C. Tweedie, Christoph Hellwig,
	David S. Miller, riel, netdev, linux-kernel

On Tue, Jan 09 2001, Andrea Arcangeli wrote:
> > > > Thats fine. Get me 128K-512K chunks nicely streaming into my raid controller
> > > > and I'll be a happy man
> > >
> > > No problem, apply blk-13B and you'll get 512K chunks for SCSI and RAID.
> > 
> > i cannot agree more - Jens' patch did wonders to IO performance here. It
> 
> BTW, I noticed what is left in blk-13B seems to be my work (Jens's fixes for
> merging when the I/O queue is full are just been integrated in test1x). The
> 512K SCSI command, wake_up_nr, elevator fixes and cleanups and removal of the
> bogus 64 max_segment limit in scsi.c that matters only with the IOMMU to allow
> devices with sg_tablesize <64 to do SG with 64 segments were all thought and
> implemented by me. My last public patch with most of the blk-13B stuff in it
> was here:
> 
> 	ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/patches/2.4.0-test7/blkdev-3
> 
> I sumbitted a later revision of the above blkdev-3 to Jens and he kept nicely
> maintaining it in sync with 2.4.x-latest.

There are several parts that have been merged beyond recognition at this
point :-). The wake_up_nr was actually partially redone by Ingo, I suspect
he can fill in the gaps there. Then there are the general cleanups and cruft
removal done by you (elevator->nr_segments stuff). The bogus 64 max segments
from SCSI was there before merge too, I think I've actually had that in my
tree for ages!

The request free batching and pending queues were done by me, and Ingo
helped tweak it during the spec runs to find a sweet spot of how much to
batch etc.

The elevator received lots of massaging beyond blkdev-3. For one, there
are now only one complete queue scan for merge and insert of request where
we before did one for each of them. The merger also does correct
accounting and aging.

In addition there are a bunch other small fixes in there, I'm too lazy
to list them all now :)

> My blkdev tree is even more advanced but I didn't had time to update with 2.4.0
> and marge it with Jens yet (I just described to Jens what "more advanced"
> means though, in practice it means something like a x2 speedup in tiotest seek

I haven't heard anything beyond the raised QUEUE_NR_REQUEST, so I'd like to
see what you have pending so we can merge :-). The tiotest seek increase was
mainly due to the elevator having 3000 requests to juggle and thus being able
to eliminate a lot of seeks right?

> write numbers, streaming I/O doesn't change on highmem boxes but it doesn't
> hurt lowmem boxes anymore). Current blk-13B isn't ok for integration yet
> because it hurts with lowmem (try with mem=32m with your scsi array that gets
> 512K*512 requests in flight :) and it's not able to exploit the elevator as

I don't see any lowmem problems -- if under pressure, the queue should be
fired and thus it won't get as long as if you have lots of memory free.`

> well as my tree even on highmemory machines. So I'd wait until I merge the last
> bits with Jens (I raised the QUEUE_NR_REQUESTS to 3000) before inclusion.

?? What do you mean exploit the elevator?

-- 
* Jens Axboe <axboe@suse.de>
* SuSE Labs
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1
  2001-01-09 20:07                 ` Ingo Molnar
@ 2001-01-09 20:15                   ` Linus Torvalds
  2001-01-09 20:36                     ` Christoph Hellwig
  0 siblings, 1 reply; 119+ messages in thread
From: Linus Torvalds @ 2001-01-09 20:15 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: linux-kernel



On Tue, 9 Jan 2001, Ingo Molnar wrote:
> 
>				 So i do believe that the networking
> code is properly designed in this respect, and this concept goes to the
> highest level of the networking code.

Absolutely. This is why I have no conceptual problems with the networking
layer changes, and why I am in violent disagreement with people who think
the networking layer should have used the (much inferior, in my opinion)
kiobuf/kiovec approach.

For people who worry about code re-use and argue for kiobuf/kiovec on
those grounds, I can only say that the code re-use should go the other
way. It should be "the bad code should re-use code from the good code". It
should NOT be "the new code should re-use code from the old code".

			Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1
  2001-01-09 20:15                   ` Linus Torvalds
@ 2001-01-09 20:36                     ` Christoph Hellwig
  2001-01-09 20:55                       ` Linus Torvalds
  0 siblings, 1 reply; 119+ messages in thread
From: Christoph Hellwig @ 2001-01-09 20:36 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: migo, linux-kernel

In article <Pine.LNX.4.10.10101091212520.2331-100000@penguin.transmeta.com> you wrote:


> On Tue, 9 Jan 2001, Ingo Molnar wrote:
>> 
>>				 So i do believe that the networking
>> code is properly designed in this respect, and this concept goes to the
>> highest level of the networking code.

> Absolutely. This is why I have no conceptual problems with the networking
> layer changes, and why I am in violent disagreement with people who think
> the networking layer should have used the (much inferior, in my opinion)
> kiobuf/kiovec approach.

At least I (who has started this threads) haven't said htey should use iobufs
internally.  I said: use iovecs in the interface, because this interface
is a little more general and allows to integrate into other parts (namely Ben's
aio work nicely).

Also the tuple argument you gave earlier isn't right in this specific case:

when doing sendfile from pagecache to an fs, you have a bunch of pages,
an offset in the first and a length that makes the data end before last
page's end.

> For people who worry about code re-use and argue for kiobuf/kiovec on
> those grounds, I can only say that the code re-use should go the other
> way. It should be "the bad code should re-use code from the good code". It
> should NOT be "the new code should re-use code from the old code".

It's not relly about reusing, but about compatiblity with other interfaces...

	Christoph

-- 
Whip me.  Beat me.  Make me maintain AIX.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1
  2001-01-09 20:36                     ` Christoph Hellwig
@ 2001-01-09 20:55                       ` Linus Torvalds
  2001-01-09 21:12                         ` Christoph Hellwig
  2001-01-09 23:06                         ` Benjamin C.R. LaHaise
  0 siblings, 2 replies; 119+ messages in thread
From: Linus Torvalds @ 2001-01-09 20:55 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: migo, linux-kernel



On Tue, 9 Jan 2001, Christoph Hellwig wrote:
> 
> Also the tuple argument you gave earlier isn't right in this specific case:
> 
> when doing sendfile from pagecache to an fs, you have a bunch of pages,
> an offset in the first and a length that makes the data end before last
> page's end.

No.

Look at sendfile(). You do NOT have a "bunch" of pages.

Sendfile() is very much a page-at-a-time thing, and expects the actual IO
layers to do it's own scatter-gather. 

So sendfile() doesn't want any array at all: it only wants a single
page-offset-length tuple interface.

The _lower-level_ stuff (ie TCP and the drivers) want the "array of
tuples", and again, they do NOT want an array of pages, because if
somebody does two sendfile() calls that fit in one packet, it really needs
an array of tuples.

In short, the kiobuf interface is _always_ the wrong one.

		Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1
  2001-01-09 20:55                       ` Linus Torvalds
@ 2001-01-09 21:12                         ` Christoph Hellwig
  2001-01-09 21:26                           ` Linus Torvalds
  2001-01-09 23:06                         ` Benjamin C.R. LaHaise
  1 sibling, 1 reply; 119+ messages in thread
From: Christoph Hellwig @ 2001-01-09 21:12 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Christoph Hellwig, migo, linux-kernel

On Tue, Jan 09, 2001 at 12:55:51PM -0800, Linus Torvalds wrote:
> 
> 
> On Tue, 9 Jan 2001, Christoph Hellwig wrote:
> > 
> > Also the tuple argument you gave earlier isn't right in this specific case:
> > 
> > when doing sendfile from pagecache to an fs, you have a bunch of pages,
> > an offset in the first and a length that makes the data end before last
> > page's end.
> 
> No.
> 
> Look at sendfile(). You do NOT have a "bunch" of pages.
> 
> Sendfile() is very much a page-at-a-time thing, and expects the actual IO
> layers to do it's own scatter-gather. 
> 
> So sendfile() doesn't want any array at all: it only wants a single
> page-offset-length tuple interface.

The current implementations does.
But others are possible.  I could post one in a few days to show that it is
possible.

	Christoph

-- 
Whip me.  Beat me.  Make me maintain AIX.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1
  2001-01-09 12:04               ` Ingo Molnar
  2001-01-09 14:25                 ` Stephen C. Tweedie
@ 2001-01-09 21:13                 ` David S. Miller
  1 sibling, 0 replies; 119+ messages in thread
From: David S. Miller @ 2001-01-09 21:13 UTC (permalink / raw)
  To: sct; +Cc: mingo, hch, riel, netdev, linux-kernel, sct

   Date: Tue, 9 Jan 2001 14:25:42 +0000
   From: "Stephen C. Tweedie" <sct@redhat.com>

   Perhaps tcp can merge internal 4K requests, but if you're doing udp
   jumbograms (or STP or VIA), you do need an interface which can give
   the networking stack more than one page at once.

All network protocols can use the current interface and get the result
you are after, see MSG_MORE.  TCP isn't "special" in this regard.

Later,
David S. Miller
davem@redhat.com
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1
  2001-01-09 14:40             ` Ingo Molnar
                                 ` (2 preceding siblings ...)
  2001-01-09 15:25               ` Stephen Frost
@ 2001-01-09 21:18               ` David S. Miller
  3 siblings, 0 replies; 119+ messages in thread
From: David S. Miller @ 2001-01-09 21:18 UTC (permalink / raw)
  To: sct; +Cc: mingo, sct, riel, hch, netdev, linux-kernel

   Date: Tue, 9 Jan 2001 15:17:25 +0000
   From: "Stephen C. Tweedie" <sct@redhat.com>

   Jes has also got hard numbers for the performance advantages of
   jumbograms on some of the networks he's been using, and you ain't
   going to get udp jumbograms through a page-by-page API, ever.

Again, see MSG_MORE in the patches.  It is possible and our UDP
implementation could make it easily.

Later,
David S. Miller
davem@redhat.com
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1
  2001-01-09 13:42   ` David S. Miller
@ 2001-01-09 21:19     ` David S. Miller
  0 siblings, 0 replies; 119+ messages in thread
From: David S. Miller @ 2001-01-09 21:19 UTC (permalink / raw)
  To: trond.myklebust; +Cc: linux-kernel, netdev

   Date: Tue, 9 Jan 2001 16:27:49 +0100 (CET)
   From: Trond Myklebust <trond.myklebust@fys.uio.no>

   OK, but can you eventually generalize it to non-stream protocols
   (i.e. UDP)?

Sure, this is what MSG_MORE is meant to accomodate.  UDP could support
it just fine.

Later,
David S. Miller
davem@redhat.com
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1
  2001-01-09 21:12                         ` Christoph Hellwig
@ 2001-01-09 21:26                           ` Linus Torvalds
  2001-01-10  7:42                             ` Christoph Hellwig
  0 siblings, 1 reply; 119+ messages in thread
From: Linus Torvalds @ 2001-01-09 21:26 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: migo, linux-kernel



On Tue, 9 Jan 2001, Christoph Hellwig wrote:
> > 
> > Look at sendfile(). You do NOT have a "bunch" of pages.
> > 
> > Sendfile() is very much a page-at-a-time thing, and expects the actual IO
> > layers to do it's own scatter-gather. 
> > 
> > So sendfile() doesn't want any array at all: it only wants a single
> > page-offset-length tuple interface.
> 
> The current implementations does.
> But others are possible.  I could post one in a few days to show that it is
> possible.

Why do you bother arguing, when I've shown you that even if sendfile()
_did_ do multiple pages, it STILL wouldn't make kibuf's the right
interface. You just snipped out that part of my email, which states that
the networking layer would still need to do better scatter-gather than
kiobuf's can give it for multiple send-file invocations.

Let me iterate:

 - the layers like TCP _need_ to do scatter-gather anyway: you absolutely
   want to be able to send out just one packet even if the data comes from
   two different sources (for example, one source might be the http
   header, while the other source is the actual file contents. This is
   definitely not a made-up-example, this is THE example of something like
   this, and happens with just about all protocols that have a notion of 
   a header, which is pretty much 100% of them).

 - because TCP needs to do scatter-gather anyway across calls, there is no
   real reason for sendfile() to do it. And sendfile() doing it would
   _not_ obviate the need for it in the networking layer - it would only
   add complexity for absolutely no performance gain.

So neither sendfile _nor_ the networking layer want kiobuf's. Never have,
never will. The "half-way scatter-gather" support they give ends up either
being too much baggage, or too little. It's never the right fit.

kiovec adds the support for true scatter-gather, but with a horribly bad
interface, and much too much overhead - and absolutely NO advantages over
the _proper_ array of <page-offset-tuple> which is much simpler than the
complex two-level arrays that you get with kiovec+kiobuf.

End of story.

		Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1
  2001-01-09 15:17               ` Stephen C. Tweedie
  2001-01-09 15:37                 ` Ingo Molnar
@ 2001-01-09 22:25                 ` Linus Torvalds
  2001-01-10 15:21                   ` Stephen C. Tweedie
  1 sibling, 1 reply; 119+ messages in thread
From: Linus Torvalds @ 2001-01-09 22:25 UTC (permalink / raw)
  To: linux-kernel

In article <20010109151725.D9321@redhat.com>,
Stephen C. Tweedie <sct@redhat.com> wrote:
>
>Jes has also got hard numbers for the performance advantages of
>jumbograms on some of the networks he's been using, and you ain't
>going to get udp jumbograms through a page-by-page API, ever.

Wrong.

The only thing you need is a nagle-type thing that coalesces requests.
In the case of UDP, that coalescing obviously has to be explicitly
controlled, as the "standard" UDP behaviour is to send out just one
packet per write.

But this is a problem for TCP too: you want to tell TCP to _not_ send
out a short packet even if there are none in-flight, if you know you
want to send more.  So you want to have some way to anti-nagle for TCP
anyway. 

Also, if you look at the problem of "writev()", you'll notice that you
have many of the same issues: what you really want is to _always_
coalesce, and only send out when explicitly asked for (and then that
explicit ask would be on by default at the end of write() and at the
very end of the last segment in "writev()". 

It so happens that this logic already exists, it's called MSG_MORE or
something similar (I'm too lazy to check the actual patches). 

And it's there exactly because it is stupid to make the upper layers
have to gather everything into one packet if the lower layers need that
logic for other reasons anyway. Which they obviously do.

So what you can do is to just do multiple writes, and set the MSG_MORE
flag.  This works with sendfile(), but more importantly it is also an
uncommonly good interface to user mode.  With this, you can actually
implement things like "writev()" _properly_ from user-space, and we
could get rid of the special socket writev() magic if we wanted to. 

So if you have a header, you just send out that header separately (with
the MSG_MORE flag), and then do a "sendfile()" or whatever to send out
the data. 

This is much more flexible than writev(), and a lot easier to use.  It's
also a hell of a lot more flexible than the ugly sendfile() interfaces
that HP-UX and the BSD people have - I'm ashamed of how little taste the
BSD group in general has had in interface design.  Ugh.  Tacking on a
mixture of writev() and sendfile() in the same system call.  Tacky. 

			Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 119+ messages in thread

* [patch]: ac4 blk (was Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1)
  2001-01-09 18:41                               ` Ingo Molnar
@ 2001-01-09 22:58                                 ` Jens Axboe
  0 siblings, 0 replies; 119+ messages in thread
From: Jens Axboe @ 2001-01-09 22:58 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Chris Evans, linux-kernel

On Tue, Jan 09 2001, Ingo Molnar wrote:
> 
> > > but in 2.4, with the right patch from Jens, it doesnt suck anymore. )
> >
> > Is this "right patch from Jens" on the radar for 2.4 inclusion?
> 
> i do hope so!

Here's a version against 2.4.0-ac4, blk-13B did not apply cleanly due to
moving of i2o files and S/390 dasd changes:

*.kernel.org/pub/linux/kernel/people/axboe/patches/2.4.0-ac4/blk-13C.bz2

-- 
* Jens Axboe <axboe@suse.de>
* SuSE Labs
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1
  2001-01-09 20:55                       ` Linus Torvalds
  2001-01-09 21:12                         ` Christoph Hellwig
@ 2001-01-09 23:06                         ` Benjamin C.R. LaHaise
  2001-01-09 23:54                           ` Linus Torvalds
  1 sibling, 1 reply; 119+ messages in thread
From: Benjamin C.R. LaHaise @ 2001-01-09 23:06 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Christoph Hellwig, migo, linux-kernel

On Tue, 9 Jan 2001, Linus Torvalds wrote:

> The _lower-level_ stuff (ie TCP and the drivers) want the "array of
> tuples", and again, they do NOT want an array of pages, because if
> somebody does two sendfile() calls that fit in one packet, it really needs
> an array of tuples.

A kiobuf simply provides that tuple plus the completion callback.  Stick a
bunch of them together and you've got a kiovec.  I don't see the advantage
of moving to simpler primatives if they don't provide needed
functionality.

> In short, the kiobuf interface is _always_ the wrong one.

Please tell me what you think the right interface is that provides a hook
on io completion and is asynchronous.

		-ben


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1
  2001-01-09 20:12                                     ` Jens Axboe
@ 2001-01-09 23:20                                       ` Andrea Arcangeli
  2001-01-09 23:34                                         ` Jens Axboe
  0 siblings, 1 reply; 119+ messages in thread
From: Andrea Arcangeli @ 2001-01-09 23:20 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Ingo Molnar, Alan Cox, Stephen C. Tweedie, Christoph Hellwig,
	David S. Miller, riel, netdev, linux-kernel

On Tue, Jan 09, 2001 at 09:12:04PM +0100, Jens Axboe wrote:
> I haven't heard anything beyond the raised QUEUE_NR_REQUEST, so I'd like to
> see what you have pending so we can merge :-). The tiotest seek increase was
> mainly due to the elevator having 3000 requests to juggle and thus being able
> to eliminate a lot of seeks right?

Raising QUEUE_NR_REQUEST is possible because of the rework of other parts of
ll_rw_block meant to fix the lowmem boxes.

> > write numbers, streaming I/O doesn't change on highmem boxes but it doesn't
> > hurt lowmem boxes anymore). Current blk-13B isn't ok for integration yet
> > because it hurts with lowmem (try with mem=32m with your scsi array that gets
> > 512K*512 requests in flight :) and it's not able to exploit the elevator as
> 
> I don't see any lowmem problems -- if under pressure, the queue should be
> fired and thus it won't get as long as if you have lots of memory free.`

A write(2) shouldn't cause the allocator to wait I/O completion. It's the write
that should block when it's only polluting the cache or you'll hurt the
innocent rest of the system that isn't writing.

At least with my original implementation of the 512K large scsi command
support that you merged, before a write could block you first had to generate
at least 128Mbyte of memory _locked_ all queued in the I/O request list waiting
the driver to process the requests (only locked, without considering
the dirty part of memory).

Since you raised from 256 requests per queue to 512 with your patch you
may have to generate 256Mbyte of locked memory before a write can block.

This is great on the 8G boxes that runs specweb but this isn't that great on a
32Mbyte box connected incidentally to a decent SCSI adapter.

I say "may" because I didn't checked closely if you introduced any kind of
logic to avoid this. It seems not though because such a logic needs to touch at
least blkdev_release_request and that's what I developed in my tree and then I
could raise the number of I/O request in the queue up to 10000 if I wanted
without any problem, the max-I/O in flight was controlled properly. (this
allowed me to optimize away not 256 or in your case 512 seeks but 10000 seeks)
This is what I meant with exploiting the elevator. No panic, there's no buffer
overflow there ;)

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1
  2001-01-09 23:20                                       ` Andrea Arcangeli
@ 2001-01-09 23:34                                         ` Jens Axboe
  2001-01-09 23:52                                           ` Andrea Arcangeli
  0 siblings, 1 reply; 119+ messages in thread
From: Jens Axboe @ 2001-01-09 23:34 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Ingo Molnar, Alan Cox, Stephen C. Tweedie, Christoph Hellwig,
	David S. Miller, riel, netdev, linux-kernel

On Wed, Jan 10 2001, Andrea Arcangeli wrote:
> On Tue, Jan 09, 2001 at 09:12:04PM +0100, Jens Axboe wrote:
> > I haven't heard anything beyond the raised QUEUE_NR_REQUEST, so I'd like to
> > see what you have pending so we can merge :-). The tiotest seek increase was
> > mainly due to the elevator having 3000 requests to juggle and thus being able
> > to eliminate a lot of seeks right?
> 
> Raising QUEUE_NR_REQUEST is possible because of the rework of other parts of
> ll_rw_block meant to fix the lowmem boxes.

Ah I see. It would be nice to base the QUEUE_NR_REQUEST on something else
than a static number. For example, 3000 per queue translates into 281Kb
of request slots per queue. On a typical system with a floppy, hard drive,
and CD-ROM it's getting close to 1Mb of RAM used for this alone. On a
32Mb box this is unaccebtable.

I previously had blk_init_queue_nr(q, nr_free_slots) to eg not use that
many free slots on say floppy, which doesn't really make much sense
anyway.

> > I don't see any lowmem problems -- if under pressure, the queue should be
> > fired and thus it won't get as long as if you have lots of memory free.`
> 
> A write(2) shouldn't cause the allocator to wait I/O completion. It's the write
> that should block when it's only polluting the cache or you'll hurt the
> innocent rest of the system that isn't writing.
> 
> At least with my original implementation of the 512K large scsi command
> support that you merged, before a write could block you first had to generate
> at least 128Mbyte of memory _locked_ all queued in the I/O request list waiting
> the driver to process the requests (only locked, without considering
> the dirty part of memory).
> 
> Since you raised from 256 requests per queue to 512 with your patch you
> may have to generate 256Mbyte of locked memory before a write can block.
> 
> This is great on the 8G boxes that runs specweb but this isn't that great on a
> 32Mbyte box connected incidentally to a decent SCSI adapter.

Yes I see your point. However memory shortage will fire the queue in due
time, it won't make the WRITE block however. In this case it would be
bdflush blocking on the WRITE's, which seem exactly what we don't want?

> I say "may" because I didn't checked closely if you introduced any kind of
> logic to avoid this. It seems not though because such a logic needs to touch at
> least blkdev_release_request and that's what I developed in my tree and then I
> could raise the number of I/O request in the queue up to 10000 if I wanted
> without any problem, the max-I/O in flight was controlled properly. (this
> allowed me to optimize away not 256 or in your case 512 seeks but 10000 seeks)
> This is what I meant with exploiting the elevator. No panic, there's no buffer
> overflow there ;)

So you imposed a MB limit on how much I/O would be outstanding in
blkdev_release_request? Wouldn't it make more sense to move this to at
get_request time, since with the blkdev_release_request approach you won't
catch lots of outstanding lock buffers before you start releasing one of
them, at which point it would be too late (it might recover, but still).


-- 
* Jens Axboe <axboe@suse.de>
* SuSE Labs
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1
  2001-01-09 23:34                                         ` Jens Axboe
@ 2001-01-09 23:52                                           ` Andrea Arcangeli
  0 siblings, 0 replies; 119+ messages in thread
From: Andrea Arcangeli @ 2001-01-09 23:52 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Ingo Molnar, Alan Cox, Stephen C. Tweedie, Christoph Hellwig,
	David S. Miller, riel, netdev, linux-kernel

On Wed, Jan 10, 2001 at 12:34:35AM +0100, Jens Axboe wrote:
> Ah I see. It would be nice to base the QUEUE_NR_REQUEST on something else
> than a static number. For example, 3000 per queue translates into 281Kb
> of request slots per queue. On a typical system with a floppy, hard drive,
> and CD-ROM it's getting close to 1Mb of RAM used for this alone. On a
> 32Mb box this is unaccebtable.

Yes of course. Infact 3000 was just the number I choosen when doing the
benchmarks on a 128Mbox. Things needs to be autotuning and that's not yet
implemented. I meant 3000 to tell how such number can grow. Right now if you
use 3000 you will need to lock 1.5G of RAM (more than the normal zone!) before
you can block with the 512K scsi commands.  This was just to show the rest of
the blkdev layer was obviously restructured.  On a 8G box 10000 requests
would probably be a good number.

> Yes I see your point. However memory shortage will fire the queue in due
> time, it won't make the WRITE block however. In this case it would be

That's the performance problem I'm talking about on the lowmem boxes. Infact
this problem will happen in 2.4.x too, just less biased than with the
512K scsi commands and by you increasing the number of requests from 256 to 512.

> bdflush blocking on the WRITE's, which seem exactly what we don't want?

In 2.4.0 Linus fixed wakeup_bdflush not to wait bdflush anymore as I suggested,
now it's the task context that sumbits the requests directly to the I/O queue
so it's the task that must block, not bdflush. And the task will block correctly
_if_ we unplug at the sane time in ll_rw_block.

> So you imposed a MB limit on how much I/O would be outstanding in
> blkdev_release_request? Wouldn't it make more sense to move this to at

No absolutely. Not in blkdev_release_request. The changes there
are because you need to somehow do some accounting at I/O completion.

> get_request time, since with the blkdev_release_request approach you won't

Yes, only ll_rw_block uplugs, not blkdev_release_request.  Obviously since the
latter runs from irqs.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1
  2001-01-09 23:06                         ` Benjamin C.R. LaHaise
@ 2001-01-09 23:54                           ` Linus Torvalds
  2001-01-10  7:51                             ` Gerd Knorr
  0 siblings, 1 reply; 119+ messages in thread
From: Linus Torvalds @ 2001-01-09 23:54 UTC (permalink / raw)
  To: Benjamin C.R. LaHaise; +Cc: Christoph Hellwig, migo, linux-kernel



On Tue, 9 Jan 2001, Benjamin C.R. LaHaise wrote:

> On Tue, 9 Jan 2001, Linus Torvalds wrote:
> 
> > The _lower-level_ stuff (ie TCP and the drivers) want the "array of
> > tuples", and again, they do NOT want an array of pages, because if
> > somebody does two sendfile() calls that fit in one packet, it really needs
> > an array of tuples.
> 
> A kiobuf simply provides that tuple plus the completion callback.  Stick a
> bunch of them together and you've got a kiovec.  I don't see the advantage
> of moving to simpler primatives if they don't provide needed
> functionality.

Ehh.

Let's re-state your argument:

 "You could have used the existing, complex and cumbersome primitives that
  had the wrong semantics. I don't see the advantage of pointing out the
  fact that those primitives are badly designed for the problem at hand 
  and moving to simpler and better designed primitives that fit the
  problem well"

Would you agree that that is the essense of what you said? And if not,
then why not?

> Please tell me what you think the right interface is that provides a hook
> on io completion and is asynchronous.

Suggested fix to kiovec's: get rid of them. Immediately. Replace them with
kiobuf's that can handle scatter-gather pages. kiobuf's have 90% of that
support already.

Never EVER have a "struct page **" interface. It is never the valid thing
to do. You should have

	struct fragment {
		struct page *page;
		__u16 offset, length;
	}

and then have "struct fragment **" inside the kiobuf's instead. Rename
"nr_pages" as "nr_fragments", and get rid of the global offset/length, as
they don't make any sense. Voila - your kiobuf is suddenly a lot more
flexible.

Finally, don't embed the static KIO_STATIC_PAGES array in the kiobuf. The
caller knows when it makes sense, and when it doesn't. Don't embed that
knowledge in fundamental data structures.

In the meantime, I'm more than happy to make sure that the networking
infrastructure is sane. Which implies that the networking infrastructure
does NOT use kiovecs.

		Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1
  2001-01-09 20:10                                     ` Ingo Molnar
@ 2001-01-10  0:00                                       ` Andrea Arcangeli
  0 siblings, 0 replies; 119+ messages in thread
From: Andrea Arcangeli @ 2001-01-10  0:00 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Jens Axboe, linux-kernel

On Tue, Jan 09, 2001 at 09:10:24PM +0100, Ingo Molnar wrote:
> 
> On Tue, 9 Jan 2001, Andrea Arcangeli wrote:
> 
> > BTW, I noticed what is left in blk-13B seems to be my work (Jens's
> > fixes for merging when the I/O queue is full are just been integrated
> > in test1x).  [...]
> 
> it was Jens' [i think those were implemented by Jens entirely]
> batch-freeing changes that made the most difference. (we did

Confirm, the bach-freeing was Jens's work.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1
  2001-01-09 15:40                 ` Ingo Molnar
  2001-01-09 15:48                   ` Stephen Frost
@ 2001-01-10  1:14                   ` Dave Zarzycki
  2001-01-10  1:14                     ` David S. Miller
  2001-01-10  1:19                     ` Ingo Molnar
  1 sibling, 2 replies; 119+ messages in thread
From: Dave Zarzycki @ 2001-01-10  1:14 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: linux-kernel

On Tue, 9 Jan 2001, Ingo Molnar wrote:

> then you'll love the zerocopy patch :-) Just use sendfile() or specify
> MSG_NOCOPY to sendmsg(), and you'll see effective memory-to-card
> DMA-and-checksumming on cards that support it.

I'm confused.

In user space, how do you know when its safe to reuse the buffer that was
handed to sendmsg() with the MSG_NOCOPY flag? Or does sendmsg() with that
flag block until the buffer isn't needed by the kernel any more? If it
does block, doesn't that defeat the use of non-blocking I/O?

davez

-- 
Dave Zarzycki
http://thor.sbay.org/~dave/


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1
  2001-01-10  1:14                   ` Dave Zarzycki
@ 2001-01-10  1:14                     ` David S. Miller
  2001-01-10  2:18                       ` Dave Zarzycki
  2001-01-10  1:19                     ` Ingo Molnar
  1 sibling, 1 reply; 119+ messages in thread
From: David S. Miller @ 2001-01-10  1:14 UTC (permalink / raw)
  To: dave; +Cc: mingo, linux-kernel

   Date: 	Tue, 9 Jan 2001 17:14:33 -0800 (PST)
   From: Dave Zarzycki <dave@zarzycki.org>

   On Tue, 9 Jan 2001, Ingo Molnar wrote:

   > then you'll love the zerocopy patch :-) Just use sendfile() or specify
   > MSG_NOCOPY to sendmsg(), and you'll see effective memory-to-card
   > DMA-and-checksumming on cards that support it.

   I'm confused.

   In user space, how do you know when its safe to reuse the buffer that was
   handed to sendmsg() with the MSG_NOCOPY flag? Or does sendmsg() with that
   flag block until the buffer isn't needed by the kernel any more? If it
   does block, doesn't that defeat the use of non-blocking I/O?

Ignore Ingo's comments about the MSG_NOCOPY flag, I've not included
those parts in the zerocopy patches as they are very controversial
and require some VM layer support.

Basically, it pins the userspace pages, so if you write to them before
the data is fully sent and the networking buffer freed, they get
copied with a COW fault.

Later,
David S. Miller
davem@redhat.com
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1
  2001-01-10  1:14                   ` Dave Zarzycki
  2001-01-10  1:14                     ` David S. Miller
@ 2001-01-10  1:19                     ` Ingo Molnar
  1 sibling, 0 replies; 119+ messages in thread
From: Ingo Molnar @ 2001-01-10  1:19 UTC (permalink / raw)
  To: Dave Zarzycki; +Cc: linux-kernel


On Tue, 9 Jan 2001, Dave Zarzycki wrote:

> In user space, how do you know when its safe to reuse the buffer that
> was handed to sendmsg() with the MSG_NOCOPY flag? Or does sendmsg()
> with that flag block until the buffer isn't needed by the kernel any
> more? If it does block, doesn't that defeat the use of non-blocking
> I/O?

sendmsg() marks those pages COW and copies the original page into a new
one for further usage. (the old page is used until the packet is
released.) So for maximum performance user-space should not reuse such
buffers immediately.

	Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1
  2001-01-10  1:14                     ` David S. Miller
@ 2001-01-10  2:18                       ` Dave Zarzycki
  0 siblings, 0 replies; 119+ messages in thread
From: Dave Zarzycki @ 2001-01-10  2:18 UTC (permalink / raw)
  To: David S. Miller; +Cc: mingo, linux-kernel

On Tue, 9 Jan 2001, David S. Miller wrote:

> Ignore Ingo's comments about the MSG_NOCOPY flag, I've not included
> those parts in the zerocopy patches as they are very controversial
> and require some VM layer support.

Okay, I talked to some kernel engineers where I work and they were (I
think) very justifiably skeptical of zero-copy work with respect to
read/write style APIs.

> Basically, it pins the userspace pages, so if you write to them before
> the data is fully sent and the networking buffer freed, they get
> copied with a COW fault.

Yum... Assuming a gigabit ethernet link is saturated with the
sendmsg(MSG_NOCOPY) API, what is CPU utilization like for a given clock
speed and processor make? It is any different than the sendfile() case?

davez

-- 
Dave Zarzycki
http://thor.sbay.org/~dave/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 119+ messages in thread

* storage over IP (was Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1)
  2001-01-09 10:23         ` Ingo Molnar
                             ` (2 preceding siblings ...)
  2001-01-09 14:18           ` Stephen C. Tweedie
@ 2001-01-10  2:56           ` dean gaudet
  2001-01-10  2:58             ` David S. Miller
  2001-01-10  3:05             ` storage over IP (was Re: [PLEASE-TESTME] Zerocopy networking patch, Alan Cox
  3 siblings, 2 replies; 119+ messages in thread
From: dean gaudet @ 2001-01-10  2:56 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Rik van Riel, David S. Miller, hch, netdev, linux-kernel

On Tue, 9 Jan 2001, Ingo Molnar wrote:

> On Mon, 8 Jan 2001, Rik van Riel wrote:
>
> > Having proper kiobuf support would make it possible to, for example,
> > do zerocopy network->disk data transfers and lots of other things.
>
> i used to think that this is useful, but these days it isnt.

this seems to be in the general theme of "network receive is boring".
which i mostly agree with... except recently i've been thinking about an
application where it may not be so boring, but i haven't researched all
the details yet.

the application is storage over IP -- SAN using IP (i.e. gigabit ethernet)
technologies instead of fiberchannel technologies.  several companies are
doing it or planning to do it (for example EMC, 3ware).

i'm taking a wild guess that SCSI over FC is arranged conveniently to
allow a scatter request to read packets off the FC NIC such that the
headers go one way and the data lands neatly into the page cache (i.e.
fixed length headers).  i've never investigated the actual protocols
though so maybe the solution used was to just push a lot of the detail
down into the controllers.

a quick look at the iSCSI specification
<http://www.ietf.org/internet-drafts/draft-ietf-ips-iscsi-02.txt>, and the
FCIP spec
<http://www.ietf.org/internet-drafts/draft-ietf-ips-fcovertcpip-01.txt>
show that both use TCP/IP.  TCP/IP has variable length headers (or am i on
crack?), which totally complicates the receive path.

the iSCSI requirements document seems to imply they're happy with pushing
this extra processing down to a special storage NIC.  that kind of sucks
-- one of the benefits of storage over IP would be the ability to
redundantly connect a box to storage and IP with only two NICs (instead of
4 -- 2 IP and 2 FC).

is NFS receive single copy today?

anyone tried doing packet demultiplexing by grabbing headers on one pass
and scattering the data on a second pass?

i'm hoping i'm missing something.  anyone else looked around at this stuff
yet?

-dean

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: storage over IP (was Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1)
  2001-01-10  2:56           ` storage over IP (was Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1) dean gaudet
@ 2001-01-10  2:58             ` David S. Miller
  2001-01-10  3:18               ` dean gaudet
  2001-01-10  3:05             ` storage over IP (was Re: [PLEASE-TESTME] Zerocopy networking patch, Alan Cox
  1 sibling, 1 reply; 119+ messages in thread
From: David S. Miller @ 2001-01-10  2:58 UTC (permalink / raw)
  To: dean-list-linux-kernel; +Cc: mingo, riel, hch, netdev, linux-kernel

   Date: Tue, 9 Jan 2001 18:56:33 -0800 (PST)
   From: dean gaudet <dean-list-linux-kernel@arctic.org>

   is NFS receive single copy today?

With the zerocopy patches, NFS client receive is "single cpu copy" if
that's what you mean.

Later,
David S. Miller
davem@redhat.com

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: storage over IP (was Re: [PLEASE-TESTME] Zerocopy networking patch,
  2001-01-10  2:56           ` storage over IP (was Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1) dean gaudet
  2001-01-10  2:58             ` David S. Miller
@ 2001-01-10  3:05             ` Alan Cox
  1 sibling, 0 replies; 119+ messages in thread
From: Alan Cox @ 2001-01-10  3:05 UTC (permalink / raw)
  To: dean gaudet
  Cc: Ingo Molnar, Rik van Riel, David S. Miller, hch, netdev, linux-kernel

> fixed length headers).  i've never investigated the actual protocols
> though so maybe the solution used was to just push a lot of the detail
> down into the controllers.

The stuff I have access to (MPT fusion) pushes the FC handling down onto the
board. Basically you talk scsi and IP to it (See drivers/message/fusion in
-ac)

> <http://www.ietf.org/internet-drafts/draft-ietf-ips-fcovertcpip-01.txt>
> show that both use TCP/IP.  TCP/IP has variable length headers (or am i on
> crack?), which totally complicates the receive path.

TCP has variable length headers. It also prevents you re-ordering commands
in the stream which would be beneficial. I've not checked if the draft uses
multiple TCP streams but then you have scaling questions. 

Alan

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: storage over IP (was Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1)
  2001-01-10  3:18               ` dean gaudet
@ 2001-01-10  3:09                 ` David S. Miller
  0 siblings, 0 replies; 119+ messages in thread
From: David S. Miller @ 2001-01-10  3:09 UTC (permalink / raw)
  To: dean-list-linux-kernel; +Cc: mingo, riel, hch, netdev, linux-kernel

   Date: Tue, 9 Jan 2001 19:18:53 -0800 (PST)
   From: dean gaudet <dean-list-linux-kernel@arctic.org>

   - NIC DMAs packet to memory
   - CPU reads headers from memory, figures out it's NFS
   - CPU copies data bytes from packet image in memory to pagecache

Yes, this is precisely what happens in the NFS client with
the zerocopy patches applied.

Later,
David S. Miller
davem@redhat.com
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: storage over IP (was Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1)
  2001-01-10  2:58             ` David S. Miller
@ 2001-01-10  3:18               ` dean gaudet
  2001-01-10  3:09                 ` David S. Miller
  0 siblings, 1 reply; 119+ messages in thread
From: dean gaudet @ 2001-01-10  3:18 UTC (permalink / raw)
  To: David S. Miller; +Cc: mingo, riel, hch, netdev, linux-kernel

On Tue, 9 Jan 2001, David S. Miller wrote:

>    Date: Tue, 9 Jan 2001 18:56:33 -0800 (PST)
>    From: dean gaudet <dean-list-linux-kernel@arctic.org>
>
>    is NFS receive single copy today?
>
> With the zerocopy patches, NFS client receive is "single cpu copy" if
> that's what you mean.

yeah sorry, i meant:

- NIC DMAs packet to memory
- CPU reads headers from memory, figures out it's NFS
- CPU copies data bytes from packet image in memory to pagecache

-dean

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1
  2001-01-09 21:26                           ` Linus Torvalds
@ 2001-01-10  7:42                             ` Christoph Hellwig
  2001-01-10  8:05                               ` Linus Torvalds
  2001-01-17 14:05                               ` Rik van Riel
  0 siblings, 2 replies; 119+ messages in thread
From: Christoph Hellwig @ 2001-01-10  7:42 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Christoph Hellwig, migo, linux-kernel

On Tue, Jan 09, 2001 at 01:26:44PM -0800, Linus Torvalds wrote:
> 
> 
> On Tue, 9 Jan 2001, Christoph Hellwig wrote:
> > > 
> > > Look at sendfile(). You do NOT have a "bunch" of pages.
> > > 
> > > Sendfile() is very much a page-at-a-time thing, and expects the actual IO
> > > layers to do it's own scatter-gather. 
> > > 
> > > So sendfile() doesn't want any array at all: it only wants a single
> > > page-offset-length tuple interface.
> > 
> > The current implementations does.
> > But others are possible.  I could post one in a few days to show that it is
> > possible.
> 
> Why do you bother arguing, when I've shown you that even if sendfile()
> _did_ do multiple pages, it STILL wouldn't make kibuf's the right
> interface. You just snipped out that part of my email, which states that
> the networking layer would still need to do better scatter-gather than
> kiobuf's can give it for multiple send-file invocations.

Simple.  Because I stated before that I DON'T even want the networking
to use kiobufs in lower layers.  My whole argument is to pass a kiovec
into the fileop instead of a page, because it makes sense for other
drivers to use multiple pages, and doesn't hurt networking besides
the cost of one kiobuf (116k) and the processor cycles for creating
and destroying it once per sys_sendfile.

	Christoph

-- 
Whip me.  Beat me.  Make me maintain AIX.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1
  2001-01-09 23:54                           ` Linus Torvalds
@ 2001-01-10  7:51                             ` Gerd Knorr
  0 siblings, 0 replies; 119+ messages in thread
From: Gerd Knorr @ 2001-01-10  7:51 UTC (permalink / raw)
  To: linux-kernel

> > Please tell me what you think the right interface is that provides a hook
> > on io completion and is asynchronous.
> 
> Suggested fix to kiovec's: get rid of them. Immediately. Replace them with
> kiobuf's that can handle scatter-gather pages. kiobuf's have 90% of that
> support already.
> 
> Never EVER have a "struct page **" interface. It is never the valid thing
> to do.

Hmm, /me is quite happy with it.  It's fine for *big* chunks of memory like
video frames:  I just need a large number of pages, length and offset.  If
someone wants to have a look: a rewritten bttv version which uses kiobufs
is available at http://www.strusel007.de/linux/bttv/bttv-0.8.8.tar.gz

It does _not_ use kiovecs throuth (to be exact: kiovecs with just one single
kiobuf in there).

> You should have
> 
> 	struct fragment {
> 		struct page *page;
> 		__u16 offset, length;
> 	}

What happens with big memory blocks?  Do all pages but the first and last
get offset=0 and length=PAGE_SIZE then?

  Gerd

-- 
Get back there in front of the computer NOW. Christmas can wait.
	-- Linus "the Grinch" Torvalds,  24 Dec 2000 on linux-kernel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1
  2001-01-10  7:42                             ` Christoph Hellwig
@ 2001-01-10  8:05                               ` Linus Torvalds
  2001-01-10  8:33                                 ` Christoph Hellwig
  2001-01-10  8:37                                 ` Andrew Morton
  2001-01-17 14:05                               ` Rik van Riel
  1 sibling, 2 replies; 119+ messages in thread
From: Linus Torvalds @ 2001-01-10  8:05 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: migo, linux-kernel



On Wed, 10 Jan 2001, Christoph Hellwig wrote:
> 
> Simple.  Because I stated before that I DON'T even want the networking
> to use kiobufs in lower layers.  My whole argument is to pass a kiovec
> into the fileop instead of a page, because it makes sense for other
> drivers to use multiple pages, and doesn't hurt networking besides
> the cost of one kiobuf (116k) and the processor cycles for creating
> and destroying it once per sys_sendfile.

Fair enough.

My whole argument against that is that I think kiovec's are incredibly
ugly, and the less I see of them in critical regions, the happier I am.

And that, I have to admit, is really mostly a matter of "taste". 

De gustibus non disputandum.

		Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1
  2001-01-10  8:05                               ` Linus Torvalds
@ 2001-01-10  8:33                                 ` Christoph Hellwig
  2001-01-10  8:37                                 ` Andrew Morton
  1 sibling, 0 replies; 119+ messages in thread
From: Christoph Hellwig @ 2001-01-10  8:33 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: migo, linux-kernel

On Wed, Jan 10, 2001 at 12:05:01AM -0800, Linus Torvalds wrote:
> 
> 
> On Wed, 10 Jan 2001, Christoph Hellwig wrote:
> > 
> > Simple.  Because I stated before that I DON'T even want the networking
> > to use kiobufs in lower layers.  My whole argument is to pass a kiovec
> > into the fileop instead of a page, because it makes sense for other
> > drivers to use multiple pages, and doesn't hurt networking besides
> > the cost of one kiobuf (116k) and the processor cycles for creating
> > and destroying it once per sys_sendfile.
> 
> Fair enough.
> 
> My whole argument against that is that I think kiovec's are incredibly
> ugly, and the less I see of them in critical regions, the happier I am.
> 
> And that, I have to admit, is really mostly a matter of "taste". 

Ok.

This is a statement that makes all the kiobuf efforts currently look
no more as interesting as before.

IHMO is time to find a generic interface for IO that is acceptable by
you and widely usable.

As you stated before that seems to be s.th. with page,offset,length
tuples.

	Christoph

-- 
Whip me.  Beat me.  Make me maintain AIX.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1
  2001-01-10  8:05                               ` Linus Torvalds
  2001-01-10  8:33                                 ` Christoph Hellwig
@ 2001-01-10  8:37                                 ` Andrew Morton
  2001-01-10 23:32                                   ` Linus Torvalds
  1 sibling, 1 reply; 119+ messages in thread
From: Andrew Morton @ 2001-01-10  8:37 UTC (permalink / raw)
  To: linux-kernel

Linus Torvalds wrote:
> 
> De gustibus non disputandum.

http://cogprints.soton.ac.uk/documents/disk0/00/00/07/57/

	"ingestion of the afterbirth during delivery"

eh?


http://www.degustibus.co.uk/

	"Award winning artisan breadmakers."

Ah.  That'll be it.

-
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1
  2001-01-09 15:27     ` Trond Myklebust
@ 2001-01-10  9:21       ` Trond Myklebust
  0 siblings, 0 replies; 119+ messages in thread
From: Trond Myklebust @ 2001-01-10  9:21 UTC (permalink / raw)
  To: David S. Miller; +Cc: linux-kernel, netdev

>>>>> " " == David S Miller <davem@redhat.com> writes:

     >    Date: Tue, 9 Jan 2001 16:27:49 +0100 (CET) From: Trond
     >    Myklebust <trond.myklebust@fys.uio.no>

     >    OK, but can you eventually generalize it to non-stream
     >    protocols (i.e. UDP)?

     > Sure, this is what MSG_MORE is meant to accomodate.  UDP could
     > support it just fine.

Great! I've been waiting for something like this. In particular the
knfsd TCP server code can get very buffer-intensive without it since
you need to pre-allocate 1 set of buffers per TCP connection (else you
get DOS due to buffer saturation when doing wait+retry for blocked
sockets).

If it all gets in to the kernel, I'll do the work of adapting the NFS
+ sunrpc stuff.

Cheers,
  Trond
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1
  2001-01-09 22:25                 ` Linus Torvalds
@ 2001-01-10 15:21                   ` Stephen C. Tweedie
  0 siblings, 0 replies; 119+ messages in thread
From: Stephen C. Tweedie @ 2001-01-10 15:21 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel, Stephen Tweedie

Hi,

On Tue, Jan 09, 2001 at 02:25:43PM -0800, Linus Torvalds wrote:
> In article <20010109151725.D9321@redhat.com>,
> Stephen C. Tweedie <sct@redhat.com> wrote:
> >
> >Jes has also got hard numbers for the performance advantages of
> >jumbograms on some of the networks he's been using, and you ain't
> >going to get udp jumbograms through a page-by-page API, ever.
> 
> The only thing you need is a nagle-type thing that coalesces requests.

Is this robust enough to build a useful user-level API on top of?

What happens if we have a threaded application in which more than one
process may be sending udp sendmsg()s to the file descriptor?  If we
end up decomposing each datagram into multiple page-sized chunks, then
you can imagine them arriving at the fd stream in interleaved order.

You can fix that by adding extra locking, but that just indicates that
the original API wasn't sufficient to communicate the precise intent
of the application in the first place.

Things look worse from the point of view of ll_rw_block, which lacks
any concept of (a) a file descriptor, or (b) a non-reorderable stream
of atomic requests.  ll_rw_block coalesces in any order it chooses, so
its coalescing function is a _lot_ more complex than hooking the next
page onto a linked list.  

Once the queue size grows non-trivial, adding a new request can become
quite expensive (even with only one item on the request queue at once,
make_request is still by far the biggest cost on a kernel profile
running raw IO).  If you've got a 32-page IO to send, sending it in
chunks means either merging 32 times into that queue when you could
have just done it once, or holding off all merging until you're told
to unplug: but with multiple clients, you just encounter the lack of
caller context again, and each client can unplug the other before its
time.

I realise these are apples and oranges to some extent, because
ll_rw_block doesn't accept a file descriptor: the place where we _do_
use file descriptors, block_write(), could be doing some of this if
the requests were coming from an application.

However, that doesn't address the fact that we have got raw devices
and filesystems such as XFS already generating large multi-page block
IO requests and having to cram them down the thin pipe which is
ll_rw_block, and the MSG_MORE flag doesn't seem capable of extending
to ll_rw_block sufficiently well.

I guess it comes down to this: what problem are we trying to fix?  If
it's strictly limited to sendfile/writev and related calls, then
you've convinced me that page-by-page MSG_MORE can work if you add a
bit of locking, but that locking is by itself nasty.  

Think about O_DIRECT to a database file.  We get a write() call,
locate the physical pages through unspecified magic, and fire off a
series of page or partial-page writes to the O_DIRECT fd.  If we are
coalescing these via MSG_MORE, then we have to keep the fd locked for
write until we've processed the whole IO (including any page faults
that result).  The filesystem --- which is what understands the
concept of a file descriptor --- can merge these together into another
request, but we'd just have to split that request into chunks again to
send them to ll_rw_block.

We may also have things like software raid layers in the write path.
That's the motivation for having an object capable of describing
multi-page IOs --- it lets us pass the desired IO chunks down through
the filesystem, virtual block devices and physical block devices,
without any context being required and without having to
decompose/merge at each layer.

Cheers,
 Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1
  2001-01-10  8:37                                 ` Andrew Morton
@ 2001-01-10 23:32                                   ` Linus Torvalds
  2001-01-19 15:55                                     ` Andrew Scott
  0 siblings, 1 reply; 119+ messages in thread
From: Linus Torvalds @ 2001-01-10 23:32 UTC (permalink / raw)
  To: linux-kernel

In article <3A5C1F64.99C611F2@uow.edu.au>,
Andrew Morton  <andrewm@uow.edu.au> wrote:
>Linus Torvalds wrote:
>> 
>> De gustibus non disputandum.
>
>http://cogprints.soton.ac.uk/documents/disk0/00/00/07/57/
>
>	"ingestion of the afterbirth during delivery"
>
>eh?
>
>
>http://www.degustibus.co.uk/
>
>	"Award winning artisan breadmakers."
>
>Ah.  That'll be it.

Latin 101. Literally "about taste no argument".

I suspect that it _should_ be "De gustibus non disputandum est", but
it's been too many years. That adds the required verb ("is") to make it
a full sentence. 

In English: "There is no arguing taste".

		Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1
  2001-01-09 19:14               ` Linus Torvalds
  2001-01-09 20:07                 ` Ingo Molnar
@ 2001-01-12  1:42                 ` Stephen C. Tweedie
  1 sibling, 0 replies; 119+ messages in thread
From: Stephen C. Tweedie @ 2001-01-12  1:42 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel, Stephen Tweedie

Hi,

On Tue, Jan 09, 2001 at 11:14:54AM -0800, Linus Torvalds wrote:
> In article <20010109122810.A3115@caldera.de>,
> 
> kiobufs are crap. Face it. They do NOT allow proper multi-page scatter
> gather, regardless of what the kiobuf PR department has said.

It's not surprising, since they were designed to solve a totally
different problem.

Kiobufs were always intended to represent logical buffers --- a virtual
address range from some process, or a region of a cached file.  The
purpose behind them was, if you remember, to allow something like
map_user_kiobuf() to produce a list of physical pages from the user VA
range.

This works exactly as intended.  The raw IO device driver may build a
kiobuf to represent a user VA range, and the XFS filesystem may build
one for its pagebuf abstraction to represent a range within a file in
the page cache.  The lower level IO routines just don't care where the
buffers came from.

There are still problems here --- the encoding of block addresses in
the list, dealing with a stack of completion events if you push these
buffers down through various layers of logical block device such as
raid/lvm, carving requests up and merging them if you get requirest
which span a raid or LVM stripe, for example.  Kiobufs don't solve
those, but neither do skfrags, and neither does the MSG_MORE concept.

If you want a scatter-gather list capable of taking individual
buffer_heads and merging them, then sure, kiobufs won't do the trick
as they stand now: they were never intended to.  The whole point of
kiobufs was to encapsulate one single buffer in the higher layers, and
to allow lower layers to work on that buffer without caring where the
memory came from.  

But adding the sub-page sg lists is a simple extension.  I've got a
number of raw IO fixes pending, and we've just traced the source of
the last problem that was holding it up, so if you want I'll add the
per-page offset/length with those. 

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1
  2001-01-09 19:54                                   ` Andrea Arcangeli
  2001-01-09 20:10                                     ` Ingo Molnar
  2001-01-09 20:12                                     ` Jens Axboe
@ 2001-01-17  5:16                                     ` Rik van Riel
  2 siblings, 0 replies; 119+ messages in thread
From: Rik van Riel @ 2001-01-17  5:16 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Ingo Molnar, Jens Axboe, Alan Cox, Stephen C. Tweedie,
	Christoph Hellwig, David S. Miller, netdev, linux-kernel

On Tue, 9 Jan 2001, Andrea Arcangeli wrote:

> BTW, I noticed what is left in blk-13B seems to be my work

Yeah yeah, we'll buy you beer at the next conference... ;)

Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

		http://www.surriel.com/
http://www.conectiva.com/	http://distro.conectiva.com.br/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1
  2001-01-10  7:42                             ` Christoph Hellwig
  2001-01-10  8:05                               ` Linus Torvalds
@ 2001-01-17 14:05                               ` Rik van Riel
  2001-01-18  0:53                                 ` Christoph Hellwig
  1 sibling, 1 reply; 119+ messages in thread
From: Rik van Riel @ 2001-01-17 14:05 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Linus Torvalds, migo, linux-kernel

On Wed, 10 Jan 2001, Christoph Hellwig wrote:

> Simple.  Because I stated before that I DON'T even want the
> networking to use kiobufs in lower layers.  My whole argument is
> to pass a kiovec into the fileop instead of a page, because it
> makes sense for other drivers to use multiple pages,

Now wouldn't it be great if we had one type of data
structure that would work for both the network layer
and the block layer (and v4l, ...)  ?

If we constantly need to convert between zerocopy
metadata type, I'm sure we'll lose most of the performance
gain we started this whole idea for in the first place.

cheers,

Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

		http://www.surriel.com/
http://www.conectiva.com/	http://distro.conectiva.com.br/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1
  2001-01-17 14:05                               ` Rik van Riel
@ 2001-01-18  0:53                                 ` Christoph Hellwig
  2001-01-18  1:13                                   ` Linus Torvalds
  0 siblings, 1 reply; 119+ messages in thread
From: Christoph Hellwig @ 2001-01-18  0:53 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Christoph Hellwig, Linus Torvalds, mingo, linux-kernel

On Thu, Jan 18, 2001 at 01:05:43AM +1100, Rik van Riel wrote:
> On Wed, 10 Jan 2001, Christoph Hellwig wrote:
> 
> > Simple.  Because I stated before that I DON'T even want the
> > networking to use kiobufs in lower layers.  My whole argument is
> > to pass a kiovec into the fileop instead of a page, because it
> > makes sense for other drivers to use multiple pages,
> 
> Now wouldn't it be great if we had one type of data
> structure that would work for both the network layer
> and the block layer (and v4l, ...)  ?

Sure it would be nice, and IIRC that was what the kiobuf stuff was
designed for.  But it looks like it doesn't do well for the networking
(and maybe other) guys.

That means we have to find something that might be worth paying a little
overhead for in all layers, but that on the other hand is usable evrywhere.

So after the last flame^H^H^H^H^Hthread I've come up in my mind with the
following structures:

/*
 * a simple page,offset,legth tuple like Linus wants it
 */
struct kiobuf2 {
	struct page *   page;   /* The page itself               */
	u_int16_t       offset; /* Offset to start of valid data */
	u_int16_t       length; /* Number of valid bytes of data */
};

/*
 * A container for the tuples - it is actually pretty similar to old
 * kiobuf, but on the other hand allows SG
 */
struct kiovec2 {
	int             nbufs;          /* Kiobufs actually referenced */
	int             array_len;      /* Space in the allocated lists */

	struct kiobuf * bufs;

	unsigned int    locked : 1;     /* If set, pages has been locked */

	/* Always embed enough struct pages for 64k of IO */
	struct kiobuf * buf_array[KIO_STATIC_PAGES];	 

	/* Private data */
	void *          private;
	
	/* Dynamic state for IO completion: */
	atomic_t        io_count;       /* IOs still in progress */
	int             errno;

	/* Status of completed IO */
	void (*end_io)	(struct kiovec *); /* Completion callback */
	wait_queue_head_t wait_queue;
};


We don't need the page-length/offset in the usual block-io path, but on
the other hand, if we get a common interface for it...

	Christoph

-- 
Whip me.  Beat me.  Make me maintain AIX.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1
  2001-01-18  0:53                                 ` Christoph Hellwig
@ 2001-01-18  1:13                                   ` Linus Torvalds
  2001-01-18 17:50                                     ` Christoph Hellwig
  2001-01-18 21:12                                     ` Albert D. Cahalan
  0 siblings, 2 replies; 119+ messages in thread
From: Linus Torvalds @ 2001-01-18  1:13 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Rik van Riel, mingo, linux-kernel



On Thu, 18 Jan 2001, Christoph Hellwig wrote:
> 
> /*
>  * a simple page,offset,legth tuple like Linus wants it
>  */
> struct kiobuf2 {
> 	struct page *   page;   /* The page itself               */
> 	u_int16_t       offset; /* Offset to start of valid data */
> 	u_int16_t       length; /* Number of valid bytes of data */
> };

Please use "u16". Or "__u16" if you want to export it to user space.

> struct kiovec2 {
> 	int             nbufs;          /* Kiobufs actually referenced */
> 	int             array_len;      /* Space in the allocated lists */
> 	struct kiobuf * bufs;

Any reason for array_len?

Why not just 

	int nbufs,
	struct kiobuf *bufs;


Remember: simplicity is a virtue. 

Simplicity is also what makes it usable for people who do NOT want to have
huge overhead.

> 	unsigned int    locked : 1;     /* If set, pages has been locked */

Remove this. I don't think it's valid to lock the pages. Who wants to use
this anyway?

> 	/* Always embed enough struct pages for 64k of IO */
> 	struct kiobuf * buf_array[KIO_STATIC_PAGES];	 

Kill kill kill kill. 

If somebody wants to embed a kiovec into their own data structure, THEY
can decide to add their own buffers etc. A fundamental data structure
should _never_ make assumptions like this.

> 	/* Private data */
> 	void *          private;
> 	
> 	/* Dynamic state for IO completion: */
> 	atomic_t        io_count;       /* IOs still in progress */

What is io_count used for?

> 	int             errno;
> 
> 	/* Status of completed IO */
> 	void (*end_io)	(struct kiovec *); /* Completion callback */
> 	wait_queue_head_t wait_queue;

I suspect all of the above ("private", "end_io" etc) should be at a higher
layer. Not everybody will necessarily need them.

Remember: if this is to be well designed, we want to have the data
structures to pass down to low-level drivers etc, that may not want or
need a lot of high-level stuff. You should not pass down more than the
driver really needs.

In the end, the only thing you _know_ a driver will need (assuming that it
wants these kinds of buffers) is just

	int nbufs;
	struct biobuf *bufs;

That's kind of the minimal set. That should be one level of abstraction in
its own right. 

Never over-design. Never think "Hmm, maybe somebody would find this
useful". Start from what you know people _have_ to have, and try to make
that set smaller. When you can make it no smaller, you've reached one
point. That's a good point to start from - use that for some real
implementation.

Once you've gotten that far, you can see how well you can embed the lower
layers into higher layers. That does _not_ mean that the lower layers
should know about the high-level data structures. Try to avoid pushing
down abstractions too far. Maybe you'll want to push down the error code.
But maybe not. And you should NOT link the callback with the vector of
IO's: you may find (in fact, I bet you _will_ find), that the lowest level
will want a callback to call up to when it is ready, and that layer may
want _another_ callback to call up to higher levels.

Imagine, for example, the network driver telling the IP layer that "ok,
packet sent". That's _NOT_ the same callback as the TCP layer telling the
upper layers that the packet data has been sent and successfully
acknowledged, and that the data structures can be free'd now. They are at
two completely different levels of abstraction, and one level needing
something doesn't need that the other level should necessarily even care.

Don't imagine that everybody wants the same data structure, and that that
data structure should thus be very generic. Genericity kills good ideas.

		Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1
  2001-01-18  1:13                                   ` Linus Torvalds
@ 2001-01-18 17:50                                     ` Christoph Hellwig
  2001-01-18 18:04                                       ` Linus Torvalds
  2001-01-18 21:12                                     ` Albert D. Cahalan
  1 sibling, 1 reply; 119+ messages in thread
From: Christoph Hellwig @ 2001-01-18 17:50 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Rik van Riel, mingo, linux-kernel, kiobuf-io-devel

On Wed, Jan 17, 2001 at 05:13:31PM -0800, Linus Torvalds wrote:
> 
> 
> On Thu, 18 Jan 2001, Christoph Hellwig wrote:
> > 
> > /*
> >  * a simple page,offset,legth tuple like Linus wants it
> >  */
> > struct kiobuf2 {
> > 	struct page *   page;   /* The page itself               */
> > 	u_int16_t       offset; /* Offset to start of valid data */
> > 	u_int16_t       length; /* Number of valid bytes of data */
> > };
> 
> Please use "u16". Or "__u16" if you want to export it to user space.

Ok.


> 
> > struct kiovec2 {
> > 	int             nbufs;          /* Kiobufs actually referenced */
> > 	int             array_len;      /* Space in the allocated lists */
> > 	struct kiobuf * bufs;
> 
> Any reason for array_len?

It's usefull for the expand function - but with kiobufs as secondary data
structure it may no more be nessesary.

> Why not just 
> 
> 	int nbufs,
> 	struct kiobuf *bufs;
> 
> 
> Remember: simplicity is a virtue. 
> 
> Simplicity is also what makes it usable for people who do NOT want to have
> huge overhead.
> 
> > 	unsigned int    locked : 1;     /* If set, pages has been locked */
> 
> Remove this. I don't think it's valid to lock the pages. Who wants to use
> this anyway?

E.g. in the block IO pathes the pages have to be locked.
It's also used by free_kiovec to see wether to do unlock_kiovec before.

> 
> > 	/* Always embed enough struct pages for 64k of IO */
> > 	struct kiobuf * buf_array[KIO_STATIC_PAGES];	 
> 
> Kill kill kill kill. 
> 
> If somebody wants to embed a kiovec into their own data structure, THEY
> can decide to add their own buffers etc. A fundamental data structure
> should _never_ make assumptions like this.

Ok.

> 
> > 	/* Private data */
> > 	void *          private;
> > 	
> > 	/* Dynamic state for IO completion: */
> > 	atomic_t        io_count;       /* IOs still in progress */
> 
> What is io_count used for?

In the current buffer_head based IO-scheme it is used to determine wether
all bh request are finished.  It's obsolete once we pass kiobufs to the
low-level drivers.

> 
> > 	int             errno;
> > 
> > 	/* Status of completed IO */
> > 	void (*end_io)	(struct kiovec *); /* Completion callback */
> > 	wait_queue_head_t wait_queue;
> 
> I suspect all of the above ("private", "end_io" etc) should be at a higher
> layer. Not everybody will necessarily need them.
> 
> Remember: if this is to be well designed, we want to have the data
> structures to pass down to low-level drivers etc, that may not want or
> need a lot of high-level stuff. You should not pass down more than the
> driver really needs.
> 
> In the end, the only thing you _know_ a driver will need (assuming that it
> wants these kinds of buffers) is just
> 
> 	int nbufs;
> 	struct biobuf *bufs;
> 
> That's kind of the minimal set. That should be one level of abstraction in
> its own right. 

Ok. Then we need an additional more or less generic object that is used for
passing in a rw_kiovec file operation (and we really want that for many kinds
of IO). I thould mostly be used for communicating to the high-level driver.

/*
 * the name is just plain stupid, but that shouldn't matter
 */
struct vfs_kiovec {
	struct kiovec *	iov;

	/* private data, mostly for the callback */
	void * private;

	/* completion callback */
 	void (*end_io)	(struct vfs_kiovec *);
 	wait_queue_head_t wait_queue;
};

	Christoph

-- 
Whip me.  Beat me.  Make me maintain AIX.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1
  2001-01-18 17:50                                     ` Christoph Hellwig
@ 2001-01-18 18:04                                       ` Linus Torvalds
  0 siblings, 0 replies; 119+ messages in thread
From: Linus Torvalds @ 2001-01-18 18:04 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Rik van Riel, mingo, linux-kernel, kiobuf-io-devel



On Thu, 18 Jan 2001, Christoph Hellwig wrote:
> > 
> > Remove this. I don't think it's valid to lock the pages. Who wants to use
> > this anyway?
> 
> E.g. in the block IO pathes the pages have to be locked.
> It's also used by free_kiovec to see wether to do unlock_kiovec before.

This is all MUCH higher level functionality, and probably bogus anyway.

> > That's kind of the minimal set. That should be one level of abstraction in
> > its own right. 
> 
> Ok. Then we need an additional more or less generic object that is used for
> passing in a rw_kiovec file operation (and we really want that for many kinds
> of IO). I thould mostly be used for communicating to the high-level driver.

That's fine.

		Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1
  2001-01-18  1:13                                   ` Linus Torvalds
  2001-01-18 17:50                                     ` Christoph Hellwig
@ 2001-01-18 21:12                                     ` Albert D. Cahalan
  2001-01-19  1:52                                       ` 2.4.1-pre8 video/ohci1394 compile problem ebi4
  2001-01-19  6:55                                       ` [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 Linus Torvalds
  1 sibling, 2 replies; 119+ messages in thread
From: Albert D. Cahalan @ 2001-01-18 21:12 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Christoph Hellwig, Rik van Riel, mingo, linux-kernel

>> struct kiovec2 {
>> 	int             nbufs;          /* Kiobufs actually referenced */
>> 	int             array_len;      /* Space in the allocated lists */
>> 	struct kiobuf * bufs;
>
> Any reason for array_len?
>
> Why not just 
> 
> 	int nbufs,
> 	struct kiobuf *bufs;
>
> Remember: simplicity is a virtue. 
>
> Simplicity is also what makes it usable for people who do NOT want to have
> huge overhead.
>
>> 	unsigned int    locked : 1;     /* If set, pages has been locked */
>
> Remove this. I don't think it's valid to lock the pages. Who wants to use
> this anyway?
>
>> 	/* Always embed enough struct pages for 64k of IO */
>> 	struct kiobuf * buf_array[KIO_STATIC_PAGES];	 
>
> Kill kill kill kill. 
>
> If somebody wants to embed a kiovec into their own data structure, THEY
> can decide to add their own buffers etc. A fundamental data structure
> should _never_ make assumptions like this.

What about getting rid of both that and the pointer, and just
hanging that data on the end as a variable length array?

struct kiovec2{
  int nbufs;
  /* ... */
  struct kiobuf[0];
}
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 119+ messages in thread

* 2.4.1-pre8 video/ohci1394 compile problem
  2001-01-18 21:12                                     ` Albert D. Cahalan
@ 2001-01-19  1:52                                       ` ebi4
  2001-01-19  6:55                                       ` [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 Linus Torvalds
  1 sibling, 0 replies; 119+ messages in thread
From: ebi4 @ 2001-01-19  1:52 UTC (permalink / raw)
  To: linux-kernel

video1394.o(.data+0x0): multiple definition of `ohci_csr_rom'
ohci1394.o(.data+0x0): first defined here
make[3]: *** [ieee1394drv.o] Error 1

Compilation fails here.

::::: Gene Imes			     http://www.ozob.net :::::

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1
  2001-01-18 21:12                                     ` Albert D. Cahalan
  2001-01-19  1:52                                       ` 2.4.1-pre8 video/ohci1394 compile problem ebi4
@ 2001-01-19  6:55                                       ` Linus Torvalds
  1 sibling, 0 replies; 119+ messages in thread
From: Linus Torvalds @ 2001-01-19  6:55 UTC (permalink / raw)
  To: linux-kernel

In article <200101182112.f0ILCmZ113705@saturn.cs.uml.edu>,
Albert D. Cahalan <acahalan@cs.uml.edu> wrote:
>
>What about getting rid of both that and the pointer, and just
>hanging that data on the end as a variable length array?
>
>struct kiovec2{
>  int nbufs;
>  /* ... */
>  struct kiobuf[0];
>}

If the struct ends up having lots of other fields, yes.

On the other hand, if one basic form of kiobuf's ends up being really
just the array and the number of elements, there are reasons not to do
this. One is that you can "peel" off parts of the buffer, and split it
up if (for example) your driver has some limitation to the number of
scatter-gather requests it can make. For example, you may have code that
looks roughly like

	.. int nr, struct kibuf *buf ..

	while (nr > MAX_SEGMENTS) {
		lower_level(MAX_SEGMENTS, buf);
		nr -= MAX_SEGMENTS;
		buf += MAX_SEGMENTS;
	}
	lower_level(nr, buf);

which is rather awkward to do if you tie "nr" and the array too closely
together. 

(Of course, the driver could just split them up - take it from the
structure and pass them down in the separated manner. I don't know which
level the separation is worth doing at, but I have this feeling that if
the structure ends up being _only_ the nbufs and bufs, they should not
be tied together.)

		Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1
  2001-01-10 23:32                                   ` Linus Torvalds
@ 2001-01-19 15:55                                     ` Andrew Scott
  0 siblings, 0 replies; 119+ messages in thread
From: Andrew Scott @ 2001-01-19 15:55 UTC (permalink / raw)
  To: Linus Torvalds, linux-kernel

On 10 Jan 2001, at 15:32, Linus Torvalds wrote:

> Latin 101. Literally "about taste no argument".

Or "about taste no argument there is" if you add the 'est', which 
still makes sense in english, in a twisted (convoluted as apposed to 
'bad' or 'sick') way.   

Q.E.D.

> I suspect that it _should_ be "De gustibus non disputandum est", but
> it's been too many years. That adds the required verb ("is") to make it
> a full sentence. 
> 
> In English: "There is no arguing taste".
> 
> 		Linus


------------------Mailed via Pegasus 3.12c & Mercury 1.48---------------
A.J.Scott@casdn.neu.edu                    Fax (617)373-2942
Andrew Scott                               Tel (617)373-5278   _
Northeastern University--138 Meserve Hall                     / \   /
College of Arts & Sciences-Deans Office                      / \ \ /
Boston, Ma. 02115                                           /   \_/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1
  2001-01-10  8:41 Manfred Spraul
  2001-01-10  8:31 ` David S. Miller
  2001-01-10 11:25 ` Ingo Molnar
@ 2001-01-13 15:43 ` yodaiken
  2 siblings, 0 replies; 119+ messages in thread
From: yodaiken @ 2001-01-13 15:43 UTC (permalink / raw)
  To: Manfred Spraul; +Cc: mingo, linux-kernel


FWIW: POSIX mq_send does not promise that the buffer is safe, it only
promises that the message is queued. Interesting interface.



On Wed, Jan 10, 2001 at 09:41:24AM +0100, Manfred Spraul wrote:
> > > In user space, how do you know when its safe to reuse the buffer that 
> > > was handed to sendmsg() with the MSG_NOCOPY flag? Or does sendmsg() 
> > > with that flag block until the buffer isn't needed by the kernel any 
> > > more? If it does block, doesn't that defeat the use of non-blocking 
> > > I/O? 
> > 
> > sendmsg() marks those pages COW and copies the original page into a new 
> > one for further usage. (the old page is used until the packet is 
> > released.) So for maximum performance user-space should not reuse such 
> > buffers immediately. 
> >
> That means sendmsg() changes the page tables? I measures
> smp_call_function on my Dual Pentium 350, and it took around 1950 cpu
> ticks.
> I'm sure that for an 8 way server the total lost time on all cpus (multi
> threaded server) is larger than the time required to copy the complete
> page.
> (I've attached my patch, just run "insmod dummy p_shift=0")
> 
> 
> --
> 	Manfred
> --- 2.4/drivers/net/dummy.c	Mon Dec  4 02:45:22 2000
> +++ build-2.4/drivers/net/dummy.c	Wed Jan 10 09:15:20 2001
> @@ -95,9 +95,168 @@
>  
>  static struct net_device dev_dummy;
>  
> +/* ************************************* */
> +int p_shift = -1;
> +MODULE_PARM     (p_shift, "1i");
> +MODULE_PARM_DESC(p_shift, "Shift for the profile buffer");
> +
> +int p_size = 0;
> +MODULE_PARM     (p_size, "1i");
> +MODULE_PARM_DESC(p_size, "size");
> +
> +
> +#define STAT_TABLELEN		16384
> +static unsigned long totals[STAT_TABLELEN];
> +static unsigned int overflows;
> +
> +static unsigned long long stime;
> +static void start_measure(void)
> +{
> +	 __asm__ __volatile__ (
> +		".align 64\n\t"
> +	 	"pushal\n\t"
> +		"cpuid\n\t"
> +		"popal\n\t"
> +		"rdtsc\n\t"
> +		"movl %%eax,(%0)\n\t"
> +		"movl %%edx,4(%0)\n\t"
> +		: /* no output */
> +		: "c"(&stime)
> +		: "eax", "edx", "memory" );
> +}
> +
> +static void end_measure(void)
> +{
> +static unsigned long long etime;
> +	__asm__ __volatile__ (
> +		"pushal\n\t"
> +		"cpuid\n\t"
> +		"popal\n\t"
> +		"rdtsc\n\t"
> +		"movl %%eax,(%0)\n\t"
> +		"movl %%edx,4(%0)\n\t"
> +		: /* no output */
> +		: "c"(&etime)
> +		: "eax", "edx", "memory" );
> +	{
> +		unsigned long time = (unsigned long)(etime-stime);
> +		time >>= p_shift;
> +		if(time < STAT_TABLELEN) {
> +			totals[time]++;
> +		} else {
> +			overflows++;
> +		}
> +	}
> +}
> +
> +static void clean_buf(void)
> +{
> +	memset(totals,0,sizeof(totals));
> +	overflows = 0;
> +}
> +
> +static void print_line(unsigned long* array)
> +{
> +	int i;
> +	for(i=0;i<32;i++) {
> +		if((i%32)==16)
> +			printk(":");
> +		printk("%lx ",array[i]); 
> +	}
> +}
> +
> +static void print_buf(char* caption)
> +{
> +	int i, other = 0;
> +	printk("Results - %s - shift %d",
> +		caption, p_shift);
> +
> +	for(i=0;i<STAT_TABLELEN;i+=32) {
> +		int j;
> +		int local = 0;
> +		for(j=0;j<32;j++)
> +			local += totals[i+j];
> +
> +		if(local) {
> +			printk("\n%3x: ",i);
> +			print_line(&totals[i]);
> +			other += local;
> +		}
> +	}
> +	printk("\nOverflows: %d.\n",
> +		overflows);
> +	printk("Sum: %ld\n",other+overflows);
> +}
> +
> +static void return_immediately(void* dummy)
> +{
> +	return;
> +}
> +
> +static void just_one_page(void* dummy)
> +{
> +	__flush_tlb_one(0x12345678);
> +	return;
> +}
> +
> +
>  static int __init dummy_init_module(void)
>  {
>  	int err;
> +
> +	if(p_shift != -1) {
> +		int i;
> +		void* p;
> +		kmem_cache_t* cachep;
> +		/* empty test measurement: */
> +		printk("******** kernel cpu benchmark started **********\n");
> +		clean_buf();
> +		set_current_state(TASK_UNINTERRUPTIBLE);
> +		schedule_timeout(200);
> +		for(i=0;i<100;i++) {
> +			start_measure();
> +			return_immediately(NULL);
> +			return_immediately(NULL);
> +			return_immediately(NULL);
> +			return_immediately(NULL);
> +			end_measure();
> +		}
> +		print_buf("zero");
> +		clean_buf();
> +
> +		set_current_state(TASK_UNINTERRUPTIBLE);
> +		schedule_timeout(200);
> +		for(i=0;i<100;i++) {
> +			start_measure();
> +			return_immediately(NULL);
> +			return_immediately(NULL);
> +			smp_call_function(return_immediately,NULL,
> +						1, 1);
> +			return_immediately(NULL);
> +			return_immediately(NULL);
> +			end_measure();
> +		}
> +		print_buf("empty smp_call_function()");
> +		clean_buf();
> +
> +		set_current_state(TASK_UNINTERRUPTIBLE);
> +		schedule_timeout(200);
> +		for(i=0;i<100;i++) {
> +			start_measure();
> +			return_immediately(NULL);
> +			return_immediately(NULL);
> +			smp_call_function(just_one_page,NULL,
> +						1, 1);
> +			just_one_page(NULL);
> +			return_immediately(NULL);
> +			return_immediately(NULL);
> +			end_measure();
> +		}
> +		print_buf("flush_one_page()");
> +		clean_buf();	
> +
> +		return -EINVAL;
> +	}
>  
>  	dev_dummy.init = dummy_init;
>  	SET_MODULE_OWNER(&dev_dummy);
> 


-- 
---------------------------------------------------------
Victor Yodaiken 
Finite State Machine Labs: The RTLinux Company.
 www.fsmlabs.com  www.rtlinux.com

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1
  2001-01-10 12:07     ` Ingo Molnar
@ 2001-01-10 16:18       ` Jamie Lokier
  0 siblings, 0 replies; 119+ messages in thread
From: Jamie Lokier @ 2001-01-10 16:18 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Manfred Spraul, linux-kernel

Ingo Molnar wrote:
> > > well, this is a performance problem if you are using threads. For normal
> > > processes there is no need for a SMP cross-call, there TLB flushes are
> > > local only.
> > >
> > But that would be ugly as hell:
> > so apache 2.0 would become slower with MSG_NOCOPY, whereas samba 2.2
> > would become faster.
> 
> there *is* a cost of having a shared VM - and this is i suspect
> unavoidable.

Is it possible to avoid the SMP cross-call in the case that the other
threads have neither accessed nor dirtied the page in question?

One way to implement this is to share VMs but not the page tables, or to
share parts of the page tables that don't contain writable pages.

Just a sudden inspired thought...  I don't know if it is possible or
worthwhile.

enjoy,
-- Jamie
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1
  2001-01-10 12:03   ` Manfred Spraul
@ 2001-01-10 12:07     ` Ingo Molnar
  2001-01-10 16:18       ` Jamie Lokier
  0 siblings, 1 reply; 119+ messages in thread
From: Ingo Molnar @ 2001-01-10 12:07 UTC (permalink / raw)
  To: Manfred Spraul; +Cc: linux-kernel


On Wed, 10 Jan 2001, Manfred Spraul wrote:

> > well, this is a performance problem if you are using threads. For normal
> > processes there is no need for a SMP cross-call, there TLB flushes are
> > local only.
> >
> But that would be ugly as hell:
> so apache 2.0 would become slower with MSG_NOCOPY, whereas samba 2.2
> would become faster.

there *is* a cost of having a shared VM - and this is i suspect
unavoidable.

> Is is possible to move the responsibility for maitaining the copy to
> the caller?

this needs a completion event i believe.

> e.g. use msg_control, and then the caller can request either that a
> signal is sent when that data is transfered, or that a variable is set
> to 0.

i believe a signal-based thing would be the right (and scalable) solution
- the signal handler could free() the buffer.

this makes sense even in the VM-assisted MSG_NOCOPY case, since one wants
to do garbage collection of these in-flight buffers anyway. (not for
correctness but for performance reasons - free()-ing and immediately
reusing such a buffer would generate a COW.)

	Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1
  2001-01-10 11:25 ` Ingo Molnar
@ 2001-01-10 12:03   ` Manfred Spraul
  2001-01-10 12:07     ` Ingo Molnar
  0 siblings, 1 reply; 119+ messages in thread
From: Manfred Spraul @ 2001-01-10 12:03 UTC (permalink / raw)
  To: mingo; +Cc: linux-kernel

Ingo Molnar wrote:
> 
> On Wed, 10 Jan 2001, Manfred Spraul wrote:
> 
> > That means sendmsg() changes the page tables? I measures
> > smp_call_function on my Dual Pentium 350, and it took around 1950 cpu
> > ticks.
> 
> well, this is a performance problem if you are using threads. For normal
> processes there is no need for a SMP cross-call, there TLB flushes are
> local only.
> 
But that would be ugly as hell:
so apache 2.0 would become slower with MSG_NOCOPY, whereas samba 2.2
would become faster.

Is is possible to move the responsibility for maitaining the copy to the
caller?

e.g. use msg_control, and then the caller can request either that a
signal is sent when that data is transfered, or that a variable is set
to 0.

--
	Manfred
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1
  2001-01-10  8:41 Manfred Spraul
  2001-01-10  8:31 ` David S. Miller
@ 2001-01-10 11:25 ` Ingo Molnar
  2001-01-10 12:03   ` Manfred Spraul
  2001-01-13 15:43 ` yodaiken
  2 siblings, 1 reply; 119+ messages in thread
From: Ingo Molnar @ 2001-01-10 11:25 UTC (permalink / raw)
  To: Manfred Spraul; +Cc: linux-kernel


On Wed, 10 Jan 2001, Manfred Spraul wrote:

> That means sendmsg() changes the page tables? I measures
> smp_call_function on my Dual Pentium 350, and it took around 1950 cpu
> ticks.

well, this is a performance problem if you are using threads. For normal
processes there is no need for a SMP cross-call, there TLB flushes are
local only.

	Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1
@ 2001-01-10  8:41 Manfred Spraul
  2001-01-10  8:31 ` David S. Miller
                   ` (2 more replies)
  0 siblings, 3 replies; 119+ messages in thread
From: Manfred Spraul @ 2001-01-10  8:41 UTC (permalink / raw)
  To: mingo, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 908 bytes --]

> > In user space, how do you know when its safe to reuse the buffer that 
> > was handed to sendmsg() with the MSG_NOCOPY flag? Or does sendmsg() 
> > with that flag block until the buffer isn't needed by the kernel any 
> > more? If it does block, doesn't that defeat the use of non-blocking 
> > I/O? 
> 
> sendmsg() marks those pages COW and copies the original page into a new 
> one for further usage. (the old page is used until the packet is 
> released.) So for maximum performance user-space should not reuse such 
> buffers immediately. 
>
That means sendmsg() changes the page tables? I measures
smp_call_function on my Dual Pentium 350, and it took around 1950 cpu
ticks.
I'm sure that for an 8 way server the total lost time on all cpus (multi
threaded server) is larger than the time required to copy the complete
page.
(I've attached my patch, just run "insmod dummy p_shift=0")


--
	Manfred

[-- Attachment #2: patch-newperf --]
[-- Type: text/plain, Size: 3500 bytes --]

--- 2.4/drivers/net/dummy.c	Mon Dec  4 02:45:22 2000
+++ build-2.4/drivers/net/dummy.c	Wed Jan 10 09:15:20 2001
@@ -95,9 +95,168 @@
 
 static struct net_device dev_dummy;
 
+/* ************************************* */
+int p_shift = -1;
+MODULE_PARM     (p_shift, "1i");
+MODULE_PARM_DESC(p_shift, "Shift for the profile buffer");
+
+int p_size = 0;
+MODULE_PARM     (p_size, "1i");
+MODULE_PARM_DESC(p_size, "size");
+
+
+#define STAT_TABLELEN		16384
+static unsigned long totals[STAT_TABLELEN];
+static unsigned int overflows;
+
+static unsigned long long stime;
+static void start_measure(void)
+{
+	 __asm__ __volatile__ (
+		".align 64\n\t"
+	 	"pushal\n\t"
+		"cpuid\n\t"
+		"popal\n\t"
+		"rdtsc\n\t"
+		"movl %%eax,(%0)\n\t"
+		"movl %%edx,4(%0)\n\t"
+		: /* no output */
+		: "c"(&stime)
+		: "eax", "edx", "memory" );
+}
+
+static void end_measure(void)
+{
+static unsigned long long etime;
+	__asm__ __volatile__ (
+		"pushal\n\t"
+		"cpuid\n\t"
+		"popal\n\t"
+		"rdtsc\n\t"
+		"movl %%eax,(%0)\n\t"
+		"movl %%edx,4(%0)\n\t"
+		: /* no output */
+		: "c"(&etime)
+		: "eax", "edx", "memory" );
+	{
+		unsigned long time = (unsigned long)(etime-stime);
+		time >>= p_shift;
+		if(time < STAT_TABLELEN) {
+			totals[time]++;
+		} else {
+			overflows++;
+		}
+	}
+}
+
+static void clean_buf(void)
+{
+	memset(totals,0,sizeof(totals));
+	overflows = 0;
+}
+
+static void print_line(unsigned long* array)
+{
+	int i;
+	for(i=0;i<32;i++) {
+		if((i%32)==16)
+			printk(":");
+		printk("%lx ",array[i]); 
+	}
+}
+
+static void print_buf(char* caption)
+{
+	int i, other = 0;
+	printk("Results - %s - shift %d",
+		caption, p_shift);
+
+	for(i=0;i<STAT_TABLELEN;i+=32) {
+		int j;
+		int local = 0;
+		for(j=0;j<32;j++)
+			local += totals[i+j];
+
+		if(local) {
+			printk("\n%3x: ",i);
+			print_line(&totals[i]);
+			other += local;
+		}
+	}
+	printk("\nOverflows: %d.\n",
+		overflows);
+	printk("Sum: %ld\n",other+overflows);
+}
+
+static void return_immediately(void* dummy)
+{
+	return;
+}
+
+static void just_one_page(void* dummy)
+{
+	__flush_tlb_one(0x12345678);
+	return;
+}
+
+
 static int __init dummy_init_module(void)
 {
 	int err;
+
+	if(p_shift != -1) {
+		int i;
+		void* p;
+		kmem_cache_t* cachep;
+		/* empty test measurement: */
+		printk("******** kernel cpu benchmark started **********\n");
+		clean_buf();
+		set_current_state(TASK_UNINTERRUPTIBLE);
+		schedule_timeout(200);
+		for(i=0;i<100;i++) {
+			start_measure();
+			return_immediately(NULL);
+			return_immediately(NULL);
+			return_immediately(NULL);
+			return_immediately(NULL);
+			end_measure();
+		}
+		print_buf("zero");
+		clean_buf();
+
+		set_current_state(TASK_UNINTERRUPTIBLE);
+		schedule_timeout(200);
+		for(i=0;i<100;i++) {
+			start_measure();
+			return_immediately(NULL);
+			return_immediately(NULL);
+			smp_call_function(return_immediately,NULL,
+						1, 1);
+			return_immediately(NULL);
+			return_immediately(NULL);
+			end_measure();
+		}
+		print_buf("empty smp_call_function()");
+		clean_buf();
+
+		set_current_state(TASK_UNINTERRUPTIBLE);
+		schedule_timeout(200);
+		for(i=0;i<100;i++) {
+			start_measure();
+			return_immediately(NULL);
+			return_immediately(NULL);
+			smp_call_function(just_one_page,NULL,
+						1, 1);
+			just_one_page(NULL);
+			return_immediately(NULL);
+			return_immediately(NULL);
+			end_measure();
+		}
+		print_buf("flush_one_page()");
+		clean_buf();	
+
+		return -EINVAL;
+	}
 
 	dev_dummy.init = dummy_init;
 	SET_MODULE_OWNER(&dev_dummy);


^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1
  2001-01-10  8:41 Manfred Spraul
@ 2001-01-10  8:31 ` David S. Miller
  2001-01-10 11:25 ` Ingo Molnar
  2001-01-13 15:43 ` yodaiken
  2 siblings, 0 replies; 119+ messages in thread
From: David S. Miller @ 2001-01-10  8:31 UTC (permalink / raw)
  To: manfred; +Cc: mingo, linux-kernel

   Date: 	Wed, 10 Jan 2001 09:41:24 +0100
   From: Manfred Spraul <manfred@colorfullife.com>

   That means sendmsg() changes the page tables?

Not in the zerocopy patch I am proposing and asking people to test.  I
stated in another email that MSG_NOCOPY was considered experimental
and thus left out of my patches.

   I measures smp_call_function on my Dual Pentium 350, and it took
   around 1950 cpu ticks.

And this is one of several reasons why the MSG_NOCOPY facility is
considered experimental.

Later,
David S. Miller
davem@redhat.com
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1
  2001-01-09 22:59         ` Ingo Molnar
  2001-01-09 23:11           ` Dan Hollis
@ 2001-01-10  3:24           ` Chris Wedgwood
  1 sibling, 0 replies; 119+ messages in thread
From: Chris Wedgwood @ 2001-01-10  3:24 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Dan Hollis, David S. Miller, stephenl, linux-kernel

On Tue, Jan 09, 2001 at 11:59:13PM +0100, Ingo Molnar wrote:

    it's a bad name in that case. We dont 'send any file' if we in
    fact are receiving a data stream from a socket and writing it
    into a file :-)

so rename it -- user-space can retain the old name for compatibility
and new more sensible name


  --cw
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1
  2001-01-09 22:59         ` Ingo Molnar
@ 2001-01-09 23:11           ` Dan Hollis
  2001-01-10  3:24           ` Chris Wedgwood
  1 sibling, 0 replies; 119+ messages in thread
From: Dan Hollis @ 2001-01-09 23:11 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: David S. Miller, stephenl, linux-kernel

On Tue, 9 Jan 2001, Ingo Molnar wrote:
> On Tue, 9 Jan 2001, Dan Hollis wrote:
> > > This is not what senfile() does, it sends (to a network socket) a
> > > file (from the page cache), nothing more.
> > Ok in any case, it would be nice to have a generic sendfile() which works
> > on any fd's - socket or otherwise.
> it's a bad name in that case. We dont 'send any file' if we in fact are
> receiving a data stream from a socket and writing it into a file :-)

So we should have different system calls just so one can handle socket
and one can handle disk fd? :P

Ok so now will have special case sendfile() for each different kind of
fd's.

To connect socket-socket we can call it electrician() and to connect
pipe-pipe we can call it plumber() [1].

:P :b :P :b

-Dan

[1] Yes, Alex Belits, I know i've now stolen your joke...

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1
  2001-01-09 22:58       ` Dan Hollis
@ 2001-01-09 22:59         ` Ingo Molnar
  2001-01-09 23:11           ` Dan Hollis
  2001-01-10  3:24           ` Chris Wedgwood
  0 siblings, 2 replies; 119+ messages in thread
From: Ingo Molnar @ 2001-01-09 22:59 UTC (permalink / raw)
  To: Dan Hollis; +Cc: David S. Miller, stephenl, linux-kernel


On Tue, 9 Jan 2001, Dan Hollis wrote:

> > This is not what senfile() does, it sends (to a network socket) a
> > file (from the page cache), nothing more.
>
> Ok in any case, it would be nice to have a generic sendfile() which works
> on any fd's - socket or otherwise.

it's a bad name in that case. We dont 'send any file' if we in fact are
receiving a data stream from a socket and writing it into a file :-)

(i think Pavel raised this issue before.)

	Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1
  2001-01-09 22:03     ` David S. Miller
@ 2001-01-09 22:58       ` Dan Hollis
  2001-01-09 22:59         ` Ingo Molnar
  0 siblings, 1 reply; 119+ messages in thread
From: Dan Hollis @ 2001-01-09 22:58 UTC (permalink / raw)
  To: David S. Miller; +Cc: mingo, stephenl, linux-kernel

On Tue, 9 Jan 2001, David S. Miller wrote:
>    Just extend sendfile to allow any fd to any fd. sendfile already
>    does file->socket and file->file. It only needs to be extended to
>    do socket->file.
> This is not what senfile() does, it sends (to a network socket) a
> file (from the page cache), nothing more.

Ok in any case, it would be nice to have a generic sendfile() which works
on any fd's - socket or otherwise.

What sort of sendfile() behaviour is defined with select()? Can it be
asynchronous?

-Dan

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1
  2001-01-09 19:14   ` Dan Hollis
@ 2001-01-09 22:03     ` David S. Miller
  2001-01-09 22:58       ` Dan Hollis
  0 siblings, 1 reply; 119+ messages in thread
From: David S. Miller @ 2001-01-09 22:03 UTC (permalink / raw)
  To: goemon; +Cc: mingo, stephenl, linux-kernel

   Date: 	Tue, 9 Jan 2001 11:14:05 -0800 (PST)
   From: Dan Hollis <goemon@anime.net>

   Just extend sendfile to allow any fd to any fd. sendfile already
   does file->socket and file->file. It only needs to be extended to
   do socket->file.

This is not what senfile() does, it sends (to a network socket) a
file (from the page cache), nothing more.

Later,
David S. Miller
davem@redhat.com
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1
  2001-01-09 13:47   ` Andrew Morton
@ 2001-01-09 19:15     ` Dan Hollis
  0 siblings, 0 replies; 119+ messages in thread
From: Dan Hollis @ 2001-01-09 19:15 UTC (permalink / raw)
  To: Andrew Morton; +Cc: mingo, Stephen Landamore, linux-kernel

On Wed, 10 Jan 2001, Andrew Morton wrote:
> y'know our pals have patented it?
> http://www.delphion.com/details?pn=US05845280__

Bad faith patent? Actionable, treble damages?

-Dan

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1
  2001-01-09 13:24 ` Ingo Molnar
  2001-01-09 13:47   ` Andrew Morton
@ 2001-01-09 19:14   ` Dan Hollis
  2001-01-09 22:03     ` David S. Miller
  1 sibling, 1 reply; 119+ messages in thread
From: Dan Hollis @ 2001-01-09 19:14 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Stephen Landamore, linux-kernel

On Tue, 9 Jan 2001, Ingo Molnar wrote:
> :-) I think sendfile() should also have its logical extensions:
> receivefile(). I dont know how the HPUX implementation works, but in
> Linux, right now it's only possible to sendfile() from a file to a socket.
> The logical extension of this is to allow socket->file IO and file->file,
> socket->socket IO as well. (the later one could be interesting for things
> like web proxies.)

Just extend sendfile to allow any fd to any fd. sendfile already does
file->socket and file->file. It only needs to be extended to do
socket->file.

-Dan

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1
@ 2001-01-09 17:46 Manfred Spraul
  0 siblings, 0 replies; 119+ messages in thread
From: Manfred Spraul @ 2001-01-09 17:46 UTC (permalink / raw)
  To: sct, mingo; +Cc: linux-kernel

sct wrote:
> We've already got measurements showing how insane this is. Raw IO 
> requests, plus internal pagebuf contiguous requests from XFS, have to 
> get broken down into page-sized chunks by the current ll_rw_block() 
> API, only to get reassembled by the make_request code. It's 
> *enormous* overhead, and the kiobuf-based disk IO code demonstrates 
> this clearly. 

Stephen, I see one big difference between ll_rw_block and the proposed
tcp_sendpage():
You must allocate and initialize a complete buffer head for each page
you want to read, and then you pass the array of buffer heads to
ll_rw_block with one function call.
I'm certain the overhead is the allocation/initialization/freeing of the
buffer heads, not the function call.

AFAICS the proposed tcp_sendpage interface is the other way around:
you need one function call for each page, but no memory
allocation/setup. The memory is allocated internally by the tcp_sendpage
implementation, and it merges requests when possible, thus for a 9000
byte jumbopacket you'd need 3 function calls to tcp_sendpage(MSG_MORE),
but only one skb is allocated and set up.

Ingo is that correct?

--
	Manfred

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1
  2001-01-09 13:24 ` Ingo Molnar
@ 2001-01-09 13:47   ` Andrew Morton
  2001-01-09 19:15     ` Dan Hollis
  2001-01-09 19:14   ` Dan Hollis
  1 sibling, 1 reply; 119+ messages in thread
From: Andrew Morton @ 2001-01-09 13:47 UTC (permalink / raw)
  To: mingo; +Cc: Stephen Landamore, linux-kernel

Ingo Molnar wrote:
> 
> On Tue, 9 Jan 2001, Stephen Landamore wrote:
> 
> > >> Sure.  But sendfile is not one of the fundamental UNIX operations...
> 
> > > Neither were eg. kernel-based semaphores. So what? Unix wasnt
> 
> > Ehh, that's not correct. HP-UX was the first to implement sendfile().
> 
> i dont think we disagree. What i was referring to was the 'original' Unix
> idea, the 30 years old one, which did not include sendfile() :-) We never
> claimed that sendfile() first came up in Linux [that would be a blatant
> lie] - and the Linux API itself was indeed influenced by existing
> sendfile()/copyfile() interfaces. (at the time Linus implemented
> sendfile() there already existed several similar interfaces.)
> 

y'know our pals have patented it?

http://www.delphion.com/details?pn=US05845280__
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1
  2001-01-09 13:08 Stephen Landamore
@ 2001-01-09 13:24 ` Ingo Molnar
  2001-01-09 13:47   ` Andrew Morton
  2001-01-09 19:14   ` Dan Hollis
  0 siblings, 2 replies; 119+ messages in thread
From: Ingo Molnar @ 2001-01-09 13:24 UTC (permalink / raw)
  To: Stephen Landamore; +Cc: linux-kernel


On Tue, 9 Jan 2001, Stephen Landamore wrote:

> >> Sure.  But sendfile is not one of the fundamental UNIX operations...

> > Neither were eg. kernel-based semaphores. So what? Unix wasnt

> Ehh, that's not correct. HP-UX was the first to implement sendfile().

i dont think we disagree. What i was referring to was the 'original' Unix
idea, the 30 years old one, which did not include sendfile() :-) We never
claimed that sendfile() first came up in Linux [that would be a blatant
lie] - and the Linux API itself was indeed influenced by existing
sendfile()/copyfile() interfaces. (at the time Linus implemented
sendfile() there already existed several similar interfaces.)

> For the record, sendfile() exists because we (Zeus) asked HP for it.

good move :-) [honestly.]

> (So of course we agree that sendfile is important!)

:-) I think sendfile() should also have its logical extensions:
receivefile(). I dont know how the HPUX implementation works, but in
Linux, right now it's only possible to sendfile() from a file to a socket.
The logical extension of this is to allow socket->file IO and file->file,
socket->socket IO as well. (the later one could be interesting for things
like web proxies.)

	Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1
@ 2001-01-09 13:08 Stephen Landamore
  2001-01-09 13:24 ` Ingo Molnar
  0 siblings, 1 reply; 119+ messages in thread
From: Stephen Landamore @ 2001-01-09 13:08 UTC (permalink / raw)
  To: linux-kernel; +Cc: mingo

Ingo Molnar wrote:
> On Tue, 9 Jan 2001, Christoph Hellwig wrote:
>
>> Sure.  But sendfile is not one of the fundamental UNIX operations...
>
> Neither were eg. kernel-based semaphores. So what? Unix wasnt
> perfect and isnt perfect - but it was a (very) good starting
> point. If you are arguing against the existence or importance of
> sendfile() you should re-think, sendfile() is a unique (and
> important) interface because it enables moving information between
> files (streams) without involving any interim user-space memory
> buffer. No original Unix API did this AFAIK, so we obviously had to
> add it. It's an important Linux API category.

Ehh, that's not correct. HP-UX was the first to implement sendfile().
Linux (and other commercial unices) then copied the idea...

For the record, sendfile() exists because we (Zeus) asked HP for
it. (So of course we agree that sendfile is important!)

Regards,
Stephen

--
Stephen Landamore, <slandamore@zeus.com>              Zeus Technology
Tel: +44 1223 525000                      Universally Serving the Net
Fax: +44 1223 525100                              http://www.zeus.com
Zeus Technology, Zeus House, Cowley Road, Cambridge, CB4 0ZT, ENGLAND

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 119+ messages in thread

end of thread, other threads:[~2001-01-19 15:56 UTC | newest]

Thread overview: 119+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2001-01-08 21:56 [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 Jes Sorensen
2001-01-08  1:24 ` David S. Miller
2001-01-08 10:39   ` Christoph Hellwig
2001-01-08 10:34     ` David S. Miller
2001-01-08 18:05       ` Rik van Riel
2001-01-08 21:07         ` David S. Miller
2001-01-09 10:23         ` Ingo Molnar
2001-01-09 10:31           ` David S. Miller
2001-01-09 11:28             ` Christoph Hellwig
2001-01-09 12:04               ` Ingo Molnar
2001-01-09 14:25                 ` Stephen C. Tweedie
2001-01-09 14:33                   ` Alan Cox
2001-01-09 15:00                   ` Ingo Molnar
2001-01-09 15:27                     ` Stephen C. Tweedie
2001-01-09 16:16                       ` Ingo Molnar
2001-01-09 16:37                         ` Alan Cox
2001-01-09 16:48                           ` Ingo Molnar
2001-01-09 17:29                             ` Alan Cox
2001-01-09 17:38                               ` Jens Axboe
2001-01-09 18:38                                 ` Ingo Molnar
2001-01-09 19:54                                   ` Andrea Arcangeli
2001-01-09 20:10                                     ` Ingo Molnar
2001-01-10  0:00                                       ` Andrea Arcangeli
2001-01-09 20:12                                     ` Jens Axboe
2001-01-09 23:20                                       ` Andrea Arcangeli
2001-01-09 23:34                                         ` Jens Axboe
2001-01-09 23:52                                           ` Andrea Arcangeli
2001-01-17  5:16                                     ` Rik van Riel
2001-01-09 17:56                             ` Chris Evans
2001-01-09 18:41                               ` Ingo Molnar
2001-01-09 22:58                                 ` [patch]: ac4 blk (was Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1) Jens Axboe
2001-01-09 19:20                           ` [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 J Sloan
2001-01-09 18:10                         ` Stephen C. Tweedie
2001-01-09 15:38                     ` Benjamin C.R. LaHaise
2001-01-09 16:40                       ` Ingo Molnar
2001-01-09 17:30                         ` Benjamin C.R. LaHaise
2001-01-09 18:12                           ` Stephen C. Tweedie
2001-01-09 18:35                           ` Ingo Molnar
2001-01-09 17:53                       ` Christoph Hellwig
2001-01-09 21:13                 ` David S. Miller
2001-01-09 19:14               ` Linus Torvalds
2001-01-09 20:07                 ` Ingo Molnar
2001-01-09 20:15                   ` Linus Torvalds
2001-01-09 20:36                     ` Christoph Hellwig
2001-01-09 20:55                       ` Linus Torvalds
2001-01-09 21:12                         ` Christoph Hellwig
2001-01-09 21:26                           ` Linus Torvalds
2001-01-10  7:42                             ` Christoph Hellwig
2001-01-10  8:05                               ` Linus Torvalds
2001-01-10  8:33                                 ` Christoph Hellwig
2001-01-10  8:37                                 ` Andrew Morton
2001-01-10 23:32                                   ` Linus Torvalds
2001-01-19 15:55                                     ` Andrew Scott
2001-01-17 14:05                               ` Rik van Riel
2001-01-18  0:53                                 ` Christoph Hellwig
2001-01-18  1:13                                   ` Linus Torvalds
2001-01-18 17:50                                     ` Christoph Hellwig
2001-01-18 18:04                                       ` Linus Torvalds
2001-01-18 21:12                                     ` Albert D. Cahalan
2001-01-19  1:52                                       ` 2.4.1-pre8 video/ohci1394 compile problem ebi4
2001-01-19  6:55                                       ` [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 Linus Torvalds
2001-01-09 23:06                         ` Benjamin C.R. LaHaise
2001-01-09 23:54                           ` Linus Torvalds
2001-01-10  7:51                             ` Gerd Knorr
2001-01-12  1:42                 ` Stephen C. Tweedie
2001-01-09 11:42             ` David S. Miller
2001-01-09 10:31           ` Christoph Hellwig
2001-01-09 11:05             ` Ingo Molnar
2001-01-09 18:27               ` Christoph Hellwig
2001-01-09 19:19                 ` Ingo Molnar
2001-01-09 14:18           ` Stephen C. Tweedie
2001-01-09 14:40             ` Ingo Molnar
2001-01-09 14:51               ` Alan Cox
2001-01-09 15:17               ` Stephen C. Tweedie
2001-01-09 15:37                 ` Ingo Molnar
2001-01-09 22:25                 ` Linus Torvalds
2001-01-10 15:21                   ` Stephen C. Tweedie
2001-01-09 15:25               ` Stephen Frost
2001-01-09 15:40                 ` Ingo Molnar
2001-01-09 15:48                   ` Stephen Frost
2001-01-10  1:14                   ` Dave Zarzycki
2001-01-10  1:14                     ` David S. Miller
2001-01-10  2:18                       ` Dave Zarzycki
2001-01-10  1:19                     ` Ingo Molnar
2001-01-09 21:18               ` David S. Miller
2001-01-10  2:56           ` storage over IP (was Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1) dean gaudet
2001-01-10  2:58             ` David S. Miller
2001-01-10  3:18               ` dean gaudet
2001-01-10  3:09                 ` David S. Miller
2001-01-10  3:05             ` storage over IP (was Re: [PLEASE-TESTME] Zerocopy networking patch, Alan Cox
2001-01-08 21:48   ` [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 David S. Miller
2001-01-08 22:32     ` Jes Sorensen
2001-01-08 22:37       ` David S. Miller
2001-01-08 22:43       ` Stephen Frost
2001-01-08 22:36     ` David S. Miller
2001-01-09 12:12       ` Ingo Molnar
2001-01-09 13:42   ` David S. Miller
2001-01-09 21:19     ` David S. Miller
2001-01-09 13:52   ` Trond Myklebust
2001-01-09 15:27     ` Trond Myklebust
2001-01-10  9:21       ` Trond Myklebust
2001-01-09 13:08 Stephen Landamore
2001-01-09 13:24 ` Ingo Molnar
2001-01-09 13:47   ` Andrew Morton
2001-01-09 19:15     ` Dan Hollis
2001-01-09 19:14   ` Dan Hollis
2001-01-09 22:03     ` David S. Miller
2001-01-09 22:58       ` Dan Hollis
2001-01-09 22:59         ` Ingo Molnar
2001-01-09 23:11           ` Dan Hollis
2001-01-10  3:24           ` Chris Wedgwood
2001-01-09 17:46 Manfred Spraul
2001-01-10  8:41 Manfred Spraul
2001-01-10  8:31 ` David S. Miller
2001-01-10 11:25 ` Ingo Molnar
2001-01-10 12:03   ` Manfred Spraul
2001-01-10 12:07     ` Ingo Molnar
2001-01-10 16:18       ` Jamie Lokier
2001-01-13 15:43 ` yodaiken

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).