linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [Question] Explanation of zero-copy networking
@ 2001-05-07 13:43 Alexander Eichhorn
  2001-05-07 13:56 ` Alan Cox
  0 siblings, 1 reply; 18+ messages in thread
From: Alexander Eichhorn @ 2001-05-07 13:43 UTC (permalink / raw)
  To: linux-kernel

Hi all,

we are currently developing (as part of my dissertation)
a research-platform to study some new ideas in 
constructing transport systems to support applications 
with realtime-requirements (e.g. multimedia) and new 
networking technologies. The test-platform consists of 
typical multimedia-elements, such as sources, filters, 
sinks and transport-modules, which can be distributed 
across a set of computers. 

To achieve the principle of sparing ressource-usage - which 
we consider fundamental for multimedia-systems - we are 
looking for new (already implemented or planned) mechanisms to 
avoid copying the data-streams where possible (Device-IO, 
especially Network-IO; User-User-IPC). 

That's why i'd like to ask if one of the net-core developers 
could give us a (more or less - depends on what you've 
documented so far) detailed description of the newly 
implemented zero-copy mechanisms in the network-stack. 
We are interested in how to use it (changed network-API?) 
and also in the internal architecture. 

We already had a look at the kernel mailingslist 
archieves and some search machines, but all we found 
were some fragments of the puzzle only. Before digging into 
the sourcecode we try this way to get an overall description.


Our second question: Are there any plans for contructing 
a general copy-avoidance infrastructure (smth. like UVM in 
NetBSD does) and new IPC-mechanisms on top of it yet??


Thanks in advance.

Alexander Eichhorn


-- 
Alexander Eichhorn
Technical University of Ilmenau
Computer Science And Automation Faculty
Distributed Systems and Operating Systems Department
Phone +49 3677 69 4557, Fax  +49 3677 69 4541

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Question] Explanation of zero-copy networking
  2001-05-07 13:43 [Question] Explanation of zero-copy networking Alexander Eichhorn
@ 2001-05-07 13:56 ` Alan Cox
  2001-05-07 16:12   ` Richard B. Johnson
  2001-05-07 18:21   ` dean gaudet
  0 siblings, 2 replies; 18+ messages in thread
From: Alan Cox @ 2001-05-07 13:56 UTC (permalink / raw)
  To: alexander.eichhorn; +Cc: linux-kernel

> documented so far) detailed description of the newly 
> implemented zero-copy mechanisms in the network-stack. 
> We are interested in how to use it (changed network-API?) 
> and also in the internal architecture. 

It is built around sendfile. Trying to do zero copy on pages with user space
mappings get so horribly non pretty it is better to build the API from the
physical side of things.

> Our second question: Are there any plans for contructing 
> a general copy-avoidance infrastructure (smth. like UVM in 
> NetBSD does) and new IPC-mechanisms on top of it yet??

Andrea Arcangeli has O_DIRECT file I/O for the ext2 file system. There are also
several patches for kiovec based single copy pipes have been posted too.



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Question] Explanation of zero-copy networking
  2001-05-07 13:56 ` Alan Cox
@ 2001-05-07 16:12   ` Richard B. Johnson
  2001-05-07 17:53     ` Francois Romieu
                       ` (3 more replies)
  2001-05-07 18:21   ` dean gaudet
  1 sibling, 4 replies; 18+ messages in thread
From: Richard B. Johnson @ 2001-05-07 16:12 UTC (permalink / raw)
  To: Alan Cox; +Cc: alexander.eichhorn, linux-kernel

On Mon, 7 May 2001, Alan Cox wrote:

> > documented so far) detailed description of the newly 
> > implemented zero-copy mechanisms in the network-stack. 
> > We are interested in how to use it (changed network-API?) 
> > and also in the internal architecture. 
> 
> It is built around sendfile. Trying to do zero copy on pages with user space
> mappings get so horribly non pretty it is better to build the API from the
> physical side of things.
> 
> > Our second question: Are there any plans for contructing 
> > a general copy-avoidance infrastructure (smth. like UVM in 
> > NetBSD does) and new IPC-mechanisms on top of it yet??
> 
> Andrea Arcangeli has O_DIRECT file I/O for the ext2 file system. There are also
> several patches for kiovec based single copy pipes have been posted too.
> 
> 

The Networking RFCs talk about "not copying data" as they
attempt to give pointers on improving network speed.

However, PCI to memory copying runs at about 300 megabytes per
second on modern PCs and memory to memory copying runs at over 1,000
megabytes per second. In the future, these speeds will increase.

I don't advise retrofitting network code to improve the speed of
older machines. Instead, time should be spent to improve the
robustness and capability of the networking speed and accommodating
the new breeds of GHz network boards.

In case anybody is interested, Networking remains a serial communications
element. As such, it functions as a low-pass filter. The speed of
a serial communications link is set primarily by the dominant pole
of the links transfer function, which in the frequency domain, is
information_rate * 2. With 100 megabits/second link we have
200 MHz as the dominent pole. The 2 comes from Shannon, it takes
2 carrier events to determine if anything has changed (to transfer
information).  Therefore, if we can detect changes 100 million times
per second, the information carrier must have been at least 200 MHz.
This is the dominent pole.

With a 300 Megabyte / second transfer via PCI, the information carrier
must have been 300 * 8 * 2 = 4,800 MHz. This is 4,800/200 = 24 times
the frequency of the dominent pole of the network transfer function.
This is so far removed from the dominent pole of the system's transfer
function that even doubling the PCI speed (66 MHz v.s. 33 MHz) will
have no measurable affect upon networking speed. With existing kernels,
you can perform network speed tests using "lo", removing the network
board from the speed test. You will note that the network speed, due
to software, is over 10 times faster, 30 times on some machines) than
when the hardware I/O is used. This shows that the network code, alone,
cannot be improved very much to provide an improvement in throughput.

However, a new breed of GHz boards are now available. These boards
have a dominent pole of 1000 * 2 = 2000 MHz. This is rougly one-
half of the PCI bandwidth, and roughly the same as a 66 MHz bus.

This is where some work needs to be done.

Cheers,
Dick Johnson

Penguin : Linux version 2.4.1 on an i686 machine (799.53 BogoMips).

"Memory is like gasoline. You use it up when you are running. Of
course you get it all back when you reboot..."; Actual explanation
obtained from the Micro$oft help desk.



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Question] Explanation of zero-copy networking
  2001-05-07 16:12   ` Richard B. Johnson
@ 2001-05-07 17:53     ` Francois Romieu
  2001-05-07 18:00       ` Blue Lang
  2001-05-07 18:25     ` dean gaudet
                       ` (2 subsequent siblings)
  3 siblings, 1 reply; 18+ messages in thread
From: Francois Romieu @ 2001-05-07 17:53 UTC (permalink / raw)
  To: alexander.eichhorn; +Cc: linux-kernel

Richard B. Johnson <root@chaos.analogic.com> ecrit :
[...]
> when the hardware I/O is used. This shows that the network code, alone,
> cannot be improved very much to provide an improvement in throughput.

It shows that cached code performs well with ~0us latency device/memory.

Networking is about latency and pps too. They both dramatically reduce
the (axe-)evaluated bandwith.

-- 
Ueimor

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Question] Explanation of zero-copy networking
  2001-05-07 17:53     ` Francois Romieu
@ 2001-05-07 18:00       ` Blue Lang
  0 siblings, 0 replies; 18+ messages in thread
From: Blue Lang @ 2001-05-07 18:00 UTC (permalink / raw)
  To: Francois Romieu; +Cc: alexander.eichhorn, linux-kernel

On Mon, 7 May 2001, Francois Romieu wrote:

> It shows that cached code performs well with ~0us latency device/memory.
>
> Networking is about latency and pps too. They both dramatically reduce
> the (axe-)evaluated bandwith.

I think his point is more along the lines of return on investment.  You
can tweak linux to move from 9MB/sec to 9.5MB/sec on a 100Mb link, or you
can spend those same developer cycles getting much larger returns out of
much sexier hardware.

Now, who's gonna supply us with those NICs? ;)

-- 
   Blue Lang                                    http://www.gator.net/~blue
   Unix Administrator                                     Veritas Software
   2315 McMullan Circle, Raleigh, North Carolina, USA         919 835 1540


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Question] Explanation of zero-copy networking
  2001-05-07 13:56 ` Alan Cox
  2001-05-07 16:12   ` Richard B. Johnson
@ 2001-05-07 18:21   ` dean gaudet
  2001-05-07 21:59     ` Alan Cox
  1 sibling, 1 reply; 18+ messages in thread
From: dean gaudet @ 2001-05-07 18:21 UTC (permalink / raw)
  To: Alan Cox; +Cc: alexander.eichhorn, linux-kernel

On Mon, 7 May 2001, Alan Cox wrote:

> > documented so far) detailed description of the newly
> > implemented zero-copy mechanisms in the network-stack.
> > We are interested in how to use it (changed network-API?)
> > and also in the internal architecture.
>
> It is built around sendfile. Trying to do zero copy on pages with user space
> mappings get so horribly non pretty it is better to build the API from the
> physical side of things.

so there's still single copy for write() of a mmap()ed page?

since i'm naive about the high-end databases -- do they have a mechanism
to access zero-copy?  i suppose sendfile() on a raw device fd would
work... nice.

-dean


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Question] Explanation of zero-copy networking
  2001-05-07 16:12   ` Richard B. Johnson
  2001-05-07 17:53     ` Francois Romieu
@ 2001-05-07 18:25     ` dean gaudet
  2001-05-07 19:54       ` Richard B. Johnson
  2001-05-07 18:30     ` Pekka Pietikainen
  2001-05-08  7:18     ` Jamie Lokier
  3 siblings, 1 reply; 18+ messages in thread
From: dean gaudet @ 2001-05-07 18:25 UTC (permalink / raw)
  To: Richard B. Johnson; +Cc: Alan Cox, alexander.eichhorn, linux-kernel

On Mon, 7 May 2001, Richard B. Johnson wrote:

> when the hardware I/O is used. This shows that the network code, alone,
> cannot be improved very much to provide an improvement in throughput.

doesn't your analysis assume that we've got nothing else interesting to do
while doing the network i/o?  for example, i may want to do something else
which needs the memory bandwidth i'd otherwise spend on a single-copy...

-dean


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Question] Explanation of zero-copy networking
  2001-05-07 16:12   ` Richard B. Johnson
  2001-05-07 17:53     ` Francois Romieu
  2001-05-07 18:25     ` dean gaudet
@ 2001-05-07 18:30     ` Pekka Pietikainen
  2001-05-07 19:00       ` Venkatesh Ramamurthy
  2001-05-08  7:18     ` Jamie Lokier
  3 siblings, 1 reply; 18+ messages in thread
From: Pekka Pietikainen @ 2001-05-07 18:30 UTC (permalink / raw)
  To: linux-kernel

On Mon, May 07, 2001 at 12:12:57PM -0400, Richard B. Johnson wrote:
> you can perform network speed tests using "lo", removing the network
> board from the speed test. You will note that the network speed, due
> to software, is over 10 times faster, 30 times on some machines) than
> when the hardware I/O is used. This shows that the network code, alone,
> cannot be improved very much to provide an improvement in throughput.
I'd say more like a factor of 2.

Socket bandwidth using localhost: 141.63 MB/sec
Socket bandwidth using 192.168.9.3: 74.79 MB/sec

(with the boxes being able to do ~= 100MB/s when the receiver CPU/mem
bandwidth isn't limiting things). So I have slow pIII/500 class machines
with fast networking. You could rerun the test with your favourite 
multi-gigabit network and latest 1.7GHz PC and still have a similar
ratio. Being on the bleeding edge isn't easy, and waiting for a few years
for faster hardware isn't a solution for everyone ;)

Zero-copy mostly helps against CPU use (where it'll make your heavily 
loaded server be able to serve a lot more connections), not so much against
bandwidth. The receiver will still run into problems with the copy it has to
do unless you do some very evil tricks like header-splitting+MMU tricks or
run protocols designed to be accelerated in hardware.

Not that zero-copy isn't problem-free. If your bus starts corrupting random
bits there's no way of really noticing it since the NIC happily 
creates a correct TCP checksum based on the corrupt data.
It's not like hardware engineers can be expected to design hardware
that works according to spec :)

Then there's the interrupt problem, which someone will have to solve 
before they start shipping 10gigE NICs that use 1500-byte frames, 850000
interrupts/s without mitigation. Wheeee!!!! 

-- 
Pekka Pietikainen

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Question] Explanation of zero-copy networking
  2001-05-07 18:30     ` Pekka Pietikainen
@ 2001-05-07 19:00       ` Venkatesh Ramamurthy
  0 siblings, 0 replies; 18+ messages in thread
From: Venkatesh Ramamurthy @ 2001-05-07 19:00 UTC (permalink / raw)
  To: Pekka Pietikainen, linux-kernel

> Then there's the interrupt problem, which someone will have to solve
> before they start shipping 10gigE NICs that use 1500-byte frames, 850000
> interrupts/s without mitigation. Wheeee!!!!

In this situations polling helps rather than interrupt driven IO. When there
is heavy IO(read more interrupts per sec), we should automatically switch to
polling mode, once the IO drops we can go to Interrupt driven. But how do we
decide when to switch modes?

Just my 2 cents .....



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Question] Explanation of zero-copy networking
  2001-05-07 18:25     ` dean gaudet
@ 2001-05-07 19:54       ` Richard B. Johnson
  2001-05-07 20:23         ` dean gaudet
                           ` (2 more replies)
  0 siblings, 3 replies; 18+ messages in thread
From: Richard B. Johnson @ 2001-05-07 19:54 UTC (permalink / raw)
  To: dean gaudet; +Cc: Alan Cox, alexander.eichhorn, linux-kernel

On Mon, 7 May 2001, dean gaudet wrote:

> On Mon, 7 May 2001, Richard B. Johnson wrote:
> 
> > when the hardware I/O is used. This shows that the network code, alone,
> > cannot be improved very much to provide an improvement in throughput.
> 
> doesn't your analysis assume that we've got nothing else interesting to do
> while doing the network i/o?  for example, i may want to do something else
> which needs the memory bandwidth i'd otherwise spend on a single-copy...
> 
> -dean

Yes and no. It is assumed by most everybody that a single CPU
cycle saved in doing something is automatically available for
doing something else.  This is never the case unless you have
a completely polled OS environment that is not doing I/O.  In
any OS that preempts using a timer, CPU activity (actual work
being done)  bunches up around that timer interval.  The same
is true for interrupts.  This happens because,  to  a  single
measurement task,  the  CPU seems slower as  it keeps getting
taken away.  So, we end up with a lot of CPU activity bunched
up around interrupts and timer-ticks, with not much happening
elsewhere.

In Unix,  a system call does not produce a  context-switch
unless the task is required to sleep while waiting for I/O.
So, the kernel is going to send a packet to another host on
behalf of the system caller.  It copies the data,  (partial
checksum) assembles the packet, finishes the checksum, then
sends it.  The CPU is given to somebody  else while waiting
for the packet to  get somewhere and be ACKed.  But,  think
about a server  where EVERY  task  is  waiting  for I/O  to
complete!  These CPU cycles,  that you saved by eliminating
a copy (or two), are now wasted spinning.

Let's say the first packet got sent quicker because of the
reduced latency of the copy.  After that,  you still are
waiting for I/O.

Reduced to the limit,  look at using zero CPU cycles to send
and receive packets. Now, with a server loaded to its natural
ability, i.e., bandwidth limited by the round-trip loop band-
width, you still have all the tasks waiting for I/O to complete. 

Basically, "no copy" is an academic exercise. It makes the first
packet get sent more quickly, after which everything slows to
the natural bandwidth of the system.

If you used a server for multicast-only.  In other words,  you
just spewed out unidirectional data, you still slow to the rate
at which the media can take the data.  And CPUs can obtain or
generate these data a lot faster than 100-base can sink them.

When we get to media that can sink data as fast as we can generate
them (it), then we have to worry about memory copy speed. However,
these new devices are actually an IP subsystem.  They generate and
receive entire datagrams. To fully utilize these devices, the data-
gram generation and reception (the basis of all TCP/IP networking)
will have to be moved out of the kernel and into these boards. The
kernel code will only handle interfaces, connections, and rules.

Cheers,
Dick Johnson

Penguin : Linux version 2.4.1 on an i686 machine (799.53 BogoMips).

"Memory is like gasoline. You use it up when you are running. Of
course you get it all back when you reboot..."; Actual explanation
obtained from the Micro$oft help desk.



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Question] Explanation of zero-copy networking
  2001-05-07 19:54       ` Richard B. Johnson
@ 2001-05-07 20:23         ` dean gaudet
  2001-05-08 11:09         ` Bjorn Wesen
  2001-05-08 17:30         ` Alexander Eichhorn
  2 siblings, 0 replies; 18+ messages in thread
From: dean gaudet @ 2001-05-07 20:23 UTC (permalink / raw)
  To: Richard B. Johnson; +Cc: Alan Cox, alexander.eichhorn, linux-kernel

On Mon, 7 May 2001, Richard B. Johnson wrote:

> When we get to media that can sink data as fast as we can generate
> them (it), then we have to worry about memory copy speed. However,
> these new devices are actually an IP subsystem.  They generate and
> receive entire datagrams. To fully utilize these devices, the data-
> gram generation and reception (the basis of all TCP/IP networking)
> will have to be moved out of the kernel and into these boards. The
> kernel code will only handle interfaces, connections, and rules.

heh, and then these things will be expensive, so few will buy them and
they'll remain in older process technologies (like .21u) because there's
no economy of scale, while CPUs jump ahead to fewer and fewer microns
(.13u, .10u), and in a moore's law doubling or so someone will come up
with the bright idea to move everything back to the CPU again and use
mostly dumb i/o devices.  (or they'll use a bunch of general purpose
computers clustered behind inexpensive switches to achieve the same
thing at a fraction of the cost.)

we've never seen this happen before!  :)

-dean


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Question] Explanation of zero-copy networking
  2001-05-07 18:21   ` dean gaudet
@ 2001-05-07 21:59     ` Alan Cox
  2001-05-08 16:20       ` Jamie Lokier
  0 siblings, 1 reply; 18+ messages in thread
From: Alan Cox @ 2001-05-07 21:59 UTC (permalink / raw)
  To: dean gaudet; +Cc: Alan Cox, alexander.eichhorn, linux-kernel

> so there's still single copy for write() of a mmap()ed page?

An mmap page will go direct to disk. But mmap() isnt a good model for 
streaming I/O.

> 
> 


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Question] Explanation of zero-copy networking
  2001-05-07 16:12   ` Richard B. Johnson
                       ` (2 preceding siblings ...)
  2001-05-07 18:30     ` Pekka Pietikainen
@ 2001-05-08  7:18     ` Jamie Lokier
  2001-05-09 15:13       ` Eric W. Biederman
  3 siblings, 1 reply; 18+ messages in thread
From: Jamie Lokier @ 2001-05-08  7:18 UTC (permalink / raw)
  To: Richard B. Johnson; +Cc: Alan Cox, alexander.eichhorn, linux-kernel

Richard B. Johnson wrote:
> However, PCI to memory copying runs at about 300 megabytes per
> second on modern PCs and memory to memory copying runs at over 1,000
> megabytes per second. In the future, these speeds will increase.

That would be "big expensive modern PCs" then.  Our clusters of 700MHz
boxes are strictly limited to 132 megabytes per second over PCI...

-- Jamie

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Question] Explanation of zero-copy networking
  2001-05-07 19:54       ` Richard B. Johnson
  2001-05-07 20:23         ` dean gaudet
@ 2001-05-08 11:09         ` Bjorn Wesen
  2001-05-08 17:30         ` Alexander Eichhorn
  2 siblings, 0 replies; 18+ messages in thread
From: Bjorn Wesen @ 2001-05-08 11:09 UTC (permalink / raw)
  To: Richard B. Johnson; +Cc: linux-kernel

On Mon, 7 May 2001, Richard B. Johnson wrote:
> Basically, "no copy" is an academic exercise. It makes the first
> packet get sent more quickly, after which everything slows to
> the natural bandwidth of the system.
> 
> If you used a server for multicast-only.  In other words,  you
> just spewed out unidirectional data, you still slow to the rate
> at which the media can take the data.  And CPUs can obtain or
> generate these data a lot faster than 100-base can sink them.

This is an awfully PC-centric way of putting things. You assume that the
only ones who use Linux are those with a 1 ghz CPU and those 66 mhz PCI
boards and whatever. You simply cannot make that assumption anymore; the
diversity of Linux HW these days is so broad that the sweet spot between
CPU cycles, memory bandwidth etc which controls the code optimization
fluctuates wildly.

A simple kernel profile of one of our embedded Linux systems for example
show csum_partial_copy limiting the performance. Now for us zero-copy
cannot be implemented anyway because we don't have a checksumming ethernet
controller but if we had, we could enhance performance by 50% by skipping
the copy perhaps. And there definitely are no 1 GHZ embedded CPU's in the
same price range to choose instead, or Rambus memories etc.. raw power
simply is not an option sometimes.

It's still true of course that it's not obvious that the cycles spent on
copying can be used for anything better in all cases.

However, the beauty of open-source is that there is no need to debate over
whether something should be done or not. If someone feels the need, it
will be coded and if it's good people will use it. In this case, if anyone
gets a 200% boost in performance, they probably won't listen to the
argument that "it's academic" afterwards :) And some others might go
twiddle their hardware and skip the zero-copy mechanism altogether.

-BW


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Question] Explanation of zero-copy networking
  2001-05-07 21:59     ` Alan Cox
@ 2001-05-08 16:20       ` Jamie Lokier
  0 siblings, 0 replies; 18+ messages in thread
From: Jamie Lokier @ 2001-05-08 16:20 UTC (permalink / raw)
  To: Alan Cox; +Cc: dean gaudet, alexander.eichhorn, linux-kernel

Alan Cox wrote:
> > so there's still single copy for write() of a mmap()ed page?
> 
> An mmap page will go direct to disk.

Looking at the 2.4.4 code, mmap() of file followed by write() to socket
will copy the data once.

I could be mistaken (only glanced at the code quickly) but I base that
on the only call to ->sendpage being through sendfile.

So yes, there's a single copy overhead for mmap()+write().

-- Jamie

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Question] Explanation of zero-copy networking
  2001-05-07 19:54       ` Richard B. Johnson
  2001-05-07 20:23         ` dean gaudet
  2001-05-08 11:09         ` Bjorn Wesen
@ 2001-05-08 17:30         ` Alexander Eichhorn
  2001-05-09  9:56           ` Reto Baettig
  2 siblings, 1 reply; 18+ messages in thread
From: Alexander Eichhorn @ 2001-05-08 17:30 UTC (permalink / raw)
  To: root; +Cc: linux-kernel

At first, thanks for the (unexpected large) discussion and hints! 

Second: sorry for the multimedia-centric viewpoint, but i think
it's an important task for future operating systems development
(or better: for a real world OS like linux) to have sophisticated 
support for a _large diversity_ in application requirements 
and realtime/multimedia apps are treated stepmotherly for too long.


"Richard B. Johnson" wrote:

> So, the kernel is going to send a packet to another host on
> behalf of the system caller.  It copies the data,  (partial
> checksum) assembles the packet, finishes the checksum, then
> sends it.  The CPU is given to somebody  else while waiting
> for the packet to  get somewhere and be ACKed.  But,  think
> about a server  where EVERY  task  is  waiting  for I/O  to
> complete!  These CPU cycles,  that you saved by eliminating
> a copy (or two), are now wasted spinning.
>
> Basically, "no copy" is an academic exercise. It makes the first
> packet get sent more quickly, after which everything slows to
> the natural bandwidth of the system.

This is the semantic of a typical client/server request/reply 
protocol which is used in "traditional" applications. But it isn't 
appropriate for the communication of realtime mediastreams
because it breakes the strict timing constraints. Here we need
asynchronous (*non blocking semantics*) communication.

> 
> If you used a server for multicast-only.  In other words,  you
> just spewed out unidirectional data, you still slow to the rate
> at which the media can take the data.  And CPUs can obtain or
> generate these data a lot faster than 100-base can sink them.
> 
> When we get to media that can sink data as fast as we can generate
> them (it), then we have to worry about memory copy speed. However,
> these new devices are actually an IP subsystem.  They generate and
> receive entire datagrams. To fully utilize these devices, the data-
> gram generation and reception (the basis of all TCP/IP networking)
> will have to be moved out of the kernel and into these boards. The
> kernel code will only handle interfaces, connections, and rules.

Ohhhh, these are the arguments of people rather investing in
more ressources than investing in clever algorithms. It's comparable 
to the old war between the ATM folks and the IP/Ethernet folks; 
concepts against "brute" ressources. 

1. You don't take into account that there are not only high-end PC's and
Workstations with enormous CPU and memory resources! Devices for 
"pervasive ubiquitous computing" (don't blame me for this fashion word)
for example are mostly embedded systems with scarce ressources, happy
to have enough CPU-cycles for video-codecs. 

2. On the other hand are Video-on-Demand servers with (not only one)
high speed NIC's, large SAN's or disk arrays for video storage with
gigabit/infiniband connections, <fill in your favorite toy>. Here's
the problem not only saturating the links (for economic reasons), but
also to guarantee low delay and jitter to every connection. 
I think we should extend the usability of linux to this class of 
servers too.

3. Have a look at the various papers on high performance networking.
The gap between the growth in network bandwidth and the growth in CPU 
and bus performance is increasing. Today the system-busses are not 
considered to be in the "window of scarcity" (today we have 100MBit 
Ethernets and 133++MB/s PCI). Tomorrow our operating system concepts 
have to cope with 1, 10, ?? Gigabit Ethernets, Infiniband ,
... who knows.

This means: scale CPU and memory-bus performance accordingly or 
use ressource-sparing ipc-mechanisms and implement computational 
complex algorithms (checksum calculations, encryption) in hardware.
Besides continuous-media applications other applications who need
to move data-chunks much larger as the CPU-caches will
benefit from such infrastructures too. (Both classes of systems 
from above will be affected.)

For those applications copy avoidance is so fundamantal or 
copying is so expensive because copying needs all three basic 
system ressources (CPU, memory and bandwidth of local communication-
facilities - busses) at the same time (synchronous)!

Many researchers recognized this problem and developed techniques 
to overcome the dusted os-concepts (UNet, UVM,..). Unfortunately
they need special hardware (NIC's), have partially too much 
overheads or are not general enough. The one thing it shows us
is that there is still some work to be done.

Regards,

Alexander Eichhorn

-- 
Alexander Eichhorn
Technical University of Ilmenau
Computer Science And Automation Faculty
Distributed Systems and Operating Systems Department
Phone +49 3677 69 4557, Fax  +49 3677 69 4541

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Question] Explanation of zero-copy networking
  2001-05-08 17:30         ` Alexander Eichhorn
@ 2001-05-09  9:56           ` Reto Baettig
  0 siblings, 0 replies; 18+ messages in thread
From: Reto Baettig @ 2001-05-09  9:56 UTC (permalink / raw)
  To: alexander.eichhorn; +Cc: root, linux-kernel

> considered to be in the "window of scarcity" (today we have 100MBit
> Ethernets and 133++MB/s PCI). Tomorrow our operating system concepts
> have to cope with 1, 10, ?? Gigabit Ethernets, Infiniband ,
> ... who knows.

We had to write our own RPC mechanism because with the standard-stacks
we had no chance of achieving our goals. We would have loved to use
tcp/ip but it was not possible with Linux 2.2. 
Today we achieve almost 200MB/s over our RPC stack and this with the
CPU's almost idle. With TCP/IP and Gig-E we only came up to 60-70MB/s
and then the system was completely busy and unresponsive (Linux 2.4 is
supposed to be better but I doubt that we get a CPU load this low
without zerocopy networking).

We would like to look at the zerocopy ideas of Linux 2.4 and try to
implement our RPC mechanism over zerocopy-TCP (if something like this
exists). We just started with this idea and don't know exactly where to
start yet (we are looking for something like a de-facto zerocopy
standard for sockets)... Any ideas are welcome.

	Reto

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Question] Explanation of zero-copy networking
  2001-05-08  7:18     ` Jamie Lokier
@ 2001-05-09 15:13       ` Eric W. Biederman
  0 siblings, 0 replies; 18+ messages in thread
From: Eric W. Biederman @ 2001-05-09 15:13 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Richard B. Johnson, Alan Cox, alexander.eichhorn, linux-kernel

Jamie Lokier <lk@tantalophile.demon.co.uk> writes:

> Richard B. Johnson wrote:
> > However, PCI to memory copying runs at about 300 megabytes per
> > second on modern PCs and memory to memory copying runs at over 1,000
> > megabytes per second. In the future, these speeds will increase.
> 
> That would be "big expensive modern PCs" then.  Our clusters of 700MHz
> boxes are strictly limited to 132 megabytes per second over PCI...

300 Megabytes per second is definitely an odd number for a PCI bus.
But 132 Megabytes per second is actually high, the continuous burst
speeds are:
32bit 33Mhz: 33*1000*1000*32/(1024*1024*8) = 125.8 Megabytes/second
64bit 33Mhz: 33*1000*1000*64/(1024*1024*8) = 251.7 Megabytes/second
32bit 66Mhz: 66*1000*1000*32/(1024*1024*8) = 251.7 Megabytes/second
64bit 66Mhz: 66*1000*1000*64/(1024*1024*8) = 503.4 Megabytes/second

The possibility of getting a continuous bursts is actually low, if
nothing else you have an interrupt acknowledgement 100 times per
second.  But if you are pushing the bus it should deliver close to
it's burst potential.  But the ISA traffic doing subtractive decode
can be nasty because you get 4 PCI cycles before you even get
acknowledgement from the PCI/ISA bridge that you there is something to
transfer to.   

Eric

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2001-05-13 15:55 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2001-05-07 13:43 [Question] Explanation of zero-copy networking Alexander Eichhorn
2001-05-07 13:56 ` Alan Cox
2001-05-07 16:12   ` Richard B. Johnson
2001-05-07 17:53     ` Francois Romieu
2001-05-07 18:00       ` Blue Lang
2001-05-07 18:25     ` dean gaudet
2001-05-07 19:54       ` Richard B. Johnson
2001-05-07 20:23         ` dean gaudet
2001-05-08 11:09         ` Bjorn Wesen
2001-05-08 17:30         ` Alexander Eichhorn
2001-05-09  9:56           ` Reto Baettig
2001-05-07 18:30     ` Pekka Pietikainen
2001-05-07 19:00       ` Venkatesh Ramamurthy
2001-05-08  7:18     ` Jamie Lokier
2001-05-09 15:13       ` Eric W. Biederman
2001-05-07 18:21   ` dean gaudet
2001-05-07 21:59     ` Alan Cox
2001-05-08 16:20       ` Jamie Lokier

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).