linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* TOE brain dump
@ 2003-08-02 17:04 Werner Almesberger
  2003-08-02 17:32 ` Nivedita Singhvi
  2003-08-02 20:57 ` Alan Cox
  0 siblings, 2 replies; 74+ messages in thread
From: Werner Almesberger @ 2003-08-02 17:04 UTC (permalink / raw)
  To: netdev, linux-kernel

At OLS, there was a bit of discussion on (true and false *) TOEs
(TCP Offload Engines). In the course of this discussion, I've
suggested what might be a novel approach, so in case this is a
good idea, I'd like to dump my thoughts on it, before someone
tries to patent my ideas. (Most likely, some of this has already
been done or tried elsewhere, but it can't hurt to try to err on
the safe side.)

(*) The InfiniBand people unfortunately call also their TCP/IP
    bypass "TOE" (for which they promptly get shouted down,
    every time they use that word). This is misleading, because
    there is no TCP that's getting offloaded, but TCP is simply
    never done. I would consider it to be more accurate to view
    this as a separate networking technology, with semantics
    different from TCP/IP, similar to ATM and AAL5.

While I'm not entirely convinced about the usefulness of TOE in
all the cases it's been suggested for, I can see value in certain
areas, e.g. when TCP per-packet overhead becomes an issue.

However, I consider the approach of putting a new or heavily
modified stack, which duplicates a considerable amount of the
functionality in the main kernel, on a separate piece of hardware
questionable at best. Some of the issues:

 - if this stack is closed source or generally hard to modify,
   security fixes will be slowed down

 - if this stack is closed source or generally hard to modify,
   TOE will not be available to projects modifying the stack,
   e.g. any of the research projects trying to make TCP work at
   gigabit speeds

 - this stack either needs to implement all administrative
   interfaces of the regular kernel, or such a system would have
   non-uniform configuration/monitoring across interfaces

 - in some cases, administrative interfaces will require a
   NIC/TOE-specific switch in the kernel (netlink helps here)

 - route changes on multi-homed hosts (or any similar kind of
   failover) are difficult if the state of TCP connections is
   tied to specific NICs (I've discussed some issues when
   "migrating" TCP connections in the documentation of tcpcp,
   http://www.almesberger.net/tcpcp/)

 - new kernel features will always lag behind on this kind of
   TOE, and different kernels will require different "firmware"

 - last but not least, keeping TOE firmware up to date with the
   TCP/IP stack in the mainstream kernel will require - for each
   such TOE device - a significant and continuous effort over a
   long period of time

In short, I think such a solution is either a pain to use, or
unmaintainable, or - most likely - both.

So, how to do better ? Easy: use the Source, Luke. Here's my
idea:

 - instead of putting a different stack on the TOE, a
   general-purpose processor (probably with some enhancements,
   and certainly with optimized data paths) is added to the NIC

 - that processor runs the same Linux kernel image as the host,
   acting like a NUMA system

 - a selectable part of TCP/IP is handled on the NIC, and the
   rest of the system runs on the host processor

 - instrumentation is added to the mainstream kernel to ensure
   that as little data as possible is shared between the main
   CPU and such peripheral CPUs. Note that such instrumentation
   would be generic, outlining possible boundaries, and not tied
   to a specific TOE design.

 - depending on hardware details (cache coherence, etc.), the
   instrumentation mentioned above may even be necessary for
   correctness. This would have the unfortunate effect of making
   the design very fragile with respect to changes in the
   mainstream kernel. (Performance loss in the case of imperfect
   instrumentation would be preferable.)

 - further instrumentation may be needed to let the kernel switch
   CPUs (i.e. host to NIC, and vice versa) at the right time

 - since the NIC would probably use a CPU design different from
   the host CPU, we'd need "fat" kernel binaries:

   - data structures are the same, i.e. word sizes, byte order,
     bit numbering, etc. are compatible, and alignments are
     chosen such that all CPUs involved are reasonably happy

   - kernels live in the same address space

   - function pointers become arrays, with one pointer per
     architecture. When comparing pointers, the first element is
     used.

 - if one should choose to also run parts of user space on the
   NIC, fat binaries would also be needed for this (along with
   other complications)

Benefits:

 - putting the CPU next to the NIC keeps data paths short, and
   allows for all kinds of optimizations (e.g. a pipelined
   memory architecture)

 - the design is fairly generic, and would equally apply to
   other areas of the kernel than TCP/IP

 - using the same kernel image eliminates most maintenance
   problems, and encourages experimenting with the stack

 - using the same kernel image (and compatible data structures)
   guarantees that administrative interfaces are uniform in the
   entire system

 - such a design is likely to be able to allow TCP state to be
   moved to a different NIC, if necessary

Possible problems, that may kill this idea:

 - it may be too hard to achieve correctness

 - it may be too hard to switch CPUs properly

 - it may not be possible to express copy operations efficiently
   in such a context

 - there may be no way to avoid sharing of hardware-specific
   data structures, such as page tables, or to emulate their use

 - people may consider the instrumentation required for this,
   although fairly generic, too intrusive

 - all this instrumentation may eat too much performance

 - nobody may be interested in building hardware for this

 - nobody may be patient enough to pursue such long-termish
   development, with uncertain outcome

 - something I haven't thought of

I lack the resources (hardware, financial, and otherwise) to
actually do something with these ideas, so please feel free to
put them to some use.

- Werner

-- 
  _________________________________________________________________________
 / Werner Almesberger, Buenos Aires, Argentina     werner@almesberger.net /
/_http://www.almesberger.net/____________________________________________/

^ permalink raw reply	[flat|nested] 74+ messages in thread
[parent not found: <g83n.8vu.9@gated-at.bofh.it>]
[parent not found: <gq0f.8bj.9@gated-at.bofh.it>]
* RE: TOE brain dump
@ 2003-08-04 18:36 Perez-Gonzalez, Inaky
  2003-08-04 19:03 ` Alan Cox
  0 siblings, 1 reply; 74+ messages in thread
From: Perez-Gonzalez, Inaky @ 2003-08-04 18:36 UTC (permalink / raw)
  To: Larry McVoy, David Lang
  Cc: Erik Andersen, Werner Almesberger, Jeff Garzik, netdev,
	linux-kernel, Nivedita Singhvi


> From: Larry McVoy [mailto:lm@bitmover.com]
>
> > 2. router nodes that have access to main memory (PCI card running linux
> > acting as a router/firewall/VPN to offload the main CPU's)
> 
> I can get an entire machine, memory, disk, > Ghz CPU, case, power supply,
> cdrom, floppy, onboard enet extra net card for routing, for $250 or less,
> quantity 1, shipped to my door.
> 
> Why would I want to spend money on some silly offload card when I can get
> the whole PC for less than the card?

Because you want to stack 200 of those together in a huge
data center interconnecting whatever you want to interconnect
and you don't want your maintenance costs to go up to the sky?

I see your point, though :)

Iñaky Pérez-González -- Not speaking for Intel -- all opinions are my own (and my fault)

^ permalink raw reply	[flat|nested] 74+ messages in thread

end of thread, other threads:[~2003-08-07  2:14 UTC | newest]

Thread overview: 74+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2003-08-02 17:04 TOE brain dump Werner Almesberger
2003-08-02 17:32 ` Nivedita Singhvi
2003-08-02 18:06   ` Werner Almesberger
2003-08-02 19:08   ` Jeff Garzik
2003-08-02 21:49     ` Werner Almesberger
2003-08-03  6:40       ` Jeff Garzik
2003-08-03 17:57         ` Werner Almesberger
2003-08-03 18:27           ` Erik Andersen
2003-08-03 19:40             ` Larry McVoy
2003-08-03 20:13               ` David Lang
2003-08-03 20:30                 ` Larry McVoy
2003-08-03 21:21                   ` David Lang
2003-08-03 23:44                     ` Larry McVoy
2003-08-03 21:58                   ` Jeff Garzik
2003-08-05 19:28                   ` Timothy Miller
2003-08-03 20:34               ` jamal
2003-08-04  1:47         ` Glen Turner
2003-08-04  3:48           ` Larry McVoy
2003-08-06  7:12         ` Andre Hedrick
     [not found]         ` <Pine.LNX.4.10.10308060009130.25045-100000@master.linux-ide .org>
2003-08-06  8:20           ` Lincoln Dale
2003-08-06  8:22             ` David S. Miller
2003-08-06 13:07               ` Jesse Pollard
2003-08-03 19:21       ` Eric W. Biederman
2003-08-04 19:24         ` Werner Almesberger
2003-08-04 19:26           ` David S. Miller
2003-08-05 17:19           ` Eric W. Biederman
2003-08-06  5:13             ` Werner Almesberger
2003-08-06  7:58               ` Eric W. Biederman
2003-08-06 13:37                 ` Werner Almesberger
2003-08-06 15:58                   ` Andy Isaacson
2003-08-06 16:27                     ` Chris Friesen
2003-08-06 17:01                       ` Andy Isaacson
2003-08-06 17:55                         ` Matti Aarnio
2003-08-07  2:14                         ` Lincoln Dale
2003-08-06 12:46             ` Jesse Pollard
2003-08-06 16:25               ` Andy Isaacson
2003-08-06 18:58                 ` Jesse Pollard
2003-08-06 19:39                   ` Andy Isaacson
2003-08-06 21:13                     ` David Schwartz
2003-08-03  4:01     ` Ben Greear
2003-08-03  6:22       ` Alan Shih
2003-08-03  6:41         ` Jeff Garzik
2003-08-03  8:25         ` David Lang
2003-08-03 18:05           ` Werner Almesberger
2003-08-03 22:02           ` Alan Shih
2003-08-03 20:52       ` Alan Cox
2003-08-04 14:36     ` Ingo Oeser
2003-08-04 17:19       ` Alan Shih
2003-08-05  8:15         ` Ingo Oeser
2003-08-02 20:57 ` Alan Cox
2003-08-02 22:14   ` Werner Almesberger
2003-08-03 20:51     ` Alan Cox
     [not found] <g83n.8vu.9@gated-at.bofh.it>
2003-08-03 12:13 ` Ihar 'Philips' Filipau
2003-08-03 18:10   ` Werner Almesberger
2003-08-04  8:55     ` Ihar 'Philips' Filipau
2003-08-04 13:08       ` Jesse Pollard
2003-08-04 19:32       ` Werner Almesberger
2003-08-04 19:48         ` David Lang
2003-08-04 19:56           ` Werner Almesberger
2003-08-04 20:01             ` David Lang
2003-08-04 20:09               ` Werner Almesberger
2003-08-04 20:24                 ` David Lang
2003-08-05  1:38                   ` Werner Almesberger
2003-08-05  1:46                     ` David Lang
2003-08-05  1:54                       ` Larry McVoy
2003-08-05  2:30                         ` Werner Almesberger
2003-08-06  1:47                           ` Val Henson
2003-08-05  3:04                       ` Werner Almesberger
2003-08-04 23:30           ` Peter Chubb
     [not found] <gq0f.8bj.9@gated-at.bofh.it>
     [not found] ` <gvCD.4mJ.5@gated-at.bofh.it>
     [not found]   ` <gJmp.7Th.33@gated-at.bofh.it>
     [not found]     ` <gNpS.2YJ.9@gated-at.bofh.it>
2003-08-04 14:15       ` Ihar 'Philips' Filipau
2003-08-04 14:56         ` Jesse Pollard
2003-08-04 15:51           ` Ihar 'Philips' Filipau
2003-08-04 18:36 Perez-Gonzalez, Inaky
2003-08-04 19:03 ` Alan Cox

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).