linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Glen Turner <glen.turner@aarnet.edu.au>
To: Jeff Garzik <jgarzik@pobox.com>
Cc: linux-kernel@vger.kernel.org
Subject: Re: TOE brain dump
Date: Mon, 04 Aug 2003 11:17:23 +0930	[thread overview]
Message-ID: <3F2DBB2B.9050803@aarnet.edu.au> (raw)
In-Reply-To: <3F2CAE61.7070401@pobox.com>


> Really fast, really long pipes in practice don't exist for 99.9% of all 
> Internet users.

Writing from Australia, I think you're out by at least
one order of magnitude and probably two.  That is, I'd
expect about 10% of the net to be on long fast pipes.

Here every worthwhile fast pipe is a long fast pipe.  90% of
Australia's net traffic goes to the West Coast of the USA,
that's 14,000Km away.

Australia accounts for about 10% of current net traffic. About
30% of Australia's net traffic is from AARNet, typically
100Base-TX hosts.

So you're out by about an order of magnitude, just accounting
for one ISP in one small country.  I'll leave the calculations
for the academic networks of China to others.

> There is one interesting TOE solution, that I have yet to see created: 
> run Linux on an embedded processor, on the NIC.  This stripped-down 
> Linux kernel would perform all the header parsing, checksumming, etc. 
> into the NIC's local RAM.  The Linux OS driver interface becomes a 
> virtual interface with a large MTU, that communicates from host CPU to 
> NIC across the PCI bus using jumbo-ethernet-like data frames. Management 
> frames would control the ethernet interface on the other side of the PCI 
> bus "tunnel".

This assumes the offload processor is at least 100x faster at
processing the IP frames than the kernel.  There is silicon where
that is true (eg, network processors), but good GCC support for
that silicon is unlikely (as good GCC support for popular silicon
is somewhat lacking).

Someone else wrote:
> It's been tried a number of times. Usually, real life sneaks
> in at one point or another, leaving behind a complex mess.
> When they've sorted out these problems, regular TCP has caught
> up with the great optimized transport protocols. At that point,
> they return to their niche, sometimes tail between legs and
> muttering curses, sometimes shaking their fist and boldly
> proclaiming how badly they'll rub TCP in the dirt in the next
> round. Maybe they shed off some of the complexity, and trade it
> for even more aggressive optimization, which puts them into
> their niche even more firmly. Eventually, they fade away. 

This ignores the push-back of platform support onto protocol
design.  The IETF iSCSI WG discussed using tranpsort protocols
which allow out-of-order delivery of SCSI blocks, rather than
the head-of-queue blocking that happens using TCP, but it
was felt that iSCSI would never gain vendor support unless
it ran over TCP.

 > Another problem of TCP is that it has grown a bit too many
 > knobs you need to turn before it works over your really fast
 > really long pipe. (In one of the OLS after dinner speeches,
 > this was quite appropriately called the "wizard gap".)

That's Matt Mathis's phrase.  The Web100 project
<http://www.web100.org/> has a set of patches to the kernel
which go a long way to reducing the wizard gap.  It would be
nice to see those patches eventually appear in the Linux
mainstream.

It's disturbing to see patches with a similar purpose (such
as those instrumenting UDP) being knocked back on grounds
of slowing the TCP/IP path.  Which is a wonderful example
of suboptimisation.

 > That's why NFS turned off UDP checksums ;-) As soon as you put
 > it on IP, it will crawl to distances you didn't imagine in your
 > wildest dreams. It always does.

I'll note that Sun turned UDP checksumming back on.  Not
only is disk corruption forever, but Sun servers running
DNS servers were notorious for not checksumming DNS responses,
having the nasty effect of poisoning DNS caches.

The NANOG mailing list (a list of US ISP network engineers)
cooperated in finding all of these and getting those Classic
SunOS kernels patched to activate checksumming.  We couldn't
do that nowdays, the net is just so much bigger.

Do the net a favour, don't stuff with UDP checksumming.
RFC1122 (Host Requirements) states that checksumming
MUST be on by default and that hosts MAY allow checksumming
to be turned off per *program* (ie, not across the entire
box).  That requirement is born of bitter experience with
Classic SunOS's "no checksumming across the entire box by
default".

-- 
  Glen Turner         Tel: (08) 8303 3936 or +61 8 8303 3936
  Network Engineer          Email: glen.turner@aarnet.edu.au
  Australian Academic & Research Network   www.aarnet.edu.au
-- 
  linux.conf.au 2004, Adelaide          lca2004.linux.org.au
  Main conference 14-17 January 2004   Miniconfs from 12 Jan


  parent reply	other threads:[~2003-08-04  1:47 UTC|newest]

Thread overview: 74+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2003-08-02 17:04 TOE brain dump Werner Almesberger
2003-08-02 17:32 ` Nivedita Singhvi
2003-08-02 18:06   ` Werner Almesberger
2003-08-02 19:08   ` Jeff Garzik
2003-08-02 21:49     ` Werner Almesberger
2003-08-03  6:40       ` Jeff Garzik
2003-08-03 17:57         ` Werner Almesberger
2003-08-03 18:27           ` Erik Andersen
2003-08-03 19:40             ` Larry McVoy
2003-08-03 20:13               ` David Lang
2003-08-03 20:30                 ` Larry McVoy
2003-08-03 21:21                   ` David Lang
2003-08-03 23:44                     ` Larry McVoy
2003-08-03 21:58                   ` Jeff Garzik
2003-08-05 19:28                   ` Timothy Miller
2003-08-03 20:34               ` jamal
2003-08-04  1:47         ` Glen Turner [this message]
2003-08-04  3:48           ` Larry McVoy
2003-08-06  7:12         ` Andre Hedrick
     [not found]         ` <Pine.LNX.4.10.10308060009130.25045-100000@master.linux-ide .org>
2003-08-06  8:20           ` Lincoln Dale
2003-08-06  8:22             ` David S. Miller
2003-08-06 13:07               ` Jesse Pollard
2003-08-03 19:21       ` Eric W. Biederman
2003-08-04 19:24         ` Werner Almesberger
2003-08-04 19:26           ` David S. Miller
2003-08-05 17:19           ` Eric W. Biederman
2003-08-06  5:13             ` Werner Almesberger
2003-08-06  7:58               ` Eric W. Biederman
2003-08-06 13:37                 ` Werner Almesberger
2003-08-06 15:58                   ` Andy Isaacson
2003-08-06 16:27                     ` Chris Friesen
2003-08-06 17:01                       ` Andy Isaacson
2003-08-06 17:55                         ` Matti Aarnio
2003-08-07  2:14                         ` Lincoln Dale
2003-08-06 12:46             ` Jesse Pollard
2003-08-06 16:25               ` Andy Isaacson
2003-08-06 18:58                 ` Jesse Pollard
2003-08-06 19:39                   ` Andy Isaacson
2003-08-06 21:13                     ` David Schwartz
2003-08-03  4:01     ` Ben Greear
2003-08-03  6:22       ` Alan Shih
2003-08-03  6:41         ` Jeff Garzik
2003-08-03  8:25         ` David Lang
2003-08-03 18:05           ` Werner Almesberger
2003-08-03 22:02           ` Alan Shih
2003-08-03 20:52       ` Alan Cox
2003-08-04 14:36     ` Ingo Oeser
2003-08-04 17:19       ` Alan Shih
2003-08-05  8:15         ` Ingo Oeser
2003-08-02 20:57 ` Alan Cox
2003-08-02 22:14   ` Werner Almesberger
2003-08-03 20:51     ` Alan Cox
     [not found] <g83n.8vu.9@gated-at.bofh.it>
2003-08-03 12:13 ` Ihar 'Philips' Filipau
2003-08-03 18:10   ` Werner Almesberger
2003-08-04  8:55     ` Ihar 'Philips' Filipau
2003-08-04 13:08       ` Jesse Pollard
2003-08-04 19:32       ` Werner Almesberger
2003-08-04 19:48         ` David Lang
2003-08-04 19:56           ` Werner Almesberger
2003-08-04 20:01             ` David Lang
2003-08-04 20:09               ` Werner Almesberger
2003-08-04 20:24                 ` David Lang
2003-08-05  1:38                   ` Werner Almesberger
2003-08-05  1:46                     ` David Lang
2003-08-05  1:54                       ` Larry McVoy
2003-08-05  2:30                         ` Werner Almesberger
2003-08-06  1:47                           ` Val Henson
2003-08-05  3:04                       ` Werner Almesberger
2003-08-04 23:30           ` Peter Chubb
     [not found] <gq0f.8bj.9@gated-at.bofh.it>
     [not found] ` <gvCD.4mJ.5@gated-at.bofh.it>
     [not found]   ` <gJmp.7Th.33@gated-at.bofh.it>
     [not found]     ` <gNpS.2YJ.9@gated-at.bofh.it>
2003-08-04 14:15       ` Ihar 'Philips' Filipau
2003-08-04 14:56         ` Jesse Pollard
2003-08-04 15:51           ` Ihar 'Philips' Filipau
2003-08-04 18:36 Perez-Gonzalez, Inaky
2003-08-04 19:03 ` Alan Cox

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=3F2DBB2B.9050803@aarnet.edu.au \
    --to=glen.turner@aarnet.edu.au \
    --cc=jgarzik@pobox.com \
    --cc=linux-kernel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).