linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: ebiederm@xmission.com (Eric W. Biederman)
To: Werner Almesberger <werner@almesberger.net>
Cc: Jeff Garzik <jgarzik@pobox.com>,
	Nivedita Singhvi <niv@us.ibm.com>,
	netdev@oss.sgi.com, linux-kernel@vger.kernel.org
Subject: Re: TOE brain dump
Date: 05 Aug 2003 11:19:09 -0600	[thread overview]
Message-ID: <m1u18wuinm.fsf@frodo.biederman.org> (raw)
In-Reply-To: <20030804162433.L5798@almesberger.net>

Werner Almesberger <werner@almesberger.net> writes:

> Eric W. Biederman wrote:
> > The optimized for low latency cases seem to have a strong
> > market in clusters.
> 
> Clusters have captive, no, _desperate_ customers ;-) And it
> seems that people are just as happy putting MPI as their
> transport on top of all those link-layer technologies.

MPI is not a transport.  It an interface like the Berkeley sockets
layer.  The semantics it wants right now are usually mapped to
TCP/IP when used on an IP network.  Though I suspect SCTP might
be a better fit.  

But right now nothing in the IP stack is a particularly good fit.

Right now there is a very strong feeling among most of the people
using and developing on clusters that by and large what they are doing
is not of interest to the general kernel community, and so has no
chance of going in.   So you see hack piled on top of hack piled on
top of hack.

Mostly I think the that is less true, at least if they can stand the
process of severe code review and cleaning up their code.  If we can
put in code to scale the kernel to 64 processors.  NIC drivers for
fast interconnects and a few similar tweaks can't hurt either.  

But of course to get through the peer review process people need
to understand what they are doing.

> > There is one place in low latency communications that I can think
> > of where TCP/IP is not the proper solution.  For low latency
> > communication the checksum is at the wrong end of the packet.
> 
> That's one of the few things ATM's AAL5 got right. But in the end,
> I think it doesn't really matter. At 1 Gbps, an MTU-sized packet
> flies by within 13 us. At 10 Gbps, it's only 1.3 us. At that point,
> you may well treat it as an atomic unit.

So store and forward of packets in a 3 layer switch hierarchy, at 1.3 us
per copy. 1.3us to the NIC + 1.3us to the first switch chip + 1.3us to the
second switch chip + 1.3us to the top level switch chip + 1.3us to a middle layer
switch chip + 1.3us to the receiving NIC + 1.3us the receiver.

1.3us * 7 = 9.1us to deliver a packet to the other side.  That is
still quite painful.  Right now I can get better latencies over any of
the cluster interconnects.  I think 5 us is the current low end, with
the high end being about 1 us.

Quite often in MPI when a message is sent the program cannot continue
until the reply is received.  Possibly this is a fundamental problem
with the application programming model, encouraging applications to
be latency sensitive.  But it is a well established API and
programming paradigm so it has to be lived with.

All of this is pretty much the reverse of the TOE case.  Things are
latency sensitive because real work needs to be done.  And the more
latency you have the slower that work gets done.  

A lot of the NICs which are used for MPI tend to be smart for two
reasons.  1) So they can do source routing. 2) So they can safely
export some of their interface to user space, so in the fast path
they can bypass the kernel.

Eric



  parent reply	other threads:[~2003-08-05 17:22 UTC|newest]

Thread overview: 74+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2003-08-02 17:04 TOE brain dump Werner Almesberger
2003-08-02 17:32 ` Nivedita Singhvi
2003-08-02 18:06   ` Werner Almesberger
2003-08-02 19:08   ` Jeff Garzik
2003-08-02 21:49     ` Werner Almesberger
2003-08-03  6:40       ` Jeff Garzik
2003-08-03 17:57         ` Werner Almesberger
2003-08-03 18:27           ` Erik Andersen
2003-08-03 19:40             ` Larry McVoy
2003-08-03 20:13               ` David Lang
2003-08-03 20:30                 ` Larry McVoy
2003-08-03 21:21                   ` David Lang
2003-08-03 23:44                     ` Larry McVoy
2003-08-03 21:58                   ` Jeff Garzik
2003-08-05 19:28                   ` Timothy Miller
2003-08-03 20:34               ` jamal
2003-08-04  1:47         ` Glen Turner
2003-08-04  3:48           ` Larry McVoy
2003-08-06  7:12         ` Andre Hedrick
     [not found]         ` <Pine.LNX.4.10.10308060009130.25045-100000@master.linux-ide .org>
2003-08-06  8:20           ` Lincoln Dale
2003-08-06  8:22             ` David S. Miller
2003-08-06 13:07               ` Jesse Pollard
2003-08-03 19:21       ` Eric W. Biederman
2003-08-04 19:24         ` Werner Almesberger
2003-08-04 19:26           ` David S. Miller
2003-08-05 17:19           ` Eric W. Biederman [this message]
2003-08-06  5:13             ` Werner Almesberger
2003-08-06  7:58               ` Eric W. Biederman
2003-08-06 13:37                 ` Werner Almesberger
2003-08-06 15:58                   ` Andy Isaacson
2003-08-06 16:27                     ` Chris Friesen
2003-08-06 17:01                       ` Andy Isaacson
2003-08-06 17:55                         ` Matti Aarnio
2003-08-07  2:14                         ` Lincoln Dale
2003-08-06 12:46             ` Jesse Pollard
2003-08-06 16:25               ` Andy Isaacson
2003-08-06 18:58                 ` Jesse Pollard
2003-08-06 19:39                   ` Andy Isaacson
2003-08-06 21:13                     ` David Schwartz
2003-08-03  4:01     ` Ben Greear
2003-08-03  6:22       ` Alan Shih
2003-08-03  6:41         ` Jeff Garzik
2003-08-03  8:25         ` David Lang
2003-08-03 18:05           ` Werner Almesberger
2003-08-03 22:02           ` Alan Shih
2003-08-03 20:52       ` Alan Cox
2003-08-04 14:36     ` Ingo Oeser
2003-08-04 17:19       ` Alan Shih
2003-08-05  8:15         ` Ingo Oeser
2003-08-02 20:57 ` Alan Cox
2003-08-02 22:14   ` Werner Almesberger
2003-08-03 20:51     ` Alan Cox
     [not found] <g83n.8vu.9@gated-at.bofh.it>
2003-08-03 12:13 ` Ihar 'Philips' Filipau
2003-08-03 18:10   ` Werner Almesberger
2003-08-04  8:55     ` Ihar 'Philips' Filipau
2003-08-04 13:08       ` Jesse Pollard
2003-08-04 19:32       ` Werner Almesberger
2003-08-04 19:48         ` David Lang
2003-08-04 19:56           ` Werner Almesberger
2003-08-04 20:01             ` David Lang
2003-08-04 20:09               ` Werner Almesberger
2003-08-04 20:24                 ` David Lang
2003-08-05  1:38                   ` Werner Almesberger
2003-08-05  1:46                     ` David Lang
2003-08-05  1:54                       ` Larry McVoy
2003-08-05  2:30                         ` Werner Almesberger
2003-08-06  1:47                           ` Val Henson
2003-08-05  3:04                       ` Werner Almesberger
2003-08-04 23:30           ` Peter Chubb
     [not found] <gq0f.8bj.9@gated-at.bofh.it>
     [not found] ` <gvCD.4mJ.5@gated-at.bofh.it>
     [not found]   ` <gJmp.7Th.33@gated-at.bofh.it>
     [not found]     ` <gNpS.2YJ.9@gated-at.bofh.it>
2003-08-04 14:15       ` Ihar 'Philips' Filipau
2003-08-04 14:56         ` Jesse Pollard
2003-08-04 15:51           ` Ihar 'Philips' Filipau
2003-08-04 18:36 Perez-Gonzalez, Inaky
2003-08-04 19:03 ` Alan Cox

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=m1u18wuinm.fsf@frodo.biederman.org \
    --to=ebiederm@xmission.com \
    --cc=jgarzik@pobox.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=netdev@oss.sgi.com \
    --cc=niv@us.ibm.com \
    --cc=werner@almesberger.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).