linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jesse Pollard <jesse@cats-chateau.net>
To: Andy Isaacson <adi@hexapodia.org>
Cc: netdev@oss.sgi.com, linux-kernel@vger.kernel.org
Subject: Re: TOE brain dump
Date: Wed, 6 Aug 2003 13:58:59 -0500	[thread overview]
Message-ID: <03080613585900.09086@tabby> (raw)
In-Reply-To: <20030806112556.C26920@hexapodia.org>

On Wednesday 06 August 2003 11:25, Andy Isaacson wrote:
> On Wed, Aug 06, 2003 at 07:46:33AM -0500, Jesse Pollard wrote:
> > On Tuesday 05 August 2003 12:19, Eric W. Biederman wrote:
> > > So store and forward of packets in a 3 layer switch hierarchy, at 1.3
> > > us per copy. 1.3us to the NIC + 1.3us to the first switch chip + 1.3us
> > > to the second switch chip + 1.3us to the top level switch chip + 1.3us
> > > to a middle layer switch chip + 1.3us to the receiving NIC + 1.3us the
> > > receiver.
> > >
> > > 1.3us * 7 = 9.1us to deliver a packet to the other side.  That is
> > > still quite painful.  Right now I can get better latencies over any of
> > > the cluster interconnects.  I think 5 us is the current low end, with
> > > the high end being about 1 us.
> >
> > I think you are off here since the second and third layer should not
> > recompute checksums other than for the header (if they even did that).
> > Most of the switches I used (mind, not configured) were wire speed. Only
> > header checksums had recomputes, and I understood it was only for
> > routing.
>
> The switches may be "wire speed" but that doesn't help the latency any.
> AFAIK all GigE switches are store-and-forward, which automatically costs
> you the full 1.3us for each link hop.  (I didn't check Eric's numbers,
> so I don't know that 1.3us is the right value, but it sounds right.)
> Also I think you might be confused about what Eric meant by "3 layer
> switch hierarchy"; he's referring to a tree topology network with
> layer-one switches connecting hosts, layer-two switches connecting
> layer-one switches, and layer-three switches connecting layer-two
> switches.  This means that your worst-case node-to-node latency has 6
> wire hops with 7 "read the entire packet into memory" operations,
> depending on how you count the initiating node's generation of the
> packet.

If it reads the packet into memory before starting transmission, it isn't
"wire speed". It is a router.

> [snip]
>
> > > Quite often in MPI when a message is sent the program cannot continue
> > > until the reply is received.  Possibly this is a fundamental problem
> > > with the application programming model, encouraging applications to
> > > be latency sensitive.  But it is a well established API and
> > > programming paradigm so it has to be lived with.
>
> This is true, in HPC.  Some of the problem is the APIs encouraging such
> behavior; another part of the problem is that sometimes, the problem has
> fundamental latency dependencies that cannot be programmed around.
>
> > > A lot of the NICs which are used for MPI tend to be smart for two
> > > reasons.  1) So they can do source routing. 2) So they can safely
> > > export some of their interface to user space, so in the fast path
> > > they can bypass the kernel.
> >
> > And bypass any security checks required. A single rogue MPI application
> > using such an interface can/will bring the cluster down.
>
> This is just false.  Kernel bypass (done properly) has no negative
> effect on system stability, either on-node or on-network.  By "done
> properly" I mean that the NIC has mappings programmed into it by the
> kernel at app-startup time, and properly bounds-checks all remote DMA,
> and has a method for verifying that incoming packets are not rogue or
> corrupt.  (Of course a rogue *kernel* can probably interfere with other
> *applications* on the network it's connected to, by inserting malicious
> packets into the datastream, but even that is soluble with cookies or
> routing checks.  However, I don't believe any systems try to defend
> against rogue nodes today.)

Just because the packet gets transfered to a buffer correctly does not
mean that buffer is the one it should have been sent to. If it didn't
have this problem, then there would be no kernel TCP/IP interaction. Just
open the ethernet device and start writing/reading. Ooops. known security
failure.

>
> I believe that Myrinet's hardware has the capability to meet the "kernel
> bypass done properly" requirement I state above; I make no claim that
> their GM implementation actually meets the requirement (although I think
> it might).  It's pretty likely that QSW's Elan hardware can, too, but I
> know even less about that.

since the routing is done is user mode, as part of the library, it can be
used to directly affect processes NOT owned by the user. This bypasses
the kernel security checks by definition. Already known to happen with
raw myrinet, so there is a kernel layer on top of it to shield it (or
at least try to). If there is no kernel involvement, then there can be
no restrictions on what can be passed down the line to the device. Now
some of the modifications for myrinet were to use normal TCP/IP to establish
source/destination header information, then bypass any packet handshake, but
force EACH packet to include the pre-established source/destination header 
info. This is equivalent to UDP, but without any checksums, and sometimes
can bypass part of the kernel cache. Unfortunately, it also means that
sometimes incoming data is NOT destined for the user, and must be 
erased/copied before the final destination is achieved. This introduces leaks 
due to the race condition caused by the transfer to the wrong buffer.

You can't DMA directly to a users buffer, because you MUST verify the header
before the data... and you can't do that until the buffer is in memory...
So bypassing the kernel generates security failures.

This is already a problem in fibre channel devices, and in other network
devices. Anytime you bypass the kernel security you also void any 
restrictions on the network, and any hosts it is attached to.

  reply	other threads:[~2003-08-06 18:59 UTC|newest]

Thread overview: 74+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2003-08-02 17:04 TOE brain dump Werner Almesberger
2003-08-02 17:32 ` Nivedita Singhvi
2003-08-02 18:06   ` Werner Almesberger
2003-08-02 19:08   ` Jeff Garzik
2003-08-02 21:49     ` Werner Almesberger
2003-08-03  6:40       ` Jeff Garzik
2003-08-03 17:57         ` Werner Almesberger
2003-08-03 18:27           ` Erik Andersen
2003-08-03 19:40             ` Larry McVoy
2003-08-03 20:13               ` David Lang
2003-08-03 20:30                 ` Larry McVoy
2003-08-03 21:21                   ` David Lang
2003-08-03 23:44                     ` Larry McVoy
2003-08-03 21:58                   ` Jeff Garzik
2003-08-05 19:28                   ` Timothy Miller
2003-08-03 20:34               ` jamal
2003-08-04  1:47         ` Glen Turner
2003-08-04  3:48           ` Larry McVoy
2003-08-06  7:12         ` Andre Hedrick
     [not found]         ` <Pine.LNX.4.10.10308060009130.25045-100000@master.linux-ide .org>
2003-08-06  8:20           ` Lincoln Dale
2003-08-06  8:22             ` David S. Miller
2003-08-06 13:07               ` Jesse Pollard
2003-08-03 19:21       ` Eric W. Biederman
2003-08-04 19:24         ` Werner Almesberger
2003-08-04 19:26           ` David S. Miller
2003-08-05 17:19           ` Eric W. Biederman
2003-08-06  5:13             ` Werner Almesberger
2003-08-06  7:58               ` Eric W. Biederman
2003-08-06 13:37                 ` Werner Almesberger
2003-08-06 15:58                   ` Andy Isaacson
2003-08-06 16:27                     ` Chris Friesen
2003-08-06 17:01                       ` Andy Isaacson
2003-08-06 17:55                         ` Matti Aarnio
2003-08-07  2:14                         ` Lincoln Dale
2003-08-06 12:46             ` Jesse Pollard
2003-08-06 16:25               ` Andy Isaacson
2003-08-06 18:58                 ` Jesse Pollard [this message]
2003-08-06 19:39                   ` Andy Isaacson
2003-08-06 21:13                     ` David Schwartz
2003-08-03  4:01     ` Ben Greear
2003-08-03  6:22       ` Alan Shih
2003-08-03  6:41         ` Jeff Garzik
2003-08-03  8:25         ` David Lang
2003-08-03 18:05           ` Werner Almesberger
2003-08-03 22:02           ` Alan Shih
2003-08-03 20:52       ` Alan Cox
2003-08-04 14:36     ` Ingo Oeser
2003-08-04 17:19       ` Alan Shih
2003-08-05  8:15         ` Ingo Oeser
2003-08-02 20:57 ` Alan Cox
2003-08-02 22:14   ` Werner Almesberger
2003-08-03 20:51     ` Alan Cox
     [not found] <g83n.8vu.9@gated-at.bofh.it>
2003-08-03 12:13 ` Ihar 'Philips' Filipau
2003-08-03 18:10   ` Werner Almesberger
2003-08-04  8:55     ` Ihar 'Philips' Filipau
2003-08-04 13:08       ` Jesse Pollard
2003-08-04 19:32       ` Werner Almesberger
2003-08-04 19:48         ` David Lang
2003-08-04 19:56           ` Werner Almesberger
2003-08-04 20:01             ` David Lang
2003-08-04 20:09               ` Werner Almesberger
2003-08-04 20:24                 ` David Lang
2003-08-05  1:38                   ` Werner Almesberger
2003-08-05  1:46                     ` David Lang
2003-08-05  1:54                       ` Larry McVoy
2003-08-05  2:30                         ` Werner Almesberger
2003-08-06  1:47                           ` Val Henson
2003-08-05  3:04                       ` Werner Almesberger
2003-08-04 23:30           ` Peter Chubb
     [not found] <gq0f.8bj.9@gated-at.bofh.it>
     [not found] ` <gvCD.4mJ.5@gated-at.bofh.it>
     [not found]   ` <gJmp.7Th.33@gated-at.bofh.it>
     [not found]     ` <gNpS.2YJ.9@gated-at.bofh.it>
2003-08-04 14:15       ` Ihar 'Philips' Filipau
2003-08-04 14:56         ` Jesse Pollard
2003-08-04 15:51           ` Ihar 'Philips' Filipau
2003-08-04 18:36 Perez-Gonzalez, Inaky
2003-08-04 19:03 ` Alan Cox

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=03080613585900.09086@tabby \
    --to=jesse@cats-chateau.net \
    --cc=adi@hexapodia.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=netdev@oss.sgi.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).