From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S270871AbTHFTkK (ORCPT ); Wed, 6 Aug 2003 15:40:10 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S270991AbTHFTkK (ORCPT ); Wed, 6 Aug 2003 15:40:10 -0400 Received: from pirx.hexapodia.org ([208.42.114.113]:64329 "EHLO pirx.hexapodia.org") by vger.kernel.org with ESMTP id S270871AbTHFTj6 (ORCPT ); Wed, 6 Aug 2003 15:39:58 -0400 Date: Wed, 6 Aug 2003 14:39:56 -0500 From: Andy Isaacson To: Jesse Pollard Cc: netdev@oss.sgi.com, linux-kernel@vger.kernel.org Subject: Re: TOE brain dump Message-ID: <20030806143956.B15543@hexapodia.org> References: <20030802140444.E5798@almesberger.net> <03080607463300.08387@tabby> <20030806112556.C26920@hexapodia.org> <03080613585900.09086@tabby> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.2.5i In-Reply-To: <03080613585900.09086@tabby>; from jesse@cats-chateau.net on Wed, Aug 06, 2003 at 01:58:59PM -0500 X-PGP-Fingerprint: 48 01 21 E2 D4 E4 68 D1 B8 DF 39 B2 AF A3 16 B9 X-PGP-Key-URL: http://web.hexapodia.org/~adi/pgp.txt X-Domestic-Surveillance: money launder bomb tax evasion Sender: linux-kernel-owner@vger.kernel.org X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Aug 06, 2003 at 01:58:59PM -0500, Jesse Pollard wrote: > On Wednesday 06 August 2003 11:25, Andy Isaacson wrote: > > The switches may be "wire speed" but that doesn't help the latency any. > > AFAIK all GigE switches are store-and-forward, which automatically costs > > you the full 1.3us for each link hop. (I didn't check Eric's numbers, > > so I don't know that 1.3us is the right value, but it sounds right.) > > Also I think you might be confused about what Eric meant by "3 layer > > switch hierarchy"; he's referring to a tree topology network with > > layer-one switches connecting hosts, layer-two switches connecting > > layer-one switches, and layer-three switches connecting layer-two > > switches. This means that your worst-case node-to-node latency has 6 > > wire hops with 7 "read the entire packet into memory" operations, > > depending on how you count the initiating node's generation of the > > packet. > > If it reads the packet into memory before starting transmission, it isn't > "wire speed". It is a router. [Please read an implied "I might be totally off base here, since I've never designed an Ethernet switch" disclaimer into this paragraph.] This statement is completely false. Ethernet switches *do* read the packet into memory before starting transmission. This must be so, because an Ethernet switch does not propagate runts, jabber frames, or frames with an incorrect ethernet crc. If the switch starts transmission before it's received the last bit, it is provably impossible for it to avoid propagating crc-failing-frames; ergo, switches must have the entire packet on hand before starting transmission. > > > > A lot of the NICs which are used for MPI tend to be smart for two > > > > reasons. 1) So they can do source routing. 2) So they can safely > > > > export some of their interface to user space, so in the fast path > > > > they can bypass the kernel. > > > > > > And bypass any security checks required. A single rogue MPI application > > > using such an interface can/will bring the cluster down. > > > > This is just false. Kernel bypass (done properly) has no negative > > effect on system stability, either on-node or on-network. By "done > > properly" I mean that the NIC has mappings programmed into it by the > > kernel at app-startup time, and properly bounds-checks all remote DMA, > > and has a method for verifying that incoming packets are not rogue or > > corrupt. (Of course a rogue *kernel* can probably interfere with other > > *applications* on the network it's connected to, by inserting malicious > > packets into the datastream, but even that is soluble with cookies or > > routing checks. However, I don't believe any systems try to defend > > against rogue nodes today.) > > Just because the packet gets transfered to a buffer correctly does not > mean that buffer is the one it should have been sent to. If it didn't > have this problem, then there would be no kernel TCP/IP interaction. Just > open the ethernet device and start writing/reading. Ooops. known security > failure. You're ignoring the fact that there's a complete, programmable RISC CPU on the Myrinet card which is running code (the MCP, Myrinet Control Program) installed into it by the kernel. The kernel tells the MCP to allow access to a given app (by mapping a page of PCI IO addresses into the user's virtual address space), and the MCP checks the user's DMA requests for validity. The user cannot generate arbitrary Myrinet routing requests, cannot write to arbitrary addresses, cannot send messages to hosts not in his allowed lists, et cetera. We do know that the buffer is the one it should have been sent to, because the MCP on the sending end verified that it was an allowed destination host, and the MCP on the receiving end verified that the destination address was valid. Myrinet Inc even offers a SDK allowing you to write your own MCP, if you so desire, and various research projects have done precisely that. Demonstrating that dumb Ethernet cards cannot be smart does not demonstrate that smart FooNet cards cannot be smart. (s/FooNet/$x/ as desired.) > > I believe that Myrinet's hardware has the capability to meet the "kernel > > bypass done properly" requirement I state above; I make no claim that > > their GM implementation actually meets the requirement (although I think > > it might). It's pretty likely that QSW's Elan hardware can, too, but I > > know even less about that. > > since the routing is done is user mode, as part of the library, it can be > used to directly affect processes NOT owned by the user. This > bypasses the kernel security checks by definition. The routing is done on the MCP, not in a library. (Or at least, it could be -- I don't know offhand how GM1 and GM2 work.) This is not an insoluble problem. > Already known to happen with raw myrinet, so there is a kernel layer > on top of it to shield it (or at least try to). Perhaps that's the case with GM1 (I don't know) but it is not a fundamental flaw of the hardware or the network. > If there is no kernel involvement, then there can be no restrictions > on what can be passed down the line to the device. The MCP provides the necessary checking. > Now some of the modifications for myrinet were to use normal TCP/IP to > establish source/destination header information, then bypass any > packet handshake, but force EACH packet to include the pre-established > source/destination header info. I don't know what you're talking about here; perhaps this was some early "TCP over Myrinet" thing. Currently on a host with GM1 running, the myri0 interface shows up as an almost-normal Ethernet interface, and most of the relevant networking ioctls work just fine. I can even tcpdump it. On a related topic, there is a Myrinet line card with a GigE port available. I haven't looked into the software end deeply, but apparently you just stick a standard Myrinet route to that switch port on the front of the Myrinet frame, append an Ethernet frame, and your Myrinet host can send GigE packets without bother. I don't know how incoming ethernet packets are routed, alas -- presumably a Myrinet route is encoded in the MAC somehow. > This is equivalent to UDP, but without any checksums, and sometimes > can bypass part of the kernel cache. Unfortunately, it also means that > sometimes incoming data is NOT destined for the user, and must be > erased/copied before the final destination is achieved. This introduces leaks > due to the race condition caused by the transfer to the wrong buffer. > > You can't DMA directly to a users buffer, because you MUST verify the header > before the data... and you can't do that until the buffer is in memory... > So bypassing the kernel generates security failures. Again, the security problems are solved by having the MCP check the necessary conditions. You bring up a good point WRT error resilience, though -- I don't know how Myrinet handles media bit errors. You *can* DMA directly to a user's buffer, because the necessary header information was checked on the MCP before the bits even touch the PCI bus. > This is already a problem in fibre channel devices, and in other network > devices. Anytime you bypass the kernel security you also void any > restrictions on the network, and any hosts it is attached to. Sufficiently advanced HBA hardware and software solve this problem. Please pick another windmill to tilt at. (Like the error one; I need to find out what the answer to that is.) -andy