On Sun, Jun 03 2018, Dilger, Andreas wrote: > On Jun 1, 2018, at 17:19, NeilBrown wrote: >> >> On Fri, Jun 01 2018, Doug Oucharek wrote: >> >>> Would it makes sense to land LNet and LNDs on their own first? Get >>> the networking house in order first before layering on the file >>> system? >> >> I'd like to turn that question on it's head: >> Do we need LNet and LNDs? What value do they provide? >> (this is a genuine question, not being sarcastic). >> >> It is a while since I tried to understand LNet, and then it was a >> fairly superficial look, but I think it is an abstraction layer >> that provides packet-based send/receive with some numa-awareness >> and routing functionality. It sits over sockets (TCP) and IB and >> provides a uniform interface. > > LNet is originally based on a high-performance networking stack called > Portals (v3, http://www.cs.sandia.gov/Portals/), with additions for LNet > routing to allow cross-network bridging. > > A critical part of LNet is that it is for RDMA and not packet-based > messages. Everything in Lustre is structured around RDMA. Of course, > RDMA is not possible with TCP so it just does send/receive under the > covers, though it can do zero copy data sends (and at one time zero-copy > receives, but those changes were rejected by the kernel maintainers). > It definitely does RDMA with IB, RoCE, OPA in the kernel, and other RDMA > network types not in the kernel (e.g. Cray Gemini/Aries, Atos/Bull BXI, > and previously older network types no longer supported). Thanks! That will probably help me understand it more easily next time I dive in. > > Even with TCP it has some improvements for performance, such as using > separate sockets for send and receive of large messages, as well as > a socket for small messages that has Nagle disabled so that it does > not delay those packets for aggregation. That sounds like something that could benefit NFS... pNFS already partially does this by virtue of the fact that data often goes to a different server than control, so a different socket is needed. I wonder if it could benefit from more explicit separate of message sizes. Thanks a lot for this background info! NeilBrown > > In addition to the RDMA support, there is also multi-rail support in > the out-of-tree version that we haven't been allowed to land, which > can aggregate network bandwidth. While there exists channel bonding > for TCP connections, that does not exist for IB or other RDMA networks. > >> That is almost a description of the xprt layer in sunrpc. sunrpc >> doesn't have routing, but it does have some numa awareness (for the >> server side at least) and it definitely provides packet-based >> send/receive over various transports - tcp, udp, local (unix domain), >> and IB. >> So: can we use sunrpc/xprt in place of LNet? > > No, that would totally kill the performance of Lustre. > >> How much would we need to enhance sunrpc/xprt for this to work? What >> hooks would be needed to implement the routing as a separate layer. >> >> If LNet is, in some way, much better than sunrpc, then can we share that >> superior functionality with our NFS friends by adding it to sunrpc? > > There was some discussion at NetApp about adding a Lustre/LNet transport > for pNFS, but I don't think it ever got beyond the proposal stage: > > https://tools.ietf.org/html/draft-faibish-nfsv4-pnfs-lustre-layout-07 > >> Maybe the answer to this is "no", but I think LNet would be hard to sell >> without a clear statement of why that was the answer. > > There are other users outside of the kernel tree that use LNet in addition > to just Lustre. The Cray "DVS" I/O forwarding service[*] uses LNet, and > another experimental filesystem named Zest[+] also used LNet. > > [*] https://www.alcf.anl.gov/files/Sugiyama-Wallace-Thursday16B-slides.pdf > [+] https://www.psc.edu/images/zest/zest-sc07-paper.pdf > >> One reason that I would like to see lustre stay in drivers/staging (so I >> do not support Greg's patch) is that this sort of transition of Lustre >> to using an improved sunrpc/xprt would be much easier if both were in >> the same tree. Certainly it would be easier for a larger community to >> be participating in the work. > > I don't think the proposal to encapsulate all of the Lustre protocol into > pNFS made a lot of sense, since this would have only really been available > on Linux, at which point it would be better to use the native Lustre client > rather than funnel everything through pNFS. > > However, _just_ using the LNet transport for (p)NFS might make sense. LNet > is largely independent from Lustre (it used to be a separate source tree) > and is very efficient over the network. > > Cheers, Andreas > -- > Andreas Dilger > Lustre Principal Architect > Intel Corporation