From: NeilBrown <neilb@suse.com>
To: "Dilger\, Andreas" <andreas.dilger@intel.com>
Cc: Doug Oucharek <doucharek@cray.com>,
Andreas Dilger <adilger@dilger.ca>,
"devel\@driverdev.osuosl.org" <devel@driverdev.osuosl.org>,
Christoph Hellwig <hch@infradead.org>,
Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
"Linux Kernel Mailing List" <linux-kernel@vger.kernel.org>,
"Drokin\, Oleg" <oleg.drokin@intel.com>,
"selinux\@tycho.nsa.gov" <selinux@tycho.nsa.gov>,
fsdevel <linux-fsdevel@vger.kernel.org>,
lustre-devel <lustre-devel@lists.lustre.org>
Subject: Re: [lustre-devel] [PATCH] staging: lustre: delete the filesystem from the tree.
Date: Mon, 04 Jun 2018 13:54:55 +1000 [thread overview]
Message-ID: <87h8mjp5o0.fsf@notabene.neil.brown.name> (raw)
In-Reply-To: <58123CDD-8424-4E1D-A11F-0F899970A49B@intel.com>
[-- Attachment #1: Type: text/plain, Size: 4921 bytes --]
On Sun, Jun 03 2018, Dilger, Andreas wrote:
> On Jun 1, 2018, at 17:19, NeilBrown <neilb@suse.com> wrote:
>>
>> On Fri, Jun 01 2018, Doug Oucharek wrote:
>>
>>> Would it makes sense to land LNet and LNDs on their own first? Get
>>> the networking house in order first before layering on the file
>>> system?
>>
>> I'd like to turn that question on it's head:
>> Do we need LNet and LNDs? What value do they provide?
>> (this is a genuine question, not being sarcastic).
>>
>> It is a while since I tried to understand LNet, and then it was a
>> fairly superficial look, but I think it is an abstraction layer
>> that provides packet-based send/receive with some numa-awareness
>> and routing functionality. It sits over sockets (TCP) and IB and
>> provides a uniform interface.
>
> LNet is originally based on a high-performance networking stack called
> Portals (v3, http://www.cs.sandia.gov/Portals/), with additions for LNet
> routing to allow cross-network bridging.
>
> A critical part of LNet is that it is for RDMA and not packet-based
> messages. Everything in Lustre is structured around RDMA. Of course,
> RDMA is not possible with TCP so it just does send/receive under the
> covers, though it can do zero copy data sends (and at one time zero-copy
> receives, but those changes were rejected by the kernel maintainers).
> It definitely does RDMA with IB, RoCE, OPA in the kernel, and other RDMA
> network types not in the kernel (e.g. Cray Gemini/Aries, Atos/Bull BXI,
> and previously older network types no longer supported).
Thanks! That will probably help me understand it more easily next time
I dive in.
>
> Even with TCP it has some improvements for performance, such as using
> separate sockets for send and receive of large messages, as well as
> a socket for small messages that has Nagle disabled so that it does
> not delay those packets for aggregation.
That sounds like something that could benefit NFS...
pNFS already partially does this by virtue of the fact that data often
goes to a different server than control, so a different socket is
needed. I wonder if it could benefit from more explicit separate of
message sizes.
Thanks a lot for this background info!
NeilBrown
>
> In addition to the RDMA support, there is also multi-rail support in
> the out-of-tree version that we haven't been allowed to land, which
> can aggregate network bandwidth. While there exists channel bonding
> for TCP connections, that does not exist for IB or other RDMA networks.
>
>> That is almost a description of the xprt layer in sunrpc. sunrpc
>> doesn't have routing, but it does have some numa awareness (for the
>> server side at least) and it definitely provides packet-based
>> send/receive over various transports - tcp, udp, local (unix domain),
>> and IB.
>> So: can we use sunrpc/xprt in place of LNet?
>
> No, that would totally kill the performance of Lustre.
>
>> How much would we need to enhance sunrpc/xprt for this to work? What
>> hooks would be needed to implement the routing as a separate layer.
>>
>> If LNet is, in some way, much better than sunrpc, then can we share that
>> superior functionality with our NFS friends by adding it to sunrpc?
>
> There was some discussion at NetApp about adding a Lustre/LNet transport
> for pNFS, but I don't think it ever got beyond the proposal stage:
>
> https://tools.ietf.org/html/draft-faibish-nfsv4-pnfs-lustre-layout-07
>
>> Maybe the answer to this is "no", but I think LNet would be hard to sell
>> without a clear statement of why that was the answer.
>
> There are other users outside of the kernel tree that use LNet in addition
> to just Lustre. The Cray "DVS" I/O forwarding service[*] uses LNet, and
> another experimental filesystem named Zest[+] also used LNet.
>
> [*] https://www.alcf.anl.gov/files/Sugiyama-Wallace-Thursday16B-slides.pdf
> [+] https://www.psc.edu/images/zest/zest-sc07-paper.pdf
>
>> One reason that I would like to see lustre stay in drivers/staging (so I
>> do not support Greg's patch) is that this sort of transition of Lustre
>> to using an improved sunrpc/xprt would be much easier if both were in
>> the same tree. Certainly it would be easier for a larger community to
>> be participating in the work.
>
> I don't think the proposal to encapsulate all of the Lustre protocol into
> pNFS made a lot of sense, since this would have only really been available
> on Linux, at which point it would be better to use the native Lustre client
> rather than funnel everything through pNFS.
>
> However, _just_ using the LNet transport for (p)NFS might make sense. LNet
> is largely independent from Lustre (it used to be a separate source tree)
> and is very efficient over the network.
>
> Cheers, Andreas
> --
> Andreas Dilger
> Lustre Principal Architect
> Intel Corporation
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]
next prev parent reply other threads:[~2018-06-04 3:55 UTC|newest]
Thread overview: 16+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-06-01 9:11 [PATCH] staging: lustre: delete the filesystem from the tree Greg Kroah-Hartman
2018-06-01 11:41 ` Christoph Hellwig
2018-06-01 18:20 ` Andreas Dilger
2018-06-01 18:25 ` [lustre-devel] " Doug Oucharek
2018-06-01 23:19 ` NeilBrown
2018-06-03 20:34 ` Dilger, Andreas
2018-06-04 3:54 ` NeilBrown [this message]
2018-06-04 3:59 ` Alexey Lyashkov
2018-06-04 4:15 ` Andreas Dilger
2018-06-04 4:17 ` Alexey Lyashkov
2018-06-01 18:30 ` Andreas Dilger
2018-06-01 18:41 ` Oleg Drokin
2018-06-01 19:08 ` Greg Kroah-Hartman
2018-06-04 7:09 ` Christoph Hellwig
2018-06-04 7:14 ` Greg Kroah-Hartman
2018-06-02 0:28 ` [lustre-devel] " NeilBrown
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87h8mjp5o0.fsf@notabene.neil.brown.name \
--to=neilb@suse.com \
--cc=adilger@dilger.ca \
--cc=andreas.dilger@intel.com \
--cc=devel@driverdev.osuosl.org \
--cc=doucharek@cray.com \
--cc=gregkh@linuxfoundation.org \
--cc=hch@infradead.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=lustre-devel@lists.lustre.org \
--cc=oleg.drokin@intel.com \
--cc=selinux@tycho.nsa.gov \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).