linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: NeilBrown <neilb@suse.com>
To: "Dilger\, Andreas" <andreas.dilger@intel.com>
Cc: Doug Oucharek <doucharek@cray.com>,
	Andreas Dilger <adilger@dilger.ca>,
	"devel\@driverdev.osuosl.org" <devel@driverdev.osuosl.org>,
	Christoph Hellwig <hch@infradead.org>,
	Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	"Linux Kernel Mailing List" <linux-kernel@vger.kernel.org>,
	"Drokin\, Oleg" <oleg.drokin@intel.com>,
	"selinux\@tycho.nsa.gov" <selinux@tycho.nsa.gov>,
	fsdevel <linux-fsdevel@vger.kernel.org>,
	lustre-devel <lustre-devel@lists.lustre.org>
Subject: Re: [lustre-devel] [PATCH] staging: lustre: delete the filesystem from the tree.
Date: Mon, 04 Jun 2018 13:54:55 +1000	[thread overview]
Message-ID: <87h8mjp5o0.fsf@notabene.neil.brown.name> (raw)
In-Reply-To: <58123CDD-8424-4E1D-A11F-0F899970A49B@intel.com>

[-- Attachment #1: Type: text/plain, Size: 4921 bytes --]

On Sun, Jun 03 2018, Dilger, Andreas wrote:

> On Jun 1, 2018, at 17:19, NeilBrown <neilb@suse.com> wrote:
>> 
>> On Fri, Jun 01 2018, Doug Oucharek wrote:
>> 
>>> Would it makes sense to land LNet and LNDs on their own first?  Get
>>> the networking house in order first before layering on the file
>>> system?
>> 
>> I'd like to turn that question on it's head:
>>  Do we need LNet and LNDs?  What value do they provide?
>> (this is a genuine question, not being sarcastic).
>> 
>> It is a while since I tried to understand LNet, and then it was a
>> fairly superficial look, but I think it is an abstraction layer
>> that provides packet-based send/receive with some numa-awareness
>> and routing functionality.  It sits over sockets (TCP) and IB and
>> provides a uniform interface.
>
> LNet is originally based on a high-performance networking stack called
> Portals (v3, http://www.cs.sandia.gov/Portals/), with additions for LNet
> routing to allow cross-network bridging.
>
> A critical part of LNet is that it is for RDMA and not packet-based
> messages.  Everything in Lustre is structured around RDMA.  Of course,
> RDMA is not possible with TCP so it just does send/receive under the
> covers, though it can do zero copy data sends (and at one time zero-copy
> receives, but those changes were rejected by the kernel maintainers).
> It definitely does RDMA with IB, RoCE, OPA in the kernel, and other RDMA
> network types not in the kernel (e.g. Cray Gemini/Aries, Atos/Bull BXI,
> and previously older network types no longer supported).

Thanks!  That will probably help me understand it more easily next time
I dive in.

>
> Even with TCP it has some improvements for performance, such as using
> separate sockets for send and receive of large messages, as well as
> a socket for small messages that has Nagle disabled so that it does
> not delay those packets for aggregation.

That sounds like something that could benefit NFS...
pNFS already partially does this by virtue of the fact that data often
goes to a different server than control, so a different socket is
needed.  I wonder if it could benefit from more explicit separate of
message sizes.


Thanks a lot for this background info!
NeilBrown

>
> In addition to the RDMA support, there is also multi-rail support in
> the out-of-tree version that we haven't been allowed to land, which
> can aggregate network bandwidth.  While there exists channel bonding
> for TCP connections, that does not exist for IB or other RDMA networks.
>
>> That is almost a description of the xprt layer in sunrpc.  sunrpc
>> doesn't have routing, but it does have some numa awareness (for the
>> server side at least) and it definitely provides packet-based
>> send/receive over various transports - tcp, udp, local (unix domain),
>> and IB.
>> So: can we use sunrpc/xprt in place of LNet?
>
> No, that would totally kill the performance of Lustre.
>
>> How much would we need to enhance sunrpc/xprt for this to work?  What
>> hooks would be needed to implement the routing as a separate layer.
>> 
>> If LNet is, in some way, much better than sunrpc, then can we share that
>> superior functionality with our NFS friends by adding it to sunrpc?
>
> There was some discussion at NetApp about adding a Lustre/LNet transport
> for pNFS, but I don't think it ever got beyond the proposal stage:
>
> https://tools.ietf.org/html/draft-faibish-nfsv4-pnfs-lustre-layout-07
>
>> Maybe the answer to this is "no", but I think LNet would be hard to sell
>> without a clear statement of why that was the answer.
>
> There are other users outside of the kernel tree that use LNet in addition
> to just Lustre.  The Cray "DVS" I/O forwarding service[*] uses LNet, and
> another experimental filesystem named Zest[+] also used LNet.
>
> [*] https://www.alcf.anl.gov/files/Sugiyama-Wallace-Thursday16B-slides.pdf
> [+] https://www.psc.edu/images/zest/zest-sc07-paper.pdf
>
>> One reason that I would like to see lustre stay in drivers/staging (so I
>> do not support Greg's patch) is that this sort of transition of Lustre
>> to using an improved sunrpc/xprt would be much easier if both were in
>> the same tree.  Certainly it would be easier for a larger community to
>> be participating in the work.
>
> I don't think the proposal to encapsulate all of the Lustre protocol into
> pNFS made a lot of sense, since this would have only really been available
> on Linux, at which point it would be better to use the native Lustre client
> rather than funnel everything through pNFS.
>
> However, _just_ using the LNet transport for (p)NFS might make sense.  LNet
> is largely independent from Lustre (it used to be a separate source tree)
> and is very efficient over the network.
>
> Cheers, Andreas
> --
> Andreas Dilger
> Lustre Principal Architect
> Intel Corporation

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

  reply	other threads:[~2018-06-04  3:55 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-06-01  9:11 [PATCH] staging: lustre: delete the filesystem from the tree Greg Kroah-Hartman
2018-06-01 11:41 ` Christoph Hellwig
2018-06-01 18:20   ` Andreas Dilger
2018-06-01 18:25     ` [lustre-devel] " Doug Oucharek
2018-06-01 23:19       ` NeilBrown
2018-06-03 20:34         ` Dilger, Andreas
2018-06-04  3:54           ` NeilBrown [this message]
2018-06-04  3:59             ` Alexey Lyashkov
2018-06-04  4:15               ` Andreas Dilger
2018-06-04  4:17                 ` Alexey Lyashkov
2018-06-01 18:30 ` Andreas Dilger
2018-06-01 18:41   ` Oleg Drokin
2018-06-01 19:08   ` Greg Kroah-Hartman
2018-06-04  7:09     ` Christoph Hellwig
2018-06-04  7:14       ` Greg Kroah-Hartman
2018-06-02  0:28 ` [lustre-devel] " NeilBrown

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87h8mjp5o0.fsf@notabene.neil.brown.name \
    --to=neilb@suse.com \
    --cc=adilger@dilger.ca \
    --cc=andreas.dilger@intel.com \
    --cc=devel@driverdev.osuosl.org \
    --cc=doucharek@cray.com \
    --cc=gregkh@linuxfoundation.org \
    --cc=hch@infradead.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=lustre-devel@lists.lustre.org \
    --cc=oleg.drokin@intel.com \
    --cc=selinux@tycho.nsa.gov \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).