All of lore.kernel.org
 help / color / mirror / Atom feed
* [lustre-devel] Multi-rail networking for Lustre
@ 2015-10-16 10:14 Olaf Weber
  2015-11-12 20:50 ` Amir Shehata
  0 siblings, 1 reply; 5+ messages in thread
From: Olaf Weber @ 2015-10-16 10:14 UTC (permalink / raw)
  To: lustre-devel

As some of you know, I held a presentation at the LAD'15 developers
meeting describing a proposal for implementing multi-rail networking
for Lustre. Some discussion on this list has referenced that talk.
The slides I used can be found here (1MB PDF):

   http://wiki.lustre.org/images/7/79/LAD15_Lustre_Interface_Bonding_Final.pdf

Since I grew up when a slide deck was a pile of transparencies, and
the rule was that you'd used too much text if people could get more
than the bare gist of your talk from the slides, the rest of this mail
is a slide-by-slide paraphrase of the talk. (There is no recording:
unless the other attendees weigh in this is the best you'll get.) A
few points are starred to indicate that they are after-the-talk
additions.


Slide 1: Lustre Interface Bonding

Title - boring.


Slide 2: Interface Bonding

Under various names multi-rail has been a longstanding wishlist
item. The various names do imply technical differences in how people
think about the problem and the solutions they propose. Despite the
title of the presentation, this proposal is best characterized by
"multi-rail", which is the term we've been using internally at SGI.

Fujitsu contributed an implementation at the level of the infiniband
LND. It wasn't landed, and some of the reviewers felt that an LNet-
level solution should be investigated instead. This code was a big
influence on how I ended up approaching the problem.

The current proposal is a collaboration between SGI and Intel. There
is in fact a contract and resources have been committed. Whether the
implementation will match this proposal is still an open question:
these are the early days, and your feedback is welcome and can be
taken into account.

The end goal is general availability.


Slide 3: Why Multi-Rail?

At SGI we care because our big systems can have tens of terabytes of
memory, and therefore need a fat connection to the rest of a Lustre
cluster.

An additional complication is that big systems have an internal
"network" (NUMAlink in SGI's case) and it can matter a lot for
performance whether memory is close or remote to an interface. So what
we want is to have multiple interfaces spread throughout a system, and
then be able to use whichever will be most efficient for a particular
I/O operation.


Slide 4: Design Constraints

These are a couple of constraints (or requirements if you prefer) that
the design tries to satisfy.

Mixed-version clusters: it should not be a requirement to update an
entire cluster because of a few multi-rail capable nodes. Moreover, in
mixed- vendor situations, it may not be possible to upgrade an entire
cluster in one fell swoop.

Simple configuration: creating and distributing configuration files,
and then keeping them in sync across a cluster, becomes tricky once
clusters get bigger. So I look for ways to have the systems configure
themselves.

Adaptable: autoconfiguration is nice, but there are always cases where
it doesn't get things quite right. There have to be ways to fine-tune
a system or cluster, or even to completely override the
autoconfiguration.

LNet-level implementation: there are three levels at which you can
reasonably implement multi-rail: LND, LNet, and PortalRPC. An
LND-level solution has as its main disadvantage that you cannot
balance I/O load between LNDs. A PortalRPC-level solution would
certainly follow a commonly design tenet in networking: "the upper
layers will take care of that". The upper layers just want a reliable
network, thankyouverymuch. LNet seems like the right spot for
something like this. It allows the implementation to be reasonably
self-contained within the LNet subsystem.


Slide 5: Example Lustre Cluster

A simple cluster, used to guide the discussion. Missing in the picture
is the connecting fabric. Note that the UV client is much bigger than
the other nodes.


Slide 6: Mono-rail Single Fabric

The kind of fabric we have today. The UV is starved for I/O.


Slide 7: LNets in a Single Fabric

You can make additional interfaces in the UV useful by defining
multiple LNets in the fabric, and then carefully setting up aliases on
the nodes with only a single interface. This can be done today, but
setting this up correctly is a bit tricky, and involves cluster-wide
configuration. It is not something you'd like to have to retrofit top
an existing cluster.


Slide 8: Multi-rail Single Fabric

An example of a fabric topology that we want to work well. Some nodes
have multiple interfaces, and when they do they can all be used to
talk to the other nodes.


Slide 9: Multi-rail Dual Fabric

Similar to previous slide, but now with more LNets. Here too the goal
is active-active use of the LNets and all interfaces.


Slide 10: Mixed-Version Clusters

This section of the presentation expands on the first item of Slide 4.


Slide 11: A Single Multi-Rail Node

Assume we install multi-rail capable Lustre only on the UV. Would that
work? It turns out that it should actually work, though there are some
limits to the functionality. In particular, the MGS/MDS/OSS nodes will
not be aware that they know the UV by a number of NIDs, and it may be
best to avoid this by ensuring that the UV always uses the same
interface to talk to a particular node. This gives us the same
functionality as the multiple LNet example of Slide 7, but with a much
less complicated configuration.


Slide 12: Peer Version Discovery

A multi-rail capable node would like to know if any peer node is also
multi-rail capable. The LNet protocol itself isn't properly versioned,
but the LNet ping protocol (not to be confused with the ptlrpc
pinger!) does transport a feature flags field. There are enough bits
available in that field that we can just steal one and use it to
indicate multi-rail capability in a ping reply.

Note that a ping request does not carry any information beyond the
source NID of the requesting node. In particular, it cannot carry
version information to the node being pinged.


Slide 13: Peer Version Discovery

A simple version discovery protocol can be built on LNet ping.

    1) LNet keeps track of all known peers
    2) On first communication, do an LNet ping
    3) The node now knows the peer version

And we get a list of the peer's interfaces for free.


Slide 14: Easy Configuration

This section of the presentation expands on the second item of Slide 4.


Slide 15: Peer Interface Discovery

The list of interfaces of a peer is all we need to know for the simple
cases. With that we know the peer under all its aliases, and can
determine whether any of the other local interfaces (for example those
on different LNets) can talk to the same peer.

Now the peer also needs to know the node's interfaces. It would be
nice if there was a reliable way to get the peer to issue an LNet ping
to the node. For the most basic situation this works, but once I
looked at more complex situations it became clear that this cannot be
done reliably. So instead I propose to just have the node push a list
of its interfaces to the peer.


Slide 16: Peer Interface Discovery

The push of the list is much like an LNet ping, except it does an
LNetPut() instead of an LNetGet().

This should be safe on several grounds. An LNet router doesn't do deep
inspection of Put/Get requests, so even a downrev router will be able
to forward them. If such a Put somehow ends up at a downrev peer, the
peer will silently drop the message. (The slide says a protocol error
will be returned, which is wrong.)


Slide 17: Configuring Interfaces on a Node

How does a node know its own interfaces? This can be done in a way
similar to the current methods: kernel module parameters and/or
DLC. These use the same in-kernel parser, so the syntax is similar in
either case.

     networks=o2ib(ib0,ib1)

This is an example where two interfaces are used in the same LNet.

     networks=o2ib(ib0[2],ib1[6])[2,6]

The same example annotated with CPT information. This refers back to
Slide 3: on a big NUMA system it matters to be able to place the
helper threads for an interface close to that interface.

* Of course that information is also available in the kernel, and with
    a few extensions to the CPT mechanism, the kernel could itself find
    the node to which an interface is connected, then find the/a CPT
    that contains CPUs on that node.


Slide 18: Configuring Interfaces on a Node

LNet uses credits to determine whether a node can send something
across an interface or to a peer. These credits are assigned
per-interface, for both local and peer credits. So more interfaces
means more credits overall. The defaults for credit-related tunables
can stay the same. On LNet routers, which do have multiple interfaces,
these tunables are already interpreted per interface.


Slide 19: Dynamic Configuration

There is some scope for supporting hot-plugging interfaces. When
adding an interface, enable then push. When removing an interface,
push then disable.

Note that removing the interface with the NID by which a node is known
to the MGS (MDS/...) might not be a good idea. If additional
interfaces are present then existing connections can remain active,
but establishing new ones becomes a problem.

* This is a weakness of this proposal.


Slide 20: Adaptable

This section of the presentation expands on the third item of Slide 4.


Slide 21: Interface Selection

Selecting a local interface to send from, and a peer interface to send
to can use a number of rules.

- Direct connection preferred: by default, don't go through an LNet
    router unless there is no other path. Note that today an LNet router
    will refuse to forward traffic if it believes there is a direct
    connection between the node and the peer.

- LNet network type: since using TCP only is the default, it also
    makes sense to have a default rule that if a non-TCP network has been
    configured, then that network should be used first. (As with all such
    rules, it must be possible to override this default.)

- NUMA criteria: pick a local interface that (i) can reach the peer,
    (ii) is close to the memory used for the I/O, and (iii) close to the
    CPU driving the I/O.

- Local credits: pick a local interface depending on the availability
    of credits. Credits are a useful indicator for how busy an interface
    is. Systematically choosing the interface with the most available
    credits should get you something resembling a round-robin
    strategy. And this can even be used to balance across heterogeneous
    interfaces/fabrics.

- Peer credits: pick a peer interface depending on the availability of
    peer credits. Then pick a local interface that connects to this peer
    interface.

- Other criteria, namely...


Slice 22: Routing Enhancements

The fabric connecting nodes in a cluster can have a complicated
topology. So can have cases where a node has two interfaces N1,N2, and
a peer has two interfaces P1,P2, all on the same LNet, yet N1-P1 and
N2-P2 are preferred paths, while N1-P2 and N2-P1 should be avoided.

So there should be ways to define preferred point-to-point connections
within an LNet. This solves the N1-P1 problem mentioned above.

There also need to be ways to define a preference for using one LNet
over another, possibly for a subset of NIDs. This is the mechanism by
which the "anything but TCP" default can be overruled.

The existing syntax for LNet routing can easily(?) be extended to
cover these cases.


Slide 23: Extra Considerations

As you may have noticed, I'm looking for ways to be NUMA friendly. But
one thing I want to avoid is having Lustre nodes know too much about
the topology of their peers. How much is too much? I draw the line at
them knowing anything at all.

At the PortalRPC level each RPC is a request/response pair. (This in
contrast to the LNet level put/ack and get/reply pairs that make up
the request and the response.)

The PortalRPC layer is told the originating interface of a request. It
then sends the response to that same interface. The node sending the
request is usually a client -- especially when a large data transfer is
involved -- and this is a simple way to ensure that whatever NUMA-aware
policies it used to select the originating interface are also honored
when the response arrives.


Slide 24: Extra Considerations

If for some reason the peer cannot send a message to the originating
interface, then any other interface will do. This is an event worth
logging, as it indicates a malfunction somewhere, and after that just
keeping the cluster going should be the prime concern.

Trying all local-remote interface pairs might not be a good idea:
there can be too many combinations and the cumulative timeouts become
a problem.

To avoid timeouts at the PortalRPC level, LNet may already need to
start resending a message long before the "offical" below-LND-level
timeout for message arrival has expired.

The added network resiliency is limited. As noted for Slide 19, if the
interface that fails is has the NID by which a node is primarily
known, establishing new connections to that node becomes impossible.


Slide 25: Extra Considerations

Failing nodes can be used to construct some very creative
scenarios. For example if a peer reboots with downrev software LNet on
a node will not be able to tell by itself. But in this case the
PortalRPC layer can signal to LNet that it needs to re-check the peer.

NID reuse by different nodes is also a scenario that introduces a lot
complications. (Arguably it does do this already today.)

If needed, it might be possible to sneak a 32 bit identifying cookie
into the NID each node reports on the loopback network. Whether this
would actually be useful (and for that matter how such cookies would
be assigned) is not clear.


Slide 26: LNet-level Implementation

This section of the presentation expands on the fourth item of Slide 4.


Slide 27-29: Implementation Notes

A staccato of notes on how to implement bits and pieces of the above.
There's too much text in the slides already, so I'm not paraphrasing.


Slide 30: Implementation Notes

This slide gives a plausible way to cut the work into smaller pieces
that can be implemented as self-contained bits.

     1) Split lnet_ni
     2) Local interface selection
     *) Routing enhancements for local interface selection
     3) Split lnet_peer
     4) Ping on connect
     5) Implement push
     6) Peer interface selection
     7) Resending on failure
     8) Routing enhancements

There's of course no guarantee that this division will survive the
actual coding. But if it does, then note that after step 2 is
implemented, the configuration of Slide 11 (single multi-rail node)
should already be working.


Slide 31: Feedback & Discussion

Looking forward to further feedback & discussion here.


Slide 32:

End title - also boring.


Olaf

-- 
Olaf Weber                 SGI               Phone:  +31(0)30-6696796
                            Veldzigt 2b       Fax:    +31(0)30-6696799
Sr Software Engineer       3454 PW de Meern  Vnet:   955-6796
Storage Software           The Netherlands   Email:  olaf at sgi.com

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [lustre-devel] Multi-rail networking for Lustre
  2015-10-16 10:14 [lustre-devel] Multi-rail networking for Lustre Olaf Weber
@ 2015-11-12 20:50 ` Amir Shehata
       [not found]   ` <569E5BCA.5030705@sgi.com>
  0 siblings, 1 reply; 5+ messages in thread
From: Amir Shehata @ 2015-11-12 20:50 UTC (permalink / raw)
  To: lustre-devel

To build up on this email trail, the Scope and Requirements document is now
available here:

http://wiki.lustre.org/Multi-Rail_LNet

Exact link:

http://wiki.lustre.org/images/7/73/Multi-Rail%2BScope%2Band%2BRequirements%2BDocument.pdf

thanks
amir

On 16 October 2015 at 03:14, Olaf Weber <olaf@sgi.com> wrote:

> As some of you know, I held a presentation at the LAD'15 developers
> meeting describing a proposal for implementing multi-rail networking
> for Lustre. Some discussion on this list has referenced that talk.
> The slides I used can be found here (1MB PDF):
>
>
> http://wiki.lustre.org/images/7/79/LAD15_Lustre_Interface_Bonding_Final.pdf
>
> Since I grew up when a slide deck was a pile of transparencies, and
> the rule was that you'd used too much text if people could get more
> than the bare gist of your talk from the slides, the rest of this mail
> is a slide-by-slide paraphrase of the talk. (There is no recording:
> unless the other attendees weigh in this is the best you'll get.) A
> few points are starred to indicate that they are after-the-talk
> additions.
>
>
> Slide 1: Lustre Interface Bonding
>
> Title - boring.
>
>
> Slide 2: Interface Bonding
>
> Under various names multi-rail has been a longstanding wishlist
> item. The various names do imply technical differences in how people
> think about the problem and the solutions they propose. Despite the
> title of the presentation, this proposal is best characterized by
> "multi-rail", which is the term we've been using internally at SGI.
>
> Fujitsu contributed an implementation at the level of the infiniband
> LND. It wasn't landed, and some of the reviewers felt that an LNet-
> level solution should be investigated instead. This code was a big
> influence on how I ended up approaching the problem.
>
> The current proposal is a collaboration between SGI and Intel. There
> is in fact a contract and resources have been committed. Whether the
> implementation will match this proposal is still an open question:
> these are the early days, and your feedback is welcome and can be
> taken into account.
>
> The end goal is general availability.
>
>
> Slide 3: Why Multi-Rail?
>
> At SGI we care because our big systems can have tens of terabytes of
> memory, and therefore need a fat connection to the rest of a Lustre
> cluster.
>
> An additional complication is that big systems have an internal
> "network" (NUMAlink in SGI's case) and it can matter a lot for
> performance whether memory is close or remote to an interface. So what
> we want is to have multiple interfaces spread throughout a system, and
> then be able to use whichever will be most efficient for a particular
> I/O operation.
>
>
> Slide 4: Design Constraints
>
> These are a couple of constraints (or requirements if you prefer) that
> the design tries to satisfy.
>
> Mixed-version clusters: it should not be a requirement to update an
> entire cluster because of a few multi-rail capable nodes. Moreover, in
> mixed- vendor situations, it may not be possible to upgrade an entire
> cluster in one fell swoop.
>
> Simple configuration: creating and distributing configuration files,
> and then keeping them in sync across a cluster, becomes tricky once
> clusters get bigger. So I look for ways to have the systems configure
> themselves.
>
> Adaptable: autoconfiguration is nice, but there are always cases where
> it doesn't get things quite right. There have to be ways to fine-tune
> a system or cluster, or even to completely override the
> autoconfiguration.
>
> LNet-level implementation: there are three levels at which you can
> reasonably implement multi-rail: LND, LNet, and PortalRPC. An
> LND-level solution has as its main disadvantage that you cannot
> balance I/O load between LNDs. A PortalRPC-level solution would
> certainly follow a commonly design tenet in networking: "the upper
> layers will take care of that". The upper layers just want a reliable
> network, thankyouverymuch. LNet seems like the right spot for
> something like this. It allows the implementation to be reasonably
> self-contained within the LNet subsystem.
>
>
> Slide 5: Example Lustre Cluster
>
> A simple cluster, used to guide the discussion. Missing in the picture
> is the connecting fabric. Note that the UV client is much bigger than
> the other nodes.
>
>
> Slide 6: Mono-rail Single Fabric
>
> The kind of fabric we have today. The UV is starved for I/O.
>
>
> Slide 7: LNets in a Single Fabric
>
> You can make additional interfaces in the UV useful by defining
> multiple LNets in the fabric, and then carefully setting up aliases on
> the nodes with only a single interface. This can be done today, but
> setting this up correctly is a bit tricky, and involves cluster-wide
> configuration. It is not something you'd like to have to retrofit top
> an existing cluster.
>
>
> Slide 8: Multi-rail Single Fabric
>
> An example of a fabric topology that we want to work well. Some nodes
> have multiple interfaces, and when they do they can all be used to
> talk to the other nodes.
>
>
> Slide 9: Multi-rail Dual Fabric
>
> Similar to previous slide, but now with more LNets. Here too the goal
> is active-active use of the LNets and all interfaces.
>
>
> Slide 10: Mixed-Version Clusters
>
> This section of the presentation expands on the first item of Slide 4.
>
>
> Slide 11: A Single Multi-Rail Node
>
> Assume we install multi-rail capable Lustre only on the UV. Would that
> work? It turns out that it should actually work, though there are some
> limits to the functionality. In particular, the MGS/MDS/OSS nodes will
> not be aware that they know the UV by a number of NIDs, and it may be
> best to avoid this by ensuring that the UV always uses the same
> interface to talk to a particular node. This gives us the same
> functionality as the multiple LNet example of Slide 7, but with a much
> less complicated configuration.
>
>
> Slide 12: Peer Version Discovery
>
> A multi-rail capable node would like to know if any peer node is also
> multi-rail capable. The LNet protocol itself isn't properly versioned,
> but the LNet ping protocol (not to be confused with the ptlrpc
> pinger!) does transport a feature flags field. There are enough bits
> available in that field that we can just steal one and use it to
> indicate multi-rail capability in a ping reply.
>
> Note that a ping request does not carry any information beyond the
> source NID of the requesting node. In particular, it cannot carry
> version information to the node being pinged.
>
>
> Slide 13: Peer Version Discovery
>
> A simple version discovery protocol can be built on LNet ping.
>
>    1) LNet keeps track of all known peers
>    2) On first communication, do an LNet ping
>    3) The node now knows the peer version
>
> And we get a list of the peer's interfaces for free.
>
>
> Slide 14: Easy Configuration
>
> This section of the presentation expands on the second item of Slide 4.
>
>
> Slide 15: Peer Interface Discovery
>
> The list of interfaces of a peer is all we need to know for the simple
> cases. With that we know the peer under all its aliases, and can
> determine whether any of the other local interfaces (for example those
> on different LNets) can talk to the same peer.
>
> Now the peer also needs to know the node's interfaces. It would be
> nice if there was a reliable way to get the peer to issue an LNet ping
> to the node. For the most basic situation this works, but once I
> looked at more complex situations it became clear that this cannot be
> done reliably. So instead I propose to just have the node push a list
> of its interfaces to the peer.
>
>
> Slide 16: Peer Interface Discovery
>
> The push of the list is much like an LNet ping, except it does an
> LNetPut() instead of an LNetGet().
>
> This should be safe on several grounds. An LNet router doesn't do deep
> inspection of Put/Get requests, so even a downrev router will be able
> to forward them. If such a Put somehow ends up at a downrev peer, the
> peer will silently drop the message. (The slide says a protocol error
> will be returned, which is wrong.)
>
>
> Slide 17: Configuring Interfaces on a Node
>
> How does a node know its own interfaces? This can be done in a way
> similar to the current methods: kernel module parameters and/or
> DLC. These use the same in-kernel parser, so the syntax is similar in
> either case.
>
>     networks=o2ib(ib0,ib1)
>
> This is an example where two interfaces are used in the same LNet.
>
>     networks=o2ib(ib0[2],ib1[6])[2,6]
>
> The same example annotated with CPT information. This refers back to
> Slide 3: on a big NUMA system it matters to be able to place the
> helper threads for an interface close to that interface.
>
> * Of course that information is also available in the kernel, and with
>    a few extensions to the CPT mechanism, the kernel could itself find
>    the node to which an interface is connected, then find the/a CPT
>    that contains CPUs on that node.
>
>
> Slide 18: Configuring Interfaces on a Node
>
> LNet uses credits to determine whether a node can send something
> across an interface or to a peer. These credits are assigned
> per-interface, for both local and peer credits. So more interfaces
> means more credits overall. The defaults for credit-related tunables
> can stay the same. On LNet routers, which do have multiple interfaces,
> these tunables are already interpreted per interface.
>
>
> Slide 19: Dynamic Configuration
>
> There is some scope for supporting hot-plugging interfaces. When
> adding an interface, enable then push. When removing an interface,
> push then disable.
>
> Note that removing the interface with the NID by which a node is known
> to the MGS (MDS/...) might not be a good idea. If additional
> interfaces are present then existing connections can remain active,
> but establishing new ones becomes a problem.
>
> * This is a weakness of this proposal.
>
>
> Slide 20: Adaptable
>
> This section of the presentation expands on the third item of Slide 4.
>
>
> Slide 21: Interface Selection
>
> Selecting a local interface to send from, and a peer interface to send
> to can use a number of rules.
>
> - Direct connection preferred: by default, don't go through an LNet
>    router unless there is no other path. Note that today an LNet router
>    will refuse to forward traffic if it believes there is a direct
>    connection between the node and the peer.
>
> - LNet network type: since using TCP only is the default, it also
>    makes sense to have a default rule that if a non-TCP network has been
>    configured, then that network should be used first. (As with all such
>    rules, it must be possible to override this default.)
>
> - NUMA criteria: pick a local interface that (i) can reach the peer,
>    (ii) is close to the memory used for the I/O, and (iii) close to the
>    CPU driving the I/O.
>
> - Local credits: pick a local interface depending on the availability
>    of credits. Credits are a useful indicator for how busy an interface
>    is. Systematically choosing the interface with the most available
>    credits should get you something resembling a round-robin
>    strategy. And this can even be used to balance across heterogeneous
>    interfaces/fabrics.
>
> - Peer credits: pick a peer interface depending on the availability of
>    peer credits. Then pick a local interface that connects to this peer
>    interface.
>
> - Other criteria, namely...
>
>
> Slice 22: Routing Enhancements
>
> The fabric connecting nodes in a cluster can have a complicated
> topology. So can have cases where a node has two interfaces N1,N2, and
> a peer has two interfaces P1,P2, all on the same LNet, yet N1-P1 and
> N2-P2 are preferred paths, while N1-P2 and N2-P1 should be avoided.
>
> So there should be ways to define preferred point-to-point connections
> within an LNet. This solves the N1-P1 problem mentioned above.
>
> There also need to be ways to define a preference for using one LNet
> over another, possibly for a subset of NIDs. This is the mechanism by
> which the "anything but TCP" default can be overruled.
>
> The existing syntax for LNet routing can easily(?) be extended to
> cover these cases.
>
>
> Slide 23: Extra Considerations
>
> As you may have noticed, I'm looking for ways to be NUMA friendly. But
> one thing I want to avoid is having Lustre nodes know too much about
> the topology of their peers. How much is too much? I draw the line at
> them knowing anything at all.
>
> At the PortalRPC level each RPC is a request/response pair. (This in
> contrast to the LNet level put/ack and get/reply pairs that make up
> the request and the response.)
>
> The PortalRPC layer is told the originating interface of a request. It
> then sends the response to that same interface. The node sending the
> request is usually a client -- especially when a large data transfer is
> involved -- and this is a simple way to ensure that whatever NUMA-aware
> policies it used to select the originating interface are also honored
> when the response arrives.
>
>
> Slide 24: Extra Considerations
>
> If for some reason the peer cannot send a message to the originating
> interface, then any other interface will do. This is an event worth
> logging, as it indicates a malfunction somewhere, and after that just
> keeping the cluster going should be the prime concern.
>
> Trying all local-remote interface pairs might not be a good idea:
> there can be too many combinations and the cumulative timeouts become
> a problem.
>
> To avoid timeouts at the PortalRPC level, LNet may already need to
> start resending a message long before the "offical" below-LND-level
> timeout for message arrival has expired.
>
> The added network resiliency is limited. As noted for Slide 19, if the
> interface that fails is has the NID by which a node is primarily
> known, establishing new connections to that node becomes impossible.
>
>
> Slide 25: Extra Considerations
>
> Failing nodes can be used to construct some very creative
> scenarios. For example if a peer reboots with downrev software LNet on
> a node will not be able to tell by itself. But in this case the
> PortalRPC layer can signal to LNet that it needs to re-check the peer.
>
> NID reuse by different nodes is also a scenario that introduces a lot
> complications. (Arguably it does do this already today.)
>
> If needed, it might be possible to sneak a 32 bit identifying cookie
> into the NID each node reports on the loopback network. Whether this
> would actually be useful (and for that matter how such cookies would
> be assigned) is not clear.
>
>
> Slide 26: LNet-level Implementation
>
> This section of the presentation expands on the fourth item of Slide 4.
>
>
> Slide 27-29: Implementation Notes
>
> A staccato of notes on how to implement bits and pieces of the above.
> There's too much text in the slides already, so I'm not paraphrasing.
>
>
> Slide 30: Implementation Notes
>
> This slide gives a plausible way to cut the work into smaller pieces
> that can be implemented as self-contained bits.
>
>     1) Split lnet_ni
>     2) Local interface selection
>     *) Routing enhancements for local interface selection
>     3) Split lnet_peer
>     4) Ping on connect
>     5) Implement push
>     6) Peer interface selection
>     7) Resending on failure
>     8) Routing enhancements
>
> There's of course no guarantee that this division will survive the
> actual coding. But if it does, then note that after step 2 is
> implemented, the configuration of Slide 11 (single multi-rail node)
> should already be working.
>
>
> Slide 31: Feedback & Discussion
>
> Looking forward to further feedback & discussion here.
>
>
> Slide 32:
>
> End title - also boring.
>
>
> Olaf
>
> --
> Olaf Weber                 SGI               Phone:  +31(0)30-6696796
>                            Veldzigt 2b       Fax:    +31(0)30-6696799
> Sr Software Engineer       3454 PW de Meern  Vnet:   955-6796
> Storage Software           The Netherlands   Email:  olaf at sgi.com
> _______________________________________________
> lustre-devel mailing list
> lustre-devel at lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20151112/6dfdbf4c/attachment-0001.htm>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [lustre-devel] Multi-rail networking for Lustre
       [not found]                   ` <56A13FDB.2050902@sgi.com>
@ 2016-01-22  9:08                     ` Alexey Lyashkov
  2016-01-22 14:31                       ` Olaf Weber
  0 siblings, 1 reply; 5+ messages in thread
From: Alexey Lyashkov @ 2016-01-22  9:08 UTC (permalink / raw)
  To: lustre-devel

On Thu, Jan 21, 2016 at 11:30 PM, Olaf Weber <olaf@sgi.com> wrote:

> On 21-01-16 20:16, Alexey Lyashkov wrote:
>
>>
>>         why uuid can't be used as we have already such type of identifier
>>         for the peer?
>>
>>
>>     Rechecking the code to see how UUIDs are generated and where they are
>>     used, as far as I can tell Lustre doesn't have true per-peer UUIDs.
>>
>>
>>     A client doesn't identify itself by UUID to the servers. Instead it
>>     identifies a mount of a lustre filesystem by UUID. If it mounts two
>>     filesystems it will use different UUIDs for each.
>>
>> Client is identify by UUID. if server want to send something to client -
>> it
>> uses an export and client uuid.
>> none NID's uses in that case. if you check code again - you will don't see
>> any NID's until ptlrpc_send_reply issued. So it's single function where
>> NID
>> from connection present.
>>
>
> The UUID does not identify a client. The UUID identifies a client+mount.
> If the client mounts more than one filesystem, there will be different UUID
> for each mount on that client.
>
> In lustre terms each mount point is separated client. It have own cache,
own structures, and completely separated each from an other.
One exceptions it's ldlm cache which live on global object id space.



> At PtlRPC level, these are different connections. At LNet level, the same
> set of NIDs is used. It is that LNet level that I work on.
>
>



>     And the UUID used for servers is derived from the first NID of the list
>>     of NIDs for that server. If the same server shows up in multiple lists
>>     (for different objects) with the NIDs in different orders, then
>>     different UUIDs will be generated.
>>
>>
>> really?
>> LustreError: 137-5: lustre-OST0000_UUID: not available for connect from
>> 0 at lo
>> (no target). If you are running an HA pair check that the target is
>> mounted
>> on the other server.
>>
>> where is NID present at UUID from that message ?
>>
>
> The NID 'req->rq_peer.nid' is translated to '0 at lo' in this message.
>

I say about NID substring in UUID message, it's not exist for long time.


>
> The "lustre-OST0000_UUID" is more interesting. Note that it identifies an
> OST, as opposed to an OSS. Which is the point I am trying to make: this
> UUID identifies an OST; it does not identify an OSS.


All lustre stack operate with UUID, and it have none differences when it
UUID live. We may migrate service / client from one network address to
another, without logical reconnect. It's my main objections against you
ideas.
If none have a several addresses LNet should be responsible to reliability
delivery a one-way requests. which is logically connect to PtlRPC. If node
will be need to use different routing and different NID's for communication
- it's should be hide in LNet, and LNet should provide as high api as
possible.




>
>
>>
>> I expect you know about situation when one DNS name have several addresses
>> like several 'A' records in dns zone file.
>>
>
> Sure, but when one name points to several machines, it does not help me
> balance traffic over the interfaces of just one machine.


Simple balance may be DNS based - just round robin, as we have now on IB /
sock lnd. it isn't balance?
If you talk about more serious you should start from good flow control
between nodes. Probably Ideas from RIP and LACK protocols will help.





>
>
>     I want something that identifies the machine -- as opposed to a single
>>     interface on the machine -- so that LNet can make an intelligent
>> choice
>>     between the interfaces. And I want PtlRPC to supply hints as to which
>>     interface it expects to be best: PtlRPC is the one layer that knows
>>     which Get and Put calls are part of the same RPC, which is required to
>>     be able to generate these hints.
>>
>>
>> It have none differences how you will deliver a reply to same XID. I may
>> construct a network when request will send with IB but reply will accept
>> via
>> TCP and it's will work. how PtlRPC will help in that case? how one side
>> will
>> know which path is better for different ?
>> Why you think moving low level network knowleage to high level of ptlrpc
>> will good?
>> it's totally different levels - LNet must response to routing and finding
>> "which is best", but PtlRPC connect from two one-way messages into
>> request-reply protocol with high level processing.
>>
>
> A PtlRPC RPC has structure. The first LNetPut() transmits just the header
> information. Then one or more LNetPut() or LNetGet() messages are done to
> transmit the rest of the request. Then the response follows, which also
> consists of several LNetPut() or LNetGet() messages.
>
It's wrong. Looks you mix an RPC and bulk transfers.

in RPC case - it's logical connect from two one-way messages. Each send via
single LNetPut call (first from Client side, second from server side). It's
have a connected via ME (XID) data on LNet layer.
PtlRPC register an ME entry and send that information as part of request
message to the server. Server take that info from request and post send
with pointing to that ME (XID), none else.

but if you talk about bulk transfer - situation slightly different. Client
send an XID to server and mark ME / MD as remote controlled.
in that case server may call LNetGet/LNetPut to transfer a data from client
based on XID (XID range in multi bulk case).
but again - it's just connection on XID / ME / match bits (different name
for one object).





> The key word is "heuristic": have a server assume that traffic related to
> a request should prefer to use the source NID of that request. This is a
> simple way for a node that cares about these things to have a server do the
> right thing, without requiring that the server know how the internals of
> how the node is put together.
>
> For some reason you seem hung up on the idea that it does not matter which
> interfaces are used by network traffic. Our experience at SGI is that it
> does matter on big computers. Therefore we take it into account in the
> design. Therefore there is this apparent layer violation in the design.
>
> Memory place



>                      And there is also a catch, because there are cases
>>         where PtlRPC
>>                      has a valid interest in declaring not just that it
>>         wants to talk
>>                      to a node, but also that it wants to talk to a
>> specific
>>         NID on
>>                      that node.
>>
>>
>>                  None valid scenario for it. Looks you think PtlRPC is
>> good
>>         place for
>>                  routing
>>                  information if think it correct.
>>
>>
>>              For this case I'm thinking in particular of memory and CPU
>>         locality in
>>              big systems. (Think of a big system as itself being built
>> from
>>         nodes
>>              connected by a network.)
>>
>>         I don't say something about memory and CPU. I say about network
>> routing.
>>         PtlRPC choose a new connection and it connection have a
>>         source<>destination
>>         NIDs relation inside, so each new LNetPut will use that NID's to
>>         send info
>>         to such UUID. it's say some of network routing code is on PtlRPC
>> but
>>         should
>>         be on LNet layer. Where CPU and Memory is?
>>
>>
>>     The thread that initiates an RPC runs on a specific CPU, and may well
>> be
>>     bound there by a cpuset. This is common practice on big systems. The
>>     memory buffers involved in the RPC live on a specific part of the
>>     system, and are closer to some CPUs and some interfaces than others.
>>     That's where CPU and memory locality comes in, from my perspective.
>>
>> I mean about network distance, but you are connect local and non local
>> memory to that discussions.
>> Why? local and non local memory problem is outside of Multi rail problem.
>>
>
> For me, at SGI, local and non local memory is an integral part of the
> multi rail problem.


Multi rail (bounding) may work without any knowledge about local and non
local memory?
I think so. as it's just a network transport. So local and non local memory
is different problem and good to separate it in different design, to make
each design as simple as possible. It will help with developing a code,
testing and integrating.



>
>
> int ptlrpc_uuid_to_peer (struct obd_uuid *uuid,
>>                           lnet_process_id_t *peer, lnet_nid_t *self)
>> {
>>          int               best_dist = 0;
>>          __u32             best_order = 0;
>>          int               count = 0;
>>          int               rc = -ENOENT;
>>          int               portals_compatibility;
>>          int               dist;
>>          __u32             order;
>>          lnet_nid_t        dst_nid;
>>          lnet_nid_t        src_nid;
>>
>>          portals_compatibility = LNetCtl(IOC_LIBCFS_PORTALS_COMPATIBILITY,
>> NULL);
>>
>>          peer->pid = LNET_PID_LUSTRE;
>>
>>          /* Choose the matching UUID that's closest */
>>          while (lustre_uuid_to_peer(uuid->uuid, &dst_nid, count++) == 0) {
>>                  dist = LNetDist(dst_nid, &src_nid, &order);
>>                  if (dist < 0)
>>                          continue;
>>
>> this code will don't work for you if you introduce a some abstract NID, as
>> PtlRPC will not able to find best distance.
>>
>
> The lustre_uuid_to_peer() function enumerates all NIDs associated with the
> UUID. This includes the primary NID, but also includes the other NIDs. So
> we find a preferred peer NID based on that. Then we modify the code like
> this:
>
Why PtlRPC should be know that low level details? Currently we have a
problems - when one of destination NID's is unreachable and transfer
initiator need a full ptlrpc reconnect to resend to different NID. But as
you should be have a resend


>
> The call of LNetPrimaryNID() gives the primary peer NID for the peer NID.
> For this to work a handful of calls to LNetPrimaryNID() must be added.
> After that it is up to LNet to find the best route.
>

Per our's comment PrimaryNID will changed after we will find a best, did
you think it loop usefull if you replace loop result at anycases ?
from other view ptlrpc_uuid_to_peer called only in few cases, all other
time ptlrpc have a cache a results in ptlrpc connection info.



>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20160122/dd976904/attachment-0001.htm>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [lustre-devel] Multi-rail networking for Lustre
  2016-01-22  9:08                     ` Alexey Lyashkov
@ 2016-01-22 14:31                       ` Olaf Weber
  2016-01-22 20:06                         ` Alexey Lyashkov
  0 siblings, 1 reply; 5+ messages in thread
From: Olaf Weber @ 2016-01-22 14:31 UTC (permalink / raw)
  To: lustre-devel

On 22-01-16 10:08, Alexey Lyashkov wrote:
>
>
> On Thu, Jan 21, 2016 at 11:30 PM, Olaf Weber <olaf@sgi.com
> <mailto:olaf@sgi.com>> wrote:
>
>     On 21-01-16 20:16, Alexey Lyashkov wrote:

[...]

> In lustre terms each mount point is separated client. It have own cache, own
> structures, and completely separated each from an other.
> One exceptions it's ldlm cache which live on global object id space.

Another exception is flock deadlock detection, which is always a global 
operation. This is why ldlm_flock_deadlock() inspects c_peer.nid.

[...]

> All lustre stack operate with UUID, and it have none differences when it
> UUID live. We may migrate service / client from one network address to
> another, without logical reconnect. It's my main objections against you ideas.
> If none have a several addresses LNet should be responsible to reliability
> delivery a one-way requests. which is logically connect to PtlRPC. If node
> will be need to use different routing and different NID's for communication
> - it's should be hide in LNet, and LNet should provide as high api as possible.

The basic idea behind the multi-rail design is that LNet figures out how to 
send a message to a peer. But the user of LNet can provide a hint to 
indicate that for a specific message a specific path is preferred.

One of our goals is to keep changes to the LNet API small.

>         I expect you know about situation when one DNS name have several
>         addresses
>         like several 'A' records in dns zone file.
>
>
>     Sure, but when one name points to several machines, it does not help me
>     balance traffic over the interfaces of just one machine.
>
>
> Simple balance may be DNS based - just round robin, as we have now on IB /
> sock lnd. it isn't balance?
 > If you talk about more serious you should start from good flow control
 > between nodes. Probably Ideas from RIP and LACK protocols will help.

There is bonding/balancing in socklnd. There is none in o2iblnd.

[...]

>     A PtlRPC RPC has structure. The first LNetPut() transmits just the
>     header information. Then one or more LNetPut() or LNetGet() messages are
>     done to transmit the rest of the request. Then the response follows,
>     which also consists of several LNetPut() or LNetGet() messages.
>
> It's wrong. Looks you mix an RPC and bulk transfers.

Difference in terminology: I tend to think of an RPC as a request/response 
pair (if there is a response), and these in turn include all traffic related 
to the RPC, including any bulk transfers.

[...]

>     The lustre_uuid_to_peer() function enumerates all NIDs associated with
>     the UUID. This includes the primary NID, but also includes the other
>     NIDs. So we find a preferred peer NID based on that. Then we modify the
>     code like this:
>
> Why PtlRPC should be know that low level details? Currently we have a
> problems - when one of destination NID's is unreachable and transfer
> initiator need a full ptlrpc reconnect to resend to different NID. But as
> you should be have a resend

Within LNet a resend can be triggered from lnet_finalize() after a failed 
attempt to send the message has been decommitted. (Otherwise multiple send 
attempts will need to be tracked at the same time.)

>     The call of LNetPrimaryNID() gives the primary peer NID for the peer
>     NID. For this to work a handful of calls to LNetPrimaryNID() must be
>     added. After that it is up to LNet to find the best route.
>
>
> Per our's comment PrimaryNID will changed after we will find a best, did you
> think it loop usefull if you replace loop result at anycases ?
> from other view ptlrpc_uuid_to_peer called only in few cases, all other time
> ptlrpc have a cache a results in ptlrpc connection info.

The main benefit of the loop becomes detecting whether the node is sending 
to itself, in which case the loopback interface must be used. Though I do 
worry about degenerate or bad configurations where not all the IP addresses 
belong to the same node.

-- 
Olaf Weber                 SGI               Phone:  +31(0)30-6696796
                            Veldzigt 2b       Fax:    +31(0)30-6696799
Sr Software Engineer       3454 PW de Meern  Vnet:   955-6796
Storage Software           The Netherlands   Email:  olaf at sgi.com

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [lustre-devel] Multi-rail networking for Lustre
  2016-01-22 14:31                       ` Olaf Weber
@ 2016-01-22 20:06                         ` Alexey Lyashkov
  0 siblings, 0 replies; 5+ messages in thread
From: Alexey Lyashkov @ 2016-01-22 20:06 UTC (permalink / raw)
  To: lustre-devel

On Fri, Jan 22, 2016 at 5:31 PM, Olaf Weber <olaf@sgi.com> wrote:

> On 22-01-16 10:08, Alexey Lyashkov wrote:
>
>>
>>
>> On Thu, Jan 21, 2016 at 11:30 PM, Olaf Weber <olaf@sgi.com
>> <mailto:olaf@sgi.com>> wrote:
>>
>>     On 21-01-16 20:16, Alexey Lyashkov wrote:
>>
>
> [...]
>
> In lustre terms each mount point is separated client. It have own cache,
>> own
>> structures, and completely separated each from an other.
>> One exceptions it's ldlm cache which live on global object id space.
>>
>
> Another exception is flock deadlock detection, which is always a global
> operation. This is why ldlm_flock_deadlock() inspects c_peer.nid.
>
> flock is part of ldlm.




> [...]
>
> All lustre stack operate with UUID, and it have none differences when it
>> UUID live. We may migrate service / client from one network address to
>> another, without logical reconnect. It's my main objections against you
>> ideas.
>> If none have a several addresses LNet should be responsible to reliability
>> delivery a one-way requests. which is logically connect to PtlRPC. If node
>> will be need to use different routing and different NID's for
>> communication
>> - it's should be hide in LNet, and LNet should provide as high api as
>> possible.
>>
>
> The basic idea behind the multi-rail design is that LNet figures out how
> to send a message to a peer. But the user of LNet can provide a hint to
> indicate that for a specific message a specific path is preferred.
>
> it's good idea, but routing is part it idea. Routing may changed at any
time but ptlrpc should avoid resending request as it will be need a full
logical reconnect and it operation isn't fast and isn't light for a server.
What hint you want from PtlRPC? it have a single responsible - send to some
UUID a some buffer.
PtlRPC (or some upper code) may send QoS hint, probably it may send hint
about local memory, to avoid access from other NUMA node, what else?


> One of our goals is to keep changes to the LNet API small.


And to avoid it you add one more conversion UUID -> abstract NID -> ...
real NID.
while a direct conversion UUID -> .. real NID is possible and provide
better result in case you need a hide a network topology changes.




>
>
>         I expect you know about situation when one DNS name have several
>>         addresses
>>         like several 'A' records in dns zone file.
>>
>>
>>     Sure, but when one name points to several machines, it does not help
>> me
>>     balance traffic over the interfaces of just one machine.
>>
>>
>> Simple balance may be DNS based - just round robin, as we have now on IB /
>> sock lnd. it isn't balance?
>>
> > If you talk about more serious you should start from good flow control
> > between nodes. Probably Ideas from RIP and LACK protocols will help.
>
> There is bonding/balancing in socklnd. There is none in o2iblnd.
>
> you are one of inspectors for an http://review.whamcloud.com/#/c/14625/.
and i don't see any architecture objections except an fix a some bugs in
code. So I may say - it's exist and Fujitsu's uses it in production (it was
present on LAD devel summit).


> [...]
>
>     A PtlRPC RPC has structure. The first LNetPut() transmits just the
>>     header information. Then one or more LNetPut() or LNetGet() messages
>> are
>>     done to transmit the rest of the request. Then the response follows,
>>     which also consists of several LNetPut() or LNetGet() messages.
>>
>> It's wrong. Looks you mix an RPC and bulk transfers.
>>
>
> Difference in terminology: I tend to think of an RPC as a request/response
> pair (if there is a response), and these in turn include all traffic
> related to the RPC, including any bulk transfers.


"The first LNetPut() transmits just the header information." what you mean
about "header". If we talk about bulk transfer protocol - first LNetPut
will transfer an RPC body which uses to setup an bulk transfer.



> [...]
>
>     The lustre_uuid_to_peer() function enumerates all NIDs associated with
>>     the UUID. This includes the primary NID, but also includes the other
>>     NIDs. So we find a preferred peer NID based on that. Then we modify
>> the
>>     code like this:
>>
>> Why PtlRPC should be know that low level details? Currently we have a
>> problems - when one of destination NID's is unreachable and transfer
>> initiator need a full ptlrpc reconnect to resend to different NID. But as
>> you should be have a resend
>>
>
> Within LNet a resend can be triggered from lnet_finalize() after a failed
> attempt to send the message has been decommitted. (Otherwise multiple send
> attempts will need to be tracked at the same time.)

PtlRPC and LNet may have a different timeout window for now. So timeout on
PtlRPC can smaller then LNet LND.
in that case you will have a lots TX in flight where is new TX allocated.
That is tracking a multiple send attempts in same time.
But it's not a good from request latency perspective but easy way to change
a routing for now.
With single "Primary NID" you will lack this functionality or need to
implement something similar.




>
>
>     The call of LNetPrimaryNID() gives the primary peer NID for the peer
>>     NID. For this to work a handful of calls to LNetPrimaryNID() must be
>>     added. After that it is up to LNet to find the best route.
>>
>>
>> Per our's comment PrimaryNID will changed after we will find a best, did
>> you
>> think it loop usefull if you replace loop result at anycases ?
>> from other view ptlrpc_uuid_to_peer called only in few cases, all other
>> time
>> ptlrpc have a cache a results in ptlrpc connection info.
>>
>
> The main benefit of the loop becomes detecting whether the node is sending
> to itself, in which case the loopback interface must be used. Though I do
> worry about degenerate or bad configurations where not all the IP addresses
> belong to the same node.
>
> You can work without loopback driver - it was checked with socklnd and
work fine. But currently it loop used to make connection sorting in right
order. Fast connections on top. But it have assumption - network topology
have a no changes at run time.



-- 
Alexey Lyashkov *?* Technical lead for a Morpheus team
Seagate Technology, LLC
www.seagate.com
www.lustre.org
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20160122/86eafe7e/attachment.htm>

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2016-01-22 20:06 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-10-16 10:14 [lustre-devel] Multi-rail networking for Lustre Olaf Weber
2015-11-12 20:50 ` Amir Shehata
     [not found]   ` <569E5BCA.5030705@sgi.com>
     [not found]     ` <CAJ2e-W2hVPfq1fT_43tAWM1eE7Ue8qD3RsswBXr+Fzwv39kyCQ@mail.gmail.com>
     [not found]       ` <569FDCC0.90004@sgi.com>
     [not found]         ` <CAJ2e-W0cFuzNDda4fWm-Sd=wmyjYnRyXx9PSLWGAHX5KQO1PGQ@mail.gmail.com>
     [not found]           ` <569FF198.5040207@sgi.com>
     [not found]             ` <CAJ2e-W3x-O8pWkg8vT40D2g6hbworabsc8MraqGZPw1QSbCFdg@mail.gmail.com>
     [not found]               ` <56A10B37.60709@sgi.com>
     [not found]                 ` <CAJ2e-W2q2JPBuye6gLfPYYqU1vk8YgBqE4=_u7Jdsu-vt8JdCw@mail.gmail.com>
     [not found]                   ` <56A13FDB.2050902@sgi.com>
2016-01-22  9:08                     ` Alexey Lyashkov
2016-01-22 14:31                       ` Olaf Weber
2016-01-22 20:06                         ` Alexey Lyashkov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.