All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC] RDMA bridge for ROCE and Infiniband
@ 2021-11-12 12:04 Christoph Lameter
  2021-11-12 13:50 ` Christoph Lameter
  2021-11-14  9:09 ` Leon Romanovsky
  0 siblings, 2 replies; 4+ messages in thread
From: Christoph Lameter @ 2021-11-12 12:04 UTC (permalink / raw)
  To: linux-rdma; +Cc: Leon Romanovsky, Jason Gunthorpe, rpearson, Doug Ledford

We have a larger Infiniband deployment and want to gradually move servers
to a new fabric that is using RDMA over Ethernet using ROCE. The IB
Fabric is mission critical and complex due to the need to have various
ways to prevent failover with one of them being a secondary IB fabric.

It is not possible resourcewise to simply rebuild the system using ROCE
and then switch over.

Both Fabrics use RDMA and we need some way for nodes on each cluster to
communicate with each other. We do not really need memory to memory
transfers. What we use from the RDMA
stack is basically messaging and multicast. So the really
core services of the RDMA stack are not needed. Also UD/UDP is sufficient,
the other protocols may not be necessary to support.

Any ideas on how to do this would be appreciated. I have not found anything
that could help us here, so we are interested in creating a new piece of
Open Source RDMA software that allows the briding of native IB traffic to ROCE.


Basic Design
------------
Lets say we have a single system that functions as a *bridge* and has one
interface going to Infiniband and one going to Ethernet with ROCE.

In our use case we do not need any "true" RDMA in the sense of memory to
memory transfers. We only need UDP and UD messsaging and support for
multicast.

In order to simplify the multicast aspects, the bridge will simply
subscribe to the multicast groups of interest when the bridge software
starts up.

PROXYARP for regular IP / IPoIB traffic
--------------------------------------
It is possible to do proxyarp on both sides of a Linux system that is
connected both to Infiniband and ROCE. And thus this can already be seen
as a single IP subnet from the Kernel stack perspective and communication
is not a problem for non RDMA traffic.

PROXYARP means that the MAC address of the bridge is used for all IPoIB
addresses on the IB side of the bridge. Similar the GID of the bridge is
used in all IPoIB Packets on the IB side of the bridge that come from the
ROCE side.

The kernel already removes and adds IPoIB and IP headers as needed. So
this works for regular IP traffic but not for native IB / RDMA packets.
IP traffic is only used for non performance critical aspects of
our application and so the performance on this level is not a concern.

Each of the host in the bridge IP subnet has 3 addresses: An IPv4
address, a MAC address and a GID.


RDMA Packets (Native IB and ROCE)
=================================
ROCE v2 packets are basically IB packets with another header on top so
the simplistic idealistic version of how this is going to
work is by stripping and adding the UDP ROCE v2 headers to the IB packet.

ROCE packets
------------
UDP roce packets send to the IP addresses on the ROCE side have the MAC
address of the bridge. So these already contain the IP address for the
other side that can be used to lookup the GID in order to convert the
packet and forward it to the Infiniband node.

UD packets
----------
Routing capabilities are limited on the Infiniband side but one could
construct a way to map GIDs for the hosts on the ROCE side to the LID
of the bridge by using the ACM daemon.

There will be complications regards to RDMA_CM support and the details
of mapping characteristics between packets but hopefully this will
be fairly manageable.

Multicast packets
-----------------

Multicast packets can be converted easily since there is a direct
mapping possible between the MAC address used for a Multicast group and
the MGID in the Infiniband fabric. Otherwise this process is similar
to UD/UDP traffic.


Implementation
==============

There are basically three ways to implement this:

A) As add on to the RDMA stack in the Linux Kernel
B) As a user space process that operates on RAW sockets and receives traffic
   filtered by the NIC or by the kernel to process. It wouild use the same
   raw sockets to send these packets to the other side.
C) In firmware / logic of the NIC. This is out of reach of us here
   I think.


Inbound from ROCE
-----------------
This can be done like in the RXE driver. Simply listening to the UDP port
for ROCE will get us the traffic we need to operate on.

Outbound to ROCE
----------------
A Ethernet RAW socket will allow us to create arbitrary datagrams as needed.
This has been widely done before in numerous settings and code is already
open sources that does stuff like this. So this is fairly straightforward.

Inbound from Infiniband
-----------------------
The challenge here is to isolate the traffic that is destined for the bridge
itself from traffic that needs to be forwarded. One way would be to force
the inclusion of a Global Header in each packet so that the GID can be
matched. When the GID does not match the bridge then the traffic would be
forwarded to the code which could then do the necessary packet conversion.

I do not know of any way that something like this has been done before.
Potentially this means working with flow steering and dealing with firmware
issues in the NIC.

Outbound to Infiniband
----------------------
I saw in a recent changelog for the Mellanox NICs that the ability has
been added to send raw IB datagrams. If that can be used to construct
a packet that is coming from one of the GIDs associated with the ROCE IP
addresses then this will work.

Otherwise we need to have some way to set the GID for outbound packets
to make this work.

The logic needed on Infiniband is similar to that required for an
Infiniband router.




The biggest risk here seems to be the Infiniband side of things. Is there
a way to create a filter for the traffic we need?

Any tips and suggestions on how to approach this problem would be appreciated.


Christoph Lameter, 12. November 2021


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [RFC] RDMA bridge for ROCE and Infiniband
  2021-11-12 12:04 [RFC] RDMA bridge for ROCE and Infiniband Christoph Lameter
@ 2021-11-12 13:50 ` Christoph Lameter
  2021-11-14  9:09 ` Leon Romanovsky
  1 sibling, 0 replies; 4+ messages in thread
From: Christoph Lameter @ 2021-11-12 13:50 UTC (permalink / raw)
  To: linux-rdma; +Cc: Leon Romanovsky, Jason Gunthorpe, rpearson, Doug Ledford

Sorry for the screwed up email address.

On Fri, 12 Nov 2021, Christoph Lameter wrote:

> We have a larger Infiniband deployment and want to gradually move servers
> to a new fabric that is using RDMA over Ethernet using ROCE. The IB
> Fabric is mission critical and complex due to the need to have various
> ways to prevent failover with one of them being a secondary IB fabric.
>
> It is not possible resourcewise to simply rebuild the system using ROCE
> and then switch over.
>
> Both Fabrics use RDMA and we need some way for nodes on each cluster to
> communicate with each other. We do not really need memory to memory
> transfers. What we use from the RDMA
> stack is basically messaging and multicast. So the really
> core services of the RDMA stack are not needed. Also UD/UDP is sufficient,
> the other protocols may not be necessary to support.
>
> Any ideas on how to do this would be appreciated. I have not found anything
> that could help us here, so we are interested in creating a new piece of
> Open Source RDMA software that allows the briding of native IB traffic to ROCE.
>
>
> Basic Design
> ------------
> Lets say we have a single system that functions as a *bridge* and has one
> interface going to Infiniband and one going to Ethernet with ROCE.
>
> In our use case we do not need any "true" RDMA in the sense of memory to
> memory transfers. We only need UDP and UD messsaging and support for
> multicast.
>
> In order to simplify the multicast aspects, the bridge will simply
> subscribe to the multicast groups of interest when the bridge software
> starts up.
>
> PROXYARP for regular IP / IPoIB traffic
> --------------------------------------
> It is possible to do proxyarp on both sides of a Linux system that is
> connected both to Infiniband and ROCE. And thus this can already be seen
> as a single IP subnet from the Kernel stack perspective and communication
> is not a problem for non RDMA traffic.
>
> PROXYARP means that the MAC address of the bridge is used for all IPoIB
> addresses on the IB side of the bridge. Similar the GID of the bridge is
> used in all IPoIB Packets on the IB side of the bridge that come from the
> ROCE side.
>
> The kernel already removes and adds IPoIB and IP headers as needed. So
> this works for regular IP traffic but not for native IB / RDMA packets.
> IP traffic is only used for non performance critical aspects of
> our application and so the performance on this level is not a concern.
>
> Each of the host in the bridge IP subnet has 3 addresses: An IPv4
> address, a MAC address and a GID.
>
>
> RDMA Packets (Native IB and ROCE)
> =================================
> ROCE v2 packets are basically IB packets with another header on top so
> the simplistic idealistic version of how this is going to
> work is by stripping and adding the UDP ROCE v2 headers to the IB packet.
>
> ROCE packets
> ------------
> UDP roce packets send to the IP addresses on the ROCE side have the MAC
> address of the bridge. So these already contain the IP address for the
> other side that can be used to lookup the GID in order to convert the
> packet and forward it to the Infiniband node.
>
> UD packets
> ----------
> Routing capabilities are limited on the Infiniband side but one could
> construct a way to map GIDs for the hosts on the ROCE side to the LID
> of the bridge by using the ACM daemon.
>
> There will be complications regards to RDMA_CM support and the details
> of mapping characteristics between packets but hopefully this will
> be fairly manageable.
>
> Multicast packets
> -----------------
>
> Multicast packets can be converted easily since there is a direct
> mapping possible between the MAC address used for a Multicast group and
> the MGID in the Infiniband fabric. Otherwise this process is similar
> to UD/UDP traffic.
>
>
> Implementation
> ==============
>
> There are basically three ways to implement this:
>
> A) As add on to the RDMA stack in the Linux Kernel
> B) As a user space process that operates on RAW sockets and receives traffic
>    filtered by the NIC or by the kernel to process. It wouild use the same
>    raw sockets to send these packets to the other side.
> C) In firmware / logic of the NIC. This is out of reach of us here
>    I think.
>
>
> Inbound from ROCE
> -----------------
> This can be done like in the RXE driver. Simply listening to the UDP port
> for ROCE will get us the traffic we need to operate on.
>
> Outbound to ROCE
> ----------------
> A Ethernet RAW socket will allow us to create arbitrary datagrams as needed.
> This has been widely done before in numerous settings and code is already
> open sources that does stuff like this. So this is fairly straightforward.
>
> Inbound from Infiniband
> -----------------------
> The challenge here is to isolate the traffic that is destined for the bridge
> itself from traffic that needs to be forwarded. One way would be to force
> the inclusion of a Global Header in each packet so that the GID can be
> matched. When the GID does not match the bridge then the traffic would be
> forwarded to the code which could then do the necessary packet conversion.
>
> I do not know of any way that something like this has been done before.
> Potentially this means working with flow steering and dealing with firmware
> issues in the NIC.
>
> Outbound to Infiniband
> ----------------------
> I saw in a recent changelog for the Mellanox NICs that the ability has
> been added to send raw IB datagrams. If that can be used to construct
> a packet that is coming from one of the GIDs associated with the ROCE IP
> addresses then this will work.
>
> Otherwise we need to have some way to set the GID for outbound packets
> to make this work.
>
> The logic needed on Infiniband is similar to that required for an
> Infiniband router.
>
>
>
>
> The biggest risk here seems to be the Infiniband side of things. Is there
> a way to create a filter for the traffic we need?
>
> Any tips and suggestions on how to approach this problem would be appreciated.
>
>
> Christoph Lameter, 12. November 2021
>
>

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [RFC] RDMA bridge for ROCE and Infiniband
  2021-11-12 12:04 [RFC] RDMA bridge for ROCE and Infiniband Christoph Lameter
  2021-11-12 13:50 ` Christoph Lameter
@ 2021-11-14  9:09 ` Leon Romanovsky
  2021-11-16 11:44   ` Christoph Lameter
  1 sibling, 1 reply; 4+ messages in thread
From: Leon Romanovsky @ 2021-11-14  9:09 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-rdma, Jason Gunthorpe, rpearson, Doug Ledford

On Fri, Nov 12, 2021 at 01:04:13PM +0100, Christoph Lameter wrote:
> We have a larger Infiniband deployment and want to gradually move servers
> to a new fabric that is using RDMA over Ethernet using ROCE. The IB
> Fabric is mission critical and complex due to the need to have various
> ways to prevent failover with one of them being a secondary IB fabric.
> 
> It is not possible resourcewise to simply rebuild the system using ROCE
> and then switch over.
> 
> Both Fabrics use RDMA and we need some way for nodes on each cluster to
> communicate with each other. We do not really need memory to memory
> transfers. What we use from the RDMA
> stack is basically messaging and multicast. So the really
> core services of the RDMA stack are not needed. Also UD/UDP is sufficient,
> the other protocols may not be necessary to support.
> 
> Any ideas on how to do this would be appreciated. I have not found anything
> that could help us here, so we are interested in creating a new piece of
> Open Source RDMA software that allows the briding of native IB traffic to ROCE.
> 
> 
> Basic Design
> ------------
> Lets say we have a single system that functions as a *bridge* and has one
> interface going to Infiniband and one going to Ethernet with ROCE.
> 
> In our use case we do not need any "true" RDMA in the sense of memory to
> memory transfers. We only need UDP and UD messsaging and support for
> multicast.
> 
> In order to simplify the multicast aspects, the bridge will simply
> subscribe to the multicast groups of interest when the bridge software
> starts up.
> 
> PROXYARP for regular IP / IPoIB traffic
> --------------------------------------
> It is possible to do proxyarp on both sides of a Linux system that is
> connected both to Infiniband and ROCE. And thus this can already be seen
> as a single IP subnet from the Kernel stack perspective and communication
> is not a problem for non RDMA traffic.
> 
> PROXYARP means that the MAC address of the bridge is used for all IPoIB
> addresses on the IB side of the bridge. Similar the GID of the bridge is
> used in all IPoIB Packets on the IB side of the bridge that come from the
> ROCE side.
> 
> The kernel already removes and adds IPoIB and IP headers as needed. So
> this works for regular IP traffic but not for native IB / RDMA packets.
> IP traffic is only used for non performance critical aspects of
> our application and so the performance on this level is not a concern.
> 
> Each of the host in the bridge IP subnet has 3 addresses: An IPv4
> address, a MAC address and a GID.
> 
> 
> RDMA Packets (Native IB and ROCE)
> =================================
> ROCE v2 packets are basically IB packets with another header on top so
> the simplistic idealistic version of how this is going to
> work is by stripping and adding the UDP ROCE v2 headers to the IB packet.
> 
> ROCE packets
> ------------
> UDP roce packets send to the IP addresses on the ROCE side have the MAC
> address of the bridge. So these already contain the IP address for the
> other side that can be used to lookup the GID in order to convert the
> packet and forward it to the Infiniband node.
> 
> UD packets
> ----------
> Routing capabilities are limited on the Infiniband side but one could
> construct a way to map GIDs for the hosts on the ROCE side to the LID
> of the bridge by using the ACM daemon.
> 
> There will be complications regards to RDMA_CM support and the details
> of mapping characteristics between packets but hopefully this will
> be fairly manageable.
> 
> Multicast packets
> -----------------
> 
> Multicast packets can be converted easily since there is a direct
> mapping possible between the MAC address used for a Multicast group and
> the MGID in the Infiniband fabric. Otherwise this process is similar
> to UD/UDP traffic.
> 
> 
> Implementation
> ==============
> 
> There are basically three ways to implement this:
> 
> A) As add on to the RDMA stack in the Linux Kernel
> B) As a user space process that operates on RAW sockets and receives traffic
>    filtered by the NIC or by the kernel to process. It wouild use the same
>    raw sockets to send these packets to the other side.
> C) In firmware / logic of the NIC. This is out of reach of us here
>    I think.
> 
> 
> Inbound from ROCE
> -----------------
> This can be done like in the RXE driver. Simply listening to the UDP port
> for ROCE will get us the traffic we need to operate on.
> 
> Outbound to ROCE
> ----------------
> A Ethernet RAW socket will allow us to create arbitrary datagrams as needed.
> This has been widely done before in numerous settings and code is already
> open sources that does stuff like this. So this is fairly straightforward.
> 
> Inbound from Infiniband
> -----------------------
> The challenge here is to isolate the traffic that is destined for the bridge
> itself from traffic that needs to be forwarded. One way would be to force
> the inclusion of a Global Header in each packet so that the GID can be
> matched. When the GID does not match the bridge then the traffic would be
> forwarded to the code which could then do the necessary packet conversion.
> 
> I do not know of any way that something like this has been done before.
> Potentially this means working with flow steering and dealing with firmware
> issues in the NIC.
> 
> Outbound to Infiniband
> ----------------------
> I saw in a recent changelog for the Mellanox NICs that the ability has
> been added to send raw IB datagrams. If that can be used to construct
> a packet that is coming from one of the GIDs associated with the ROCE IP
> addresses then this will work.
> 
> Otherwise we need to have some way to set the GID for outbound packets
> to make this work.
> 
> The logic needed on Infiniband is similar to that required for an
> Infiniband router.
> 
> 
> 
> 
> The biggest risk here seems to be the Infiniband side of things. Is there
> a way to create a filter for the traffic we need?
> 
> Any tips and suggestions on how to approach this problem would be appreciated.

Mellanox has Skyway product, which is IB to ETH gateway.
https://www.nvidia.com/en-us/networking/infiniband/skyway/

I imagine that it can be extended to perform IB to RoCE too,
because it uses steering to perform IB to ETH translation.

Thanks

> 
> 
> Christoph Lameter, 12. November 2021
> 

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [RFC] RDMA bridge for ROCE and Infiniband
  2021-11-14  9:09 ` Leon Romanovsky
@ 2021-11-16 11:44   ` Christoph Lameter
  0 siblings, 0 replies; 4+ messages in thread
From: Christoph Lameter @ 2021-11-16 11:44 UTC (permalink / raw)
  To: Leon Romanovsky; +Cc: linux-rdma, Jason Gunthorpe, rpearson, Doug Ledford

On Sun, 14 Nov 2021, Leon Romanovsky wrote:

> > Any tips and suggestions on how to approach this problem would be appreciated.
>
> Mellanox has Skyway product, which is IB to ETH gateway.
> https://www.nvidia.com/en-us/networking/infiniband/skyway/
>
> I imagine that it can be extended to perform IB to RoCE too,
> because it uses steering to perform IB to ETH translation.

The Mellanox Gateways (4036E, 6036G and Skyway) are only for IP
traffic and do not support ROCE or Native IB across the bridge. The newest
version of it "Skyways" does not even support Multicast.

And these are appliance so they cannot be modified. If Mellanox would
commit to do this then great but I guess that would need to be some high
level decision and if a product comes out of it at all then it will take a
couple of years.

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2021-11-16 11:44 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-11-12 12:04 [RFC] RDMA bridge for ROCE and Infiniband Christoph Lameter
2021-11-12 13:50 ` Christoph Lameter
2021-11-14  9:09 ` Leon Romanovsky
2021-11-16 11:44   ` Christoph Lameter

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.