> On 3/4/2015 11:41 AM, Weiny, Ira wrote: > >> > >>> InfiniBand InfiniBand InfiniBand Verbs > >>> iWARP InfiniBand iWARP Verbs (subset of IBV, with > >>> specific connection establishment > >>> requirements that don't exist with IBV) > >>> RoCE Ethernet InfiniBand Verbs (but with different > >>> addressing because of the different > >>> link layer) > >>> OPA OPA InfiniBand Verbs > >> > >> Verbs is an interface definition to hardware that has been twisted to > >> be a software API and extended to expose vendor-specific implementation > 'features' > >> as extensions. It is not a transport. > >> > >> The device capability bits seems to have evolved to mean: vendor A > >> implemented some random 'feature' in their hardware and wants all > >> applications to now check for this 'feature' and change their code to use it. > >> Basically, what gets defined as a device cap is rather arbitrary. > >> > > > > This was the point I was trying to make and the reason the OPA MAD support > was implemented as a device capability. > > The proposed device capability stands for a change that is way more drastic > than just a vendor extension. This is a device running a completely different > wire protocol which does not interoperate with the IB device that is > impersonating. > > Also, it does not come under the same jurisdication as IB. It is entirely possible > that IBTA could make some change in the future where OPA can no longer > masquerade as IB. > > OPA must also be identifiable by any verbs I disagree. Software applications running IP don't need to know they are running over IB vs Ethernet? Why would this have to be true for Verbs? perftest, libibverbs (example apps), mvapich (verbs), openmpi (verbs), ibacm (librdmacm based applications), srp, and ipoib all run without any knowledge of the link being OPA. While not an exhaustive list of verbs applications, this is a pretty good sampling. We have specifically designed OPA to support these applications without modifications. > or management application. Agreed, but _only_ when talking to the hardware. Other MAD interfaces are IB compatible. The idea of the original series was to check the device capability bits (in kernel and in userspace) for SM diagnostic and diagnostic tool support. As currently submitted I would need to add a kernel ABI to get the extended capability bits (because the originals were exhausted). I have been waiting for this discussion to settle before going through that effort. > To do > that, it should be properly represented in node type, transport type, and link > layer as all the other RDMA technologies have been regardless of the changes > needed. All the other RDMA technologies have gone through this process. > As I suggested above: Would setting the Link Layer to a new value (ie. IB_LINK_LAYER_OMNI_PATH_ARCH) while maintaining the Transport as InfiniBand Verbs be satisfactory? While this breaks at least some of the examples I list above I believe I have worked out a way to phase in this support through our provider library. To address your comments from the other fork of this thread: > > The 32 bit LIDs in the SMP are designed for future expansion. Currently OPA > does not support > 16 bit LIDs. > > It's not just verbs (structures and APIs) but also CM and any other place where > LID is used which are numerous. > > Is a new verbs coming for this ? I can't guarantee that no changes will be required in the future. But right now the answer is "No" because we specifically implemented OPA to be "InfiniBand Verbs". > How will compatibility be dealt with ? Compatibility is provided by presenting identical InfiniBand Verbs interfaces (kernel abi, SA protocols, etc) to the user. > > This is also another example why a more complete picture of OPA is needed. > > -- Hal At this moment OPA provides ULP compatibility for all IB Verbs applications with the exception of management software which talks directly to the SMA and PMA. All ULP interactions with management (Path Record, CM, Multicast joins, etc) are still IB formatted and compatible. We have specifically designed this interoperability with OFA. To help illustrate my point I have included 2 patches below. The first defines a new Link Layer; IB_LINK_LAYER_OMNI_PATH_ARCH and modifies all the kernel interfaces which look at link layer. This has been minimally tested with IPoIB and the perftest tools (the modification of which is included as the 2nd patch). The entire patch boils down to changing "if (InfiniBand)" to "if (InfiniBand || OPA)". IMO this is rather inefficient but if this is more acceptable to the community I am willing to investigate this further. Ira From 9f09be92576204b3ead71f714fc231110f03bff6 Mon Sep 17 00:00:00 2001 From: Ira Weiny Date: Wed, 3 Dec 2014 20:01:09 -0500 Subject: [PATCH] WIP: IB/core: Add IB_LINK_LAYER_OMNI_PATH_ARCH This OPA Link Layer is 100% compatible with InfiniBand Verbs software. --- drivers/infiniband/core/agent.c | 4 +++- drivers/infiniband/core/cma.c | 39 +++++++++++++++++++++++-------- drivers/infiniband/core/mad.c | 4 +++- drivers/infiniband/core/multicast.c | 16 ++++++++----- drivers/infiniband/core/sa_query.c | 22 +++++++++++------ drivers/infiniband/core/sysfs.c | 2 ++ drivers/infiniband/core/ucma.c | 1 + drivers/infiniband/ulp/ipoib/ipoib_main.c | 3 ++- include/rdma/ib_verbs.h | 1 + 9 files changed, 66 insertions(+), 26 deletions(-) diff --git a/drivers/infiniband/core/agent.c b/drivers/infiniband/core/agent.c index f6d2961..0b1e7ee 100644 --- a/drivers/infiniband/core/agent.c +++ b/drivers/infiniband/core/agent.c @@ -147,6 +147,7 @@ int ib_agent_port_open(struct ib_device *device, int port_num) struct ib_agent_port_private *port_priv; unsigned long flags; int ret; + enum rdma_link_layer ll; /* Create new device info */ port_priv = kzalloc(sizeof *port_priv, GFP_KERNEL); @@ -156,7 +157,8 @@ int ib_agent_port_open(struct ib_device *device, int port_num) goto error1; } - if (rdma_port_get_link_layer(device, port_num) == IB_LINK_LAYER_INFINIBAND) { + ll = rdma_port_get_link_layer(device, port_num); + if (ll == IB_LINK_LAYER_INFINIBAND || ll == IB_LINK_LAYER_OMNI_PATH_ARCH) { /* Obtain send only MAD agent for SMI QP */ port_priv->agent[0] = ib_register_mad_agent(device, port_num, IB_QPT_SMI, NULL, 0, diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c index d570030..d16586c 100644 --- a/drivers/infiniband/core/cma.c +++ b/drivers/infiniband/core/cma.c @@ -349,6 +349,15 @@ static int cma_translate_addr(struct sockaddr *addr, struct rdma_dev_addr *dev_a return ret; } +static inline int ll_matches_dev_type(enum rdma_link_layer ll, + unsigned short dev_type) +{ + return ((dev_type == ARPHRD_INFINIBAND && + (ll == IB_LINK_LAYER_INFINIBAND || ll == IB_LINK_LAYER_OMNI_PATH_ARCH)) + || + (dev_type != ARPHRD_INFINIBAND && ll == IB_LINK_LAYER_ETHERNET)); +} + static int cma_acquire_dev(struct rdma_id_private *id_priv, struct rdma_id_private *listen_id_priv) { @@ -357,11 +366,9 @@ static int cma_acquire_dev(struct rdma_id_private *id_priv, union ib_gid gid, iboe_gid; int ret = -ENODEV; u8 port, found_port; - enum rdma_link_layer dev_ll = dev_addr->dev_type == ARPHRD_INFINIBAND ? - IB_LINK_LAYER_INFINIBAND : IB_LINK_LAYER_ETHERNET; + unsigned short dev_type = dev_addr->dev_type; - if (dev_ll != IB_LINK_LAYER_INFINIBAND && - id_priv->id.ps == RDMA_PS_IPOIB) + if (dev_type != ARPHRD_INFINIBAND && id_priv->id.ps == RDMA_PS_IPOIB) return -EINVAL; mutex_lock(&lock); @@ -370,9 +377,11 @@ static int cma_acquire_dev(struct rdma_id_private *id_priv, memcpy(&gid, dev_addr->src_dev_addr + rdma_addr_gid_offset(dev_addr), sizeof gid); + if (listen_id_priv && - rdma_port_get_link_layer(listen_id_priv->id.device, - listen_id_priv->id.port_num) == dev_ll) { + ll_matches_dev_type(rdma_port_get_link_layer(listen_id_priv->id.device, + listen_id_priv->id.port_num), + dev_type)) { cma_dev = listen_id_priv->cma_dev; port = listen_id_priv->id.port_num; if (rdma_node_get_transport(cma_dev->device->node_type) == RDMA_TRANSPORT_IB && @@ -394,7 +403,8 @@ static int cma_acquire_dev(struct rdma_id_private *id_priv, listen_id_priv->cma_dev == cma_dev && listen_id_priv->id.port_num == port) continue; - if (rdma_port_get_link_layer(cma_dev->device, port) == dev_ll) { + if (ll_matches_dev_type(rdma_port_get_link_layer(cma_dev->device, port), + dev_type)) { if (rdma_node_get_transport(cma_dev->device->node_type) == RDMA_TRANSPORT_IB && rdma_port_get_link_layer(cma_dev->device, port) == IB_LINK_LAYER_ETHERNET) ret = ib_find_cached_gid(cma_dev->device, &iboe_gid, &found_port, NULL); @@ -699,9 +709,11 @@ static int cma_ib_init_qp_attr(struct rdma_id_private *id_priv, struct rdma_dev_addr *dev_addr = &id_priv->id.route.addr.dev_addr; int ret; u16 pkey; + enum rdma_link_layer ll; + + ll = rdma_port_get_link_layer(id_priv->id.device, id_priv->id.port_num); - if (rdma_port_get_link_layer(id_priv->id.device, id_priv->id.port_num) == - IB_LINK_LAYER_INFINIBAND) + if (ll == IB_LINK_LAYER_INFINIBAND || ll == IB_LINK_LAYER_OMNI_PATH_ARCH) pkey = ib_addr_get_pkey(dev_addr); else pkey = 0xffff; @@ -930,6 +942,7 @@ static void cma_cancel_route(struct rdma_id_private *id_priv) { switch (rdma_port_get_link_layer(id_priv->id.device, id_priv->id.port_num)) { case IB_LINK_LAYER_INFINIBAND: + case IB_LINK_LAYER_OMNI_PATH_ARCH: if (id_priv->query) ib_sa_cancel_query(id_priv->query_id, id_priv->query); break; @@ -1008,6 +1021,7 @@ static void cma_leave_mc_groups(struct rdma_id_private *id_priv) list_del(&mc->list); switch (rdma_port_get_link_layer(id_priv->cma_dev->device, id_priv->id.port_num)) { case IB_LINK_LAYER_INFINIBAND: + case IB_LINK_LAYER_OMNI_PATH_ARCH: ib_sa_free_multicast(mc->multicast.ib); kfree(mc); break; @@ -1971,6 +1985,7 @@ int rdma_resolve_route(struct rdma_cm_id *id, int timeout_ms) case RDMA_TRANSPORT_IB: switch (rdma_port_get_link_layer(id->device, id->port_num)) { case IB_LINK_LAYER_INFINIBAND: + case IB_LINK_LAYER_OMNI_PATH_ARCH: ret = cma_resolve_ib_route(id_priv, timeout_ms); break; case IB_LINK_LAYER_ETHERNET: @@ -2023,6 +2038,7 @@ static int cma_bind_loopback(struct rdma_id_private *id_priv) u16 pkey; int ret; u8 p; + enum rdma_link_layer ll; cma_dev = NULL; mutex_lock(&lock); @@ -2059,8 +2075,9 @@ port_found: if (ret) goto out; + ll = rdma_port_get_link_layer(cma_dev->device, p); id_priv->id.route.addr.dev_addr.dev_type = - (rdma_port_get_link_layer(cma_dev->device, p) == IB_LINK_LAYER_INFINIBAND) ? + (ll == IB_LINK_LAYER_INFINIBAND || ll == IB_LINK_LAYER_OMNI_PATH_ARCH) ? ARPHRD_INFINIBAND : ARPHRD_ETHER; rdma_addr_set_sgid(&id_priv->id.route.addr.dev_addr, &gid); @@ -3364,6 +3381,7 @@ int rdma_join_multicast(struct rdma_cm_id *id, struct sockaddr *addr, case RDMA_TRANSPORT_IB: switch (rdma_port_get_link_layer(id->device, id->port_num)) { case IB_LINK_LAYER_INFINIBAND: + case IB_LINK_LAYER_OMNI_PATH_ARCH: ret = cma_join_ib_multicast(id_priv, mc); break; case IB_LINK_LAYER_ETHERNET: @@ -3408,6 +3426,7 @@ void rdma_leave_multicast(struct rdma_cm_id *id, struct sockaddr *addr) if (rdma_node_get_transport(id_priv->cma_dev->device->node_type) == RDMA_TRANSPORT_IB) { switch (rdma_port_get_link_layer(id->device, id->port_num)) { case IB_LINK_LAYER_INFINIBAND: + case IB_LINK_LAYER_OMNI_PATH_ARCH: ib_sa_free_multicast(mc->multicast.ib); kfree(mc); break; diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c index 74c30f4..4100312 100644 --- a/drivers/infiniband/core/mad.c +++ b/drivers/infiniband/core/mad.c @@ -2922,6 +2922,7 @@ static int ib_mad_port_open(struct ib_device *device, unsigned long flags; char name[sizeof "ib_mad123"]; int has_smi; + enum rdma_link_layer ll; /* Create new device info */ port_priv = kzalloc(sizeof *port_priv, GFP_KERNEL); @@ -2938,7 +2939,8 @@ static int ib_mad_port_open(struct ib_device *device, init_mad_qp(port_priv, &port_priv->qp_info[1]); cq_size = mad_sendq_size + mad_recvq_size; - has_smi = rdma_port_get_link_layer(device, port_num) == IB_LINK_LAYER_INFINIBAND; + ll = rdma_port_get_link_layer(device, port_num); + has_smi = (ll == IB_LINK_LAYER_INFINIBAND || ll == IB_LINK_LAYER_OMNI_PATH_ARCH); if (has_smi) cq_size *= 2; diff --git a/drivers/infiniband/core/multicast.c b/drivers/infiniband/core/multicast.c index fa17b55..5bce1a58 100644 --- a/drivers/infiniband/core/multicast.c +++ b/drivers/infiniband/core/multicast.c @@ -778,10 +778,12 @@ static void mcast_event_handler(struct ib_event_handler *handler, { struct mcast_device *dev; int index; + enum rdma_link_layer ll; dev = container_of(handler, struct mcast_device, event_handler); - if (rdma_port_get_link_layer(dev->device, event->element.port_num) != - IB_LINK_LAYER_INFINIBAND) + + ll = rdma_port_get_link_layer(dev->device, event->element.port_num); + if (ll != IB_LINK_LAYER_INFINIBAND && ll != IB_LINK_LAYER_OMNI_PATH_ARCH) return; index = event->element.port_num - dev->start_port; @@ -824,8 +826,9 @@ static void mcast_add_one(struct ib_device *device) } for (i = 0; i <= dev->end_port - dev->start_port; i++) { - if (rdma_port_get_link_layer(device, dev->start_port + i) != - IB_LINK_LAYER_INFINIBAND) + enum rdma_link_layer ll = rdma_port_get_link_layer(device, + dev->start_port + i); + if (ll != IB_LINK_LAYER_INFINIBAND && ll != IB_LINK_LAYER_OMNI_PATH_ARCH) continue; port = &dev->port[i]; port->dev = dev; @@ -863,8 +866,9 @@ static void mcast_remove_one(struct ib_device *device) flush_workqueue(mcast_wq); for (i = 0; i <= dev->end_port - dev->start_port; i++) { - if (rdma_port_get_link_layer(device, dev->start_port + i) == - IB_LINK_LAYER_INFINIBAND) { + enum rdma_link_layer ll = rdma_port_get_link_layer(device, + dev->start_port + i); + if (ll == IB_LINK_LAYER_INFINIBAND || ll == IB_LINK_LAYER_OMNI_PATH_ARCH) { port = &dev->port[i]; deref_port(port); wait_for_completion(&port->comp); diff --git a/drivers/infiniband/core/sa_query.c b/drivers/infiniband/core/sa_query.c index c38f030..90eda12 100644 --- a/drivers/infiniband/core/sa_query.c +++ b/drivers/infiniband/core/sa_query.c @@ -450,7 +450,8 @@ static void ib_sa_event(struct ib_event_handler *handler, struct ib_event *event struct ib_sa_port *port = &sa_dev->port[event->element.port_num - sa_dev->start_port]; - if (rdma_port_get_link_layer(handler->device, port->port_num) != IB_LINK_LAYER_INFINIBAND) + enum rdma_link_layer ll = rdma_port_get_link_layer(handler->device, port->port_num); + if (ll != IB_LINK_LAYER_INFINIBAND && ll != IB_LINK_LAYER_OMNI_PATH_ARCH) return; spin_lock_irqsave(&port->ah_lock, flags); @@ -1153,6 +1154,7 @@ static void ib_sa_add_one(struct ib_device *device) { struct ib_sa_device *sa_dev; int s, e, i; + enum rdma_link_layer ll; if (rdma_node_get_transport(device->node_type) != RDMA_TRANSPORT_IB) return; @@ -1175,7 +1177,8 @@ static void ib_sa_add_one(struct ib_device *device) for (i = 0; i <= e - s; ++i) { spin_lock_init(&sa_dev->port[i].ah_lock); - if (rdma_port_get_link_layer(device, i + 1) != IB_LINK_LAYER_INFINIBAND) + ll = rdma_port_get_link_layer(device, i + 1); + if (ll != IB_LINK_LAYER_INFINIBAND && ll != IB_LINK_LAYER_OMNI_PATH_ARCH) continue; sa_dev->port[i].sm_ah = NULL; @@ -1204,16 +1207,20 @@ static void ib_sa_add_one(struct ib_device *device) if (ib_register_event_handler(&sa_dev->event_handler)) goto err; - for (i = 0; i <= e - s; ++i) - if (rdma_port_get_link_layer(device, i + 1) == IB_LINK_LAYER_INFINIBAND) + for (i = 0; i <= e - s; ++i) { + ll = rdma_port_get_link_layer(device, i + 1); + if (ll == IB_LINK_LAYER_INFINIBAND || ll == IB_LINK_LAYER_OMNI_PATH_ARCH) update_sm_ah(&sa_dev->port[i].update_task); + } return; err: - while (--i >= 0) - if (rdma_port_get_link_layer(device, i + 1) == IB_LINK_LAYER_INFINIBAND) + while (--i >= 0) { + ll = rdma_port_get_link_layer(device, i + 1); + if (ll == IB_LINK_LAYER_INFINIBAND || ll == IB_LINK_LAYER_OMNI_PATH_ARCH) ib_unregister_mad_agent(sa_dev->port[i].agent); + } kfree(sa_dev); @@ -1233,7 +1240,8 @@ static void ib_sa_remove_one(struct ib_device *device) flush_workqueue(ib_wq); for (i = 0; i <= sa_dev->end_port - sa_dev->start_port; ++i) { - if (rdma_port_get_link_layer(device, i + 1) == IB_LINK_LAYER_INFINIBAND) { + enum rdma_link_layer ll = rdma_port_get_link_layer(device, i + 1); + if (ll == IB_LINK_LAYER_INFINIBAND || ll == IB_LINK_LAYER_OMNI_PATH_ARCH) { ib_unregister_mad_agent(sa_dev->port[i].agent); if (sa_dev->port[i].sm_ah) kref_put(&sa_dev->port[i].sm_ah->ref, free_sm_ah); diff --git a/drivers/infiniband/core/sysfs.c b/drivers/infiniband/core/sysfs.c index cbd0383..66b01f4 100644 --- a/drivers/infiniband/core/sysfs.c +++ b/drivers/infiniband/core/sysfs.c @@ -253,6 +253,8 @@ static ssize_t link_layer_show(struct ib_port *p, struct port_attribute *unused, return sprintf(buf, "%s\n", "InfiniBand"); case IB_LINK_LAYER_ETHERNET: return sprintf(buf, "%s\n", "Ethernet"); + case IB_LINK_LAYER_OMNI_PATH_ARCH: + return sprintf(buf, "%s\n", "OmniPathArch"); default: return sprintf(buf, "%s\n", "Unknown"); } diff --git a/drivers/infiniband/core/ucma.c b/drivers/infiniband/core/ucma.c index 45d67e9..502e2e8 100644 --- a/drivers/infiniband/core/ucma.c +++ b/drivers/infiniband/core/ucma.c @@ -727,6 +727,7 @@ static ssize_t ucma_query_route(struct ucma_file *file, switch (rdma_port_get_link_layer(ctx->cm_id->device, ctx->cm_id->port_num)) { case IB_LINK_LAYER_INFINIBAND: + case IB_LINK_LAYER_OMNI_PATH_ARCH: ucma_copy_ib_route(&resp, &ctx->cm_id->route); break; case IB_LINK_LAYER_ETHERNET: diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c index 58b5aa3..5c51866 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c @@ -1673,7 +1673,8 @@ static void ipoib_add_one(struct ib_device *device) } for (p = s; p <= e; ++p) { - if (rdma_port_get_link_layer(device, p) != IB_LINK_LAYER_INFINIBAND) + enum rdma_link_layer ll = rdma_port_get_link_layer(device, p); + if (ll != IB_LINK_LAYER_INFINIBAND && ll != IB_LINK_LAYER_OMNI_PATH_ARCH) continue; dev = ipoib_add_port("ib%d", device, p); if (!IS_ERR(dev)) { diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h index 65994a1..6a15088 100644 --- a/include/rdma/ib_verbs.h +++ b/include/rdma/ib_verbs.h @@ -88,6 +88,7 @@ enum rdma_link_layer { IB_LINK_LAYER_UNSPECIFIED, IB_LINK_LAYER_INFINIBAND, IB_LINK_LAYER_ETHERNET, + IB_LINK_LAYER_OMNI_PATH_ARCH, }; enum ib_device_cap_flags { -- 1.8.2 From 28c5c7e44b87c4e3e29634fd378da4871401cbcd Mon Sep 17 00:00:00 2001 From: Ira Weiny Date: Wed, 11 Mar 2015 02:28:14 -0400 Subject: [PATCH] perftest: WIP add OPA Link Layer support --- src/perftest_parameters.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/src/perftest_parameters.c b/src/perftest_parameters.c index fc4088a..928ba96 100755 --- a/src/perftest_parameters.c +++ b/src/perftest_parameters.c @@ -985,6 +985,7 @@ const char *transport_str(enum ibv_transport_type type) /****************************************************************************** * ******************************************************************************/ +#define IBV_LINK_LAYER_OPA (IBV_LINK_LAYER_ETHERNET+1) const char *link_layer_str(uint8_t link_layer) { switch (link_layer) { @@ -994,6 +995,8 @@ const char *link_layer_str(uint8_t link_layer) return "IB"; case IBV_LINK_LAYER_ETHERNET: return "Ethernet"; + case IBV_LINK_LAYER_OPA: + return "OPA"; #ifdef HAVE_SCIF case IBV_LINK_LAYER_SCIF: return "SCIF"; -- 1.8.2 NrybXǧv^)޺{.n+{ٚ{ayʇڙ,jfh/oScڳ9u&jw(階ݢj"mzޖfh~m