All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH V4 0/9] IP based RoCE GID Addressing
@ 2013-09-10 14:41 Or Gerlitz
       [not found] ` <1378824099-22150-1-git-send-email-ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 38+ messages in thread
From: Or Gerlitz @ 2013-09-10 14:41 UTC (permalink / raw)
  To: roland-DgEjT+Ai2ygdnm+yROfE0A
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, monis-VPRAkNaXOzVWk0Htik3J/w,
	matanb-VPRAkNaXOzVWk0Htik3J/w, Or Gerlitz

changes from V3:

  - dropped the uverbs infrastructure patch for extensions which is now upstream
    400dbc9 "IB/core: Infrastructure for extensible uverbs commands"

  - added ocrdma patch to handle Ethernet L2 parameters, similar to the mlx4 patch.
   
  - removed the assumption that the low level driver can provide the source mac
    and vlan in the struct ib_wc returned by ib_poll_cq, and adjusted the 
    ib_init_ah_from_wc helper of the IB core accordingly.

  - fixed some vlan related issues in the mlx4 driver

See below full listing of change-history.

Currently, the IB stack (core + drivers) handle RoCE (IBoE) gids as
they encode related Ethernet net-device interface MAC address and 
possibly VLAN id.

This series changes RoCE GIDs to encode IP addresses (IPv4 + IPv6)
of the that Ethernet interface, under the following reasoning:

1. There are environments where the compute entity that runs the RoCE 
stack is not aware that its traffic is vlan-tagged. This results with that 
node to create/assume wrong GIDs from the view point of a peer node which 
is aware to vlans. 

Note that "node" here can be physical node connected to Ethernet switch acting in 
access mode talking to another node which does vlan insertion/stripping by itself.

Or another example is SRIOV Virtual Function which is configured to work in "VST" 
mode (Virtual-Switch-Tagging) such that the hypervisor configures the HW eSWitch 
to do vlan insertion for the vPORT representing that function.

2. When RoCE traffic is inspected (mirrored/trapped) in Ethernet switches for 
monitoring and security purposes. It is much more natural for both humans and 
automated utilities (...) to observe IP addresses in a certain offset into RoCE 
frames L3 header vs. MAC/VLANs (which are there anyway in the L2 header of that 
frame, so they are not gone by this change).

3. Some Bonding/Teaming advanced mode such as balance-alb and balance-tlb 
are using multiple underlying devices in parallel, and hence packets always 
carry the bond IP address but different streams have different source MACs.
The approach brought by this series is part from what would allow to 
support that for RoCE traffic too.

The 1st patch adds explicit handling of Ethernet L2 attributes, source/dest 
mac and vlan_id to the kernel IB core, in data-structures and CMA/CM code. 
Previously, with MAC/VLAN based addressing, they were encoded in the GIDs, 
where now they have to be resolved and placed separately from the IP based GIDs.

The 2nd patch modifies the CMA to cope with IP based GIDs, the 3rd/4th ones do 
that for the mlx4_ib driver, and the 5th patch to the ocrdma driver. 

The 6th patch sets the foundation for extending uverbs to the new scheme which 
was introduced lately, and the 7th/8th patches add two extended uverbs and 
respectively two extended ucma commands which are now exported to user space.
The last patch denotes mlx4 support for the uverbs extended modify qp command.

These extended verbs will allow to enhance user space libraries such that they work 
OK over the modified scheme. All RC applications using librdmacm will not need to be 
modified at all, since the change will be encapsulated into that library.

Or.

Full listing of change-history:

changes from V3:

  - dropped the uverbs Infrastructure patch for extensions which is now upstream
    400dbc9 "IB/core: Infrastructure for extensible uverbs commands"

  - added ocrdma patch to handle Ethernet L2 parameters, similar to the mlx4 patch.
   
  - removed the assumption that the low level driver can provide the source mac
    and vlan in the struct ib_wc returned by ib_poll_cq, and adjusted the 
    ib_init_ah_from_wc helper of the IB core accordingly.

  - fixed some vlan related issues in the mlx4 driver

changes from V2:

  - added handling of IP based GIDs in the ocrdma driver - patch #5, 
    as a result patches #5-8 of V1 became patches #6-9
  
changes from V1:

 - rebased the series against the latest kernel bits, which include Sean's 
   AF_IB changes to the rdma-cm
 
 - fixed bug in mlx4_ib where reset of the gid table was done for IB ports too
 
 - fixed build warnings and issues pointed by sparse

 - introduced patch #1 which does the explicit handling of Ethernet L2 attributes, 
   source/dest mac and vlan_id in the kernel data-structures and CMA/CM code. 

 - use smac when modifying a QP --> find smac in passive side + additional fields 
   to adress structures

 - add support to new QP atrr in ib_modify_qp_is_ok() special for ll = ETH
  and modified all low-level drivers to keep working after that change

 -- changes around uverbs:
 - use ah_ext as pointer in qp_attr passed from user space, so this 
   field by itself can be extended in the future
 - for kernel to user command respnses comp_mask is moved into the 
   right place which is after the non-extended command respond fields
 - fixed bug in copy_qp_attr_ex under which some fields were copied to
   wrong locations
 - use new structure rdma_ucm_init_qp_attr_ex which is extendable (ucma)

changes from V0:

 - enhanced documentation of the mlx4_ib, uverbs and ucma patches
 - broke the mlx4_ib patch to two
 - broke the extended user space commands patch to two

Matan Barak (4):
  IB/core: Ethernet L2 attributes in verbs/cm structures
  IB/core: Add RoCE IP based addressing extensions for uverbs
  IB/core: Add RoCE IP based addressing extensions for rdma_ucm
  IB/mlx4: Enable mlx4_ib support for MODIFY_QP_EX

Moni Shoua (5):
  IB/CMA: RoCE IP based GID addressing
  IB/mlx4: Use RoCE IP based GIDs in the port GID table
  IB/mlx4: Handle Ethernet L2 parameters for IP based GID addressing
  IB/ocrdma: Populate GID table with IP based gids
  IB/ocrdma: Handle Ethernet L2 parameters for IP based GID addressing

 drivers/infiniband/core/addr.c              |   97 ++++++-
 drivers/infiniband/core/cm.c                |   55 +++
 drivers/infiniband/core/cma.c               |   74 ++++-
 drivers/infiniband/core/sa_query.c          |   12 +-
 drivers/infiniband/core/ucma.c              |  193 ++++++++++-
 drivers/infiniband/core/uverbs.h            |    2 +
 drivers/infiniband/core/uverbs_cmd.c        |  359 ++++++++++++++++-----
 drivers/infiniband/core/uverbs_main.c       |    4 +-
 drivers/infiniband/core/uverbs_marshall.c   |  128 +++++++-
 drivers/infiniband/core/verbs.c             |   45 +++-
 drivers/infiniband/hw/ehca/ehca_qp.c        |    2 +-
 drivers/infiniband/hw/ipath/ipath_qp.c      |    2 +-
 drivers/infiniband/hw/mlx4/ah.c             |   40 +--
 drivers/infiniband/hw/mlx4/cq.c             |    9 +
 drivers/infiniband/hw/mlx4/main.c           |  477 +++++++++++++++++++--------
 drivers/infiniband/hw/mlx4/mlx4_ib.h        |    6 +-
 drivers/infiniband/hw/mlx4/qp.c             |  104 +++++--
 drivers/infiniband/hw/mlx5/qp.c             |    3 +-
 drivers/infiniband/hw/mthca/mthca_qp.c      |    3 +-
 drivers/infiniband/hw/ocrdma/ocrdma.h       |   12 +
 drivers/infiniband/hw/ocrdma/ocrdma_ah.c    |    5 +-
 drivers/infiniband/hw/ocrdma/ocrdma_hw.c    |   21 +-
 drivers/infiniband/hw/ocrdma/ocrdma_hw.h    |    1 -
 drivers/infiniband/hw/ocrdma/ocrdma_main.c  |  138 +++------
 drivers/infiniband/hw/ocrdma/ocrdma_verbs.c |    3 +-
 drivers/infiniband/hw/qib/qib_qp.c          |    2 +-
 drivers/net/ethernet/mellanox/mlx4/port.c   |   20 ++
 include/linux/mlx4/cq.h                     |   15 +-
 include/linux/mlx4/device.h                 |    1 +
 include/rdma/ib_addr.h                      |   84 +++--
 include/rdma/ib_cm.h                        |    1 +
 include/rdma/ib_marshall.h                  |   12 +
 include/rdma/ib_pack.h                      |    1 +
 include/rdma/ib_sa.h                        |    3 +
 include/rdma/ib_verbs.h                     |   23 ++-
 include/uapi/rdma/ib_user_sa.h              |   34 ++-
 include/uapi/rdma/ib_user_verbs.h           |  160 +++++++++-
 include/uapi/rdma/rdma_user_cm.h            |   29 ++-
 38 files changed, 1684 insertions(+), 496 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH V4 1/9] IB/core: Ethernet L2 attributes in verbs/cm structures
       [not found] ` <1378824099-22150-1-git-send-email-ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2013-09-10 14:41   ` Or Gerlitz
  2013-09-10 14:41   ` [PATCH V4 2/9] IB/CMA: RoCE IP based GID addressing Or Gerlitz
                     ` (7 subsequent siblings)
  8 siblings, 0 replies; 38+ messages in thread
From: Or Gerlitz @ 2013-09-10 14:41 UTC (permalink / raw)
  To: roland-DgEjT+Ai2ygdnm+yROfE0A
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, monis-VPRAkNaXOzVWk0Htik3J/w,
	matanb-VPRAkNaXOzVWk0Htik3J/w, Or Gerlitz

From: Matan Barak <matanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>

This patch add the support for Ethernet L2 attributes in the
verbs/cm/cma structures.

When dealing with L2 Ethernet, we should use smac, dmac, vlan ID and priority
in a similar manner that the IB L2 (and the L4 PKEY) attributes are used.

Thus, those attributes were added to the following structures:

* ib_ah_attr - added dmac
* ib_qp_attr - added smac and vlan_id, (sl remains vlan priority)
* ib_wc - added smac, vlan_id
* ib_sa_path_rec - added smac, dmac, vlan_id
* cm_av - added smac and vlan_id

For the path record structure, extra care was taken to avoid the new fields when
packing it into wire format, to avoiding breaking the IB CM and SA wire protocol.

On the active side, the CM fill its internal structures from the path provided
by the ULP, added there taking the ETH L2 attributes and placing them into
the CM Address Handle (struct cm_av).

On the passive side, the CM fills its internal structures from the WC associated
with the REQ message, added there taking the ETH L2 attributes from the WC.

When the HW driver provides the required ETH L2 attributes in the WC, they
set the IB_WC_WITH_SMAC and IB_WC_WITH_VLAN flags. The IB core code checks
for the presence of these flags, and in their absence does address
resolution from the ib_init_ah_from_wc helper function.

ib_modify_qp_is_ok is also updated to consider the link layer. Some parameters
are mandatory for Ethernet link layer, while they are irrelevant for IB.
Vendor drivers are modified to support the new function signature.

Signed-off-by: Matan Barak <matanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Signed-off-by: Or Gerlitz <ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
---
 drivers/infiniband/core/addr.c              |   97 ++++++++++++++++++++++++++-
 drivers/infiniband/core/cm.c                |   55 +++++++++++++++
 drivers/infiniband/core/cma.c               |   60 +++++++++++++++--
 drivers/infiniband/core/sa_query.c          |   12 +++-
 drivers/infiniband/core/verbs.c             |   45 ++++++++++++-
 drivers/infiniband/hw/ehca/ehca_qp.c        |    2 +-
 drivers/infiniband/hw/ipath/ipath_qp.c      |    2 +-
 drivers/infiniband/hw/mlx4/qp.c             |    9 ++-
 drivers/infiniband/hw/mlx5/qp.c             |    3 +-
 drivers/infiniband/hw/mthca/mthca_qp.c      |    3 +-
 drivers/infiniband/hw/ocrdma/ocrdma_verbs.c |    3 +-
 drivers/infiniband/hw/qib/qib_qp.c          |    2 +-
 include/linux/mlx4/device.h                 |    1 +
 include/rdma/ib_addr.h                      |   42 +++++++++++-
 include/rdma/ib_cm.h                        |    1 +
 include/rdma/ib_pack.h                      |    1 +
 include/rdma/ib_sa.h                        |    3 +
 include/rdma/ib_verbs.h                     |   23 ++++++-
 18 files changed, 340 insertions(+), 24 deletions(-)

diff --git a/drivers/infiniband/core/addr.c b/drivers/infiniband/core/addr.c
index e90f2b2..8172d37 100644
--- a/drivers/infiniband/core/addr.c
+++ b/drivers/infiniband/core/addr.c
@@ -86,6 +86,8 @@ int rdma_addr_size(struct sockaddr *addr)
 }
 EXPORT_SYMBOL(rdma_addr_size);
 
+static struct rdma_addr_client self;
+
 void rdma_addr_register_client(struct rdma_addr_client *client)
 {
 	atomic_set(&client->refcount, 1);
@@ -119,7 +121,8 @@ int rdma_copy_addr(struct rdma_dev_addr *dev_addr, struct net_device *dev,
 }
 EXPORT_SYMBOL(rdma_copy_addr);
 
-int rdma_translate_ip(struct sockaddr *addr, struct rdma_dev_addr *dev_addr)
+int rdma_translate_ip(struct sockaddr *addr, struct rdma_dev_addr *dev_addr,
+		      u16 *vlan_id)
 {
 	struct net_device *dev;
 	int ret = -EADDRNOTAVAIL;
@@ -142,6 +145,8 @@ int rdma_translate_ip(struct sockaddr *addr, struct rdma_dev_addr *dev_addr)
 			return ret;
 
 		ret = rdma_copy_addr(dev_addr, dev, NULL);
+		if (vlan_id)
+			*vlan_id = rdma_vlan_dev_vlan_id(dev);
 		dev_put(dev);
 		break;
 
@@ -153,6 +158,8 @@ int rdma_translate_ip(struct sockaddr *addr, struct rdma_dev_addr *dev_addr)
 					  &((struct sockaddr_in6 *) addr)->sin6_addr,
 					  dev, 1)) {
 				ret = rdma_copy_addr(dev_addr, dev, NULL);
+				if (vlan_id)
+					*vlan_id = rdma_vlan_dev_vlan_id(dev);
 				break;
 			}
 		}
@@ -238,7 +245,7 @@ static int addr4_resolve(struct sockaddr_in *src_in,
 	src_in->sin_addr.s_addr = fl4.saddr;
 
 	if (rt->dst.dev->flags & IFF_LOOPBACK) {
-		ret = rdma_translate_ip((struct sockaddr *) dst_in, addr);
+		ret = rdma_translate_ip((struct sockaddr *)dst_in, addr, NULL);
 		if (!ret)
 			memcpy(addr->dst_dev_addr, addr->src_dev_addr, MAX_ADDR_LEN);
 		goto put;
@@ -286,7 +293,7 @@ static int addr6_resolve(struct sockaddr_in6 *src_in,
 	}
 
 	if (dst->dev->flags & IFF_LOOPBACK) {
-		ret = rdma_translate_ip((struct sockaddr *) dst_in, addr);
+		ret = rdma_translate_ip((struct sockaddr *)dst_in, addr, NULL);
 		if (!ret)
 			memcpy(addr->dst_dev_addr, addr->src_dev_addr, MAX_ADDR_LEN);
 		goto put;
@@ -437,6 +444,88 @@ void rdma_addr_cancel(struct rdma_dev_addr *addr)
 }
 EXPORT_SYMBOL(rdma_addr_cancel);
 
+struct resolve_cb_context {
+	struct rdma_dev_addr *addr;
+	struct completion comp;
+};
+
+static void resolve_cb(int status, struct sockaddr *src_addr,
+	     struct rdma_dev_addr *addr, void *context)
+{
+	memcpy(((struct resolve_cb_context *)context)->addr, addr, sizeof(struct
+				rdma_dev_addr));
+	complete(&((struct resolve_cb_context *)context)->comp);
+}
+
+int rdma_addr_find_dmac_by_grh(union ib_gid *sgid, union ib_gid *dgid, u8 *dmac,
+			       u16 *vlan_id)
+{
+	int ret = 0;
+	struct rdma_dev_addr dev_addr;
+	struct resolve_cb_context ctx;
+	struct net_device *dev;
+
+	union {
+		struct sockaddr     _sockaddr;
+		struct sockaddr_in  _sockaddr_in;
+		struct sockaddr_in6 _sockaddr_in6;
+	} sgid_addr, dgid_addr;
+
+
+	ret = rdma_gid2ip(&sgid_addr._sockaddr, sgid);
+	if (ret)
+		return ret;
+
+	ret = rdma_gid2ip(&dgid_addr._sockaddr, dgid);
+	if (ret)
+		return ret;
+
+	memset(&dev_addr, 0, sizeof(dev_addr));
+
+	ctx.addr = &dev_addr;
+	init_completion(&ctx.comp);
+	ret = rdma_resolve_ip(&self, &sgid_addr._sockaddr, &dgid_addr._sockaddr,
+			&dev_addr, 1000, resolve_cb, &ctx);
+	if (ret)
+		return ret;
+
+	wait_for_completion(&ctx.comp);
+
+	memcpy(dmac, dev_addr.dst_dev_addr, ETH_ALEN);
+	dev = dev_get_by_index(&init_net, dev_addr.bound_dev_if);
+	if (!dev)
+		return -ENODEV;
+	if (vlan_id)
+		*vlan_id = rdma_vlan_dev_vlan_id(dev);
+	dev_put(dev);
+	return ret;
+}
+EXPORT_SYMBOL(rdma_addr_find_dmac_by_grh);
+
+int rdma_addr_find_smac_by_sgid(union ib_gid *sgid, u8 *smac, u16 *vlan_id)
+{
+	int ret = 0;
+	struct rdma_dev_addr dev_addr;
+	union {
+		struct sockaddr     _sockaddr;
+		struct sockaddr_in  _sockaddr_in;
+		struct sockaddr_in6 _sockaddr_in6;
+	} gid_addr;
+
+	ret = rdma_gid2ip(&gid_addr._sockaddr, sgid);
+
+	if (ret)
+		return ret;
+	memset(&dev_addr, 0, sizeof(dev_addr));
+	ret = rdma_translate_ip(&gid_addr._sockaddr, &dev_addr, vlan_id);
+	if (ret)
+		return ret;
+
+	memcpy(smac, dev_addr.src_dev_addr, ETH_ALEN);
+	return ret;
+}
+EXPORT_SYMBOL(rdma_addr_find_smac_by_sgid);
+
 static int netevent_callback(struct notifier_block *self, unsigned long event,
 	void *ctx)
 {
@@ -461,11 +550,13 @@ static int __init addr_init(void)
 		return -ENOMEM;
 
 	register_netevent_notifier(&nb);
+	rdma_addr_register_client(&self);
 	return 0;
 }
 
 static void __exit addr_cleanup(void)
 {
+	rdma_addr_unregister_client(&self);
 	unregister_netevent_notifier(&nb);
 	destroy_workqueue(addr_wq);
 }
diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c
index 784b97c..a67b689 100644
--- a/drivers/infiniband/core/cm.c
+++ b/drivers/infiniband/core/cm.c
@@ -47,6 +47,7 @@
 #include <linux/sysfs.h>
 #include <linux/workqueue.h>
 #include <linux/kdev_t.h>
+#include <linux/etherdevice.h>
 
 #include <rdma/ib_cache.h>
 #include <rdma/ib_cm.h>
@@ -177,6 +178,8 @@ struct cm_av {
 	struct ib_ah_attr ah_attr;
 	u16 pkey_index;
 	u8 timeout;
+	u8  valid;
+	u8  smac[ETH_ALEN];
 };
 
 struct cm_work {
@@ -346,6 +349,23 @@ static void cm_init_av_for_response(struct cm_port *port, struct ib_wc *wc,
 			   grh, &av->ah_attr);
 }
 
+int ib_update_cm_av(struct ib_cm_id *id, const u8 *smac, const u8 *alt_smac)
+{
+	struct cm_id_private *cm_id_priv;
+
+	cm_id_priv = container_of(id, struct cm_id_private, id);
+
+	if (smac != NULL)
+		memcpy(cm_id_priv->av.smac, smac, sizeof(cm_id_priv->av.smac));
+
+	if (alt_smac != NULL)
+		memcpy(cm_id_priv->alt_av.smac, alt_smac,
+		       sizeof(cm_id_priv->alt_av.smac));
+
+	return 0;
+}
+EXPORT_SYMBOL(ib_update_cm_av);
+
 static int cm_init_av_by_path(struct ib_sa_path_rec *path, struct cm_av *av)
 {
 	struct cm_device *cm_dev;
@@ -376,6 +396,9 @@ static int cm_init_av_by_path(struct ib_sa_path_rec *path, struct cm_av *av)
 	ib_init_ah_from_path(cm_dev->ib_device, port->port_num, path,
 			     &av->ah_attr);
 	av->timeout = path->packet_life_time + 1;
+	memcpy(av->smac, path->smac, sizeof(av->smac));
+
+	av->valid = 1;
 	return 0;
 }
 
@@ -1557,6 +1580,9 @@ static int cm_req_handler(struct cm_work *work)
 
 	cm_process_routed_req(req_msg, work->mad_recv_wc->wc);
 	cm_format_paths_from_req(req_msg, &work->path[0], &work->path[1]);
+
+	memcpy(work->path[0].dmac, cm_id_priv->av.ah_attr.dmac, ETH_ALEN);
+	work->path[0].vlan_id = cm_id_priv->av.ah_attr.vlan_id;
 	ret = cm_init_av_by_path(&work->path[0], &cm_id_priv->av);
 	if (ret) {
 		ib_get_cached_gid(work->port->cm_dev->ib_device,
@@ -3503,6 +3529,35 @@ static int cm_init_qp_rtr_attr(struct cm_id_private *cm_id_priv,
 		*qp_attr_mask = IB_QP_STATE | IB_QP_AV | IB_QP_PATH_MTU |
 				IB_QP_DEST_QPN | IB_QP_RQ_PSN;
 		qp_attr->ah_attr = cm_id_priv->av.ah_attr;
+		if (!cm_id_priv->av.valid)
+			return -EINVAL;
+		if (!is_zero_ether_addr(cm_id_priv->av.ah_attr.dmac))
+			*qp_attr_mask |= IB_QP_DEST_EX;
+		if (cm_id_priv->av.ah_attr.vlan_id != 0xffff) {
+			qp_attr->vlan_id = cm_id_priv->av.ah_attr.vlan_id;
+			*qp_attr_mask |= IB_QP_VID;
+		}
+		if (!is_zero_ether_addr(cm_id_priv->av.smac)) {
+			memcpy(qp_attr->smac, cm_id_priv->av.smac,
+			       sizeof(qp_attr->smac));
+			*qp_attr_mask |= IB_QP_SMAC;
+		}
+		if (cm_id_priv->alt_av.valid) {
+			if (!is_zero_ether_addr(
+					cm_id_priv->alt_av.ah_attr.dmac))
+				*qp_attr_mask |= IB_QP_ALT_DEST_EX;
+			if (cm_id_priv->alt_av.ah_attr.vlan_id != 0xffff) {
+				qp_attr->alt_vlan_id =
+					cm_id_priv->alt_av.ah_attr.vlan_id;
+				*qp_attr_mask |= IB_QP_ALT_VID;
+			}
+			if (!is_zero_ether_addr(cm_id_priv->alt_av.smac)) {
+				memcpy(qp_attr->alt_smac,
+				       cm_id_priv->alt_av.smac,
+				       sizeof(qp_attr->alt_smac));
+				*qp_attr_mask |= IB_QP_ALT_SMAC;
+			}
+		}
 		qp_attr->path_mtu = cm_id_priv->path_mtu;
 		qp_attr->dest_qp_num = be32_to_cpu(cm_id_priv->remote_qpn);
 		qp_attr->rq_psn = be32_to_cpu(cm_id_priv->rq_psn);
diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
index 3a2c3c3..27acdec 100644
--- a/drivers/infiniband/core/cma.c
+++ b/drivers/infiniband/core/cma.c
@@ -362,7 +362,7 @@ static int cma_translate_addr(struct sockaddr *addr, struct rdma_dev_addr *dev_a
 	int ret;
 
 	if (addr->sa_family != AF_IB) {
-		ret = rdma_translate_ip(addr, dev_addr);
+		ret = rdma_translate_ip(addr, dev_addr, NULL);
 	} else {
 		cma_translate_ib((struct sockaddr_ib *) addr, dev_addr);
 		ret = 0;
@@ -602,6 +602,7 @@ static int cma_modify_qp_rtr(struct rdma_id_private *id_priv,
 {
 	struct ib_qp_attr qp_attr;
 	int qp_attr_mask, ret;
+	union ib_gid sgid;
 
 	mutex_lock(&id_priv->qp_mutex);
 	if (!id_priv->id.qp) {
@@ -624,6 +625,20 @@ static int cma_modify_qp_rtr(struct rdma_id_private *id_priv,
 	if (ret)
 		goto out;
 
+	ret = ib_query_gid(id_priv->id.device, id_priv->id.port_num,
+			   qp_attr.ah_attr.grh.sgid_index, &sgid);
+	if (ret)
+		goto out;
+
+	if (rdma_node_get_transport(id_priv->cma_dev->device->node_type)
+	    == RDMA_TRANSPORT_IB &&
+	    rdma_port_get_link_layer(id_priv->id.device, id_priv->id.port_num)
+	    == IB_LINK_LAYER_ETHERNET) {
+		ret = rdma_addr_find_smac_by_sgid(&sgid, qp_attr.smac, NULL);
+
+		if (ret)
+			goto out;
+	}
 	if (conn_param)
 		qp_attr.max_dest_rd_atomic = conn_param->responder_resources;
 	ret = ib_modify_qp(id_priv->id.qp, &qp_attr, qp_attr_mask);
@@ -724,6 +739,7 @@ int rdma_init_qp_attr(struct rdma_cm_id *id, struct ib_qp_attr *qp_attr,
 		else
 			ret = ib_cm_init_qp_attr(id_priv->cm_id.ib, qp_attr,
 						 qp_attr_mask);
+
 		if (qp_attr->qp_state == IB_QPS_RTR)
 			qp_attr->rq_psn = id_priv->seq_num;
 		break;
@@ -1265,6 +1281,15 @@ static int cma_req_handler(struct ib_cm_id *cm_id, struct ib_cm_event *ib_event)
 	struct rdma_id_private *listen_id, *conn_id;
 	struct rdma_cm_event event;
 	int offset, ret;
+	u8 smac[ETH_ALEN];
+	u8 alt_smac[ETH_ALEN];
+	u8 *psmac = smac;
+	u8 *palt_smac = alt_smac;
+	int is_iboe = ((rdma_node_get_transport(cm_id->device->node_type) ==
+			RDMA_TRANSPORT_IB) &&
+		       (rdma_port_get_link_layer(cm_id->device,
+			ib_event->param.req_rcvd.port) ==
+			IB_LINK_LAYER_ETHERNET));
 
 	listen_id = cm_id->context;
 	if (!cma_check_req_qp_type(&listen_id->id, ib_event))
@@ -1309,12 +1334,29 @@ static int cma_req_handler(struct ib_cm_id *cm_id, struct ib_cm_event *ib_event)
 	if (ret)
 		goto err3;
 
+	if (is_iboe) {
+		if (ib_event->param.req_rcvd.primary_path != NULL)
+			rdma_addr_find_smac_by_sgid(
+				&ib_event->param.req_rcvd.primary_path->sgid,
+				psmac, NULL);
+		else
+			psmac = NULL;
+		if (ib_event->param.req_rcvd.alternate_path != NULL)
+			rdma_addr_find_smac_by_sgid(
+				&ib_event->param.req_rcvd.alternate_path->sgid,
+				palt_smac, NULL);
+		else
+			palt_smac = NULL;
+	}
 	/*
 	 * Acquire mutex to prevent user executing rdma_destroy_id()
 	 * while we're accessing the cm_id.
 	 */
 	mutex_lock(&lock);
-	if (cma_comp(conn_id, RDMA_CM_CONNECT) && (conn_id->id.qp_type != IB_QPT_UD))
+	if (is_iboe)
+		ib_update_cm_av(cm_id, psmac, palt_smac);
+	if (cma_comp(conn_id, RDMA_CM_CONNECT) &&
+	    (conn_id->id.qp_type != IB_QPT_UD))
 		ib_send_cm_mra(cm_id, CMA_CM_MRA_SETTING, NULL, 0);
 	mutex_unlock(&lock);
 	mutex_unlock(&conn_id->handler_mutex);
@@ -1474,7 +1516,7 @@ static int iw_conn_req_handler(struct iw_cm_id *cm_id,
 	mutex_lock_nested(&conn_id->handler_mutex, SINGLE_DEPTH_NESTING);
 	conn_id->state = RDMA_CM_CONNECT;
 
-	ret = rdma_translate_ip(laddr, &conn_id->id.route.addr.dev_addr);
+	ret = rdma_translate_ip(laddr, &conn_id->id.route.addr.dev_addr, NULL);
 	if (ret) {
 		mutex_unlock(&conn_id->handler_mutex);
 		rdma_destroy_id(new_cm_id);
@@ -1855,7 +1897,7 @@ static int cma_resolve_iboe_route(struct rdma_id_private *id_priv)
 	struct cma_work *work;
 	int ret;
 	struct net_device *ndev = NULL;
-	u16 vid;
+
 
 	work = kzalloc(sizeof *work, GFP_KERNEL);
 	if (!work)
@@ -1879,10 +1921,14 @@ static int cma_resolve_iboe_route(struct rdma_id_private *id_priv)
 		goto err2;
 	}
 
-	vid = rdma_vlan_dev_vlan_id(ndev);
+	route->path_rec->vlan_id = rdma_vlan_dev_vlan_id(ndev);
+	memcpy(route->path_rec->dmac, addr->dev_addr.dst_dev_addr, ETH_ALEN);
+	memcpy(route->path_rec->smac, ndev->dev_addr, ndev->addr_len);
 
-	iboe_mac_vlan_to_ll(&route->path_rec->sgid, addr->dev_addr.src_dev_addr, vid);
-	iboe_mac_vlan_to_ll(&route->path_rec->dgid, addr->dev_addr.dst_dev_addr, vid);
+	iboe_mac_vlan_to_ll(&route->path_rec->sgid, addr->dev_addr.src_dev_addr,
+			    route->path_rec->vlan_id);
+	iboe_mac_vlan_to_ll(&route->path_rec->dgid, addr->dev_addr.dst_dev_addr,
+			    route->path_rec->vlan_id);
 
 	route->path_rec->hop_limit = 1;
 	route->path_rec->reversible = 1;
diff --git a/drivers/infiniband/core/sa_query.c b/drivers/infiniband/core/sa_query.c
index 9838ca4..f820958 100644
--- a/drivers/infiniband/core/sa_query.c
+++ b/drivers/infiniband/core/sa_query.c
@@ -42,7 +42,7 @@
 #include <linux/kref.h>
 #include <linux/idr.h>
 #include <linux/workqueue.h>
-
+#include <uapi/linux/if_ether.h>
 #include <rdma/ib_pack.h>
 #include <rdma/ib_cache.h>
 #include "sa.h"
@@ -556,6 +556,13 @@ int ib_init_ah_from_path(struct ib_device *device, u8 port_num,
 		ah_attr->grh.hop_limit     = rec->hop_limit;
 		ah_attr->grh.traffic_class = rec->traffic_class;
 	}
+	if (force_grh) {
+		memcpy(ah_attr->dmac, rec->dmac, ETH_ALEN);
+		ah_attr->vlan_id = rec->vlan_id;
+	} else {
+		ah_attr->vlan_id = 0xffff;
+	}
+
 	return 0;
 }
 EXPORT_SYMBOL(ib_init_ah_from_path);
@@ -670,6 +677,9 @@ static void ib_sa_path_rec_callback(struct ib_sa_query *sa_query,
 
 		ib_unpack(path_rec_table, ARRAY_SIZE(path_rec_table),
 			  mad->data, &rec);
+		rec.vlan_id = 0xffff;
+		memset(rec.dmac, 0, ETH_ALEN);
+		memset(rec.smac, 0, ETH_ALEN);
 		query->callback(status, &rec, query->context);
 	} else
 		query->callback(status, NULL, query->context);
diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c
index a321df2..baab03d 100644
--- a/drivers/infiniband/core/verbs.c
+++ b/drivers/infiniband/core/verbs.c
@@ -44,6 +44,7 @@
 
 #include <rdma/ib_verbs.h>
 #include <rdma/ib_cache.h>
+#include <rdma/ib_addr.h>
 
 int ib_rate_to_mult(enum ib_rate rate)
 {
@@ -189,8 +190,28 @@ int ib_init_ah_from_wc(struct ib_device *device, u8 port_num, struct ib_wc *wc,
 	u32 flow_class;
 	u16 gid_index;
 	int ret;
+	int is_eth = (rdma_port_get_link_layer(device, port_num) ==
+			IB_LINK_LAYER_ETHERNET);
 
 	memset(ah_attr, 0, sizeof *ah_attr);
+	if (is_eth) {
+		if (!(wc->wc_flags & IB_WC_GRH))
+			return -EPROTOTYPE;
+
+		if (wc->wc_flags & IB_WC_WITH_SMAC &&
+		    wc->wc_flags & IB_WC_WITH_VLAN) {
+			memcpy(ah_attr->dmac, wc->smac, ETH_ALEN);
+			ah_attr->vlan_id = wc->vlan_id;
+		} else {
+			ret = rdma_addr_find_dmac_by_grh(&grh->dgid, &grh->sgid,
+					ah_attr->dmac, &ah_attr->vlan_id);
+			if (ret)
+				return ret;
+		}
+	} else {
+		ah_attr->vlan_id = 0xffff;
+	}
+
 	ah_attr->dlid = wc->slid;
 	ah_attr->sl = wc->sl;
 	ah_attr->src_path_bits = wc->dlid_path_bits;
@@ -473,7 +494,9 @@ EXPORT_SYMBOL(ib_create_qp);
 static const struct {
 	int			valid;
 	enum ib_qp_attr_mask	req_param[IB_QPT_MAX];
+	enum ib_qp_attr_mask	req_param_add_eth[IB_QPT_MAX];
 	enum ib_qp_attr_mask	opt_param[IB_QPT_MAX];
+	enum ib_qp_attr_mask	opt_param_add_eth[IB_QPT_MAX];
 } qp_state_table[IB_QPS_ERR + 1][IB_QPS_ERR + 1] = {
 	[IB_QPS_RESET] = {
 		[IB_QPS_RESET] = { .valid = 1 },
@@ -554,6 +577,10 @@ static const struct {
 						IB_QP_MAX_DEST_RD_ATOMIC	|
 						IB_QP_MIN_RNR_TIMER),
 			},
+			.req_param_add_eth = {
+				[IB_QPT_RC]  = (IB_QP_DEST_EX			|
+						IB_QP_SMAC)
+			},
 			.opt_param = {
 				 [IB_QPT_UD]  = (IB_QP_PKEY_INDEX		|
 						 IB_QP_QKEY),
@@ -573,7 +600,13 @@ static const struct {
 						 IB_QP_QKEY),
 				 [IB_QPT_GSI] = (IB_QP_PKEY_INDEX		|
 						 IB_QP_QKEY),
-			 }
+			 },
+			.opt_param_add_eth = {
+				[IB_QPT_RC]  = (IB_QP_ALT_DEST_EX		|
+						IB_QP_ALT_SMAC			|
+						IB_QP_VID			|
+						IB_QP_ALT_VID)
+			}
 		}
 	},
 	[IB_QPS_RTR]   = {
@@ -776,7 +809,8 @@ static const struct {
 };
 
 int ib_modify_qp_is_ok(enum ib_qp_state cur_state, enum ib_qp_state next_state,
-		       enum ib_qp_type type, enum ib_qp_attr_mask mask)
+		       enum ib_qp_type type, enum ib_qp_attr_mask mask,
+		       enum rdma_link_layer ll)
 {
 	enum ib_qp_attr_mask req_param, opt_param;
 
@@ -795,6 +829,13 @@ int ib_modify_qp_is_ok(enum ib_qp_state cur_state, enum ib_qp_state next_state,
 	req_param = qp_state_table[cur_state][next_state].req_param[type];
 	opt_param = qp_state_table[cur_state][next_state].opt_param[type];
 
+	if (ll == IB_LINK_LAYER_ETHERNET) {
+		req_param |= qp_state_table[cur_state][next_state].
+			req_param_add_eth[type];
+		opt_param |= qp_state_table[cur_state][next_state].
+			opt_param_add_eth[type];
+	}
+
 	if ((mask & req_param) != req_param)
 		return 0;
 
diff --git a/drivers/infiniband/hw/ehca/ehca_qp.c b/drivers/infiniband/hw/ehca/ehca_qp.c
index 00d6861..2e89356 100644
--- a/drivers/infiniband/hw/ehca/ehca_qp.c
+++ b/drivers/infiniband/hw/ehca/ehca_qp.c
@@ -1329,7 +1329,7 @@ static int internal_modify_qp(struct ib_qp *ibqp,
 	qp_new_state = attr_mask & IB_QP_STATE ? attr->qp_state : qp_cur_state;
 	if (!smi_reset2init &&
 	    !ib_modify_qp_is_ok(qp_cur_state, qp_new_state, ibqp->qp_type,
-				attr_mask)) {
+				attr_mask, IB_LINK_LAYER_UNSPECIFIED)) {
 		ret = -EINVAL;
 		ehca_err(ibqp->device,
 			 "Invalid qp transition new_state=%x cur_state=%x "
diff --git a/drivers/infiniband/hw/ipath/ipath_qp.c b/drivers/infiniband/hw/ipath/ipath_qp.c
index 0857a9c..face876 100644
--- a/drivers/infiniband/hw/ipath/ipath_qp.c
+++ b/drivers/infiniband/hw/ipath/ipath_qp.c
@@ -463,7 +463,7 @@ int ipath_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr,
 	new_state = attr_mask & IB_QP_STATE ? attr->qp_state : cur_state;
 
 	if (!ib_modify_qp_is_ok(cur_state, new_state, ibqp->qp_type,
-				attr_mask))
+				attr_mask, IB_LINK_LAYER_UNSPECIFIED))
 		goto inval;
 
 	if (attr_mask & IB_QP_AV) {
diff --git a/drivers/infiniband/hw/mlx4/qp.c b/drivers/infiniband/hw/mlx4/qp.c
index 4f10af2..da6f5fa 100644
--- a/drivers/infiniband/hw/mlx4/qp.c
+++ b/drivers/infiniband/hw/mlx4/qp.c
@@ -1561,13 +1561,18 @@ int mlx4_ib_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr,
 	struct mlx4_ib_qp *qp = to_mqp(ibqp);
 	enum ib_qp_state cur_state, new_state;
 	int err = -EINVAL;
-
+	int p = attr_mask & IB_QP_PORT ? attr->port_num : qp->port;
 	mutex_lock(&qp->mutex);
 
 	cur_state = attr_mask & IB_QP_CUR_STATE ? attr->cur_qp_state : qp->state;
 	new_state = attr_mask & IB_QP_STATE ? attr->qp_state : cur_state;
 
-	if (!ib_modify_qp_is_ok(cur_state, new_state, ibqp->qp_type, attr_mask)) {
+	if (cur_state == new_state && cur_state == IB_QPS_RESET)
+		p = IB_LINK_LAYER_UNSPECIFIED;
+
+	if (!ib_modify_qp_is_ok(cur_state, new_state, ibqp->qp_type,
+				attr_mask,
+				rdma_port_get_link_layer(&dev->ib_dev, p))) {
 		pr_debug("qpn 0x%x: invalid attribute mask specified "
 			 "for transition %d to %d. qp_type %d,"
 			 " attr_mask 0x%x\n",
diff --git a/drivers/infiniband/hw/mlx5/qp.c b/drivers/infiniband/hw/mlx5/qp.c
index 045f8cd..50051a4 100644
--- a/drivers/infiniband/hw/mlx5/qp.c
+++ b/drivers/infiniband/hw/mlx5/qp.c
@@ -1593,7 +1593,8 @@ int mlx5_ib_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr,
 	new_state = attr_mask & IB_QP_STATE ? attr->qp_state : cur_state;
 
 	if (ibqp->qp_type != MLX5_IB_QPT_REG_UMR &&
-	    !ib_modify_qp_is_ok(cur_state, new_state, ibqp->qp_type, attr_mask))
+	    !ib_modify_qp_is_ok(cur_state, new_state, ibqp->qp_type, attr_mask,
+				IB_LINK_LAYER_UNSPECIFIED))
 		goto out;
 
 	if ((attr_mask & IB_QP_PORT) &&
diff --git a/drivers/infiniband/hw/mthca/mthca_qp.c b/drivers/infiniband/hw/mthca/mthca_qp.c
index 26a6845..e354b2f 100644
--- a/drivers/infiniband/hw/mthca/mthca_qp.c
+++ b/drivers/infiniband/hw/mthca/mthca_qp.c
@@ -860,7 +860,8 @@ int mthca_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr, int attr_mask,
 
 	new_state = attr_mask & IB_QP_STATE ? attr->qp_state : cur_state;
 
-	if (!ib_modify_qp_is_ok(cur_state, new_state, ibqp->qp_type, attr_mask)) {
+	if (!ib_modify_qp_is_ok(cur_state, new_state, ibqp->qp_type, attr_mask,
+				IB_LINK_LAYER_UNSPECIFIED)) {
 		mthca_dbg(dev, "Bad QP transition (transport %d) "
 			  "%d->%d with attr 0x%08x\n",
 			  qp->transport, cur_state, new_state,
diff --git a/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c b/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c
index 6e982bb..0607ca7 100644
--- a/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c
+++ b/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c
@@ -1326,7 +1326,8 @@ int ocrdma_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr,
 		new_qps = old_qps;
 	spin_unlock_irqrestore(&qp->q_lock, flags);
 
-	if (!ib_modify_qp_is_ok(old_qps, new_qps, ibqp->qp_type, attr_mask)) {
+	if (!ib_modify_qp_is_ok(old_qps, new_qps, ibqp->qp_type, attr_mask,
+				IB_LINK_LAYER_UNSPECIFIED)) {
 		pr_err("%s(%d) invalid attribute mask=0x%x specified for\n"
 		       "qpn=0x%x of type=0x%x old_qps=0x%x, new_qps=0x%x\n",
 		       __func__, dev->id, attr_mask, qp->id, ibqp->qp_type,
diff --git a/drivers/infiniband/hw/qib/qib_qp.c b/drivers/infiniband/hw/qib/qib_qp.c
index 3cca55b..0cad0c4 100644
--- a/drivers/infiniband/hw/qib/qib_qp.c
+++ b/drivers/infiniband/hw/qib/qib_qp.c
@@ -585,7 +585,7 @@ int qib_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr,
 	new_state = attr_mask & IB_QP_STATE ? attr->qp_state : cur_state;
 
 	if (!ib_modify_qp_is_ok(cur_state, new_state, ibqp->qp_type,
-				attr_mask))
+				attr_mask, IB_LINK_LAYER_UNSPECIFIED))
 		goto inval;
 
 	if (attr_mask & IB_QP_AV) {
diff --git a/include/linux/mlx4/device.h b/include/linux/mlx4/device.h
index d73423c..4b9162c 100644
--- a/include/linux/mlx4/device.h
+++ b/include/linux/mlx4/device.h
@@ -1074,6 +1074,7 @@ int mlx4_SET_PORT_qpn_calc(struct mlx4_dev *dev, u8 port, u32 base_qpn,
 int mlx4_SET_PORT_PRIO2TC(struct mlx4_dev *dev, u8 port, u8 *prio2tc);
 int mlx4_SET_PORT_SCHEDULER(struct mlx4_dev *dev, u8 port, u8 *tc_tx_bw,
 		u8 *pg, u16 *ratelimit);
+int mlx4_find_cached_mac(struct mlx4_dev *dev, u8 port, u64 mac, int *idx);
 int mlx4_find_cached_vlan(struct mlx4_dev *dev, u8 port, u16 vid, int *idx);
 int mlx4_register_vlan(struct mlx4_dev *dev, u8 port, u16 vlan, int *index);
 void mlx4_unregister_vlan(struct mlx4_dev *dev, u8 port, int index);
diff --git a/include/rdma/ib_addr.h b/include/rdma/ib_addr.h
index f3ac0f2..a071560 100644
--- a/include/rdma/ib_addr.h
+++ b/include/rdma/ib_addr.h
@@ -42,6 +42,7 @@
 #include <linux/if_vlan.h>
 #include <rdma/ib_verbs.h>
 #include <rdma/ib_pack.h>
+#include <net/ipv6.h>
 
 struct rdma_addr_client {
 	atomic_t refcount;
@@ -72,7 +73,8 @@ struct rdma_dev_addr {
  * rdma_translate_ip - Translate a local IP address to an RDMA hardware
  *   address.
  */
-int rdma_translate_ip(struct sockaddr *addr, struct rdma_dev_addr *dev_addr);
+int rdma_translate_ip(struct sockaddr *addr, struct rdma_dev_addr *dev_addr,
+		      u16 *vlan_id);
 
 /**
  * rdma_resolve_ip - Resolve source and destination IP addresses to
@@ -104,6 +106,10 @@ int rdma_copy_addr(struct rdma_dev_addr *dev_addr, struct net_device *dev,
 
 int rdma_addr_size(struct sockaddr *addr);
 
+int rdma_addr_find_smac_by_sgid(union ib_gid *sgid, u8 *smac, u16 *vlan_id);
+int rdma_addr_find_dmac_by_grh(union ib_gid *sgid, union ib_gid *dgid, u8 *smac,
+			       u16 *vlan_id);
+
 static inline u16 ib_addr_get_pkey(struct rdma_dev_addr *dev_addr)
 {
 	return ((u16)dev_addr->broadcast[8] << 8) | (u16)dev_addr->broadcast[9];
@@ -142,6 +148,40 @@ static inline void iboe_mac_vlan_to_ll(union ib_gid *gid, u8 *mac, u16 vid)
 	gid->raw[8] ^= 2;
 }
 
+static inline int rdma_ip2gid(struct sockaddr *addr, union ib_gid *gid)
+{
+	switch (addr->sa_family) {
+	case AF_INET:
+		ipv6_addr_set_v4mapped(((struct sockaddr_in *)
+					addr)->sin_addr.s_addr,
+				       (struct in6_addr *)gid);
+		break;
+	case AF_INET6:
+		memcpy(gid->raw, &((struct sockaddr_in6 *)addr)->sin6_addr, 16);
+		break;
+	default:
+		return -EINVAL;
+	}
+	return 0;
+}
+
+/* Important - sockaddr should be a union of sockaddr_in and sockaddr_in6 */
+static inline int rdma_gid2ip(struct sockaddr *out, union ib_gid *gid)
+{
+	if (ipv6_addr_v4mapped((struct in6_addr *)gid)) {
+		struct sockaddr_in *out_in = (struct sockaddr_in *)out;
+		memset(out_in, 0, sizeof(*out_in));
+		out_in->sin_family = AF_INET;
+		memcpy(&out_in->sin_addr.s_addr, gid->raw + 12, 4);
+	} else {
+		struct sockaddr_in6 *out_in = (struct sockaddr_in6 *)out;
+		memset(out_in, 0, sizeof(*out_in));
+		out_in->sin6_family = AF_INET6;
+		memcpy(&out_in->sin6_addr.s6_addr, gid->raw, 16);
+	}
+	return 0;
+}
+
 static inline u16 rdma_vlan_dev_vlan_id(const struct net_device *dev)
 {
 	return dev->priv_flags & IFF_802_1Q_VLAN ?
diff --git a/include/rdma/ib_cm.h b/include/rdma/ib_cm.h
index 0e3ff30..f29e3a2 100644
--- a/include/rdma/ib_cm.h
+++ b/include/rdma/ib_cm.h
@@ -601,4 +601,5 @@ struct ib_cm_sidr_rep_param {
 int ib_send_cm_sidr_rep(struct ib_cm_id *cm_id,
 			struct ib_cm_sidr_rep_param *param);
 
+int ib_update_cm_av(struct ib_cm_id *id, const u8 *smac, const u8 *alt_smac);
 #endif /* IB_CM_H */
diff --git a/include/rdma/ib_pack.h b/include/rdma/ib_pack.h
index b37fe3b..b1f7592 100644
--- a/include/rdma/ib_pack.h
+++ b/include/rdma/ib_pack.h
@@ -34,6 +34,7 @@
 #define IB_PACK_H
 
 #include <rdma/ib_verbs.h>
+#include <uapi/linux/if_ether.h>
 
 enum {
 	IB_LRH_BYTES  = 8,
diff --git a/include/rdma/ib_sa.h b/include/rdma/ib_sa.h
index 125f871..7e071a6 100644
--- a/include/rdma/ib_sa.h
+++ b/include/rdma/ib_sa.h
@@ -154,6 +154,9 @@ struct ib_sa_path_rec {
 	u8           packet_life_time_selector;
 	u8           packet_life_time;
 	u8           preference;
+	u8           smac[ETH_ALEN];
+	u8           dmac[ETH_ALEN];
+	u16	     vlan_id;
 };
 
 #define IB_SA_MCMEMBER_REC_MGID				IB_SA_COMP_MASK( 0)
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index e393171..c8a61f8 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -48,6 +48,7 @@
 #include <linux/rwsem.h>
 #include <linux/scatterlist.h>
 #include <linux/workqueue.h>
+#include <uapi/linux/if_ether.h>
 
 #include <linux/atomic.h>
 #include <asm/uaccess.h>
@@ -470,6 +471,8 @@ struct ib_ah_attr {
 	u8			static_rate;
 	u8			ah_flags;
 	u8			port_num;
+	u8			dmac[ETH_ALEN];
+	u16			vlan_id;
 };
 
 enum ib_wc_status {
@@ -522,6 +525,8 @@ enum ib_wc_flags {
 	IB_WC_WITH_IMM		= (1<<1),
 	IB_WC_WITH_INVALIDATE	= (1<<2),
 	IB_WC_IP_CSUM_OK	= (1<<3),
+	IB_WC_WITH_SMAC		= (1<<4),
+	IB_WC_WITH_VLAN		= (1<<5),
 };
 
 struct ib_wc {
@@ -542,6 +547,8 @@ struct ib_wc {
 	u8			sl;
 	u8			dlid_path_bits;
 	u8			port_num;	/* valid only for DR SMPs on switches */
+	u8			smac[ETH_ALEN];
+	u16			vlan_id;
 };
 
 enum ib_cq_notify_flags {
@@ -719,7 +726,13 @@ enum ib_qp_attr_mask {
 	IB_QP_MAX_DEST_RD_ATOMIC	= (1<<17),
 	IB_QP_PATH_MIG_STATE		= (1<<18),
 	IB_QP_CAP			= (1<<19),
-	IB_QP_DEST_QPN			= (1<<20)
+	IB_QP_DEST_QPN			= (1<<20),
+	IB_QP_DEST_EX			= (1<<21),
+	IB_QP_ALT_DEST_EX		= (1<<22),
+	IB_QP_SMAC			= (1<<23),
+	IB_QP_ALT_SMAC			= (1<<24),
+	IB_QP_VID			= (1<<25),
+	IB_QP_ALT_VID			= (1<<26),
 };
 
 enum ib_qp_state {
@@ -769,6 +782,10 @@ struct ib_qp_attr {
 	u8			rnr_retry;
 	u8			alt_port_num;
 	u8			alt_timeout;
+	u8			smac[ETH_ALEN];
+	u8			alt_smac[ETH_ALEN];
+	u16			vlan_id;
+	u16			alt_vlan_id;
 };
 
 enum ib_wr_opcode {
@@ -1485,6 +1502,7 @@ static inline int ib_copy_to_udata(struct ib_udata *udata, void *src, size_t len
  * @next_state: Next QP state
  * @type: QP type
  * @mask: Mask of supplied QP attributes
+ * @ll : link layer of port
  *
  * This function is a helper function that a low-level driver's
  * modify_qp method can use to validate the consumer's input.  It
@@ -1493,7 +1511,8 @@ static inline int ib_copy_to_udata(struct ib_udata *udata, void *src, size_t len
  * and that the attribute mask supplied is allowed for the transition.
  */
 int ib_modify_qp_is_ok(enum ib_qp_state cur_state, enum ib_qp_state next_state,
-		       enum ib_qp_type type, enum ib_qp_attr_mask mask);
+		       enum ib_qp_type type, enum ib_qp_attr_mask mask,
+		       enum rdma_link_layer ll);
 
 int ib_register_event_handler  (struct ib_event_handler *event_handler);
 int ib_unregister_event_handler(struct ib_event_handler *event_handler);
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH V4 2/9] IB/CMA: RoCE IP based GID addressing
       [not found] ` <1378824099-22150-1-git-send-email-ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  2013-09-10 14:41   ` [PATCH V4 1/9] IB/core: Ethernet L2 attributes in verbs/cm structures Or Gerlitz
@ 2013-09-10 14:41   ` Or Gerlitz
  2013-09-10 14:41   ` [PATCH V4 3/9] IB/mlx4: Use RoCE IP based GIDs in the port GID table Or Gerlitz
                     ` (6 subsequent siblings)
  8 siblings, 0 replies; 38+ messages in thread
From: Or Gerlitz @ 2013-09-10 14:41 UTC (permalink / raw)
  To: roland-DgEjT+Ai2ygdnm+yROfE0A
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, monis-VPRAkNaXOzVWk0Htik3J/w,
	matanb-VPRAkNaXOzVWk0Htik3J/w, Or Gerlitz

From: Moni Shoua <monis-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>

Currently, the IB core and specifically the RDMA-CM assumes that
RoCE (IBoE) gids encode related Ethernet netdevice interface
MAC address and possibly VLAN id.

Change gids to be treated as they encode interface IP address.

Since Ethernet layer 2 address parameters are not longer encoded
within gids, had to extend the Infiniband address structures (e.g.
ib_ah_attr) with layer 2 address parameters, namely mac and vlan.

Signed-off-by: Moni Shoua <monis-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Signed-off-by: Or Gerlitz <ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
---
 drivers/infiniband/core/cma.c  |   22 +++++++++--------
 drivers/infiniband/core/ucma.c |   18 +++-----------
 include/rdma/ib_addr.h         |   50 +++++++++------------------------------
 3 files changed, 28 insertions(+), 62 deletions(-)

diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
index 27acdec..2497031 100644
--- a/drivers/infiniband/core/cma.c
+++ b/drivers/infiniband/core/cma.c
@@ -386,7 +386,9 @@ static int cma_acquire_dev(struct rdma_id_private *id_priv)
 		return -EINVAL;
 
 	mutex_lock(&lock);
-	iboe_addr_get_sgid(dev_addr, &iboe_gid);
+	rdma_ip2gid((struct sockaddr *)&id_priv->id.route.addr.src_addr,
+		    &iboe_gid);
+
 	memcpy(&gid, dev_addr->src_dev_addr +
 	       rdma_addr_gid_offset(dev_addr), sizeof gid);
 	list_for_each_entry(cma_dev, &dev_list, list) {
@@ -1925,10 +1927,10 @@ static int cma_resolve_iboe_route(struct rdma_id_private *id_priv)
 	memcpy(route->path_rec->dmac, addr->dev_addr.dst_dev_addr, ETH_ALEN);
 	memcpy(route->path_rec->smac, ndev->dev_addr, ndev->addr_len);
 
-	iboe_mac_vlan_to_ll(&route->path_rec->sgid, addr->dev_addr.src_dev_addr,
-			    route->path_rec->vlan_id);
-	iboe_mac_vlan_to_ll(&route->path_rec->dgid, addr->dev_addr.dst_dev_addr,
-			    route->path_rec->vlan_id);
+	rdma_ip2gid((struct sockaddr *)&id_priv->id.route.addr.src_addr,
+		    &route->path_rec->sgid);
+	rdma_ip2gid((struct sockaddr *)&id_priv->id.route.addr.dst_addr,
+		    &route->path_rec->dgid);
 
 	route->path_rec->hop_limit = 1;
 	route->path_rec->reversible = 1;
@@ -2095,6 +2097,7 @@ static void addr_handler(int status, struct sockaddr *src_addr,
 			   RDMA_CM_ADDR_RESOLVED))
 		goto out;
 
+	memcpy(cma_src_addr(id_priv), src_addr, rdma_addr_size(src_addr));
 	if (!status && !id_priv->cma_dev)
 		status = cma_acquire_dev(id_priv);
 
@@ -2104,10 +2107,8 @@ static void addr_handler(int status, struct sockaddr *src_addr,
 			goto out;
 		event.event = RDMA_CM_EVENT_ADDR_ERROR;
 		event.status = status;
-	} else {
-		memcpy(cma_src_addr(id_priv), src_addr, rdma_addr_size(src_addr));
+	} else
 		event.event = RDMA_CM_EVENT_ADDR_RESOLVED;
-	}
 
 	if (id_priv->id.event_handler(&id_priv->id, &event)) {
 		cma_exch(id_priv, RDMA_CM_DESTROYING);
@@ -2588,6 +2589,7 @@ int rdma_bind_addr(struct rdma_cm_id *id, struct sockaddr *addr)
 	if (ret)
 		goto err1;
 
+	memcpy(cma_src_addr(id_priv), addr, rdma_addr_size(addr));
 	if (!cma_any_addr(addr)) {
 		ret = cma_translate_addr(addr, &id->route.addr.dev_addr);
 		if (ret)
@@ -2598,7 +2600,6 @@ int rdma_bind_addr(struct rdma_cm_id *id, struct sockaddr *addr)
 			goto err1;
 	}
 
-	memcpy(cma_src_addr(id_priv), addr, rdma_addr_size(addr));
 	if (!(id_priv->options & (1 << CMA_OPTION_AFONLY))) {
 		if (addr->sa_family == AF_INET)
 			id_priv->afonly = 1;
@@ -3327,7 +3328,8 @@ static int cma_iboe_join_multicast(struct rdma_id_private *id_priv,
 		err = -EINVAL;
 		goto out2;
 	}
-	iboe_addr_get_sgid(dev_addr, &mc->multicast.ib->rec.port_gid);
+	rdma_ip2gid((struct sockaddr *)&id_priv->id.route.addr.src_addr,
+		    &mc->multicast.ib->rec.port_gid);
 	work->id = id_priv;
 	work->mc = mc;
 	INIT_WORK(&work->work, iboe_mcast_work_handler);
diff --git a/drivers/infiniband/core/ucma.c b/drivers/infiniband/core/ucma.c
index b0f189b..7e7da86 100644
--- a/drivers/infiniband/core/ucma.c
+++ b/drivers/infiniband/core/ucma.c
@@ -655,24 +655,14 @@ static void ucma_copy_ib_route(struct rdma_ucm_query_route_resp *resp,
 static void ucma_copy_iboe_route(struct rdma_ucm_query_route_resp *resp,
 				 struct rdma_route *route)
 {
-	struct rdma_dev_addr *dev_addr;
-	struct net_device *dev;
-	u16 vid = 0;
 
 	resp->num_paths = route->num_paths;
 	switch (route->num_paths) {
 	case 0:
-		dev_addr = &route->addr.dev_addr;
-		dev = dev_get_by_index(&init_net, dev_addr->bound_dev_if);
-			if (dev) {
-				vid = rdma_vlan_dev_vlan_id(dev);
-				dev_put(dev);
-			}
-
-		iboe_mac_vlan_to_ll((union ib_gid *) &resp->ib_route[0].dgid,
-				    dev_addr->dst_dev_addr, vid);
-		iboe_addr_get_sgid(dev_addr,
-				   (union ib_gid *) &resp->ib_route[0].sgid);
+		rdma_ip2gid((struct sockaddr *)&route->addr.dst_addr,
+			    (union ib_gid *)&resp->ib_route[0].dgid);
+		rdma_ip2gid((struct sockaddr *)&route->addr.src_addr,
+			    (union ib_gid *)&resp->ib_route[0].sgid);
 		resp->ib_route[0].pkey = cpu_to_be16(0xffff);
 		break;
 	case 2:
diff --git a/include/rdma/ib_addr.h b/include/rdma/ib_addr.h
index a071560..53f81fb 100644
--- a/include/rdma/ib_addr.h
+++ b/include/rdma/ib_addr.h
@@ -38,8 +38,12 @@
 #include <linux/in6.h>
 #include <linux/if_arp.h>
 #include <linux/netdevice.h>
+#include <linux/inetdevice.h>
 #include <linux/socket.h>
 #include <linux/if_vlan.h>
+#include <net/ipv6.h>
+#include <net/if_inet6.h>
+#include <net/ip.h>
 #include <rdma/ib_verbs.h>
 #include <rdma/ib_pack.h>
 #include <net/ipv6.h>
@@ -132,20 +136,10 @@ static inline int rdma_addr_gid_offset(struct rdma_dev_addr *dev_addr)
 	return dev_addr->dev_type == ARPHRD_INFINIBAND ? 4 : 0;
 }
 
-static inline void iboe_mac_vlan_to_ll(union ib_gid *gid, u8 *mac, u16 vid)
+static inline u16 rdma_vlan_dev_vlan_id(const struct net_device *dev)
 {
-	memset(gid->raw, 0, 16);
-	*((__be32 *) gid->raw) = cpu_to_be32(0xfe800000);
-	if (vid < 0x1000) {
-		gid->raw[12] = vid & 0xff;
-		gid->raw[11] = vid >> 8;
-	} else {
-		gid->raw[12] = 0xfe;
-		gid->raw[11] = 0xff;
-	}
-	memcpy(gid->raw + 13, mac + 3, 3);
-	memcpy(gid->raw + 8, mac, 3);
-	gid->raw[8] ^= 2;
+	return dev->priv_flags & IFF_802_1Q_VLAN ?
+		vlan_dev_vlan_id(dev) : 0xffff;
 }
 
 static inline int rdma_ip2gid(struct sockaddr *addr, union ib_gid *gid)
@@ -182,25 +176,20 @@ static inline int rdma_gid2ip(struct sockaddr *out, union ib_gid *gid)
 	return 0;
 }
 
-static inline u16 rdma_vlan_dev_vlan_id(const struct net_device *dev)
-{
-	return dev->priv_flags & IFF_802_1Q_VLAN ?
-		vlan_dev_vlan_id(dev) : 0xffff;
-}
-
 static inline void iboe_addr_get_sgid(struct rdma_dev_addr *dev_addr,
 				      union ib_gid *gid)
 {
 	struct net_device *dev;
-	u16 vid = 0xffff;
+	struct in_device *ip4;
 
 	dev = dev_get_by_index(&init_net, dev_addr->bound_dev_if);
 	if (dev) {
-		vid = rdma_vlan_dev_vlan_id(dev);
+		ip4 = (struct in_device *)dev->ip_ptr;
+		if (ip4 && ip4->ifa_list && ip4->ifa_list->ifa_address)
+			ipv6_addr_set_v4mapped(ip4->ifa_list->ifa_address,
+					       (struct in6_addr *)gid);
 		dev_put(dev);
 	}
-
-	iboe_mac_vlan_to_ll(gid, dev_addr->src_dev_addr, vid);
 }
 
 static inline void rdma_addr_get_sgid(struct rdma_dev_addr *dev_addr, union ib_gid *gid)
@@ -284,13 +273,6 @@ static inline int rdma_link_local_addr(struct in6_addr *addr)
 	return 0;
 }
 
-static inline void rdma_get_ll_mac(struct in6_addr *addr, u8 *mac)
-{
-	memcpy(mac, &addr->s6_addr[8], 3);
-	memcpy(mac + 3, &addr->s6_addr[13], 3);
-	mac[0] ^= 2;
-}
-
 static inline int rdma_is_multicast_addr(struct in6_addr *addr)
 {
 	return addr->s6_addr[0] == 0xff;
@@ -306,14 +288,6 @@ static inline void rdma_get_mcast_mac(struct in6_addr *addr, u8 *mac)
 		mac[i] = addr->s6_addr[i + 10];
 }
 
-static inline u16 rdma_get_vlan_id(union ib_gid *dgid)
-{
-	u16 vid;
-
-	vid = dgid->raw[11] << 8 | dgid->raw[12];
-	return vid < 0x1000 ? vid : 0xffff;
-}
-
 static inline struct net_device *rdma_vlan_dev_real_dev(const struct net_device *dev)
 {
 	return dev->priv_flags & IFF_802_1Q_VLAN ?
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH V4 3/9] IB/mlx4: Use RoCE IP based GIDs in the port GID table
       [not found] ` <1378824099-22150-1-git-send-email-ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  2013-09-10 14:41   ` [PATCH V4 1/9] IB/core: Ethernet L2 attributes in verbs/cm structures Or Gerlitz
  2013-09-10 14:41   ` [PATCH V4 2/9] IB/CMA: RoCE IP based GID addressing Or Gerlitz
@ 2013-09-10 14:41   ` Or Gerlitz
  2013-09-10 14:41   ` [PATCH V4 4/9] IB/mlx4: Handle Ethernet L2 parameters for IP based GID addressing Or Gerlitz
                     ` (5 subsequent siblings)
  8 siblings, 0 replies; 38+ messages in thread
From: Or Gerlitz @ 2013-09-10 14:41 UTC (permalink / raw)
  To: roland-DgEjT+Ai2ygdnm+yROfE0A
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, monis-VPRAkNaXOzVWk0Htik3J/w,
	matanb-VPRAkNaXOzVWk0Htik3J/w, Or Gerlitz

From: Moni Shoua <monis-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>

Currently, the mlx4 driver set RoCE (IBoE) gids to encode related
Ethernet netdevice interface MAC address and possibly VLAN id.

Change this scheme such that gids encode interface IP addresses
(both IP4 and IPv6).

This requires learning which are the IP addresses which are of use
by a netdevice associated with the HCA port, formatting them to gids
and adding them to the port gid table. Further, events of add and
delete address are caught to maintain the gid table accordingly.

Associated IP addresses may belong to a master of an Ethernet netdevice
on top of that port so this should be considered when building and
maintaining the gid table.

Signed-off-by: Moni Shoua <monis-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Signed-off-by: Or Gerlitz <ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
---
 drivers/infiniband/hw/mlx4/main.c    |  474 ++++++++++++++++++++++++----------
 drivers/infiniband/hw/mlx4/mlx4_ib.h |    3 +
 2 files changed, 334 insertions(+), 143 deletions(-)

diff --git a/drivers/infiniband/hw/mlx4/main.c b/drivers/infiniband/hw/mlx4/main.c
index d6c5a73..7a29ad5 100644
--- a/drivers/infiniband/hw/mlx4/main.c
+++ b/drivers/infiniband/hw/mlx4/main.c
@@ -39,6 +39,8 @@
 #include <linux/inetdevice.h>
 #include <linux/rtnetlink.h>
 #include <linux/if_vlan.h>
+#include <net/ipv6.h>
+#include <net/addrconf.h>
 
 #include <rdma/ib_smi.h>
 #include <rdma/ib_user_verbs.h>
@@ -790,7 +792,6 @@ static int add_gid_entry(struct ib_qp *ibqp, union ib_gid *gid)
 int mlx4_ib_add_mc(struct mlx4_ib_dev *mdev, struct mlx4_ib_qp *mqp,
 		   union ib_gid *gid)
 {
-	u8 mac[6];
 	struct net_device *ndev;
 	int ret = 0;
 
@@ -804,11 +805,7 @@ int mlx4_ib_add_mc(struct mlx4_ib_dev *mdev, struct mlx4_ib_qp *mqp,
 	spin_unlock(&mdev->iboe.lock);
 
 	if (ndev) {
-		rdma_get_mcast_mac((struct in6_addr *)gid, mac);
-		rtnl_lock();
-		dev_mc_add(mdev->iboe.netdevs[mqp->port - 1], mac);
 		ret = 1;
-		rtnl_unlock();
 		dev_put(ndev);
 	}
 
@@ -1031,6 +1028,8 @@ static int mlx4_ib_mcg_attach(struct ib_qp *ibqp, union ib_gid *gid, u16 lid)
 	struct mlx4_ib_qp *mqp = to_mqp(ibqp);
 	u64 reg_id;
 	struct mlx4_ib_steering *ib_steering = NULL;
+	enum mlx4_protocol prot = (gid->raw[1] == 0x0e) ?
+		MLX4_PROT_IB_IPV4 : MLX4_PROT_IB_IPV6;
 
 	if (mdev->dev->caps.steering_mode ==
 	    MLX4_STEERING_MODE_DEVICE_MANAGED) {
@@ -1042,7 +1041,7 @@ static int mlx4_ib_mcg_attach(struct ib_qp *ibqp, union ib_gid *gid, u16 lid)
 	err = mlx4_multicast_attach(mdev->dev, &mqp->mqp, gid->raw, mqp->port,
 				    !!(mqp->flags &
 				       MLX4_IB_QP_BLOCK_MULTICAST_LOOPBACK),
-				    MLX4_PROT_IB_IPV6, &reg_id);
+				    prot, &reg_id);
 	if (err)
 		goto err_malloc;
 
@@ -1061,7 +1060,7 @@ static int mlx4_ib_mcg_attach(struct ib_qp *ibqp, union ib_gid *gid, u16 lid)
 
 err_add:
 	mlx4_multicast_detach(mdev->dev, &mqp->mqp, gid->raw,
-			      MLX4_PROT_IB_IPV6, reg_id);
+			      prot, reg_id);
 err_malloc:
 	kfree(ib_steering);
 
@@ -1089,10 +1088,11 @@ static int mlx4_ib_mcg_detach(struct ib_qp *ibqp, union ib_gid *gid, u16 lid)
 	int err;
 	struct mlx4_ib_dev *mdev = to_mdev(ibqp->device);
 	struct mlx4_ib_qp *mqp = to_mqp(ibqp);
-	u8 mac[6];
 	struct net_device *ndev;
 	struct mlx4_ib_gid_entry *ge;
 	u64 reg_id = 0;
+	enum mlx4_protocol prot = (gid->raw[1] == 0x0e) ?
+		MLX4_PROT_IB_IPV4 : MLX4_PROT_IB_IPV6;
 
 	if (mdev->dev->caps.steering_mode ==
 	    MLX4_STEERING_MODE_DEVICE_MANAGED) {
@@ -1115,7 +1115,7 @@ static int mlx4_ib_mcg_detach(struct ib_qp *ibqp, union ib_gid *gid, u16 lid)
 	}
 
 	err = mlx4_multicast_detach(mdev->dev, &mqp->mqp, gid->raw,
-				    MLX4_PROT_IB_IPV6, reg_id);
+				    prot, reg_id);
 	if (err)
 		return err;
 
@@ -1127,13 +1127,8 @@ static int mlx4_ib_mcg_detach(struct ib_qp *ibqp, union ib_gid *gid, u16 lid)
 		if (ndev)
 			dev_hold(ndev);
 		spin_unlock(&mdev->iboe.lock);
-		rdma_get_mcast_mac((struct in6_addr *)gid, mac);
-		if (ndev) {
-			rtnl_lock();
-			dev_mc_del(mdev->iboe.netdevs[ge->port - 1], mac);
-			rtnl_unlock();
+		if (ndev)
 			dev_put(ndev);
-		}
 		list_del(&ge->list);
 		kfree(ge);
 	} else
@@ -1229,20 +1224,6 @@ static struct device_attribute *mlx4_class_attributes[] = {
 	&dev_attr_board_id
 };
 
-static void mlx4_addrconf_ifid_eui48(u8 *eui, u16 vlan_id, struct net_device *dev)
-{
-	memcpy(eui, dev->dev_addr, 3);
-	memcpy(eui + 5, dev->dev_addr + 3, 3);
-	if (vlan_id < 0x1000) {
-		eui[3] = vlan_id >> 8;
-		eui[4] = vlan_id & 0xff;
-	} else {
-		eui[3] = 0xff;
-		eui[4] = 0xfe;
-	}
-	eui[0] ^= 2;
-}
-
 static void update_gids_task(struct work_struct *work)
 {
 	struct update_gid_work *gw = container_of(work, struct update_gid_work, work);
@@ -1265,161 +1246,318 @@ static void update_gids_task(struct work_struct *work)
 		       MLX4_CMD_WRAPPED);
 	if (err)
 		pr_warn("set port command failed\n");
-	else {
-		memcpy(gw->dev->iboe.gid_table[gw->port - 1], gw->gids, sizeof gw->gids);
+	else
 		mlx4_ib_dispatch_event(gw->dev, gw->port, IB_EVENT_GID_CHANGE);
-	}
 
 	mlx4_free_cmd_mailbox(dev, mailbox);
 	kfree(gw);
 }
 
-static int update_ipv6_gids(struct mlx4_ib_dev *dev, int port, int clear)
+static void reset_gids_task(struct work_struct *work)
 {
-	struct net_device *ndev = dev->iboe.netdevs[port - 1];
-	struct update_gid_work *work;
-	struct net_device *tmp;
+	struct update_gid_work *gw =
+			container_of(work, struct update_gid_work, work);
+	struct mlx4_cmd_mailbox *mailbox;
+	union ib_gid *gids;
+	int err;
 	int i;
-	u8 *hits;
-	int ret;
-	union ib_gid gid;
-	int free;
-	int found;
-	int need_update = 0;
-	u16 vid;
+	struct mlx4_dev	*dev = gw->dev->dev;
 
-	work = kzalloc(sizeof *work, GFP_ATOMIC);
-	if (!work)
-		return -ENOMEM;
+	mailbox = mlx4_alloc_cmd_mailbox(dev);
+	if (IS_ERR(mailbox)) {
+		pr_warn("reset gid table failed\n");
+		goto free;
+	}
 
-	hits = kzalloc(128, GFP_ATOMIC);
-	if (!hits) {
-		ret = -ENOMEM;
-		goto out;
+	gids = mailbox->buf;
+	memcpy(gids, gw->gids, sizeof(gw->gids));
+
+	for (i = 1; i < gw->dev->num_ports + 1; i++) {
+		if (mlx4_ib_port_link_layer(&gw->dev->ib_dev, i) ==
+					    IB_LINK_LAYER_ETHERNET) {
+			err = mlx4_cmd(dev, mailbox->dma,
+				       MLX4_SET_PORT_GID_TABLE << 8 | i,
+				       1, MLX4_CMD_SET_PORT,
+				       MLX4_CMD_TIME_CLASS_B,
+				       MLX4_CMD_WRAPPED);
+			if (err)
+				pr_warn(KERN_WARNING
+					"set port %d command failed\n", i);
+		}
 	}
 
-	rcu_read_lock();
-	for_each_netdev_rcu(&init_net, tmp) {
-		if (ndev && (tmp == ndev || rdma_vlan_dev_real_dev(tmp) == ndev)) {
-			gid.global.subnet_prefix = cpu_to_be64(0xfe80000000000000LL);
-			vid = rdma_vlan_dev_vlan_id(tmp);
-			mlx4_addrconf_ifid_eui48(&gid.raw[8], vid, ndev);
-			found = 0;
-			free = -1;
-			for (i = 0; i < 128; ++i) {
-				if (free < 0 &&
-				    !memcmp(&dev->iboe.gid_table[port - 1][i], &zgid, sizeof zgid))
-					free = i;
-				if (!memcmp(&dev->iboe.gid_table[port - 1][i], &gid, sizeof gid)) {
-					hits[i] = 1;
-					found = 1;
-					break;
-				}
-			}
+	mlx4_free_cmd_mailbox(dev, mailbox);
+free:
+	kfree(gw);
+}
 
-			if (!found) {
-				if (tmp == ndev &&
-				    (memcmp(&dev->iboe.gid_table[port - 1][0],
-					    &gid, sizeof gid) ||
-				     !memcmp(&dev->iboe.gid_table[port - 1][0],
-					     &zgid, sizeof gid))) {
-					dev->iboe.gid_table[port - 1][0] = gid;
-					++need_update;
-					hits[0] = 1;
-				} else if (free >= 0) {
-					dev->iboe.gid_table[port - 1][free] = gid;
-					hits[free] = 1;
-					++need_update;
-				}
+static int update_gid_table(struct mlx4_ib_dev *dev, int port,
+			    union ib_gid *gid, int clear)
+{
+	struct update_gid_work *work;
+	int i;
+	int need_update = 0;
+	int free = -1;
+	int found = -1;
+	int max_gids;
+
+	max_gids = dev->dev->caps.gid_table_len[port];
+	for (i = 0; i < max_gids; ++i) {
+		if (!memcmp(&dev->iboe.gid_table[port - 1][i], gid,
+			    sizeof(*gid)))
+			found = i;
+
+		if (clear) {
+			if (found >= 0) {
+				need_update = 1;
+				dev->iboe.gid_table[port - 1][found] = zgid;
+				break;
 			}
+		} else {
+			if (found >= 0)
+				break;
+
+			if (free < 0 &&
+			    !memcmp(&dev->iboe.gid_table[port - 1][i], &zgid,
+				    sizeof(*gid)))
+				free = i;
 		}
 	}
-	rcu_read_unlock();
 
-	for (i = 0; i < 128; ++i)
-		if (!hits[i]) {
-			if (memcmp(&dev->iboe.gid_table[port - 1][i], &zgid, sizeof zgid))
-				++need_update;
-			dev->iboe.gid_table[port - 1][i] = zgid;
-		}
+	if (found == -1 && !clear && free >= 0) {
+		dev->iboe.gid_table[port - 1][free] = *gid;
+		need_update = 1;
+	}
 
-	if (need_update) {
-		memcpy(work->gids, dev->iboe.gid_table[port - 1], sizeof work->gids);
-		INIT_WORK(&work->work, update_gids_task);
-		work->port = port;
-		work->dev = dev;
-		queue_work(wq, &work->work);
-	} else
-		kfree(work);
+	if (!need_update)
+		return 0;
+
+	work = kzalloc(sizeof(*work), GFP_ATOMIC);
+	if (!work)
+		return -ENOMEM;
+
+	memcpy(work->gids, dev->iboe.gid_table[port - 1], sizeof(work->gids));
+	INIT_WORK(&work->work, update_gids_task);
+	work->port = port;
+	work->dev = dev;
+	queue_work(wq, &work->work);
 
-	kfree(hits);
 	return 0;
+}
 
-out:
-	kfree(work);
-	return ret;
+static int reset_gid_table(struct mlx4_ib_dev *dev)
+{
+	struct update_gid_work *work;
+
+
+	work = kzalloc(sizeof(*work), GFP_ATOMIC);
+	if (!work)
+		return -ENOMEM;
+	memset(dev->iboe.gid_table, 0, sizeof(dev->iboe.gid_table));
+	memset(work->gids, 0, sizeof(work->gids));
+	INIT_WORK(&work->work, reset_gids_task);
+	work->dev = dev;
+	queue_work(wq, &work->work);
+	return 0;
 }
 
-static void handle_en_event(struct mlx4_ib_dev *dev, int port, unsigned long event)
+static int mlx4_ib_addr_event(int event, struct net_device *event_netdev,
+			      struct mlx4_ib_dev *ibdev, union ib_gid *gid)
 {
-	switch (event) {
-	case NETDEV_UP:
-	case NETDEV_CHANGEADDR:
-		update_ipv6_gids(dev, port, 0);
-		break;
+	struct mlx4_ib_iboe *iboe;
+	int port = 0;
+	struct net_device *real_dev = rdma_vlan_dev_real_dev(event_netdev) ?
+				rdma_vlan_dev_real_dev(event_netdev) :
+				event_netdev;
+
+	if (event != NETDEV_DOWN && event != NETDEV_UP)
+		return 0;
+
+	if ((real_dev != event_netdev) &&
+	    (event == NETDEV_DOWN) &&
+	    rdma_link_local_addr((struct in6_addr *)gid))
+		return 0;
+
+	iboe = &ibdev->iboe;
+	spin_lock(&iboe->lock);
+
+	for (port = 1; port <= MLX4_MAX_PORTS; ++port)
+		if ((netif_is_bond_master(real_dev) &&
+		     (real_dev == iboe->masters[port - 1])) ||
+		     (!netif_is_bond_master(real_dev) &&
+		     (real_dev == iboe->netdevs[port - 1])))
+			update_gid_table(ibdev, port, gid,
+					 event == NETDEV_DOWN);
+
+	spin_unlock(&iboe->lock);
+	return 0;
 
-	case NETDEV_DOWN:
-		update_ipv6_gids(dev, port, 1);
-		dev->iboe.netdevs[port - 1] = NULL;
-	}
 }
 
-static void netdev_added(struct mlx4_ib_dev *dev, int port)
+static u8 mlx4_ib_get_dev_port(struct net_device *dev,
+			       struct mlx4_ib_dev *ibdev)
 {
-	update_ipv6_gids(dev, port, 0);
+	u8 port = 0;
+	struct mlx4_ib_iboe *iboe;
+	struct net_device *real_dev = rdma_vlan_dev_real_dev(dev) ?
+				rdma_vlan_dev_real_dev(dev) : dev;
+
+	iboe = &ibdev->iboe;
+	spin_lock(&iboe->lock);
+
+	for (port = 1; port <= MLX4_MAX_PORTS; ++port)
+		if ((netif_is_bond_master(real_dev) &&
+		     (real_dev == iboe->masters[port - 1])) ||
+		     (!netif_is_bond_master(real_dev) &&
+		     (real_dev == iboe->netdevs[port - 1])))
+			break;
+
+	spin_unlock(&iboe->lock);
+
+	if ((port == 0) || (port > MLX4_MAX_PORTS))
+		return 0;
+	else
+		return port;
 }
 
-static void netdev_removed(struct mlx4_ib_dev *dev, int port)
+static int mlx4_ib_inet_event(struct notifier_block *this, unsigned long event,
+				void *ptr)
 {
-	update_ipv6_gids(dev, port, 1);
+	struct mlx4_ib_dev *ibdev;
+	struct in_ifaddr *ifa = ptr;
+	union ib_gid gid;
+	struct net_device *event_netdev = ifa->ifa_dev->dev;
+
+	ipv6_addr_set_v4mapped(ifa->ifa_address, (struct in6_addr *)&gid);
+
+	ibdev = container_of(this, struct mlx4_ib_dev, iboe.nb_inet);
+
+	mlx4_ib_addr_event(event, event_netdev, ibdev, &gid);
+	return NOTIFY_DONE;
 }
 
-static int mlx4_ib_netdev_event(struct notifier_block *this, unsigned long event,
+#if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE)
+static int mlx4_ib_inet6_event(struct notifier_block *this, unsigned long event,
 				void *ptr)
 {
-	struct net_device *dev = netdev_notifier_info_to_dev(ptr);
 	struct mlx4_ib_dev *ibdev;
-	struct net_device *oldnd;
+	struct inet6_ifaddr *ifa = ptr;
+	union  ib_gid *gid = (union ib_gid *)&ifa->addr;
+	struct net_device *event_netdev = ifa->idev->dev;
+
+	ibdev = container_of(this, struct mlx4_ib_dev, iboe.nb_inet6);
+
+	mlx4_ib_addr_event(event, event_netdev, ibdev, gid);
+	return NOTIFY_DONE;
+}
+#endif
+
+static void mlx4_ib_get_dev_addr(struct net_device *dev,
+				 struct mlx4_ib_dev *ibdev, u8 port)
+{
+	struct in_device *in_dev;
+#if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE)
+	struct inet6_dev *in6_dev;
+	union ib_gid  *pgid;
+	struct inet6_ifaddr *ifp;
+#endif
+	union ib_gid gid;
+
+
+	if ((port == 0) || (port > MLX4_MAX_PORTS))
+		return;
+
+	/* IPv4 gids */
+	in_dev = in_dev_get(dev);
+	if (in_dev) {
+		for_ifa(in_dev) {
+			/*ifa->ifa_address;*/
+			ipv6_addr_set_v4mapped(ifa->ifa_address,
+					       (struct in6_addr *)&gid);
+			update_gid_table(ibdev, port, &gid, 0);
+		}
+		endfor_ifa(in_dev);
+		in_dev_put(in_dev);
+	}
+#if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE)
+	/* IPv6 gids */
+	in6_dev = in6_dev_get(dev);
+	if (in6_dev) {
+		read_lock_bh(&in6_dev->lock);
+		list_for_each_entry(ifp, &in6_dev->addr_list, if_list) {
+			pgid = (union ib_gid *)&ifp->addr;
+			update_gid_table(ibdev, port, pgid, 0);
+		}
+		read_unlock_bh(&in6_dev->lock);
+		in6_dev_put(in6_dev);
+	}
+#endif
+}
+
+static int mlx4_ib_init_gid_table(struct mlx4_ib_dev *ibdev)
+{
+	struct	net_device *dev;
+
+	if (reset_gid_table(ibdev))
+		return -1;
+
+	read_lock(&dev_base_lock);
+
+	for_each_netdev(&init_net, dev) {
+		u8 port = mlx4_ib_get_dev_port(dev, ibdev);
+		if (port)
+			mlx4_ib_get_dev_addr(dev, ibdev, port);
+	}
+
+	read_unlock(&dev_base_lock);
+
+	return 0;
+}
+
+static void mlx4_ib_scan_netdevs(struct mlx4_ib_dev *ibdev)
+{
 	struct mlx4_ib_iboe *iboe;
 	int port;
 
-	if (!net_eq(dev_net(dev), &init_net))
-		return NOTIFY_DONE;
-
-	ibdev = container_of(this, struct mlx4_ib_dev, iboe.nb);
 	iboe = &ibdev->iboe;
 
 	spin_lock(&iboe->lock);
 	mlx4_foreach_ib_transport_port(port, ibdev->dev) {
-		oldnd = iboe->netdevs[port - 1];
+		struct net_device *old_master = iboe->masters[port - 1];
+		struct net_device *curr_master;
 		iboe->netdevs[port - 1] =
 			mlx4_get_protocol_dev(ibdev->dev, MLX4_PROT_ETH, port);
-		if (oldnd != iboe->netdevs[port - 1]) {
-			if (iboe->netdevs[port - 1])
-				netdev_added(ibdev, port);
-			else
-				netdev_removed(ibdev, port);
+
+		if (iboe->netdevs[port - 1] &&
+		    netif_is_bond_slave(iboe->netdevs[port - 1])) {
+			rtnl_lock();
+			iboe->masters[port - 1] = netdev_master_upper_dev_get(
+				iboe->netdevs[port - 1]);
+			rtnl_unlock();
 		}
-	}
+		curr_master = iboe->masters[port - 1];
 
-	if (dev == iboe->netdevs[0] ||
-	    (iboe->netdevs[0] && rdma_vlan_dev_real_dev(dev) == iboe->netdevs[0]))
-		handle_en_event(ibdev, 1, event);
-	else if (dev == iboe->netdevs[1]
-		 || (iboe->netdevs[1] && rdma_vlan_dev_real_dev(dev) == iboe->netdevs[1]))
-		handle_en_event(ibdev, 2, event);
+		/* if bonding is used it is possible that we add it to masters
+		    only after IP address is assigned to the net bonding
+		    interface */
+		if (curr_master && (old_master != curr_master))
+			mlx4_ib_get_dev_addr(curr_master, ibdev, port);
+	}
 
 	spin_unlock(&iboe->lock);
+}
+
+static int mlx4_ib_netdev_event(struct notifier_block *this,
+				unsigned long event, void *ptr)
+{
+	struct net_device *dev = netdev_notifier_info_to_dev(ptr);
+	struct mlx4_ib_dev *ibdev;
+
+	if (!net_eq(dev_net(dev), &init_net))
+		return NOTIFY_DONE;
+
+	ibdev = container_of(this, struct mlx4_ib_dev, iboe.nb);
+	mlx4_ib_scan_netdevs(ibdev);
 
 	return NOTIFY_DONE;
 }
@@ -1725,11 +1863,35 @@ static void *mlx4_ib_add(struct mlx4_dev *dev)
 	if (mlx4_ib_init_sriov(ibdev))
 		goto err_mad;
 
-	if (dev->caps.flags & MLX4_DEV_CAP_FLAG_IBOE && !iboe->nb.notifier_call) {
-		iboe->nb.notifier_call = mlx4_ib_netdev_event;
-		err = register_netdevice_notifier(&iboe->nb);
-		if (err)
-			goto err_sriov;
+	if (dev->caps.flags & MLX4_DEV_CAP_FLAG_IBOE) {
+		if (!iboe->nb.notifier_call) {
+			iboe->nb.notifier_call = mlx4_ib_netdev_event;
+			err = register_netdevice_notifier(&iboe->nb);
+			if (err) {
+				iboe->nb.notifier_call = NULL;
+				goto err_notif;
+			}
+		}
+		if (!iboe->nb_inet.notifier_call) {
+			iboe->nb_inet.notifier_call = mlx4_ib_inet_event;
+			err = register_inetaddr_notifier(&iboe->nb_inet);
+			if (err) {
+				iboe->nb_inet.notifier_call = NULL;
+				goto err_notif;
+			}
+		}
+#if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE)
+		if (!iboe->nb_inet6.notifier_call) {
+			iboe->nb_inet6.notifier_call = mlx4_ib_inet6_event;
+			err = register_inet6addr_notifier(&iboe->nb_inet6);
+			if (err) {
+				iboe->nb_inet6.notifier_call = NULL;
+				goto err_notif;
+			}
+		}
+#endif
+		mlx4_ib_scan_netdevs(ibdev);
+		mlx4_ib_init_gid_table(ibdev);
 	}
 
 	for (j = 0; j < ARRAY_SIZE(mlx4_class_attributes); ++j) {
@@ -1755,11 +1917,25 @@ static void *mlx4_ib_add(struct mlx4_dev *dev)
 	return ibdev;
 
 err_notif:
-	if (unregister_netdevice_notifier(&ibdev->iboe.nb))
-		pr_warn("failure unregistering notifier\n");
+	if (ibdev->iboe.nb.notifier_call) {
+		if (unregister_netdevice_notifier(&ibdev->iboe.nb))
+			pr_warn("failure unregistering notifier\n");
+		ibdev->iboe.nb.notifier_call = NULL;
+	}
+	if (ibdev->iboe.nb_inet.notifier_call) {
+		if (unregister_inetaddr_notifier(&ibdev->iboe.nb_inet))
+			pr_warn("failure unregistering notifier\n");
+		ibdev->iboe.nb_inet.notifier_call = NULL;
+	}
+#if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE)
+	if (ibdev->iboe.nb_inet6.notifier_call) {
+		if (unregister_inet6addr_notifier(&ibdev->iboe.nb_inet6))
+			pr_warn("failure unregistering notifier\n");
+		ibdev->iboe.nb_inet6.notifier_call = NULL;
+	}
+#endif
 	flush_workqueue(wq);
 
-err_sriov:
 	mlx4_ib_close_sriov(ibdev);
 
 err_mad:
@@ -1801,6 +1977,18 @@ static void mlx4_ib_remove(struct mlx4_dev *dev, void *ibdev_ptr)
 			pr_warn("failure unregistering notifier\n");
 		ibdev->iboe.nb.notifier_call = NULL;
 	}
+	if (ibdev->iboe.nb_inet.notifier_call) {
+		if (unregister_inetaddr_notifier(&ibdev->iboe.nb_inet))
+			pr_warn("failure unregistering notifier\n");
+		ibdev->iboe.nb_inet.notifier_call = NULL;
+	}
+#if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE)
+	if (ibdev->iboe.nb_inet6.notifier_call) {
+		if (unregister_inet6addr_notifier(&ibdev->iboe.nb_inet6))
+			pr_warn("failure unregistering notifier\n");
+		ibdev->iboe.nb_inet6.notifier_call = NULL;
+	}
+#endif
 	iounmap(ibdev->uar_map);
 	for (p = 0; p < ibdev->num_ports; ++p)
 		if (ibdev->counters[p] != -1)
diff --git a/drivers/infiniband/hw/mlx4/mlx4_ib.h b/drivers/infiniband/hw/mlx4/mlx4_ib.h
index 036b663..133f41f 100644
--- a/drivers/infiniband/hw/mlx4/mlx4_ib.h
+++ b/drivers/infiniband/hw/mlx4/mlx4_ib.h
@@ -428,7 +428,10 @@ struct mlx4_ib_sriov {
 struct mlx4_ib_iboe {
 	spinlock_t		lock;
 	struct net_device      *netdevs[MLX4_MAX_PORTS];
+	struct net_device      *masters[MLX4_MAX_PORTS];
 	struct notifier_block 	nb;
+	struct notifier_block	nb_inet;
+	struct notifier_block	nb_inet6;
 	union ib_gid		gid_table[MLX4_MAX_PORTS][128];
 };
 
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH V4 4/9] IB/mlx4: Handle Ethernet L2 parameters for IP based GID addressing
       [not found] ` <1378824099-22150-1-git-send-email-ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
                     ` (2 preceding siblings ...)
  2013-09-10 14:41   ` [PATCH V4 3/9] IB/mlx4: Use RoCE IP based GIDs in the port GID table Or Gerlitz
@ 2013-09-10 14:41   ` Or Gerlitz
  2013-09-10 14:41   ` [PATCH V4 5/9] IB/ocrdma: Populate GID table with IP based gids Or Gerlitz
                     ` (4 subsequent siblings)
  8 siblings, 0 replies; 38+ messages in thread
From: Or Gerlitz @ 2013-09-10 14:41 UTC (permalink / raw)
  To: roland-DgEjT+Ai2ygdnm+yROfE0A
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, monis-VPRAkNaXOzVWk0Htik3J/w,
	matanb-VPRAkNaXOzVWk0Htik3J/w, Or Gerlitz

From: Moni Shoua <monis-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>

IP based RoCE gids don't store Ethernet L2 parameters, MAC and VLAN.

Hence, we need to extract them now from the CQE and place in struct
ib_wc (to be used for cases were they were taken from the gid).

Also, when modifying a QP or building address handle, instead of
parsing the dgid to get the MAC and VLAN, take them from the
address handle attributes.

Signed-off-by: Moni Shoua <monis-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Signed-off-by: Or Gerlitz <ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
---
 drivers/infiniband/hw/mlx4/ah.c           |   40 +++---------
 drivers/infiniband/hw/mlx4/cq.c           |    9 +++
 drivers/infiniband/hw/mlx4/mlx4_ib.h      |    3 -
 drivers/infiniband/hw/mlx4/qp.c           |  105 ++++++++++++++++++++++-------
 drivers/net/ethernet/mellanox/mlx4/port.c |   20 ++++++
 include/linux/mlx4/cq.h                   |   15 +++-
 6 files changed, 130 insertions(+), 62 deletions(-)

diff --git a/drivers/infiniband/hw/mlx4/ah.c b/drivers/infiniband/hw/mlx4/ah.c
index a251bec..170dca6 100644
--- a/drivers/infiniband/hw/mlx4/ah.c
+++ b/drivers/infiniband/hw/mlx4/ah.c
@@ -39,25 +39,6 @@
 
 #include "mlx4_ib.h"
 
-int mlx4_ib_resolve_grh(struct mlx4_ib_dev *dev, const struct ib_ah_attr *ah_attr,
-			u8 *mac, int *is_mcast, u8 port)
-{
-	struct in6_addr in6;
-
-	*is_mcast = 0;
-
-	memcpy(&in6, ah_attr->grh.dgid.raw, sizeof in6);
-	if (rdma_link_local_addr(&in6))
-		rdma_get_ll_mac(&in6, mac);
-	else if (rdma_is_multicast_addr(&in6)) {
-		rdma_get_mcast_mac(&in6, mac);
-		*is_mcast = 1;
-	} else
-		return -EINVAL;
-
-	return 0;
-}
-
 static struct ib_ah *create_ib_ah(struct ib_pd *pd, struct ib_ah_attr *ah_attr,
 				  struct mlx4_ib_ah *ah)
 {
@@ -92,21 +73,18 @@ static struct ib_ah *create_iboe_ah(struct ib_pd *pd, struct ib_ah_attr *ah_attr
 {
 	struct mlx4_ib_dev *ibdev = to_mdev(pd->device);
 	struct mlx4_dev *dev = ibdev->dev;
-	union ib_gid sgid;
-	u8 mac[6];
-	int err;
 	int is_mcast;
+	struct in6_addr in6;
 	u16 vlan_tag;
 
-	err = mlx4_ib_resolve_grh(ibdev, ah_attr, mac, &is_mcast, ah_attr->port_num);
-	if (err)
-		return ERR_PTR(err);
-
-	memcpy(ah->av.eth.mac, mac, 6);
-	err = ib_get_cached_gid(pd->device, ah_attr->port_num, ah_attr->grh.sgid_index, &sgid);
-	if (err)
-		return ERR_PTR(err);
-	vlan_tag = rdma_get_vlan_id(&sgid);
+	memcpy(&in6, ah_attr->grh.dgid.raw, sizeof(in6));
+	if (rdma_is_multicast_addr(&in6)) {
+		is_mcast = 1;
+		rdma_get_mcast_mac(&in6, ah->av.eth.mac);
+	} else {
+		memcpy(ah->av.eth.mac, ah_attr->dmac, ETH_ALEN);
+	}
+	vlan_tag = ah_attr->vlan_id;
 	if (vlan_tag < 0x1000)
 		vlan_tag |= (ah_attr->sl & 7) << 13;
 	ah->av.eth.port_pd = cpu_to_be32(to_mpd(pd)->pdn | (ah_attr->port_num << 24));
diff --git a/drivers/infiniband/hw/mlx4/cq.c b/drivers/infiniband/hw/mlx4/cq.c
index d5e60f4..5f6113b 100644
--- a/drivers/infiniband/hw/mlx4/cq.c
+++ b/drivers/infiniband/hw/mlx4/cq.c
@@ -793,6 +793,15 @@ repoll:
 			wc->sl  = be16_to_cpu(cqe->sl_vid) >> 13;
 		else
 			wc->sl  = be16_to_cpu(cqe->sl_vid) >> 12;
+		if (be32_to_cpu(cqe->vlan_my_qpn) & MLX4_CQE_VLAN_PRESENT_MASK) {
+			wc->vlan_id = be16_to_cpu(cqe->sl_vid) &
+				MLX4_CQE_VID_MASK;
+		} else {
+			wc->vlan_id = 0xffff;
+		}
+		wc->wc_flags |= IB_WC_WITH_VLAN;
+		memcpy(wc->smac, cqe->smac, ETH_ALEN);
+		wc->wc_flags |= IB_WC_WITH_SMAC;
 	}
 
 	return 0;
diff --git a/drivers/infiniband/hw/mlx4/mlx4_ib.h b/drivers/infiniband/hw/mlx4/mlx4_ib.h
index 133f41f..c06f571 100644
--- a/drivers/infiniband/hw/mlx4/mlx4_ib.h
+++ b/drivers/infiniband/hw/mlx4/mlx4_ib.h
@@ -678,9 +678,6 @@ int __mlx4_ib_query_pkey(struct ib_device *ibdev, u8 port, u16 index,
 int __mlx4_ib_query_gid(struct ib_device *ibdev, u8 port, int index,
 			union ib_gid *gid, int netw_view);
 
-int mlx4_ib_resolve_grh(struct mlx4_ib_dev *dev, const struct ib_ah_attr *ah_attr,
-			u8 *mac, int *is_mcast, u8 port);
-
 static inline bool mlx4_ib_ah_grh_present(struct mlx4_ib_ah *ah)
 {
 	u8 port = be32_to_cpu(ah->av.ib.port_pd) >> 24 & 3;
diff --git a/drivers/infiniband/hw/mlx4/qp.c b/drivers/infiniband/hw/mlx4/qp.c
index da6f5fa..e0c2186 100644
--- a/drivers/infiniband/hw/mlx4/qp.c
+++ b/drivers/infiniband/hw/mlx4/qp.c
@@ -90,6 +90,21 @@ enum {
 	MLX4_RAW_QP_MSGMAX	= 31,
 };
 
+#ifndef ETH_ALEN
+#define ETH_ALEN        6
+#endif
+static inline u64 mlx4_mac_to_u64(u8 *addr)
+{
+	u64 mac = 0;
+	int i;
+
+	for (i = 0; i < ETH_ALEN; i++) {
+		mac <<= 8;
+		mac |= addr[i];
+	}
+	return mac;
+}
+
 static const __be32 mlx4_ib_opcode[] = {
 	[IB_WR_SEND]				= cpu_to_be32(MLX4_OPCODE_SEND),
 	[IB_WR_LSO]				= cpu_to_be32(MLX4_OPCODE_LSO),
@@ -1144,16 +1159,15 @@ static void mlx4_set_sched(struct mlx4_qp_path *path, u8 port)
 	path->sched_queue = (path->sched_queue & 0xbf) | ((port - 1) << 6);
 }
 
-static int mlx4_set_path(struct mlx4_ib_dev *dev, const struct ib_ah_attr *ah,
-			 struct mlx4_qp_path *path, u8 port)
+static int _mlx4_set_path(struct mlx4_ib_dev *dev, const struct ib_ah_attr *ah,
+			  u64 smac, u16 vlan_tag, struct mlx4_qp_path *path,
+			  u8 port)
 {
-	int err;
 	int is_eth = rdma_port_get_link_layer(&dev->ib_dev, port) ==
 		IB_LINK_LAYER_ETHERNET;
-	u8 mac[6];
-	int is_mcast;
-	u16 vlan_tag;
 	int vidx;
+	int smac_index;
+
 
 	path->grh_mylmc     = ah->src_path_bits & 0x7f;
 	path->rlid	    = cpu_to_be16(ah->dlid);
@@ -1188,22 +1202,27 @@ static int mlx4_set_path(struct mlx4_ib_dev *dev, const struct ib_ah_attr *ah,
 		if (!(ah->ah_flags & IB_AH_GRH))
 			return -1;
 
-		err = mlx4_ib_resolve_grh(dev, ah, mac, &is_mcast, port);
-		if (err)
-			return err;
-
-		memcpy(path->dmac, mac, 6);
+		memcpy(path->dmac, ah->dmac, ETH_ALEN);
 		path->ackto = MLX4_IB_LINK_TYPE_ETH;
-		/* use index 0 into MAC table for IBoE */
-		path->grh_mylmc &= 0x80;
+		/* find the index  into MAC table for IBoE */
+		if (!is_zero_ether_addr((const u8 *)&smac)) {
+			if (mlx4_find_cached_mac(dev->dev, port, smac,
+						 &smac_index))
+				return -ENOENT;
+		} else {
+			smac_index = 0;
+		}
+
+		path->grh_mylmc &= 0x80 | smac_index;
 
-		vlan_tag = rdma_get_vlan_id(&dev->iboe.gid_table[port - 1][ah->grh.sgid_index]);
+		path->feup |= MLX4_FEUP_FORCE_ETH_UP;
 		if (vlan_tag < 0x1000) {
 			if (mlx4_find_cached_vlan(dev->dev, port, vlan_tag, &vidx))
 				return -ENOENT;
 
 			path->vlan_index = vidx;
 			path->fl = 1 << 6;
+			path->feup |= MLX4_FVL_FORCE_ETH_VLAN;
 		}
 	} else
 		path->sched_queue = MLX4_IB_DEFAULT_SCHED_QUEUE |
@@ -1212,6 +1231,28 @@ static int mlx4_set_path(struct mlx4_ib_dev *dev, const struct ib_ah_attr *ah,
 	return 0;
 }
 
+static int mlx4_set_path(struct mlx4_ib_dev *dev, const struct ib_qp_attr *qp,
+			 enum ib_qp_attr_mask qp_attr_mask,
+			 struct mlx4_qp_path *path, u8 port)
+{
+	return _mlx4_set_path(dev, &qp->ah_attr,
+			      mlx4_mac_to_u64((u8 *)qp->smac),
+			      (qp_attr_mask & IB_QP_VID) ? qp->vlan_id : 0xffff,
+			      path, port);
+}
+
+static int mlx4_set_alt_path(struct mlx4_ib_dev *dev,
+			     const struct ib_qp_attr *qp,
+			     enum ib_qp_attr_mask qp_attr_mask,
+			     struct mlx4_qp_path *path, u8 port)
+{
+	return _mlx4_set_path(dev, &qp->alt_ah_attr,
+			      mlx4_mac_to_u64((u8 *)qp->alt_smac),
+			      (qp_attr_mask & IB_QP_ALT_VID) ?
+			      qp->alt_vlan_id : 0xffff,
+			      path, port);
+}
+
 static void update_mcg_macs(struct mlx4_ib_dev *dev, struct mlx4_ib_qp *qp)
 {
 	struct mlx4_ib_gid_entry *ge, *tmp;
@@ -1329,7 +1370,7 @@ static int __mlx4_ib_modify_qp(struct ib_qp *ibqp,
 	}
 
 	if (attr_mask & IB_QP_AV) {
-		if (mlx4_set_path(dev, &attr->ah_attr, &context->pri_path,
+		if (mlx4_set_path(dev, attr, attr_mask, &context->pri_path,
 				  attr_mask & IB_QP_PORT ?
 				  attr->port_num : qp->port))
 			goto out;
@@ -1352,8 +1393,8 @@ static int __mlx4_ib_modify_qp(struct ib_qp *ibqp,
 		    dev->dev->caps.pkey_table_len[attr->alt_port_num])
 			goto out;
 
-		if (mlx4_set_path(dev, &attr->alt_ah_attr, &context->alt_path,
-				  attr->alt_port_num))
+		if (mlx4_set_alt_path(dev, attr, attr_mask, &context->alt_path,
+				      attr->alt_port_num))
 			goto out;
 
 		context->alt_path.pkey_index = attr->alt_pkey_index;
@@ -1464,6 +1505,17 @@ static int __mlx4_ib_modify_qp(struct ib_qp *ibqp,
 		context->pri_path.ackto = (context->pri_path.ackto & 0xf8) |
 					MLX4_IB_LINK_TYPE_ETH;
 
+	if (ibqp->qp_type == IB_QPT_UD && (new_state == IB_QPS_RTR)) {
+		int is_eth = rdma_port_get_link_layer(
+				&dev->ib_dev, qp->port) ==
+				IB_LINK_LAYER_ETHERNET;
+		if (is_eth) {
+			context->pri_path.ackto = MLX4_IB_LINK_TYPE_ETH;
+			optpar |= MLX4_QP_OPTPAR_PRIMARY_ADDR_PATH;
+		}
+	}
+
+
 	if (cur_state == IB_QPS_RTS && new_state == IB_QPS_SQD	&&
 	    attr_mask & IB_QP_EN_SQD_ASYNC_NOTIFY && attr->en_sqd_async_notify)
 		sqd_event = 1;
@@ -1561,18 +1613,21 @@ int mlx4_ib_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr,
 	struct mlx4_ib_qp *qp = to_mqp(ibqp);
 	enum ib_qp_state cur_state, new_state;
 	int err = -EINVAL;
-	int p = attr_mask & IB_QP_PORT ? attr->port_num : qp->port;
+	int ll;
 	mutex_lock(&qp->mutex);
 
 	cur_state = attr_mask & IB_QP_CUR_STATE ? attr->cur_qp_state : qp->state;
 	new_state = attr_mask & IB_QP_STATE ? attr->qp_state : cur_state;
 
-	if (cur_state == new_state && cur_state == IB_QPS_RESET)
-		p = IB_LINK_LAYER_UNSPECIFIED;
+	if (cur_state == new_state && cur_state == IB_QPS_RESET) {
+		ll = IB_LINK_LAYER_UNSPECIFIED;
+	} else {
+		int port = attr_mask & IB_QP_PORT ? attr->port_num : qp->port;
+		ll = rdma_port_get_link_layer(&dev->ib_dev, port);
+	}
 
 	if (!ib_modify_qp_is_ok(cur_state, new_state, ibqp->qp_type,
-				attr_mask,
-				rdma_port_get_link_layer(&dev->ib_dev, p))) {
+				attr_mask, ll)) {
 		pr_debug("qpn 0x%x: invalid attribute mask specified "
 			 "for transition %d to %d. qp_type %d,"
 			 " attr_mask 0x%x\n",
@@ -1789,8 +1844,10 @@ static int build_mlx_header(struct mlx4_ib_sqp *sqp, struct ib_send_wr *wr,
 				return err;
 		}
 
-		vlan = rdma_get_vlan_id(&sgid);
-		is_vlan = vlan < 0x1000;
+		if (ah->av.eth.vlan != 0xffff) {
+			vlan = be16_to_cpu(ah->av.eth.vlan) & 0x0fff;
+			is_vlan = 1;
+		}
 	}
 	ib_ud_header_init(send_size, !is_eth, is_eth, is_vlan, is_grh, 0, &sqp->ud_header);
 
diff --git a/drivers/net/ethernet/mellanox/mlx4/port.c b/drivers/net/ethernet/mellanox/mlx4/port.c
index 946e0af..fbe76f9 100644
--- a/drivers/net/ethernet/mellanox/mlx4/port.c
+++ b/drivers/net/ethernet/mellanox/mlx4/port.c
@@ -123,6 +123,26 @@ static int mlx4_set_port_mac_table(struct mlx4_dev *dev, u8 port,
 	return err;
 }
 
+int mlx4_find_cached_mac(struct mlx4_dev *dev, u8 port, u64 mac, int *idx)
+{
+	struct mlx4_port_info *info = &mlx4_priv(dev)->port[port];
+	struct mlx4_mac_table *table = &info->mac_table;
+	int i;
+
+	for (i = 0; i < MLX4_MAX_MAC_NUM; i++) {
+		if (!table->refs[i])
+			continue;
+
+		if (mac == (MLX4_MAC_MASK & be64_to_cpu(table->entries[i]))) {
+			*idx = i;
+			return 0;
+		}
+	}
+
+	return -ENOENT;
+}
+EXPORT_SYMBOL_GPL(mlx4_find_cached_mac);
+
 int __mlx4_register_mac(struct mlx4_dev *dev, u8 port, u64 mac)
 {
 	struct mlx4_port_info *info = &mlx4_priv(dev)->port[port];
diff --git a/include/linux/mlx4/cq.h b/include/linux/mlx4/cq.h
index 98fa492..e186299 100644
--- a/include/linux/mlx4/cq.h
+++ b/include/linux/mlx4/cq.h
@@ -34,6 +34,7 @@
 #define MLX4_CQ_H
 
 #include <linux/types.h>
+#include <uapi/linux/if_ether.h>
 
 #include <linux/mlx4/device.h>
 #include <linux/mlx4/doorbell.h>
@@ -43,10 +44,15 @@ struct mlx4_cqe {
 	__be32			immed_rss_invalid;
 	__be32			g_mlpath_rqpn;
 	__be16			sl_vid;
-	__be16			rlid;
-	__be16			status;
-	u8			ipv6_ext_mask;
-	u8			badfcs_enc;
+	union {
+		struct {
+			__be16	rlid;
+			__be16  status;
+			u8      ipv6_ext_mask;
+			u8      badfcs_enc;
+		};
+		u8  smac[ETH_ALEN];
+	};
 	__be32			byte_cnt;
 	__be16			wqe_index;
 	__be16			checksum;
@@ -83,6 +89,7 @@ struct mlx4_ts_cqe {
 enum {
 	MLX4_CQE_VLAN_PRESENT_MASK	= 1 << 29,
 	MLX4_CQE_QPN_MASK		= 0xffffff,
+	MLX4_CQE_VID_MASK		= 0xfff,
 };
 
 enum {
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH V4 5/9] IB/ocrdma: Populate GID table with IP based gids
       [not found] ` <1378824099-22150-1-git-send-email-ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
                     ` (3 preceding siblings ...)
  2013-09-10 14:41   ` [PATCH V4 4/9] IB/mlx4: Handle Ethernet L2 parameters for IP based GID addressing Or Gerlitz
@ 2013-09-10 14:41   ` Or Gerlitz
  2013-09-10 14:41   ` [PATCH V4 6/9] IB/ocrdma: Handle Ethernet L2 parameters for IP based GID addressing Or Gerlitz
                     ` (3 subsequent siblings)
  8 siblings, 0 replies; 38+ messages in thread
From: Or Gerlitz @ 2013-09-10 14:41 UTC (permalink / raw)
  To: roland-DgEjT+Ai2ygdnm+yROfE0A
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, monis-VPRAkNaXOzVWk0Htik3J/w,
	matanb-VPRAkNaXOzVWk0Htik3J/w, Naresh Gottumukkala, Or Gerlitz

From: Moni Shoua <monis-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>

This patch is similar in spirit to the "IB/mlx4: Use RoCE IP based GIDs
in the port GID table" patch for this series.

Changes to inet4 and inet6 addresses for the host are monitored and if the
address is associated with an ocrdma device then a gid is added or deleted
from the device's gid table. The gid format will be a IPv4 to IPv6 mapped or
the IPv6 address.

Cc: Naresh Gottumukkala <bgottumukkala-laKkSmNT4hbQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Moni Shoua <monis-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Signed-off-by: Or Gerlitz <ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
---
 drivers/infiniband/hw/ocrdma/ocrdma_main.c  |  138 ++++++++-------------------
 drivers/infiniband/hw/ocrdma/ocrdma_verbs.c |    2 +-
 2 files changed, 41 insertions(+), 99 deletions(-)

diff --git a/drivers/infiniband/hw/ocrdma/ocrdma_main.c b/drivers/infiniband/hw/ocrdma/ocrdma_main.c
index 56e0049..c3fca6e 100644
--- a/drivers/infiniband/hw/ocrdma/ocrdma_main.c
+++ b/drivers/infiniband/hw/ocrdma/ocrdma_main.c
@@ -67,46 +67,24 @@ void ocrdma_get_guid(struct ocrdma_dev *dev, u8 *guid)
 	guid[7] = mac_addr[5];
 }
 
-static void ocrdma_build_sgid_mac(union ib_gid *sgid, unsigned char *mac_addr,
-				  bool is_vlan, u16 vlan_id)
-{
-	sgid->global.subnet_prefix = cpu_to_be64(0xfe80000000000000LL);
-	sgid->raw[8] = mac_addr[0] ^ 2;
-	sgid->raw[9] = mac_addr[1];
-	sgid->raw[10] = mac_addr[2];
-	if (is_vlan) {
-		sgid->raw[11] = vlan_id >> 8;
-		sgid->raw[12] = vlan_id & 0xff;
-	} else {
-		sgid->raw[11] = 0xff;
-		sgid->raw[12] = 0xfe;
-	}
-	sgid->raw[13] = mac_addr[3];
-	sgid->raw[14] = mac_addr[4];
-	sgid->raw[15] = mac_addr[5];
-}
-
-static bool ocrdma_add_sgid(struct ocrdma_dev *dev, unsigned char *mac_addr,
-			    bool is_vlan, u16 vlan_id)
+static bool ocrdma_add_sgid(struct ocrdma_dev *dev, union ib_gid *new_sgid)
 {
 	int i;
-	union ib_gid new_sgid;
 	unsigned long flags;
 
 	memset(&ocrdma_zero_sgid, 0, sizeof(union ib_gid));
 
-	ocrdma_build_sgid_mac(&new_sgid, mac_addr, is_vlan, vlan_id);
 
 	spin_lock_irqsave(&dev->sgid_lock, flags);
 	for (i = 0; i < OCRDMA_MAX_SGID; i++) {
 		if (!memcmp(&dev->sgid_tbl[i], &ocrdma_zero_sgid,
 			    sizeof(union ib_gid))) {
 			/* found free entry */
-			memcpy(&dev->sgid_tbl[i], &new_sgid,
+			memcpy(&dev->sgid_tbl[i], new_sgid,
 			       sizeof(union ib_gid));
 			spin_unlock_irqrestore(&dev->sgid_lock, flags);
 			return true;
-		} else if (!memcmp(&dev->sgid_tbl[i], &new_sgid,
+		} else if (!memcmp(&dev->sgid_tbl[i], new_sgid,
 				   sizeof(union ib_gid))) {
 			/* entry already present, no addition is required. */
 			spin_unlock_irqrestore(&dev->sgid_lock, flags);
@@ -117,20 +95,17 @@ static bool ocrdma_add_sgid(struct ocrdma_dev *dev, unsigned char *mac_addr,
 	return false;
 }
 
-static bool ocrdma_del_sgid(struct ocrdma_dev *dev, unsigned char *mac_addr,
-			    bool is_vlan, u16 vlan_id)
+static bool ocrdma_del_sgid(struct ocrdma_dev *dev, union ib_gid *sgid)
 {
 	int found = false;
 	int i;
-	union ib_gid sgid;
 	unsigned long flags;
 
-	ocrdma_build_sgid_mac(&sgid, mac_addr, is_vlan, vlan_id);
 
 	spin_lock_irqsave(&dev->sgid_lock, flags);
 	/* first is default sgid, which cannot be deleted. */
 	for (i = 1; i < OCRDMA_MAX_SGID; i++) {
-		if (!memcmp(&dev->sgid_tbl[i], &sgid, sizeof(union ib_gid))) {
+		if (!memcmp(&dev->sgid_tbl[i], sgid, sizeof(union ib_gid))) {
 			/* found matching entry */
 			memset(&dev->sgid_tbl[i], 0, sizeof(union ib_gid));
 			found = true;
@@ -141,75 +116,18 @@ static bool ocrdma_del_sgid(struct ocrdma_dev *dev, unsigned char *mac_addr,
 	return found;
 }
 
-static void ocrdma_add_default_sgid(struct ocrdma_dev *dev)
-{
-	/* GID Index 0 - Invariant manufacturer-assigned EUI-64 */
-	union ib_gid *sgid = &dev->sgid_tbl[0];
-
-	sgid->global.subnet_prefix = cpu_to_be64(0xfe80000000000000LL);
-	ocrdma_get_guid(dev, &sgid->raw[8]);
-}
-
-#if IS_ENABLED(CONFIG_VLAN_8021Q)
-static void ocrdma_add_vlan_sgids(struct ocrdma_dev *dev)
-{
-	struct net_device *netdev, *tmp;
-	u16 vlan_id;
-	bool is_vlan;
-
-	netdev = dev->nic_info.netdev;
-
-	rcu_read_lock();
-	for_each_netdev_rcu(&init_net, tmp) {
-		if (netdev == tmp || vlan_dev_real_dev(tmp) == netdev) {
-			if (!netif_running(tmp) || !netif_oper_up(tmp))
-				continue;
-			if (netdev != tmp) {
-				vlan_id = vlan_dev_vlan_id(tmp);
-				is_vlan = true;
-			} else {
-				is_vlan = false;
-				vlan_id = 0;
-				tmp = netdev;
-			}
-			ocrdma_add_sgid(dev, tmp->dev_addr, is_vlan, vlan_id);
-		}
-	}
-	rcu_read_unlock();
-}
-#else
-static void ocrdma_add_vlan_sgids(struct ocrdma_dev *dev)
-{
-
-}
-#endif /* VLAN */
-
-static int ocrdma_build_sgid_tbl(struct ocrdma_dev *dev)
+static int ocrdma_addr_event(unsigned long event, struct net_device *netdev,
+			     union ib_gid *gid)
 {
-	ocrdma_add_default_sgid(dev);
-	ocrdma_add_vlan_sgids(dev);
-	return 0;
-}
-
-#if IS_ENABLED(CONFIG_IPV6)
-
-static int ocrdma_inet6addr_event(struct notifier_block *notifier,
-				  unsigned long event, void *ptr)
-{
-	struct inet6_ifaddr *ifa = (struct inet6_ifaddr *)ptr;
-	struct net_device *netdev = ifa->idev->dev;
 	struct ib_event gid_event;
 	struct ocrdma_dev *dev;
 	bool found = false;
 	bool updated = false;
 	bool is_vlan = false;
-	u16 vid = 0;
 
 	is_vlan = netdev->priv_flags & IFF_802_1Q_VLAN;
-	if (is_vlan) {
-		vid = vlan_dev_vlan_id(netdev);
+	if (is_vlan)
 		netdev = vlan_dev_real_dev(netdev);
-	}
 
 	rcu_read_lock();
 	list_for_each_entry_rcu(dev, &ocrdma_dev_list, entry) {
@@ -222,16 +140,14 @@ static int ocrdma_inet6addr_event(struct notifier_block *notifier,
 
 	if (!found)
 		return NOTIFY_DONE;
-	if (!rdma_link_local_addr((struct in6_addr *)&ifa->addr))
-		return NOTIFY_DONE;
 
 	mutex_lock(&dev->dev_lock);
 	switch (event) {
 	case NETDEV_UP:
-		updated = ocrdma_add_sgid(dev, netdev->dev_addr, is_vlan, vid);
+		updated = ocrdma_add_sgid(dev, gid);
 		break;
 	case NETDEV_DOWN:
-		updated = ocrdma_del_sgid(dev, netdev->dev_addr, is_vlan, vid);
+		updated = ocrdma_del_sgid(dev, gid);
 		break;
 	default:
 		break;
@@ -247,6 +163,32 @@ static int ocrdma_inet6addr_event(struct notifier_block *notifier,
 	return NOTIFY_OK;
 }
 
+static int ocrdma_inetaddr_event(struct notifier_block *notifier,
+				  unsigned long event, void *ptr)
+{
+	struct in_ifaddr *ifa = ptr;
+	union ib_gid gid;
+	struct net_device *netdev = ifa->ifa_dev->dev;
+
+	ipv6_addr_set_v4mapped(ifa->ifa_address, (struct in6_addr *)&gid);
+	return ocrdma_addr_event(event, netdev, &gid);
+}
+
+#if IS_ENABLED(CONFIG_IPV6)
+
+static int ocrdma_inet6addr_event(struct notifier_block *notifier,
+				  unsigned long event, void *ptr)
+{
+	struct inet6_ifaddr *ifa = (struct inet6_ifaddr *)ptr;
+	union  ib_gid *gid = (union ib_gid *)&ifa->addr;
+	struct net_device *netdev = ifa->idev->dev;
+	return ocrdma_addr_event(event, netdev, gid);
+}
+
+static struct notifier_block ocrdma_inetaddr_notifier = {
+	.notifier_call = ocrdma_inetaddr_event
+};
+
 static struct notifier_block ocrdma_inet6addr_notifier = {
 	.notifier_call = ocrdma_inet6addr_event
 };
@@ -423,10 +365,6 @@ static struct ocrdma_dev *ocrdma_add(struct be_dev_info *dev_info)
 	if (status)
 		goto alloc_err;
 
-	status = ocrdma_build_sgid_tbl(dev);
-	if (status)
-		goto alloc_err;
-
 	status = ocrdma_register_device(dev);
 	if (status)
 		goto alloc_err;
@@ -552,6 +490,10 @@ static int __init ocrdma_init_module(void)
 {
 	int status;
 
+	status = register_inetaddr_notifier(&ocrdma_inetaddr_notifier);
+	if (status)
+		return status;
+
 #if IS_ENABLED(CONFIG_IPV6)
 	status = register_inet6addr_notifier(&ocrdma_inet6addr_notifier);
 	if (status)
diff --git a/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c b/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c
index 0607ca7..b95253e 100644
--- a/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c
+++ b/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c
@@ -1327,7 +1327,7 @@ int ocrdma_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr,
 	spin_unlock_irqrestore(&qp->q_lock, flags);
 
 	if (!ib_modify_qp_is_ok(old_qps, new_qps, ibqp->qp_type, attr_mask,
-				IB_LINK_LAYER_UNSPECIFIED)) {
+				IB_LINK_LAYER_ETHERNET)) {
 		pr_err("%s(%d) invalid attribute mask=0x%x specified for\n"
 		       "qpn=0x%x of type=0x%x old_qps=0x%x, new_qps=0x%x\n",
 		       __func__, dev->id, attr_mask, qp->id, ibqp->qp_type,
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH V4 6/9] IB/ocrdma: Handle Ethernet L2 parameters for IP based GID addressing
       [not found] ` <1378824099-22150-1-git-send-email-ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
                     ` (4 preceding siblings ...)
  2013-09-10 14:41   ` [PATCH V4 5/9] IB/ocrdma: Populate GID table with IP based gids Or Gerlitz
@ 2013-09-10 14:41   ` Or Gerlitz
  2013-09-10 14:41   ` [PATCH V4 7/9] IB/core: Add RoCE IP based addressing extensions for uverbs Or Gerlitz
                     ` (2 subsequent siblings)
  8 siblings, 0 replies; 38+ messages in thread
From: Or Gerlitz @ 2013-09-10 14:41 UTC (permalink / raw)
  To: roland-DgEjT+Ai2ygdnm+yROfE0A
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, monis-VPRAkNaXOzVWk0Htik3J/w,
	matanb-VPRAkNaXOzVWk0Htik3J/w, Naresh Gottumukkala, Or Gerlitz

From: Moni Shoua <monis-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>

This patch is similar in spirit to the "IB/mlx4: Handle Ethernet L2 parameters for
IP based GID addressing" from series. It handles the fact that IP based RoCE gids
don't store Ethernet L2 parameters, MAC and VLAN.

When building an address handle, instead of parsing the dgid to
get the MAC and VLAN, take them from the address handle attributes.

Cc: Naresh Gottumukkala <bgottumukkala-laKkSmNT4hbQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Moni Shoua <monis-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Signed-off-by: Or Gerlitz <ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
---
 drivers/infiniband/hw/ocrdma/ocrdma.h    |   12 ++++++++++++
 drivers/infiniband/hw/ocrdma/ocrdma_ah.c |    5 +++--
 drivers/infiniband/hw/ocrdma/ocrdma_hw.c |   21 ++-------------------
 drivers/infiniband/hw/ocrdma/ocrdma_hw.h |    1 -
 4 files changed, 17 insertions(+), 22 deletions(-)

diff --git a/drivers/infiniband/hw/ocrdma/ocrdma.h b/drivers/infiniband/hw/ocrdma/ocrdma.h
index adc11d1..d544292 100644
--- a/drivers/infiniband/hw/ocrdma/ocrdma.h
+++ b/drivers/infiniband/hw/ocrdma/ocrdma.h
@@ -422,5 +422,17 @@ static inline int is_cqe_wr_imm(struct ocrdma_cqe *cqe)
 		OCRDMA_CQE_WRITE_IMM) ? 1 : 0;
 }
 
+static inline int ocrdma_resolve_dmac(struct ocrdma_dev *dev,
+		struct ib_ah_attr *ah_attr, u8 *mac_addr)
+{
+	struct in6_addr in6;
+
+	memcpy(&in6, ah_attr->grh.dgid.raw, sizeof(in6));
+	if (rdma_is_multicast_addr(&in6))
+		rdma_get_mcast_mac(&in6, mac_addr);
+	else
+		memcpy(mac_addr, ah_attr->dmac, ETH_ALEN);
+	return 0;
+}
 
 #endif
diff --git a/drivers/infiniband/hw/ocrdma/ocrdma_ah.c b/drivers/infiniband/hw/ocrdma/ocrdma_ah.c
index ee499d9..bbb7962 100644
--- a/drivers/infiniband/hw/ocrdma/ocrdma_ah.c
+++ b/drivers/infiniband/hw/ocrdma/ocrdma_ah.c
@@ -49,7 +49,7 @@ static inline int set_av_attr(struct ocrdma_dev *dev, struct ocrdma_ah *ah,
 
 	ah->sgid_index = attr->grh.sgid_index;
 
-	vlan_tag = rdma_get_vlan_id(&attr->grh.dgid);
+	vlan_tag = attr->vlan_id;
 	if (!vlan_tag || (vlan_tag > 0xFFF))
 		vlan_tag = dev->pvid;
 	if (vlan_tag && (vlan_tag < 0x1000)) {
@@ -64,7 +64,8 @@ static inline int set_av_attr(struct ocrdma_dev *dev, struct ocrdma_ah *ah,
 		eth_sz = sizeof(struct ocrdma_eth_basic);
 	}
 	memcpy(&eth.smac[0], &dev->nic_info.mac_addr[0], ETH_ALEN);
-	status = ocrdma_resolve_dgid(dev, &attr->grh.dgid, &eth.dmac[0]);
+	memcpy(&eth.dmac[0], attr->dmac, ETH_ALEN);
+	status = ocrdma_resolve_dmac(dev, attr, &eth.dmac[0]);
 	if (status)
 		return status;
 	status = ocrdma_query_gid(&dev->ibdev, 1, attr->grh.sgid_index,
diff --git a/drivers/infiniband/hw/ocrdma/ocrdma_hw.c b/drivers/infiniband/hw/ocrdma/ocrdma_hw.c
index 4ed8235..69c82b1 100644
--- a/drivers/infiniband/hw/ocrdma/ocrdma_hw.c
+++ b/drivers/infiniband/hw/ocrdma/ocrdma_hw.c
@@ -2076,23 +2076,6 @@ mbx_err:
 	return status;
 }
 
-int ocrdma_resolve_dgid(struct ocrdma_dev *dev, union ib_gid *dgid,
-			u8 *mac_addr)
-{
-	struct in6_addr in6;
-
-	memcpy(&in6, dgid, sizeof in6);
-	if (rdma_is_multicast_addr(&in6)) {
-		rdma_get_mcast_mac(&in6, mac_addr);
-	} else if (rdma_link_local_addr(&in6)) {
-		rdma_get_ll_mac(&in6, mac_addr);
-	} else {
-		pr_err("%s() fail to resolve mac_addr.\n", __func__);
-		return -EINVAL;
-	}
-	return 0;
-}
-
 static int ocrdma_set_av_params(struct ocrdma_qp *qp,
 				struct ocrdma_modify_qp *cmd,
 				struct ib_qp_attr *attrs)
@@ -2126,14 +2109,14 @@ static int ocrdma_set_av_params(struct ocrdma_qp *qp,
 
 	qp->sgid_idx = ah_attr->grh.sgid_index;
 	memcpy(&cmd->params.sgid[0], &sgid.raw[0], sizeof(cmd->params.sgid));
-	ocrdma_resolve_dgid(qp->dev, &ah_attr->grh.dgid, &mac_addr[0]);
+	ocrdma_resolve_dmac(qp->dev, ah_attr, &mac_addr[0]);
 	cmd->params.dmac_b0_to_b3 = mac_addr[0] | (mac_addr[1] << 8) |
 				(mac_addr[2] << 16) | (mac_addr[3] << 24);
 	/* convert them to LE format. */
 	ocrdma_cpu_to_le32(&cmd->params.dgid[0], sizeof(cmd->params.dgid));
 	ocrdma_cpu_to_le32(&cmd->params.sgid[0], sizeof(cmd->params.sgid));
 	cmd->params.vlan_dmac_b4_to_b5 = mac_addr[4] | (mac_addr[5] << 8);
-	vlan_id = rdma_get_vlan_id(&sgid);
+	vlan_id = ah_attr->vlan_id;
 	if (vlan_id && (vlan_id < 0x1000)) {
 		cmd->params.vlan_dmac_b4_to_b5 |=
 		    vlan_id << OCRDMA_QP_PARAMS_VLAN_SHIFT;
diff --git a/drivers/infiniband/hw/ocrdma/ocrdma_hw.h b/drivers/infiniband/hw/ocrdma/ocrdma_hw.h
index f2a89d4..82fe332 100644
--- a/drivers/infiniband/hw/ocrdma/ocrdma_hw.h
+++ b/drivers/infiniband/hw/ocrdma/ocrdma_hw.h
@@ -94,7 +94,6 @@ void ocrdma_ring_cq_db(struct ocrdma_dev *, u16 cq_id, bool armed,
 int ocrdma_mbx_get_link_speed(struct ocrdma_dev *dev, u8 *lnk_speed);
 int ocrdma_query_config(struct ocrdma_dev *,
 			struct ocrdma_mbx_query_config *config);
-int ocrdma_resolve_dgid(struct ocrdma_dev *, union ib_gid *dgid, u8 *mac_addr);
 
 int ocrdma_mbx_alloc_pd(struct ocrdma_dev *, struct ocrdma_pd *);
 int ocrdma_mbx_dealloc_pd(struct ocrdma_dev *, struct ocrdma_pd *);
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH V4 7/9] IB/core: Add RoCE IP based addressing extensions for uverbs
       [not found] ` <1378824099-22150-1-git-send-email-ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
                     ` (5 preceding siblings ...)
  2013-09-10 14:41   ` [PATCH V4 6/9] IB/ocrdma: Handle Ethernet L2 parameters for IP based GID addressing Or Gerlitz
@ 2013-09-10 14:41   ` Or Gerlitz
       [not found]     ` <1378824099-22150-8-git-send-email-ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  2013-09-10 14:41   ` [PATCH V4 8/9] IB/core: Add RoCE IP based addressing extensions for rdma_ucm Or Gerlitz
  2013-09-10 14:41   ` [PATCH V4 9/9] IB/mlx4: Enable mlx4_ib support for MODIFY_QP_EX Or Gerlitz
  8 siblings, 1 reply; 38+ messages in thread
From: Or Gerlitz @ 2013-09-10 14:41 UTC (permalink / raw)
  To: roland-DgEjT+Ai2ygdnm+yROfE0A
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, monis-VPRAkNaXOzVWk0Htik3J/w,
	matanb-VPRAkNaXOzVWk0Htik3J/w, Or Gerlitz

From: Matan Barak <matanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>

Add uverbs support for RoCE (IBoE) IP based addressing extensions
towards user space libraries.

Under ip based gid addressing, for RC QPs, QP attributes should contain the
Ethernet L2 destination. Until now, indicatings GID was sufficient. When
using ip encoded in gids, the QP attributes should contain extended destination,
indicating vlan and dmac as well. This is done via a new struct ib_uverbs_qp_dest_ex.
This new structure is contained in a new struct ib_uverbs_modify_qp_ex that is
used by MODIFY_QP_EX command. In order to make those changes seamlessly, those
extended structures were added in the bottom of the current structures.
The new command also gets smac/alt_smac/vlan_id/alt_vlan_id. Those parameters
are fixed in the QP context in order to enhance security.
The extended dest is used a a pointer rather than as a inline fixed field
in the sake of future scalability.

Also, when the gid encodes ip address, the AH attributes should contain also
dmac. Therefore, ib_uverbs_create_ah was extended to contain those fields.
When creating an AH, the user indicates the exact L2 ethernet destination
parameters. This is done by a new CREATE_AH_EX command that uses a new struct
ib_uverbs_create_ah_ex.

struct ib_user_path_rec was extended too, to contain source and destination
MAC and VLAN ID, this structure is of use by the rdma_ucm driver.

Signed-off-by: Matan Barak <matanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Signed-off-by: Or Gerlitz <ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
---
 drivers/infiniband/core/uverbs.h          |    2 +
 drivers/infiniband/core/uverbs_cmd.c      |  359 ++++++++++++++++++++++-------
 drivers/infiniband/core/uverbs_main.c     |    4 +-
 drivers/infiniband/core/uverbs_marshall.c |  128 ++++++++++-
 include/rdma/ib_marshall.h                |   12 +
 include/uapi/rdma/ib_user_sa.h            |   34 +++-
 include/uapi/rdma/ib_user_verbs.h         |  160 +++++++++++++-
 7 files changed, 608 insertions(+), 91 deletions(-)

diff --git a/drivers/infiniband/core/uverbs.h b/drivers/infiniband/core/uverbs.h
index d040b87..b0fcb0b 100644
--- a/drivers/infiniband/core/uverbs.h
+++ b/drivers/infiniband/core/uverbs.h
@@ -202,11 +202,13 @@ IB_UVERBS_DECLARE_CMD(create_qp);
 IB_UVERBS_DECLARE_CMD(open_qp);
 IB_UVERBS_DECLARE_CMD(query_qp);
 IB_UVERBS_DECLARE_CMD(modify_qp);
+IB_UVERBS_DECLARE_CMD(modify_qp_ex);
 IB_UVERBS_DECLARE_CMD(destroy_qp);
 IB_UVERBS_DECLARE_CMD(post_send);
 IB_UVERBS_DECLARE_CMD(post_recv);
 IB_UVERBS_DECLARE_CMD(post_srq_recv);
 IB_UVERBS_DECLARE_CMD(create_ah);
+IB_UVERBS_DECLARE_CMD(create_ah_ex);
 IB_UVERBS_DECLARE_CMD(destroy_ah);
 IB_UVERBS_DECLARE_CMD(attach_mcast);
 IB_UVERBS_DECLARE_CMD(detach_mcast);
diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c
index f2b81b9..9a0c5d7 100644
--- a/drivers/infiniband/core/uverbs_cmd.c
+++ b/drivers/infiniband/core/uverbs_cmd.c
@@ -1900,6 +1900,60 @@ static int modify_qp_mask(enum ib_qp_type qp_type, int mask)
 	}
 }
 
+static void ib_uverbs_modify_qp_assign(struct ib_uverbs_modify_qp *cmd,
+				       struct ib_qp_attr *attr,
+				       struct ib_uverbs_qp_dest *dest,
+				       struct ib_uverbs_qp_dest *alt_dest) {
+	attr->qp_state		  = cmd->qp_state;
+	attr->cur_qp_state	  = cmd->cur_qp_state;
+	attr->path_mtu		  = cmd->path_mtu;
+	attr->path_mig_state	  = cmd->path_mig_state;
+	attr->qkey		  = cmd->qkey;
+	attr->rq_psn		  = cmd->rq_psn;
+	attr->sq_psn		  = cmd->sq_psn;
+	attr->dest_qp_num	  = cmd->dest_qp_num;
+	attr->qp_access_flags	  = cmd->qp_access_flags;
+	attr->pkey_index	  = cmd->pkey_index;
+	attr->alt_pkey_index	  = cmd->alt_pkey_index;
+	attr->en_sqd_async_notify = cmd->en_sqd_async_notify;
+	attr->max_rd_atomic	  = cmd->max_rd_atomic;
+	attr->max_dest_rd_atomic  = cmd->max_dest_rd_atomic;
+	attr->min_rnr_timer	  = cmd->min_rnr_timer;
+	attr->port_num		  = cmd->port_num;
+	attr->timeout		  = cmd->timeout;
+	attr->retry_cnt		  = cmd->retry_cnt;
+	attr->rnr_retry		  = cmd->rnr_retry;
+	attr->alt_port_num	  = cmd->alt_port_num;
+	attr->alt_timeout	  = cmd->alt_timeout;
+
+	memcpy(attr->ah_attr.grh.dgid.raw, dest->dgid, 16);
+	attr->ah_attr.grh.flow_label        = dest->flow_label;
+	attr->ah_attr.grh.sgid_index        = dest->sgid_index;
+	attr->ah_attr.grh.hop_limit         = dest->hop_limit;
+	attr->ah_attr.grh.traffic_class     = dest->traffic_class;
+	attr->ah_attr.dlid		    = dest->dlid;
+	attr->ah_attr.sl		    = dest->sl;
+	attr->ah_attr.src_path_bits	    = dest->src_path_bits;
+	attr->ah_attr.static_rate	    = dest->static_rate;
+	attr->ah_attr.ah_flags		    = dest->is_global ?
+					      IB_AH_GRH : 0;
+	attr->ah_attr.port_num		    = dest->port_num;
+
+	memcpy(attr->alt_ah_attr.grh.dgid.raw, alt_dest->dgid, 16);
+	attr->alt_ah_attr.grh.flow_label    = alt_dest->flow_label;
+	attr->alt_ah_attr.grh.sgid_index    = alt_dest->sgid_index;
+	attr->alt_ah_attr.grh.hop_limit     = alt_dest->hop_limit;
+	attr->alt_ah_attr.grh.traffic_class = alt_dest->traffic_class;
+	attr->alt_ah_attr.dlid		    = alt_dest->dlid;
+	attr->alt_ah_attr.sl		    = alt_dest->sl;
+	attr->alt_ah_attr.src_path_bits     = alt_dest->src_path_bits;
+	attr->alt_ah_attr.static_rate       = alt_dest->static_rate;
+	attr->alt_ah_attr.ah_flags	    = alt_dest->is_global
+					      ? IB_AH_GRH : 0;
+	attr->alt_ah_attr.port_num	    = alt_dest->port_num;
+}
+
+
 ssize_t ib_uverbs_modify_qp(struct ib_uverbs_file *file,
 			    const char __user *buf, int in_len,
 			    int out_len)
@@ -1926,51 +1980,13 @@ ssize_t ib_uverbs_modify_qp(struct ib_uverbs_file *file,
 		goto out;
 	}
 
-	attr->qp_state 		  = cmd.qp_state;
-	attr->cur_qp_state 	  = cmd.cur_qp_state;
-	attr->path_mtu 		  = cmd.path_mtu;
-	attr->path_mig_state 	  = cmd.path_mig_state;
-	attr->qkey 		  = cmd.qkey;
-	attr->rq_psn 		  = cmd.rq_psn;
-	attr->sq_psn 		  = cmd.sq_psn;
-	attr->dest_qp_num 	  = cmd.dest_qp_num;
-	attr->qp_access_flags 	  = cmd.qp_access_flags;
-	attr->pkey_index 	  = cmd.pkey_index;
-	attr->alt_pkey_index 	  = cmd.alt_pkey_index;
-	attr->en_sqd_async_notify = cmd.en_sqd_async_notify;
-	attr->max_rd_atomic 	  = cmd.max_rd_atomic;
-	attr->max_dest_rd_atomic  = cmd.max_dest_rd_atomic;
-	attr->min_rnr_timer 	  = cmd.min_rnr_timer;
-	attr->port_num 		  = cmd.port_num;
-	attr->timeout 		  = cmd.timeout;
-	attr->retry_cnt 	  = cmd.retry_cnt;
-	attr->rnr_retry 	  = cmd.rnr_retry;
-	attr->alt_port_num 	  = cmd.alt_port_num;
-	attr->alt_timeout 	  = cmd.alt_timeout;
-
-	memcpy(attr->ah_attr.grh.dgid.raw, cmd.dest.dgid, 16);
-	attr->ah_attr.grh.flow_label        = cmd.dest.flow_label;
-	attr->ah_attr.grh.sgid_index        = cmd.dest.sgid_index;
-	attr->ah_attr.grh.hop_limit         = cmd.dest.hop_limit;
-	attr->ah_attr.grh.traffic_class     = cmd.dest.traffic_class;
-	attr->ah_attr.dlid 	    	    = cmd.dest.dlid;
-	attr->ah_attr.sl   	    	    = cmd.dest.sl;
-	attr->ah_attr.src_path_bits 	    = cmd.dest.src_path_bits;
-	attr->ah_attr.static_rate   	    = cmd.dest.static_rate;
-	attr->ah_attr.ah_flags 	    	    = cmd.dest.is_global ? IB_AH_GRH : 0;
-	attr->ah_attr.port_num 	    	    = cmd.dest.port_num;
-
-	memcpy(attr->alt_ah_attr.grh.dgid.raw, cmd.alt_dest.dgid, 16);
-	attr->alt_ah_attr.grh.flow_label    = cmd.alt_dest.flow_label;
-	attr->alt_ah_attr.grh.sgid_index    = cmd.alt_dest.sgid_index;
-	attr->alt_ah_attr.grh.hop_limit     = cmd.alt_dest.hop_limit;
-	attr->alt_ah_attr.grh.traffic_class = cmd.alt_dest.traffic_class;
-	attr->alt_ah_attr.dlid 	    	    = cmd.alt_dest.dlid;
-	attr->alt_ah_attr.sl   	    	    = cmd.alt_dest.sl;
-	attr->alt_ah_attr.src_path_bits     = cmd.alt_dest.src_path_bits;
-	attr->alt_ah_attr.static_rate       = cmd.alt_dest.static_rate;
-	attr->alt_ah_attr.ah_flags 	    = cmd.alt_dest.is_global ? IB_AH_GRH : 0;
-	attr->alt_ah_attr.port_num 	    = cmd.alt_dest.port_num;
+	ib_uverbs_modify_qp_assign(&cmd, attr, &cmd.dest, &cmd.alt_dest);
+	memset(attr->ah_attr.dmac, 0, sizeof(attr->ah_attr.dmac));
+	attr->vlan_id = 0xFFFF;
+	memset(attr->smac, 0, sizeof(attr->smac));
+	memset(attr->alt_ah_attr.dmac, 0, sizeof(attr->alt_ah_attr.dmac));
+	attr->alt_vlan_id = 0xFFFF;
+	memset(attr->alt_smac, 0, sizeof(attr->alt_smac));
 
 	if (qp->real_qp == qp) {
 		ret = qp->device->modify_qp(qp, attr,
@@ -1992,6 +2008,111 @@ out:
 	return ret;
 }
 
+ssize_t ib_uverbs_modify_qp_ex(struct ib_uverbs_file *file,
+			       const char __user *buf, int in_len,
+			       int out_len)
+{
+	struct ib_uverbs_modify_qp_ex cmd;
+	struct ib_uverbs_qp_dest_ex   dest_ex;
+	struct ib_uverbs_qp_dest_ex   alt_dest_ex;
+	struct ib_uverbs_qp_dest     *dest;
+	struct ib_uverbs_qp_dest     *alt_dest;
+	struct ib_udata               udata;
+	struct ib_qp                 *qp;
+	struct ib_qp_attr	     *attr;
+	int                           ret;
+
+	if (copy_from_user(&cmd, buf, sizeof(cmd)))
+		return -EFAULT;
+
+	INIT_UDATA(&udata, buf + sizeof(cmd), NULL, in_len - sizeof(cmd),
+		   out_len);
+
+	attr = kmalloc(sizeof(*attr), GFP_KERNEL);
+	if (!attr)
+		return -ENOMEM;
+
+	qp = idr_read_qp(cmd.qp_handle, file->ucontext);
+	if (!qp) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	if (!(cmd.comp_mask & IB_UVERBS_MODIFY_QP_EX_DEST_EX) ||
+	    copy_from_user(&dest_ex, cmd.dest_ex, sizeof(dest_ex)))
+		dest = &cmd.dest;
+	else
+		dest = (struct ib_uverbs_qp_dest *)&dest_ex;
+
+
+	if (!(cmd.comp_mask & IB_UVERBS_MODIFY_QP_EX_ALT_DEST_EX) ||
+	    copy_from_user(&alt_dest_ex, cmd.alt_dest_ex, sizeof(alt_dest_ex)))
+		alt_dest = &cmd.alt_dest;
+	else
+		alt_dest = (struct ib_uverbs_qp_dest *)&alt_dest_ex;
+
+	ib_uverbs_modify_qp_assign((struct ib_uverbs_modify_qp *)((void *)&cmd +
+				   offsetof(
+				   struct ib_uverbs_modify_qp_ex, dest)), attr,
+				   dest,
+				   alt_dest);
+
+	if (cmd.comp_mask &  IB_UVERBS_MODIFY_QP_EX_DEST_EX &&
+	    dest_ex.comp_mask & IB_UVERBS_QP_DEST_ATTR_DMAC)
+		memcpy(attr->ah_attr.dmac, dest_ex.dmac,
+		       sizeof(attr->ah_attr.dmac));
+	else
+		memset(attr->ah_attr.dmac, 0,
+		       sizeof(attr->ah_attr.dmac));
+	if (cmd.comp_mask & IB_UVERBS_MODIFY_QP_EX_VID)
+		attr->vlan_id = cmd.vid;
+	else
+		attr->vlan_id = 0xFFFF;
+	if (cmd.comp_mask & IB_UVERBS_MODIFY_QP_EX_SMAC)
+		memcpy(attr->smac, cmd.smac,
+		       sizeof(attr->smac));
+	else
+		memset(attr->smac, 0,
+		       sizeof(attr->smac));
+	if (cmd.comp_mask &  IB_UVERBS_MODIFY_QP_EX_ALT_DEST_EX &&
+	    alt_dest_ex.comp_mask & IB_UVERBS_QP_DEST_ATTR_DMAC)
+		memcpy(attr->alt_ah_attr.dmac, alt_dest_ex.dmac,
+		       sizeof(attr->alt_ah_attr.dmac));
+	else
+		memset(attr->alt_ah_attr.dmac, 0,
+		       sizeof(attr->alt_ah_attr.dmac));
+	if (cmd.comp_mask & IB_UVERBS_MODIFY_QP_EX_ALT_VID)
+		attr->alt_vlan_id = cmd.alt_vid;
+	else
+		attr->alt_vlan_id = 0xFFFF;
+	if (cmd.comp_mask & IB_UVERBS_MODIFY_QP_EX_ALT_SMAC)
+		memcpy(attr->alt_smac, cmd.alt_smac,
+		       sizeof(attr->alt_smac));
+	else
+		memset(attr->alt_smac, 0,
+		       sizeof(attr->alt_smac));
+
+
+	if (qp->real_qp == qp) {
+		ret = qp->device->modify_qp(qp, attr,
+			modify_qp_mask(qp->qp_type, cmd.attr_mask), &udata);
+	} else {
+		ret = ib_modify_qp(qp, attr,
+				   modify_qp_mask(qp->qp_type, cmd.attr_mask));
+	}
+
+	put_qp_read(qp);
+
+	if (ret)
+		goto out;
+
+	ret = in_len;
+
+out:
+	kfree(attr);
+
+	return ret;
+}
 ssize_t ib_uverbs_destroy_qp(struct ib_uverbs_file *file,
 			     const char __user *buf, int in_len,
 			     int out_len)
@@ -2389,48 +2510,46 @@ out:
 	return ret ? ret : in_len;
 }
 
-ssize_t ib_uverbs_create_ah(struct ib_uverbs_file *file,
-			    const char __user *buf, int in_len,
-			    int out_len)
+static struct ib_uobject *ib_uverbs_create_ah_assign(
+		struct ib_uverbs_create_ah_ex *cmd,
+		struct ib_uverbs_ah_attr_ex *src_attr,
+		struct ib_uverbs_file *file)
 {
-	struct ib_uverbs_create_ah	 cmd;
-	struct ib_uverbs_create_ah_resp	 resp;
-	struct ib_uobject		*uobj;
 	struct ib_pd			*pd;
 	struct ib_ah			*ah;
 	struct ib_ah_attr		attr;
-	int ret;
-
-	if (out_len < sizeof resp)
-		return -ENOSPC;
-
-	if (copy_from_user(&cmd, buf, sizeof cmd))
-		return -EFAULT;
+	struct ib_uobject		*uobj;
+	long				ret;
 
-	uobj = kmalloc(sizeof *uobj, GFP_KERNEL);
+	uobj = kmalloc(sizeof(*uobj), GFP_KERNEL);
 	if (!uobj)
-		return -ENOMEM;
+		return (void *)-ENOMEM;
 
-	init_uobj(uobj, cmd.user_handle, file->ucontext, &ah_lock_class);
+	init_uobj(uobj, cmd->user_handle, file->ucontext, &ah_lock_class);
 	down_write(&uobj->mutex);
 
-	pd = idr_read_pd(cmd.pd_handle, file->ucontext);
+	pd = idr_read_pd(cmd->pd_handle, file->ucontext);
 	if (!pd) {
 		ret = -EINVAL;
 		goto err;
 	}
 
-	attr.dlid 	       = cmd.attr.dlid;
-	attr.sl 	       = cmd.attr.sl;
-	attr.src_path_bits     = cmd.attr.src_path_bits;
-	attr.static_rate       = cmd.attr.static_rate;
-	attr.ah_flags          = cmd.attr.is_global ? IB_AH_GRH : 0;
-	attr.port_num 	       = cmd.attr.port_num;
-	attr.grh.flow_label    = cmd.attr.grh.flow_label;
-	attr.grh.sgid_index    = cmd.attr.grh.sgid_index;
-	attr.grh.hop_limit     = cmd.attr.grh.hop_limit;
-	attr.grh.traffic_class = cmd.attr.grh.traffic_class;
-	memcpy(attr.grh.dgid.raw, cmd.attr.grh.dgid, 16);
+	attr.dlid	       = src_attr->dlid;
+	attr.sl		       = src_attr->sl;
+	attr.src_path_bits     = src_attr->src_path_bits;
+	attr.static_rate       = src_attr->static_rate;
+	attr.ah_flags          = src_attr->is_global ? IB_AH_GRH : 0;
+	attr.port_num	       = src_attr->port_num;
+	attr.grh.flow_label    = src_attr->grh.flow_label;
+	attr.grh.sgid_index    = src_attr->grh.sgid_index;
+	attr.grh.hop_limit     = src_attr->grh.hop_limit;
+	attr.grh.traffic_class = src_attr->grh.traffic_class;
+	memcpy(attr.grh.dgid.raw, src_attr->grh.dgid, 16);
+
+	if (src_attr->comp_mask & IB_UVERBS_AH_ATTR_DMAC)
+		memcpy(attr.dmac, src_attr->dmac, sizeof(attr.dmac));
+	else
+		memset(attr.dmac, 0, sizeof(attr.dmac));
 
 	ah = ib_create_ah(pd, &attr);
 	if (IS_ERR(ah)) {
@@ -2439,22 +2558,61 @@ ssize_t ib_uverbs_create_ah(struct ib_uverbs_file *file,
 	}
 
 	ah->uobject  = uobj;
+
 	uobj->object = ah;
 
 	ret = idr_add_uobj(&ib_uverbs_ah_idr, uobj);
 	if (ret)
 		goto err_destroy;
 
+	put_pd_read(pd);
+
+	return uobj;
+
+err_destroy:
+	ib_destroy_ah(ah);
+err_put:
+	put_pd_read(pd);
+err:
+	put_uobj_write(uobj);
+	return (void *)ret;
+}
+
+ssize_t ib_uverbs_create_ah(struct ib_uverbs_file *file,
+			    const char __user *buf, int in_len,
+			    int out_len)
+{
+	struct ib_uverbs_create_ah_ex	 cmd_ex;
+	struct ib_uverbs_create_ah	*cmd = (struct ib_uverbs_create_ah *)
+					       ((void *)&cmd_ex +
+						sizeof(cmd_ex.comp_mask));
+	struct ib_uverbs_ah_attr_ex	 attr_ex;
+	struct ib_uverbs_create_ah_resp	 resp;
+	struct ib_uobject		*uobj;
+	int ret;
+
+	if (out_len < sizeof(resp))
+		return -ENOSPC;
+
+	cmd_ex.comp_mask = 0;
+	if (copy_from_user(cmd, buf, sizeof(*cmd)))
+		return -EFAULT;
+
+	attr_ex.comp_mask = 0;
+	memcpy(&attr_ex, &cmd->attr, sizeof(cmd->attr));
+
+	uobj = ib_uverbs_create_ah_assign(&cmd_ex,  &attr_ex, file);
+	if (IS_ERR(uobj))
+		return (ssize_t)uobj;
+
 	resp.ah_handle = uobj->id;
 
-	if (copy_to_user((void __user *) (unsigned long) cmd.response,
+	if (copy_to_user((void __user *)(unsigned long) cmd->response,
 			 &resp, sizeof resp)) {
 		ret = -EFAULT;
 		goto err_copy;
 	}
 
-	put_pd_read(pd);
-
 	mutex_lock(&file->mutex);
 	list_add_tail(&uobj->list, &file->ucontext->ah_list);
 	mutex_unlock(&file->mutex);
@@ -2467,15 +2625,54 @@ ssize_t ib_uverbs_create_ah(struct ib_uverbs_file *file,
 
 err_copy:
 	idr_remove_uobj(&ib_uverbs_ah_idr, uobj);
+	ib_destroy_ah(uobj->object);
+	put_uobj_write(uobj);
 
-err_destroy:
-	ib_destroy_ah(ah);
+	return ret;
+}
 
-err_put:
-	put_pd_read(pd);
+ssize_t ib_uverbs_create_ah_ex(struct ib_uverbs_file *file,
+			       const char __user *buf, int in_len,
+			       int out_len)
+{
+	struct ib_uverbs_create_ah_ex	 cmd_ex;
+	struct ib_uverbs_create_ah_resp	 resp;
+	struct ib_uobject		*uobj;
+	int ret;
 
-err:
+	if (out_len < sizeof(resp))
+		return -ENOSPC;
+
+	if (copy_from_user(&cmd_ex, buf, sizeof(cmd_ex)))
+		return -EFAULT;
+
+	uobj = ib_uverbs_create_ah_assign(&cmd_ex,  &cmd_ex.attr, file);
+	if (IS_ERR(uobj))
+		return (ssize_t)uobj;
+
+	resp.ah_handle = uobj->id;
+
+	if (copy_to_user((void __user *)(unsigned long)cmd_ex.response,
+			 &resp, sizeof(resp))) {
+		ret = -EFAULT;
+		goto err_copy;
+	}
+
+	mutex_lock(&file->mutex);
+	list_add_tail(&uobj->list, &file->ucontext->ah_list);
+	mutex_unlock(&file->mutex);
+
+	uobj->live = 1;
+
+	up_write(&uobj->mutex);
+
+	return in_len;
+
+err_copy:
+	idr_remove_uobj(&ib_uverbs_ah_idr, uobj);
+	ib_destroy_ah(uobj->object);
 	put_uobj_write(uobj);
+
 	return ret;
 }
 
diff --git a/drivers/infiniband/core/uverbs_main.c b/drivers/infiniband/core/uverbs_main.c
index 75ad86c..f1cc3f0 100644
--- a/drivers/infiniband/core/uverbs_main.c
+++ b/drivers/infiniband/core/uverbs_main.c
@@ -116,7 +116,9 @@ static ssize_t (*uverbs_cmd_table[])(struct ib_uverbs_file *file,
 	[IB_USER_VERBS_CMD_CREATE_XSRQ]		= ib_uverbs_create_xsrq,
 	[IB_USER_VERBS_CMD_OPEN_QP]		= ib_uverbs_open_qp,
 	[IB_USER_VERBS_CMD_CREATE_FLOW]		= ib_uverbs_create_flow,
-	[IB_USER_VERBS_CMD_DESTROY_FLOW]	= ib_uverbs_destroy_flow
+	[IB_USER_VERBS_CMD_DESTROY_FLOW]	= ib_uverbs_destroy_flow,
+	[IB_USER_VERBS_CMD_MODIFY_QP_EX]	= ib_uverbs_modify_qp_ex,
+	[IB_USER_VERBS_CMD_CREATE_AH_EX]	= ib_uverbs_create_ah_ex
 };
 
 static void ib_uverbs_add_one(struct ib_device *device);
diff --git a/drivers/infiniband/core/uverbs_marshall.c b/drivers/infiniband/core/uverbs_marshall.c
index e7bee46..7f7a7e2 100644
--- a/drivers/infiniband/core/uverbs_marshall.c
+++ b/drivers/infiniband/core/uverbs_marshall.c
@@ -31,6 +31,7 @@
  */
 
 #include <linux/export.h>
+#include <linux/etherdevice.h>
 #include <rdma/ib_marshall.h>
 
 void ib_copy_ah_attr_to_user(struct ib_uverbs_ah_attr *dst,
@@ -52,6 +53,17 @@ void ib_copy_ah_attr_to_user(struct ib_uverbs_ah_attr *dst,
 }
 EXPORT_SYMBOL(ib_copy_ah_attr_to_user);
 
+void ib_copy_ah_attr_to_user_ex(struct ib_uverbs_ah_attr_ex *dst,
+				struct ib_ah_attr *src)
+{
+	dst->comp_mask = 0;
+	ib_copy_ah_attr_to_user((struct ib_uverbs_ah_attr *)
+				dst, src);
+	dst->comp_mask |= IB_UVERBS_AH_ATTR_DMAC;
+	memcpy(dst->dmac, src->dmac, sizeof(dst->dmac));
+}
+EXPORT_SYMBOL(ib_copy_ah_attr_to_user_ex);
+
 void ib_copy_qp_attr_to_user(struct ib_uverbs_qp_attr *dst,
 			     struct ib_qp_attr *src)
 {
@@ -65,15 +77,15 @@ void ib_copy_qp_attr_to_user(struct ib_uverbs_qp_attr *dst,
 	dst->dest_qp_num	= src->dest_qp_num;
 	dst->qp_access_flags	= src->qp_access_flags;
 
+	ib_copy_ah_attr_to_user(&dst->ah_attr, &src->ah_attr);
+	ib_copy_ah_attr_to_user(&dst->alt_ah_attr, &src->alt_ah_attr);
+
 	dst->max_send_wr	= src->cap.max_send_wr;
 	dst->max_recv_wr	= src->cap.max_recv_wr;
 	dst->max_send_sge	= src->cap.max_send_sge;
 	dst->max_recv_sge	= src->cap.max_recv_sge;
 	dst->max_inline_data	= src->cap.max_inline_data;
 
-	ib_copy_ah_attr_to_user(&dst->ah_attr, &src->ah_attr);
-	ib_copy_ah_attr_to_user(&dst->alt_ah_attr, &src->alt_ah_attr);
-
 	dst->pkey_index		= src->pkey_index;
 	dst->alt_pkey_index	= src->alt_pkey_index;
 	dst->en_sqd_async_notify = src->en_sqd_async_notify;
@@ -91,6 +103,63 @@ void ib_copy_qp_attr_to_user(struct ib_uverbs_qp_attr *dst,
 }
 EXPORT_SYMBOL(ib_copy_qp_attr_to_user);
 
+void ib_copy_qp_attr_to_user_ex(struct ib_uverbs_qp_attr_ex *dst,
+				struct ib_qp_attr *src)
+{
+	struct ib_uverbs_ah_attr_ex ah_attr_ex;
+	struct ib_uverbs_ah_attr_ex alt_ah_attr_ex;
+
+	ib_copy_qp_attr_to_user((struct ib_uverbs_qp_attr *)
+				dst, src);
+	if (dst->comp_mask & IB_UVERBS_QP_ATTR_AH_EX &&
+	    !is_zero_ether_addr(src->ah_attr.dmac)) {
+		ib_copy_ah_attr_to_user_ex(&ah_attr_ex, &src->ah_attr);
+		if (dst->ah_attr_ex == NULL ||
+		    copy_to_user(dst->ah_attr_ex, &ah_attr_ex,
+				 sizeof(ah_attr_ex)))
+			dst->comp_mask &= ~IB_UVERBS_QP_ATTR_AH_EX;
+	} else {
+		dst->comp_mask &= ~IB_UVERBS_QP_ATTR_AH_EX;
+	}
+	if (dst->comp_mask & IB_UVERBS_QP_ATTR_ALT_AH_EX &&
+	    !is_zero_ether_addr(src->alt_ah_attr.dmac)) {
+		ib_copy_ah_attr_to_user_ex(&alt_ah_attr_ex, &src->alt_ah_attr);
+		if (dst->alt_ah_attr_ex == NULL ||
+		    copy_to_user(dst->alt_ah_attr_ex, &alt_ah_attr_ex,
+				 sizeof(alt_ah_attr_ex)))
+			dst->comp_mask &= ~IB_UVERBS_QP_ATTR_AH_EX;
+	} else {
+		dst->comp_mask &= ~IB_UVERBS_QP_ATTR_AH_EX;
+	}
+	if (dst->comp_mask & IB_UVERBS_QP_ATTR_SMAC &&
+	    !is_zero_ether_addr(src->smac))
+		memcpy(dst->smac, src->smac, sizeof(dst->smac));
+	else
+		dst->comp_mask &= ~IB_UVERBS_QP_ATTR_SMAC;
+	if (dst->comp_mask & IB_UVERBS_QP_ATTR_ALT_SMAC &&
+	    !is_zero_ether_addr(src->alt_smac))
+		memcpy(dst->alt_smac, src->alt_smac, sizeof(dst->alt_smac));
+	else
+		dst->comp_mask &= ~IB_UVERBS_QP_ATTR_ALT_SMAC;
+
+	if (dst->comp_mask & IB_UVERBS_QP_ATTR_SMAC &&
+	    src->vlan_id != 0xFFFF)
+		dst->vlan_id = src->vlan_id;
+	else
+		dst->comp_mask &= ~IB_UVERBS_QP_ATTR_VID;
+	if (dst->comp_mask & IB_UVERBS_QP_ATTR_VID &&
+	    src->vlan_id != 0xFFFF)
+		dst->vlan_id = src->vlan_id;
+	else
+		dst->comp_mask &= ~IB_UVERBS_QP_ATTR_VID;
+	if (dst->comp_mask & IB_UVERBS_QP_ATTR_ALT_VID &&
+	    src->alt_vlan_id != 0xFFFF)
+		dst->alt_vlan_id = src->alt_vlan_id;
+	else
+		dst->comp_mask &= ~IB_UVERBS_QP_ATTR_VID;
+}
+EXPORT_SYMBOL(ib_copy_qp_attr_to_user_ex);
+
 void ib_copy_path_rec_to_user(struct ib_user_path_rec *dst,
 			      struct ib_sa_path_rec *src)
 {
@@ -117,11 +186,26 @@ void ib_copy_path_rec_to_user(struct ib_user_path_rec *dst,
 }
 EXPORT_SYMBOL(ib_copy_path_rec_to_user);
 
-void ib_copy_path_rec_from_user(struct ib_sa_path_rec *dst,
-				struct ib_user_path_rec *src)
+void ib_copy_path_rec_to_user_ex(struct ib_user_path_rec_ex *dst,
+				 struct ib_sa_path_rec *src)
 {
-	memcpy(dst->dgid.raw, src->dgid, sizeof dst->dgid);
-	memcpy(dst->sgid.raw, src->sgid, sizeof dst->sgid);
+	ib_copy_path_rec_to_user((struct ib_user_path_rec *)dst, src);
+
+	dst->comp_mask = IB_USER_PATH_REC_ATTR_DMAC |
+			 IB_USER_PATH_REC_ATTR_SMAC |
+			 IB_USER_PATH_REC_ATTR_VID;
+
+	memcpy(dst->dmac, src->dmac, sizeof(dst->dmac));
+	memcpy(dst->smac, src->smac, sizeof(dst->smac));
+	dst->vlan_id = src->vlan_id;
+}
+EXPORT_SYMBOL(ib_copy_path_rec_to_user_ex);
+
+static void ib_copy_path_rec_from_user_assign(struct ib_sa_path_rec *dst,
+					      struct ib_user_path_rec *src)
+{
+	memcpy(dst->dgid.raw, src->dgid, sizeof(dst->dgid));
+	memcpy(dst->sgid.raw, src->sgid, sizeof(dst->sgid));
 
 	dst->dlid		= src->dlid;
 	dst->slid		= src->slid;
@@ -141,4 +225,34 @@ void ib_copy_path_rec_from_user(struct ib_sa_path_rec *dst,
 	dst->preference		= src->preference;
 	dst->packet_life_time_selector = src->packet_life_time_selector;
 }
+
+void ib_copy_path_rec_from_user(struct ib_sa_path_rec *dst,
+				struct ib_user_path_rec *src) {
+	memset(dst->dmac, 0, sizeof(dst->dmac));
+	memset(dst->smac, 0, sizeof(dst->smac));
+	dst->vlan_id = 0xFFFF;
+
+	ib_copy_path_rec_from_user_assign(dst, src);
+}
 EXPORT_SYMBOL(ib_copy_path_rec_from_user);
+
+void ib_copy_path_rec_from_user_ex(struct ib_sa_path_rec *dst,
+				   struct ib_user_path_rec_ex *src) {
+	if (src->comp_mask & IB_USER_PATH_REC_ATTR_DMAC)
+		memcpy(dst->dmac, src->dmac, sizeof(dst->dmac));
+	else
+		memset(dst->dmac, 0, sizeof(dst->dmac));
+
+	if (src->comp_mask & IB_USER_PATH_REC_ATTR_SMAC)
+		memcpy(dst->smac, src->smac, sizeof(dst->smac));
+	else
+		memset(dst->smac, 0, sizeof(dst->smac));
+
+	if (src->comp_mask & IB_USER_PATH_REC_ATTR_VID)
+		dst->vlan_id = src->vlan_id;
+	else
+		dst->vlan_id = 0xFFFF;
+
+	ib_copy_path_rec_from_user_assign(dst, (struct ib_user_path_rec *)src);
+}
+EXPORT_SYMBOL(ib_copy_path_rec_from_user_ex);
diff --git a/include/rdma/ib_marshall.h b/include/rdma/ib_marshall.h
index db03720..11ab3a8 100644
--- a/include/rdma/ib_marshall.h
+++ b/include/rdma/ib_marshall.h
@@ -41,13 +41,25 @@
 void ib_copy_qp_attr_to_user(struct ib_uverbs_qp_attr *dst,
 			     struct ib_qp_attr *src);
 
+void ib_copy_qp_attr_to_user_ex(struct ib_uverbs_qp_attr_ex *dst,
+				struct ib_qp_attr *src);
+
 void ib_copy_ah_attr_to_user(struct ib_uverbs_ah_attr *dst,
 			     struct ib_ah_attr *src);
 
+void ib_copy_ah_attr_to_user_ex(struct ib_uverbs_ah_attr_ex *dst,
+				struct ib_ah_attr *src);
+
 void ib_copy_path_rec_to_user(struct ib_user_path_rec *dst,
 			      struct ib_sa_path_rec *src);
 
+void ib_copy_path_rec_to_user_ex(struct ib_user_path_rec_ex *dst,
+				 struct ib_sa_path_rec *src);
+
 void ib_copy_path_rec_from_user(struct ib_sa_path_rec *dst,
 				struct ib_user_path_rec *src);
 
+void ib_copy_path_rec_from_user_ex(struct ib_sa_path_rec *dst,
+				   struct ib_user_path_rec_ex *src);
+
 #endif /* IB_USER_MARSHALL_H */
diff --git a/include/uapi/rdma/ib_user_sa.h b/include/uapi/rdma/ib_user_sa.h
index cfc7c9b..8a24e10 100644
--- a/include/uapi/rdma/ib_user_sa.h
+++ b/include/uapi/rdma/ib_user_sa.h
@@ -48,7 +48,13 @@ enum {
 struct ib_path_rec_data {
 	__u32	flags;
 	__u32	reserved;
-	__u32	path_rec[16];
+	__u32	path_rec[20];
+};
+
+enum ibv_kern_path_rec_attr_mask {
+	IB_USER_PATH_REC_ATTR_DMAC = 1ULL << 0,
+	IB_USER_PATH_REC_ATTR_SMAC = 1ULL << 1,
+	IB_USER_PATH_REC_ATTR_VID  = 1ULL << 2
 };
 
 struct ib_user_path_rec {
@@ -73,4 +79,30 @@ struct ib_user_path_rec {
 	__u8	preference;
 };
 
+struct ib_user_path_rec_ex {
+	__u8	dgid[16];
+	__u8	sgid[16];
+	__be16	dlid;
+	__be16	slid;
+	__u32	raw_traffic;
+	__be32	flow_label;
+	__u32	reversible;
+	__u32	mtu;
+	__be16	pkey;
+	__u8	hop_limit;
+	__u8	traffic_class;
+	__u8	numb_path;
+	__u8	sl;
+	__u8	mtu_selector;
+	__u8	rate_selector;
+	__u8	rate;
+	__u8	packet_life_time_selector;
+	__u8	packet_life_time;
+	__u8	preference;
+	__u32   comp_mask;
+	__u8	smac[ETH_ALEN];
+	__u8	dmac[ETH_ALEN];
+	u16	vlan_id;
+};
+
 #endif /* IB_USER_SA_H */
diff --git a/include/uapi/rdma/ib_user_verbs.h b/include/uapi/rdma/ib_user_verbs.h
index 0b233c5..f1939cf 100644
--- a/include/uapi/rdma/ib_user_verbs.h
+++ b/include/uapi/rdma/ib_user_verbs.h
@@ -37,6 +37,7 @@
 #define IB_USER_VERBS_H
 
 #include <linux/types.h>
+#include <linux/if_ether.h>
 
 /*
  * Increment this value if any changes that break userspace ABI
@@ -88,7 +89,9 @@ enum {
 	IB_USER_VERBS_CMD_CREATE_XSRQ,
 	IB_USER_VERBS_CMD_OPEN_QP,
 	IB_USER_VERBS_CMD_CREATE_FLOW = IB_USER_VERBS_CMD_THRESHOLD,
-	IB_USER_VERBS_CMD_DESTROY_FLOW
+	IB_USER_VERBS_CMD_DESTROY_FLOW,
+	IB_USER_VERBS_CMD_MODIFY_QP_EX,
+	IB_USER_VERBS_CMD_CREATE_AH_EX
 };
 
 /*
@@ -394,6 +397,24 @@ struct ib_uverbs_ah_attr {
 	__u8  reserved;
 };
 
+enum ib_uverbs_ah_attr_mask {
+	IB_UVERBS_AH_ATTR_DMAC,
+	IB_UVERBS_AH_ATTR_VID
+};
+
+struct ib_uverbs_ah_attr_ex {
+	struct ib_uverbs_global_route grh;
+	__u16 dlid;
+	__u8  sl;
+	__u8  src_path_bits;
+	__u8  static_rate;
+	__u8  is_global;
+	__u8  port_num;
+	__u8  reserved;
+	__u32 comp_mask;
+	__u8  dmac[ETH_ALEN];
+};
+
 struct ib_uverbs_qp_attr {
 	__u32	qp_attr_mask;
 	__u32	qp_state;
@@ -432,6 +453,65 @@ struct ib_uverbs_qp_attr {
 	__u8	reserved[5];
 };
 
+enum ib_uverbs_qh_attr_mask {
+	IB_UVERBS_QP_ATTR_AH_EX		= 1 << 0,
+	IB_UVERBS_QP_ATTR_ALT_AH_EX	= 1 << 1,
+	IB_UVERBS_QP_ATTR_SMAC		= 1 << 2,
+	IB_UVERBS_QP_ATTR_ALT_SMAC	= 1 << 3,
+	IB_UVERBS_QP_ATTR_VID		= 1 << 4,
+	IB_UVERBS_QP_ATTR_ALT_VID	= 1 << 5,
+};
+
+struct ib_uverbs_qp_attr_ex {
+	__u32	qp_attr_mask;
+	__u32	qp_state;
+	__u32	cur_qp_state;
+	__u32	path_mtu;
+	__u32	path_mig_state;
+	__u32	qkey;
+	__u32	rq_psn;
+	__u32	sq_psn;
+	__u32	dest_qp_num;
+	__u32	qp_access_flags;
+
+	/* deprecated for extension */
+	struct ib_uverbs_ah_attr ah_attr;
+	/* deprecated for extension */
+	struct ib_uverbs_ah_attr alt_ah_attr;
+
+	/* ib_qp_cap */
+	__u32	max_send_wr;
+	__u32	max_recv_wr;
+	__u32	max_send_sge;
+	__u32	max_recv_sge;
+	__u32	max_inline_data;
+
+	__u16	pkey_index;
+	__u16	alt_pkey_index;
+	__u8	en_sqd_async_notify;
+	__u8	sq_draining;
+	__u8	max_rd_atomic;
+	__u8	max_dest_rd_atomic;
+	__u8	min_rnr_timer;
+	__u8	port_num;
+	__u8	timeout;
+	__u8	retry_cnt;
+	__u8	rnr_retry;
+	__u8	alt_port_num;
+	__u8	alt_timeout;
+	__u8	reserved[5];
+
+	__u32   comp_mask;
+	/* represents: struct ib_uverbs_ah_attr_ex * __user */
+	void __user *ah_attr_ex;
+	/* represents: struct ib_uverbs_ah_attr_ex * __user */
+	void __user *alt_ah_attr_ex;
+	__u8	smac[ETH_ALEN];
+	__u8	alt_smac[ETH_ALEN];
+	__u16	vlan_id;
+	__u16	alt_vlan_id;
+};
+
 struct ib_uverbs_create_qp {
 	__u64 response;
 	__u64 user_handle;
@@ -492,6 +572,27 @@ struct ib_uverbs_qp_dest {
 	__u8  port_num;
 };
 
+enum ib_uverbs_qp_dest_attr_mask {
+	IB_UVERBS_QP_DEST_ATTR_DMAC = 1ULL << 0
+};
+
+struct ib_uverbs_qp_dest_ex {
+	__u8  dgid[16];
+	__u32 flow_label;
+	__u16 dlid;
+	__u16 reserved;
+	__u8  sgid_index;
+	__u8  hop_limit;
+	__u8  traffic_class;
+	__u8  sl;
+	__u8  src_path_bits;
+	__u8  static_rate;
+	__u8  is_global;
+	__u8  port_num;
+	__u32 comp_mask;
+	__u8  dmac[ETH_ALEN];
+};
+
 struct ib_uverbs_query_qp {
 	__u64 response;
 	__u32 qp_handle;
@@ -563,6 +664,54 @@ struct ib_uverbs_modify_qp {
 	__u64 driver_data[0];
 };
 
+enum ib_uverbs_modify_qp_ex_comp_mask {
+	IB_UVERBS_MODIFY_QP_EX_SMAC		      = (1ULL << 0),
+	IB_UVERBS_MODIFY_QP_EX_ALT_SMAC		      = (1ULL << 1),
+	IB_UVERBS_MODIFY_QP_EX_VID		      = (1ULL << 2),
+	IB_UVERBS_MODIFY_QP_EX_ALT_VID		      = (1ULL << 3),
+	IB_UVERBS_MODIFY_QP_EX_DEST_EX		      = (1ULL << 4),
+	IB_UVERBS_MODIFY_QP_EX_ALT_DEST_EX	      = (1ULL << 5)
+};
+
+struct ib_uverbs_modify_qp_ex {
+	__u32 comp_mask;
+	struct ib_uverbs_qp_dest dest;
+	struct ib_uverbs_qp_dest alt_dest;
+	__u32 qp_handle;
+	__u32 attr_mask;
+	__u32 qkey;
+	__u32 rq_psn;
+	__u32 sq_psn;
+	__u32 dest_qp_num;
+	__u32 qp_access_flags;
+	__u16 pkey_index;
+	__u16 alt_pkey_index;
+	__u8  qp_state;
+	__u8  cur_qp_state;
+	__u8  path_mtu;
+	__u8  path_mig_state;
+	__u8  en_sqd_async_notify;
+	__u8  max_rd_atomic;
+	__u8  max_dest_rd_atomic;
+	__u8  min_rnr_timer;
+	__u8  port_num;
+	__u8  timeout;
+	__u8  retry_cnt;
+	__u8  rnr_retry;
+	__u8  alt_port_num;
+	__u8  alt_timeout;
+	__u8  reserved[2];
+	__u8  smac[ETH_ALEN];
+	__u8  alt_smac[ETH_ALEN];
+	__u16 vid;
+	__u16 alt_vid;
+	/* represents: struct ib_uverbs_qp_dest_ex * __user */
+	void __user *dest_ex;
+	/* represents: struct ib_uverbs_qp_dest_ex * __user */
+	void __user *alt_dest_ex;
+	__u64 driver_data[0];
+};
+
 struct ib_uverbs_modify_qp_resp {
 };
 
@@ -672,6 +821,15 @@ struct ib_uverbs_create_ah {
 	struct ib_uverbs_ah_attr attr;
 };
 
+struct ib_uverbs_create_ah_ex {
+	__u32 comp_mask;
+	__u64 response;
+	__u64 user_handle;
+	__u32 pd_handle;
+	__u32 reserved;
+	struct ib_uverbs_ah_attr_ex attr;
+};
+
 struct ib_uverbs_create_ah_resp {
 	__u32 ah_handle;
 };
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH V4 8/9] IB/core: Add RoCE IP based addressing extensions for rdma_ucm
       [not found] ` <1378824099-22150-1-git-send-email-ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
                     ` (6 preceding siblings ...)
  2013-09-10 14:41   ` [PATCH V4 7/9] IB/core: Add RoCE IP based addressing extensions for uverbs Or Gerlitz
@ 2013-09-10 14:41   ` Or Gerlitz
       [not found]     ` <1378824099-22150-9-git-send-email-ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  2013-09-10 14:41   ` [PATCH V4 9/9] IB/mlx4: Enable mlx4_ib support for MODIFY_QP_EX Or Gerlitz
  8 siblings, 1 reply; 38+ messages in thread
From: Or Gerlitz @ 2013-09-10 14:41 UTC (permalink / raw)
  To: roland-DgEjT+Ai2ygdnm+yROfE0A
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, monis-VPRAkNaXOzVWk0Htik3J/w,
	matanb-VPRAkNaXOzVWk0Htik3J/w, Or Gerlitz

From: Matan Barak <matanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>

Add rdma_ucm support for RoCE (IBoE) IP based addressing extensions
towards librdmacm

Extend INIT_QP_ATTR and QUERY_ROUTE ucma commands.

INIT_QP_ATTR_EX uses struct ib_uverbs_qp_attr_ex

QUERY_ROUTE_EX uses struct rdma_ucm_query_route_resp_ex which in turn
uses ib_user_path_rec_ex

Signed-off-by: Matan Barak <matanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Signed-off-by: Or Gerlitz <ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
---
 drivers/infiniband/core/ucma.c   |  175 ++++++++++++++++++++++++++++++++++++-
 include/uapi/rdma/rdma_user_cm.h |   29 ++++++-
 2 files changed, 198 insertions(+), 6 deletions(-)

diff --git a/drivers/infiniband/core/ucma.c b/drivers/infiniband/core/ucma.c
index 7e7da86..4d59e88 100644
--- a/drivers/infiniband/core/ucma.c
+++ b/drivers/infiniband/core/ucma.c
@@ -652,6 +652,35 @@ static void ucma_copy_ib_route(struct rdma_ucm_query_route_resp *resp,
 	}
 }
 
+static void ucma_copy_ib_route_ex(struct rdma_ucm_query_route_resp_ex *resp,
+				  struct rdma_route *route)
+{
+	struct rdma_dev_addr *dev_addr;
+
+	resp->num_paths = route->num_paths;
+	switch (route->num_paths) {
+	case 0:
+		dev_addr = &route->addr.dev_addr;
+		rdma_addr_get_dgid(dev_addr,
+				   (union ib_gid *)&resp->ib_route[0].dgid);
+		rdma_addr_get_sgid(dev_addr,
+				   (union ib_gid *)&resp->ib_route[0].sgid);
+		resp->ib_route[0].pkey =
+			cpu_to_be16(ib_addr_get_pkey(dev_addr));
+		break;
+	case 2:
+		ib_copy_path_rec_to_user_ex(&resp->ib_route[1],
+					    &route->path_rec[1]);
+		/* fall through */
+	case 1:
+		ib_copy_path_rec_to_user_ex(&resp->ib_route[0],
+					    &route->path_rec[0]);
+		break;
+	default:
+		break;
+	}
+}
+
 static void ucma_copy_iboe_route(struct rdma_ucm_query_route_resp *resp,
 				 struct rdma_route *route)
 {
@@ -678,14 +707,39 @@ static void ucma_copy_iboe_route(struct rdma_ucm_query_route_resp *resp,
 	}
 }
 
-static void ucma_copy_iw_route(struct rdma_ucm_query_route_resp *resp,
+static void ucma_copy_iboe_route_ex(struct rdma_ucm_query_route_resp_ex *resp,
+				    struct rdma_route *route)
+{
+	resp->num_paths = route->num_paths;
+	switch (route->num_paths) {
+	case 0:
+		rdma_ip2gid((struct sockaddr *)&route->addr.dst_addr,
+			    (union ib_gid *)&resp->ib_route[0].dgid);
+		rdma_ip2gid((struct sockaddr *)&route->addr.src_addr,
+			    (union ib_gid *)&resp->ib_route[0].sgid);
+		resp->ib_route[0].pkey = cpu_to_be16(0xffff);
+		break;
+	case 2:
+		ib_copy_path_rec_to_user_ex(&resp->ib_route[1],
+					    &route->path_rec[1]);
+		/* fall through */
+	case 1:
+		ib_copy_path_rec_to_user_ex(&resp->ib_route[0],
+					    &route->path_rec[0]);
+		break;
+	default:
+		break;
+	}
+}
+
+static void ucma_copy_iw_route(struct ib_user_path_rec *resp_path,
 			       struct rdma_route *route)
 {
 	struct rdma_dev_addr *dev_addr;
 
 	dev_addr = &route->addr.dev_addr;
-	rdma_addr_get_dgid(dev_addr, (union ib_gid *) &resp->ib_route[0].dgid);
-	rdma_addr_get_sgid(dev_addr, (union ib_gid *) &resp->ib_route[0].sgid);
+	rdma_addr_get_dgid(dev_addr, (union ib_gid *)&resp_path->dgid);
+	rdma_addr_get_sgid(dev_addr, (union ib_gid *)&resp_path->sgid);
 }
 
 static ssize_t ucma_query_route(struct ucma_file *file,
@@ -737,7 +791,74 @@ static ssize_t ucma_query_route(struct ucma_file *file,
 		}
 		break;
 	case RDMA_TRANSPORT_IWARP:
-		ucma_copy_iw_route(&resp, &ctx->cm_id->route);
+		ucma_copy_iw_route(&resp.ib_route[0], &ctx->cm_id->route);
+		break;
+	default:
+		break;
+	}
+
+out:
+	if (copy_to_user((void __user *)(unsigned long)cmd.response,
+			 &resp, sizeof(resp)))
+		ret = -EFAULT;
+
+	ucma_put_ctx(ctx);
+	return ret;
+}
+
+static ssize_t ucma_query_route_ex(struct ucma_file *file,
+				   const char __user *inbuf,
+				   int in_len, int out_len)
+{
+	struct rdma_ucm_query_route_ex cmd;
+	struct rdma_ucm_query_route_resp_ex resp;
+	struct ucma_context *ctx;
+	struct sockaddr *addr;
+	int ret = 0;
+
+	if (out_len < sizeof(resp))
+		return -ENOSPC;
+
+	if (copy_from_user(&cmd, inbuf, sizeof(cmd)))
+		return -EFAULT;
+
+	ctx = ucma_get_ctx(file, cmd.id);
+	if (IS_ERR(ctx))
+		return PTR_ERR(ctx);
+
+	memset(&resp, 0, sizeof(resp));
+	addr = (struct sockaddr *)&ctx->cm_id->route.addr.src_addr;
+	memcpy(&resp.src_addr, addr, addr->sa_family == AF_INET ?
+				     sizeof(struct sockaddr_in) :
+				     sizeof(struct sockaddr_in6));
+	addr = (struct sockaddr *)&ctx->cm_id->route.addr.dst_addr;
+	memcpy(&resp.dst_addr, addr, addr->sa_family == AF_INET ?
+				     sizeof(struct sockaddr_in) :
+				     sizeof(struct sockaddr_in6));
+	if (!ctx->cm_id->device)
+		goto out;
+
+	resp.node_guid = (__force __u64) ctx->cm_id->device->node_guid;
+	resp.port_num = ctx->cm_id->port_num;
+	switch (rdma_node_get_transport(ctx->cm_id->device->node_type)) {
+	case RDMA_TRANSPORT_IB:
+		switch (rdma_port_get_link_layer(ctx->cm_id->device,
+			ctx->cm_id->port_num)) {
+		case IB_LINK_LAYER_INFINIBAND:
+			ucma_copy_ib_route_ex(&resp, &ctx->cm_id->route);
+			break;
+		case IB_LINK_LAYER_ETHERNET:
+			ucma_copy_iboe_route_ex(&resp, &ctx->cm_id->route);
+			break;
+		default:
+			break;
+		}
+		break;
+	case RDMA_TRANSPORT_IWARP:
+		ucma_copy_iw_route((struct ib_user_path_rec *)
+				   ((void *)&resp.ib_route[0] +
+				    sizeof(resp.ib_route[0].comp_mask)),
+				   &ctx->cm_id->route);
 		break;
 	default:
 		break;
@@ -1071,6 +1192,48 @@ out:
 	return ret;
 }
 
+static ssize_t ucma_init_qp_attr_ex(struct ucma_file *file,
+				    const char __user *inbuf,
+				    int in_len, int out_len)
+{
+	struct rdma_ucm_init_qp_attr_ex cmd;
+	struct ib_uverbs_qp_attr_ex resp;
+	struct ucma_context *ctx;
+	struct ib_qp_attr qp_attr;
+	int ret;
+
+	if (out_len < sizeof(resp))
+		return -ENOSPC;
+
+	if (copy_from_user(&cmd, inbuf, sizeof(cmd)))
+		return -EFAULT;
+
+	if (copy_from_user(&resp, (void __user *)(unsigned long)cmd.response,
+			   sizeof(resp)))
+		return -EFAULT;
+
+	ctx = ucma_get_ctx(file, cmd.id);
+	if (IS_ERR(ctx))
+		return PTR_ERR(ctx);
+
+	resp.comp_mask = cmd.comp_mask;
+	resp.qp_attr_mask = 0;
+	memset(&qp_attr, 0, sizeof(qp_attr));
+	qp_attr.qp_state = cmd.qp_state;
+	ret = rdma_init_qp_attr(ctx->cm_id, &qp_attr, &resp.qp_attr_mask);
+	if (ret)
+		goto out;
+
+	ib_copy_qp_attr_to_user_ex(&resp, &qp_attr);
+	if (copy_to_user((void __user *)(unsigned long)cmd.response,
+			 &resp, sizeof(resp)))
+		ret = -EFAULT;
+
+out:
+	ucma_put_ctx(ctx);
+	return ret;
+}
+
 static int ucma_set_option_id(struct ucma_context *ctx, int optname,
 			      void *optval, size_t optlen)
 {
@@ -1474,7 +1637,9 @@ static ssize_t (*ucma_cmd_table[])(struct ucma_file *file,
 	[RDMA_USER_CM_CMD_QUERY]	 = ucma_query,
 	[RDMA_USER_CM_CMD_BIND]		 = ucma_bind,
 	[RDMA_USER_CM_CMD_RESOLVE_ADDR]	 = ucma_resolve_addr,
-	[RDMA_USER_CM_CMD_JOIN_MCAST]	 = ucma_join_multicast
+	[RDMA_USER_CM_CMD_JOIN_MCAST]	 = ucma_join_multicast,
+	[RDMA_USER_CM_CMD_QUERY_ROUTE_EX] = ucma_query_route_ex,
+	[RDMA_USER_CM_CMD_INIT_QP_ATTR_EX] = ucma_init_qp_attr_ex
 };
 
 static ssize_t ucma_write(struct file *filp, const char __user *buf,
diff --git a/include/uapi/rdma/rdma_user_cm.h b/include/uapi/rdma/rdma_user_cm.h
index 99b80ab..ceb7f1f 100644
--- a/include/uapi/rdma/rdma_user_cm.h
+++ b/include/uapi/rdma/rdma_user_cm.h
@@ -65,7 +65,9 @@ enum {
 	RDMA_USER_CM_CMD_QUERY,
 	RDMA_USER_CM_CMD_BIND,
 	RDMA_USER_CM_CMD_RESOLVE_ADDR,
-	RDMA_USER_CM_CMD_JOIN_MCAST
+	RDMA_USER_CM_CMD_JOIN_MCAST,
+	RDMA_USER_CM_CMD_QUERY_ROUTE_EX,
+	RDMA_USER_CM_CMD_INIT_QP_ATTR_EX
 };
 
 /*
@@ -146,6 +148,13 @@ struct rdma_ucm_query {
 	__u32 option;
 };
 
+struct rdma_ucm_query_route_ex {
+	__u32 comp_mask;
+	__u64 response;
+	__u32 id;
+	__u32 reserved;
+};
+
 struct rdma_ucm_query_route_resp {
 	__u64 node_guid;
 	struct ib_user_path_rec ib_route[2];
@@ -173,6 +182,16 @@ struct rdma_ucm_query_path_resp {
 	struct ib_path_rec_data path_data[0];
 };
 
+struct rdma_ucm_query_route_resp_ex {
+	__u64 node_guid;
+	struct ib_user_path_rec_ex ib_route[2];
+	struct sockaddr_in6 src_addr;
+	struct sockaddr_in6 dst_addr;
+	__u32 num_paths;
+	__u8 port_num;
+	__u8 reserved[3];
+};
+
 struct rdma_ucm_conn_param {
 	__u32 qp_num;
 	__u32 qkey;
@@ -231,6 +250,14 @@ struct rdma_ucm_init_qp_attr {
 	__u32 qp_state;
 };
 
+struct rdma_ucm_init_qp_attr_ex {
+	__u32 comp_mask;
+	__u32 reserved;
+	__u64 response;
+	__u32 id;
+	__u32 qp_state;
+};
+
 struct rdma_ucm_notify {
 	__u32 id;
 	__u32 event;
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH V4 9/9] IB/mlx4: Enable mlx4_ib support for MODIFY_QP_EX
       [not found] ` <1378824099-22150-1-git-send-email-ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
                     ` (7 preceding siblings ...)
  2013-09-10 14:41   ` [PATCH V4 8/9] IB/core: Add RoCE IP based addressing extensions for rdma_ucm Or Gerlitz
@ 2013-09-10 14:41   ` Or Gerlitz
       [not found]     ` <1378824099-22150-10-git-send-email-ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  8 siblings, 1 reply; 38+ messages in thread
From: Or Gerlitz @ 2013-09-10 14:41 UTC (permalink / raw)
  To: roland-DgEjT+Ai2ygdnm+yROfE0A
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, monis-VPRAkNaXOzVWk0Htik3J/w,
	matanb-VPRAkNaXOzVWk0Htik3J/w, Or Gerlitz

From: Matan Barak <matanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>

mlx4_ib driver should indicate that it supports
MODIFY_QP_EX user verbs extended command.

Signed-off-by: Matan Barak <matanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Signed-off-by: Or Gerlitz <ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
---
 drivers/infiniband/hw/mlx4/main.c |    3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/drivers/infiniband/hw/mlx4/main.c b/drivers/infiniband/hw/mlx4/main.c
index 7a29ad5..77c87d0 100644
--- a/drivers/infiniband/hw/mlx4/main.c
+++ b/drivers/infiniband/hw/mlx4/main.c
@@ -1755,7 +1755,8 @@ static void *mlx4_ib_add(struct mlx4_dev *dev)
 		(1ull << IB_USER_VERBS_CMD_QUERY_SRQ)		|
 		(1ull << IB_USER_VERBS_CMD_DESTROY_SRQ)		|
 		(1ull << IB_USER_VERBS_CMD_CREATE_XSRQ)		|
-		(1ull << IB_USER_VERBS_CMD_OPEN_QP);
+		(1ull << IB_USER_VERBS_CMD_OPEN_QP)		|
+		(1ull << IB_USER_VERBS_CMD_MODIFY_QP_EX);
 
 	ibdev->ib_dev.query_device	= mlx4_ib_query_device;
 	ibdev->ib_dev.query_port	= mlx4_ib_query_port;
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH V4 8/9] IB/core: Add RoCE IP based addressing extensions for rdma_ucm
       [not found]     ` <1378824099-22150-9-git-send-email-ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2013-09-11  9:52       ` Yann Droneaud
       [not found]         ` <26c47667e463e65dd79caaa4bddc437b-zgzEX58YAwA@public.gmane.org>
  0 siblings, 1 reply; 38+ messages in thread
From: Yann Droneaud @ 2013-09-11  9:52 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: roland-DgEjT+Ai2ygdnm+yROfE0A, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	monis-VPRAkNaXOzVWk0Htik3J/w, matanb-VPRAkNaXOzVWk0Htik3J/w

Le 10.09.2013 16:41, Or Gerlitz a écrit :
> From: Matan Barak <matanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> 
> Add rdma_ucm support for RoCE (IBoE) IP based addressing extensions
> towards librdmacm
> 
> Extend INIT_QP_ATTR and QUERY_ROUTE ucma commands.
> 
> INIT_QP_ATTR_EX uses struct ib_uverbs_qp_attr_ex
> 
> QUERY_ROUTE_EX uses struct rdma_ucm_query_route_resp_ex which in turn
> uses ib_user_path_rec_ex
> 
> Signed-off-by: Matan Barak <matanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> Signed-off-by: Or Gerlitz <ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> ---
>  drivers/infiniband/core/ucma.c   |  175 
> ++++++++++++++++++++++++++++++++++++-
>  include/uapi/rdma/rdma_user_cm.h |   29 ++++++-
>  2 files changed, 198 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/infiniband/core/ucma.c 
> b/drivers/infiniband/core/ucma.c
> index 7e7da86..4d59e88 100644
> --- a/drivers/infiniband/core/ucma.c
> +++ b/drivers/infiniband/core/ucma.c
> 
> +static ssize_t ucma_init_qp_attr_ex(struct ucma_file *file,
> +				    const char __user *inbuf,
> +				    int in_len, int out_len)
> +{
> +	struct rdma_ucm_init_qp_attr_ex cmd;
> +	struct ib_uverbs_qp_attr_ex resp;
> +	struct ucma_context *ctx;
> +	struct ib_qp_attr qp_attr;
> +	int ret;
> +
> +	if (out_len < sizeof(resp))
> +		return -ENOSPC;
> +
> +	if (copy_from_user(&cmd, inbuf, sizeof(cmd)))
> +		return -EFAULT;
> +
> +	if (copy_from_user(&resp, (void __user *)(unsigned long)cmd.response,
> +			   sizeof(resp)))
> +		return -EFAULT;
> +


Reading from the response buffer ? I haven't seen that before in IB/core 
before.

It's seems to me that no other part of user verbs/RDMA API is using the 
output buffer as an input buffer.

> +	ctx = ucma_get_ctx(file, cmd.id);
> +	if (IS_ERR(ctx))
> +		return PTR_ERR(ctx);
> +
> +	resp.comp_mask = cmd.comp_mask;
> +	resp.qp_attr_mask = 0;
> +	memset(&qp_attr, 0, sizeof(qp_attr));
> +	qp_attr.qp_state = cmd.qp_state;
> +	ret = rdma_init_qp_attr(ctx->cm_id, &qp_attr, &resp.qp_attr_mask);
> +	if (ret)
> +		goto out;
> +
> +	ib_copy_qp_attr_to_user_ex(&resp, &qp_attr);
> +	if (copy_to_user((void __user *)(unsigned long)cmd.response,
> +			 &resp, sizeof(resp)))
> +		ret = -EFAULT;
> +
> +out:
> +	ucma_put_ctx(ctx);
> +	return ret;
> +}
> +

-- 
Yann Droneaud
OPTEYA

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH V4 7/9] IB/core: Add RoCE IP based addressing extensions for uverbs
       [not found]     ` <1378824099-22150-8-git-send-email-ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2013-09-11 10:06       ` meuh-zgzEX58YAwA
       [not found]         ` <6d494aa8d403e0c50b16f09fbd2c3ab6-zgzEX58YAwA@public.gmane.org>
  0 siblings, 1 reply; 38+ messages in thread
From: meuh-zgzEX58YAwA @ 2013-09-11 10:06 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: roland-DgEjT+Ai2ygdnm+yROfE0A, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	monis-VPRAkNaXOzVWk0Htik3J/w, matanb-VPRAkNaXOzVWk0Htik3J/w

Le 10.09.2013 16:41, Or Gerlitz a écrit :
> From: Matan Barak <matanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> 
> Add uverbs support for RoCE (IBoE) IP based addressing extensions
> towards user space libraries.
> 
> Under ip based gid addressing, for RC QPs, QP attributes should contain 
> the
> Ethernet L2 destination. Until now, indicatings GID was sufficient. 
> When
> using ip encoded in gids, the QP attributes should contain extended 
> destination,
> indicating vlan and dmac as well. This is done via a new struct
> ib_uverbs_qp_dest_ex.
> This new structure is contained in a new struct ib_uverbs_modify_qp_ex 
> that is
> used by MODIFY_QP_EX command. In order to make those changes 
> seamlessly, those
> extended structures were added in the bottom of the current structures.
> The new command also gets smac/alt_smac/vlan_id/alt_vlan_id. Those 
> parameters
> are fixed in the QP context in order to enhance security.
> The extended dest is used a a pointer rather than as a inline fixed 
> field
> in the sake of future scalability.
> 
> Also, when the gid encodes ip address, the AH attributes should contain 
> also
> dmac. Therefore, ib_uverbs_create_ah was extended to contain those 
> fields.
> When creating an AH, the user indicates the exact L2 ethernet 
> destination
> parameters. This is done by a new CREATE_AH_EX command that uses a new 
> struct
> ib_uverbs_create_ah_ex.
> 
> struct ib_user_path_rec was extended too, to contain source and 
> destination
> MAC and VLAN ID, this structure is of use by the rdma_ucm driver.
> 
> Signed-off-by: Matan Barak <matanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> Signed-off-by: Or Gerlitz <ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> ---
>  drivers/infiniband/core/uverbs.h          |    2 +
>  drivers/infiniband/core/uverbs_cmd.c      |  359 
> ++++++++++++++++++++++-------
>  drivers/infiniband/core/uverbs_main.c     |    4 +-
>  drivers/infiniband/core/uverbs_marshall.c |  128 ++++++++++-
>  include/rdma/ib_marshall.h                |   12 +
>  include/uapi/rdma/ib_user_sa.h            |   34 +++-
>  include/uapi/rdma/ib_user_verbs.h         |  160 +++++++++++++-
>  7 files changed, 608 insertions(+), 91 deletions(-)
> 
> diff --git a/drivers/infiniband/core/uverbs.h 
> b/drivers/infiniband/core/uverbs.h
> index d040b87..b0fcb0b 100644
> --- a/drivers/infiniband/core/uverbs.h
> +++ b/drivers/infiniband/core/uverbs.h
> @@ -202,11 +202,13 @@ IB_UVERBS_DECLARE_CMD(create_qp);
>  IB_UVERBS_DECLARE_CMD(open_qp);
>  IB_UVERBS_DECLARE_CMD(query_qp);
>  IB_UVERBS_DECLARE_CMD(modify_qp);
> +IB_UVERBS_DECLARE_CMD(modify_qp_ex);
>  IB_UVERBS_DECLARE_CMD(destroy_qp);
>  IB_UVERBS_DECLARE_CMD(post_send);
>  IB_UVERBS_DECLARE_CMD(post_recv);
>  IB_UVERBS_DECLARE_CMD(post_srq_recv);
>  IB_UVERBS_DECLARE_CMD(create_ah);
> +IB_UVERBS_DECLARE_CMD(create_ah_ex);
>  IB_UVERBS_DECLARE_CMD(destroy_ah);
>  IB_UVERBS_DECLARE_CMD(attach_mcast);
>  IB_UVERBS_DECLARE_CMD(detach_mcast);
> diff --git a/drivers/infiniband/core/uverbs_cmd.c
> b/drivers/infiniband/core/uverbs_cmd.c
> index f2b81b9..9a0c5d7 100644
> --- a/drivers/infiniband/core/uverbs_cmd.c
> +++ b/drivers/infiniband/core/uverbs_cmd.c
> @@ -1900,6 +1900,60 @@ static int modify_qp_mask(enum ib_qp_type
> qp_type, int mask)
>  	}
>  }
> 
> +static void ib_uverbs_modify_qp_assign(struct ib_uverbs_modify_qp 
> *cmd,
> +				       struct ib_qp_attr *attr,
> +				       struct ib_uverbs_qp_dest *dest,
> +				       struct ib_uverbs_qp_dest *alt_dest) {
> +	attr->qp_state		  = cmd->qp_state;
> +	attr->cur_qp_state	  = cmd->cur_qp_state;
> +	attr->path_mtu		  = cmd->path_mtu;
> +	attr->path_mig_state	  = cmd->path_mig_state;
> +	attr->qkey		  = cmd->qkey;
> +	attr->rq_psn		  = cmd->rq_psn;
> +	attr->sq_psn		  = cmd->sq_psn;
> +	attr->dest_qp_num	  = cmd->dest_qp_num;
> +	attr->qp_access_flags	  = cmd->qp_access_flags;
> +	attr->pkey_index	  = cmd->pkey_index;
> +	attr->alt_pkey_index	  = cmd->alt_pkey_index;
> +	attr->en_sqd_async_notify = cmd->en_sqd_async_notify;
> +	attr->max_rd_atomic	  = cmd->max_rd_atomic;
> +	attr->max_dest_rd_atomic  = cmd->max_dest_rd_atomic;
> +	attr->min_rnr_timer	  = cmd->min_rnr_timer;
> +	attr->port_num		  = cmd->port_num;
> +	attr->timeout		  = cmd->timeout;
> +	attr->retry_cnt		  = cmd->retry_cnt;
> +	attr->rnr_retry		  = cmd->rnr_retry;
> +	attr->alt_port_num	  = cmd->alt_port_num;
> +	attr->alt_timeout	  = cmd->alt_timeout;
> +
> +	memcpy(attr->ah_attr.grh.dgid.raw, dest->dgid, 16);
> +	attr->ah_attr.grh.flow_label        = dest->flow_label;
> +	attr->ah_attr.grh.sgid_index        = dest->sgid_index;
> +	attr->ah_attr.grh.hop_limit         = dest->hop_limit;
> +	attr->ah_attr.grh.traffic_class     = dest->traffic_class;
> +	attr->ah_attr.dlid		    = dest->dlid;
> +	attr->ah_attr.sl		    = dest->sl;
> +	attr->ah_attr.src_path_bits	    = dest->src_path_bits;
> +	attr->ah_attr.static_rate	    = dest->static_rate;
> +	attr->ah_attr.ah_flags		    = dest->is_global ?
> +					      IB_AH_GRH : 0;
> +	attr->ah_attr.port_num		    = dest->port_num;
> +
> +	memcpy(attr->alt_ah_attr.grh.dgid.raw, alt_dest->dgid, 16);
> +	attr->alt_ah_attr.grh.flow_label    = alt_dest->flow_label;
> +	attr->alt_ah_attr.grh.sgid_index    = alt_dest->sgid_index;
> +	attr->alt_ah_attr.grh.hop_limit     = alt_dest->hop_limit;
> +	attr->alt_ah_attr.grh.traffic_class = alt_dest->traffic_class;
> +	attr->alt_ah_attr.dlid		    = alt_dest->dlid;
> +	attr->alt_ah_attr.sl		    = alt_dest->sl;
> +	attr->alt_ah_attr.src_path_bits     = alt_dest->src_path_bits;
> +	attr->alt_ah_attr.static_rate       = alt_dest->static_rate;
> +	attr->alt_ah_attr.ah_flags	    = alt_dest->is_global
> +					      ? IB_AH_GRH : 0;
> +	attr->alt_ah_attr.port_num	    = alt_dest->port_num;
> +}
> +
> +
>  ssize_t ib_uverbs_modify_qp(struct ib_uverbs_file *file,
>  			    const char __user *buf, int in_len,
>  			    int out_len)
> @@ -1926,51 +1980,13 @@ ssize_t ib_uverbs_modify_qp(struct 
> ib_uverbs_file *file,
>  		goto out;
>  	}
> 
> -	attr->qp_state 		  = cmd.qp_state;
> -	attr->cur_qp_state 	  = cmd.cur_qp_state;
> -	attr->path_mtu 		  = cmd.path_mtu;
> -	attr->path_mig_state 	  = cmd.path_mig_state;
> -	attr->qkey 		  = cmd.qkey;
> -	attr->rq_psn 		  = cmd.rq_psn;
> -	attr->sq_psn 		  = cmd.sq_psn;
> -	attr->dest_qp_num 	  = cmd.dest_qp_num;
> -	attr->qp_access_flags 	  = cmd.qp_access_flags;
> -	attr->pkey_index 	  = cmd.pkey_index;
> -	attr->alt_pkey_index 	  = cmd.alt_pkey_index;
> -	attr->en_sqd_async_notify = cmd.en_sqd_async_notify;
> -	attr->max_rd_atomic 	  = cmd.max_rd_atomic;
> -	attr->max_dest_rd_atomic  = cmd.max_dest_rd_atomic;
> -	attr->min_rnr_timer 	  = cmd.min_rnr_timer;
> -	attr->port_num 		  = cmd.port_num;
> -	attr->timeout 		  = cmd.timeout;
> -	attr->retry_cnt 	  = cmd.retry_cnt;
> -	attr->rnr_retry 	  = cmd.rnr_retry;
> -	attr->alt_port_num 	  = cmd.alt_port_num;
> -	attr->alt_timeout 	  = cmd.alt_timeout;
> -
> -	memcpy(attr->ah_attr.grh.dgid.raw, cmd.dest.dgid, 16);
> -	attr->ah_attr.grh.flow_label        = cmd.dest.flow_label;
> -	attr->ah_attr.grh.sgid_index        = cmd.dest.sgid_index;
> -	attr->ah_attr.grh.hop_limit         = cmd.dest.hop_limit;
> -	attr->ah_attr.grh.traffic_class     = cmd.dest.traffic_class;
> -	attr->ah_attr.dlid 	    	    = cmd.dest.dlid;
> -	attr->ah_attr.sl   	    	    = cmd.dest.sl;
> -	attr->ah_attr.src_path_bits 	    = cmd.dest.src_path_bits;
> -	attr->ah_attr.static_rate   	    = cmd.dest.static_rate;
> -	attr->ah_attr.ah_flags 	    	    = cmd.dest.is_global ? IB_AH_GRH : 
> 0;
> -	attr->ah_attr.port_num 	    	    = cmd.dest.port_num;
> -
> -	memcpy(attr->alt_ah_attr.grh.dgid.raw, cmd.alt_dest.dgid, 16);
> -	attr->alt_ah_attr.grh.flow_label    = cmd.alt_dest.flow_label;
> -	attr->alt_ah_attr.grh.sgid_index    = cmd.alt_dest.sgid_index;
> -	attr->alt_ah_attr.grh.hop_limit     = cmd.alt_dest.hop_limit;
> -	attr->alt_ah_attr.grh.traffic_class = cmd.alt_dest.traffic_class;
> -	attr->alt_ah_attr.dlid 	    	    = cmd.alt_dest.dlid;
> -	attr->alt_ah_attr.sl   	    	    = cmd.alt_dest.sl;
> -	attr->alt_ah_attr.src_path_bits     = cmd.alt_dest.src_path_bits;
> -	attr->alt_ah_attr.static_rate       = cmd.alt_dest.static_rate;
> -	attr->alt_ah_attr.ah_flags 	    = cmd.alt_dest.is_global ? IB_AH_GRH 
> : 0;
> -	attr->alt_ah_attr.port_num 	    = cmd.alt_dest.port_num;
> +	ib_uverbs_modify_qp_assign(&cmd, attr, &cmd.dest, &cmd.alt_dest);
> +	memset(attr->ah_attr.dmac, 0, sizeof(attr->ah_attr.dmac));
> +	attr->vlan_id = 0xFFFF;
> +	memset(attr->smac, 0, sizeof(attr->smac));
> +	memset(attr->alt_ah_attr.dmac, 0, sizeof(attr->alt_ah_attr.dmac));
> +	attr->alt_vlan_id = 0xFFFF;
> +	memset(attr->alt_smac, 0, sizeof(attr->alt_smac));
> 
>  	if (qp->real_qp == qp) {
>  		ret = qp->device->modify_qp(qp, attr,
> @@ -1992,6 +2008,111 @@ out:
>  	return ret;
>  }
> 
> +ssize_t ib_uverbs_modify_qp_ex(struct ib_uverbs_file *file,
> +			       const char __user *buf, int in_len,
> +			       int out_len)
> +{
> +	struct ib_uverbs_modify_qp_ex cmd;
> +	struct ib_uverbs_qp_dest_ex   dest_ex;
> +	struct ib_uverbs_qp_dest_ex   alt_dest_ex;
> +	struct ib_uverbs_qp_dest     *dest;
> +	struct ib_uverbs_qp_dest     *alt_dest;
> +	struct ib_udata               udata;
> +	struct ib_qp                 *qp;
> +	struct ib_qp_attr	     *attr;
> +	int                           ret;
> +
> +	if (copy_from_user(&cmd, buf, sizeof(cmd)))
> +		return -EFAULT;
> +
> +	INIT_UDATA(&udata, buf + sizeof(cmd), NULL, in_len - sizeof(cmd),
> +		   out_len);
> +
> +	attr = kmalloc(sizeof(*attr), GFP_KERNEL);
> +	if (!attr)
> +		return -ENOMEM;
> +
> +	qp = idr_read_qp(cmd.qp_handle, file->ucontext);
> +	if (!qp) {
> +		ret = -EINVAL;
> +		goto out;
> +	}
> +
> +	if (!(cmd.comp_mask & IB_UVERBS_MODIFY_QP_EX_DEST_EX) ||
> +	    copy_from_user(&dest_ex, cmd.dest_ex, sizeof(dest_ex)))
> +		dest = &cmd.dest;
> +	else
> +		dest = (struct ib_uverbs_qp_dest *)&dest_ex;
> +

errors from copy_from_user() must be returned to user: EFAULT




> +ssize_t ib_uverbs_create_ah(struct ib_uverbs_file *file,
> +			    const char __user *buf, int in_len,
> +			    int out_len)
> +{
> +	struct ib_uverbs_create_ah_ex	 cmd_ex;
> +	struct ib_uverbs_create_ah	*cmd = (struct ib_uverbs_create_ah *)
> +					       ((void *)&cmd_ex +
> +						sizeof(cmd_ex.comp_mask));

What ?!

It's not aligned

> 
>  static void ib_uverbs_add_one(struct ib_device *device);
> diff --git a/drivers/infiniband/core/uverbs_marshall.c
> b/drivers/infiniband/core/uverbs_marshall.c
> index e7bee46..7f7a7e2 100644
> --- a/drivers/infiniband/core/uverbs_marshall.c
> +++ b/drivers/infiniband/core/uverbs_marshall.c
> @@ -31,6 +31,7 @@
>   */
> 
>  #include <linux/export.h>
> +#include <linux/etherdevice.h>
>  #include <rdma/ib_marshall.h>
> 
>  void ib_copy_ah_attr_to_user(struct ib_uverbs_ah_attr *dst,
> @@ -52,6 +53,17 @@ void ib_copy_ah_attr_to_user(struct 
> ib_uverbs_ah_attr *dst,
>  }
>  EXPORT_SYMBOL(ib_copy_ah_attr_to_user);
> 
> +void ib_copy_ah_attr_to_user_ex(struct ib_uverbs_ah_attr_ex *dst,
> +				struct ib_ah_attr *src)
> +{
> +	dst->comp_mask = 0;
> +	ib_copy_ah_attr_to_user((struct ib_uverbs_ah_attr *)
> +				dst, src);
> +	dst->comp_mask |= IB_UVERBS_AH_ATTR_DMAC;
> +	memcpy(dst->dmac, src->dmac, sizeof(dst->dmac));
> +}
> +EXPORT_SYMBOL(ib_copy_ah_attr_to_user_ex);
> +
>  void ib_copy_qp_attr_to_user(struct ib_uverbs_qp_attr *dst,
>  			     struct ib_qp_attr *src)
>  {
> @@ -65,15 +77,15 @@ void ib_copy_qp_attr_to_user(struct 
> ib_uverbs_qp_attr *dst,
>  	dst->dest_qp_num	= src->dest_qp_num;
>  	dst->qp_access_flags	= src->qp_access_flags;
> 
> +	ib_copy_ah_attr_to_user(&dst->ah_attr, &src->ah_attr);
> +	ib_copy_ah_attr_to_user(&dst->alt_ah_attr, &src->alt_ah_attr);
> +
>  	dst->max_send_wr	= src->cap.max_send_wr;
>  	dst->max_recv_wr	= src->cap.max_recv_wr;
>  	dst->max_send_sge	= src->cap.max_send_sge;
>  	dst->max_recv_sge	= src->cap.max_recv_sge;
>  	dst->max_inline_data	= src->cap.max_inline_data;
> 
> -	ib_copy_ah_attr_to_user(&dst->ah_attr, &src->ah_attr);
> -	ib_copy_ah_attr_to_user(&dst->alt_ah_attr, &src->alt_ah_attr);
> -
>  	dst->pkey_index		= src->pkey_index;
>  	dst->alt_pkey_index	= src->alt_pkey_index;
>  	dst->en_sqd_async_notify = src->en_sqd_async_notify;
> @@ -91,6 +103,63 @@ void ib_copy_qp_attr_to_user(struct 
> ib_uverbs_qp_attr *dst,
>  }
>  EXPORT_SYMBOL(ib_copy_qp_attr_to_user);
> 
> +void ib_copy_qp_attr_to_user_ex(struct ib_uverbs_qp_attr_ex *dst,
> +				struct ib_qp_attr *src)
> +{
> +	struct ib_uverbs_ah_attr_ex ah_attr_ex;
> +	struct ib_uverbs_ah_attr_ex alt_ah_attr_ex;
> +
> +	ib_copy_qp_attr_to_user((struct ib_uverbs_qp_attr *)
> +				dst, src);
> +	if (dst->comp_mask & IB_UVERBS_QP_ATTR_AH_EX &&
> +	    !is_zero_ether_addr(src->ah_attr.dmac)) {
> +		ib_copy_ah_attr_to_user_ex(&ah_attr_ex, &src->ah_attr);
> +		if (dst->ah_attr_ex == NULL ||
> +		    copy_to_user(dst->ah_attr_ex, &ah_attr_ex,
> +				 sizeof(ah_attr_ex)))

copy_to_user() error should be returned to user.

> +			dst->comp_mask &= ~IB_UVERBS_QP_ATTR_AH_EX;
> +	} else {
> +		dst->comp_mask &= ~IB_UVERBS_QP_ATTR_AH_EX;
> +	}
> +	if (dst->comp_mask & IB_UVERBS_QP_ATTR_ALT_AH_EX &&
> +	    !is_zero_ether_addr(src->alt_ah_attr.dmac)) {
> +		ib_copy_ah_attr_to_user_ex(&alt_ah_attr_ex, &src->alt_ah_attr);
> +		if (dst->alt_ah_attr_ex == NULL ||
> +		    copy_to_user(dst->alt_ah_attr_ex, &alt_ah_attr_ex,
> +				 sizeof(alt_ah_attr_ex)))

copy_to_user() error should be returned to user.

> +			dst->comp_mask &= ~IB_UVERBS_QP_ATTR_AH_EX;
> +	} else {
> +		dst->comp_mask &= ~IB_UVERBS_QP_ATTR_AH_EX;
> +	}
> +	if (dst->comp_mask & IB_UVERBS_QP_ATTR_SMAC &&
> +	    !is_zero_ether_addr(src->smac))
> +		memcpy(dst->smac, src->smac, sizeof(dst->smac));
> +	else
> +		dst->comp_mask &= ~IB_UVERBS_QP_ATTR_SMAC;
> +	if (dst->comp_mask & IB_UVERBS_QP_ATTR_ALT_SMAC &&
> +	    !is_zero_ether_addr(src->alt_smac))
> +		memcpy(dst->alt_smac, src->alt_smac, sizeof(dst->alt_smac));
> +	else
> +		dst->comp_mask &= ~IB_UVERBS_QP_ATTR_ALT_SMAC;
> +
> +	if (dst->comp_mask & IB_UVERBS_QP_ATTR_SMAC &&
> +	    src->vlan_id != 0xFFFF)
> +		dst->vlan_id = src->vlan_id;
> +	else
> +		dst->comp_mask &= ~IB_UVERBS_QP_ATTR_VID;
> +	if (dst->comp_mask & IB_UVERBS_QP_ATTR_VID &&
> +	    src->vlan_id != 0xFFFF)
> +		dst->vlan_id = src->vlan_id;
> +	else
> +		dst->comp_mask &= ~IB_UVERBS_QP_ATTR_VID;
> +	if (dst->comp_mask & IB_UVERBS_QP_ATTR_ALT_VID &&
> +	    src->alt_vlan_id != 0xFFFF)
> +		dst->alt_vlan_id = src->alt_vlan_id;
> +	else
> +		dst->comp_mask &= ~IB_UVERBS_QP_ATTR_VID;
> +}
> +EXPORT_SYMBOL(ib_copy_qp_attr_to_user_ex);
> +
>  void ib_copy_path_rec_to_user(struct ib_user_path_rec *dst,
>  			      struct ib_sa_path_rec *src)
>  {


> diff --git a/include/uapi/rdma/ib_user_sa.h 
> b/include/uapi/rdma/ib_user_sa.h
> index cfc7c9b..8a24e10 100644
> --- a/include/uapi/rdma/ib_user_sa.h
> +++ b/include/uapi/rdma/ib_user_sa.h
> @@ -48,7 +48,13 @@ enum {
>  struct ib_path_rec_data {
>  	__u32	flags;
>  	__u32	reserved;
> -	__u32	path_rec[16];
> +	__u32	path_rec[20];
> +};
> +
> +enum ibv_kern_path_rec_attr_mask {
> +	IB_USER_PATH_REC_ATTR_DMAC = 1ULL << 0,
> +	IB_USER_PATH_REC_ATTR_SMAC = 1ULL << 1,
> +	IB_USER_PATH_REC_ATTR_VID  = 1ULL << 2
>  };
> 
>  struct ib_user_path_rec {
> @@ -73,4 +79,30 @@ struct ib_user_path_rec {
>  	__u8	preference;
>  };
> 
> +struct ib_user_path_rec_ex {
> +	__u8	dgid[16];
> +	__u8	sgid[16];
> +	__be16	dlid;
> +	__be16	slid;
> +	__u32	raw_traffic;
> +	__be32	flow_label;
> +	__u32	reversible;
> +	__u32	mtu;
> +	__be16	pkey;
> +	__u8	hop_limit;
> +	__u8	traffic_class;
> +	__u8	numb_path;
> +	__u8	sl;
> +	__u8	mtu_selector;
> +	__u8	rate_selector;
> +	__u8	rate;
> +	__u8	packet_life_time_selector;
> +	__u8	packet_life_time;
> +	__u8	preference;
> +	__u32   comp_mask;
> +	__u8	smac[ETH_ALEN];
> +	__u8	dmac[ETH_ALEN];
> +	u16	vlan_id;
> +};
> +
>  #endif /* IB_USER_SA_H */
> diff --git a/include/uapi/rdma/ib_user_verbs.h
> b/include/uapi/rdma/ib_user_verbs.h
> index 0b233c5..f1939cf 100644
> --- a/include/uapi/rdma/ib_user_verbs.h
> +++ b/include/uapi/rdma/ib_user_verbs.h
> @@ -37,6 +37,7 @@
>  #define IB_USER_VERBS_H
> 
>  #include <linux/types.h>
> +#include <linux/if_ether.h>
> 
>  /*
>   * Increment this value if any changes that break userspace ABI
> @@ -88,7 +89,9 @@ enum {
>  	IB_USER_VERBS_CMD_CREATE_XSRQ,
>  	IB_USER_VERBS_CMD_OPEN_QP,
>  	IB_USER_VERBS_CMD_CREATE_FLOW = IB_USER_VERBS_CMD_THRESHOLD,
> -	IB_USER_VERBS_CMD_DESTROY_FLOW
> +	IB_USER_VERBS_CMD_DESTROY_FLOW,
> +	IB_USER_VERBS_CMD_MODIFY_QP_EX,
> +	IB_USER_VERBS_CMD_CREATE_AH_EX
>  };
> 
>  /*
> @@ -394,6 +397,24 @@ struct ib_uverbs_ah_attr {
>  	__u8  reserved;
>  };
> 
> +enum ib_uverbs_ah_attr_mask {
> +	IB_UVERBS_AH_ATTR_DMAC,
> +	IB_UVERBS_AH_ATTR_VID
> +};
> +
> +struct ib_uverbs_ah_attr_ex {
> +	struct ib_uverbs_global_route grh;
> +	__u16 dlid;
> +	__u8  sl;
> +	__u8  src_path_bits;
> +	__u8  static_rate;
> +	__u8  is_global;
> +	__u8  port_num;
> +	__u8  reserved;
> +	__u32 comp_mask;
> +	__u8  dmac[ETH_ALEN];
> +};
> +
>  struct ib_uverbs_qp_attr {
>  	__u32	qp_attr_mask;
>  	__u32	qp_state;
> @@ -432,6 +453,65 @@ struct ib_uverbs_qp_attr {
>  	__u8	reserved[5];
>  };
> 
> +enum ib_uverbs_qh_attr_mask {
> +	IB_UVERBS_QP_ATTR_AH_EX		= 1 << 0,
> +	IB_UVERBS_QP_ATTR_ALT_AH_EX	= 1 << 1,
> +	IB_UVERBS_QP_ATTR_SMAC		= 1 << 2,
> +	IB_UVERBS_QP_ATTR_ALT_SMAC	= 1 << 3,
> +	IB_UVERBS_QP_ATTR_VID		= 1 << 4,
> +	IB_UVERBS_QP_ATTR_ALT_VID	= 1 << 5,
> +};
> +
> +struct ib_uverbs_qp_attr_ex {
> +	__u32	qp_attr_mask;
> +	__u32	qp_state;
> +	__u32	cur_qp_state;
> +	__u32	path_mtu;
> +	__u32	path_mig_state;
> +	__u32	qkey;
> +	__u32	rq_psn;
> +	__u32	sq_psn;
> +	__u32	dest_qp_num;
> +	__u32	qp_access_flags;
> +
> +	/* deprecated for extension */
> +	struct ib_uverbs_ah_attr ah_attr;
> +	/* deprecated for extension */
> +	struct ib_uverbs_ah_attr alt_ah_attr;
> +
> +	/* ib_qp_cap */
> +	__u32	max_send_wr;
> +	__u32	max_recv_wr;
> +	__u32	max_send_sge;
> +	__u32	max_recv_sge;
> +	__u32	max_inline_data;
> +
> +	__u16	pkey_index;
> +	__u16	alt_pkey_index;
> +	__u8	en_sqd_async_notify;
> +	__u8	sq_draining;
> +	__u8	max_rd_atomic;
> +	__u8	max_dest_rd_atomic;
> +	__u8	min_rnr_timer;
> +	__u8	port_num;
> +	__u8	timeout;
> +	__u8	retry_cnt;
> +	__u8	rnr_retry;
> +	__u8	alt_port_num;
> +	__u8	alt_timeout;
> +	__u8	reserved[5];
> +
> +	__u32   comp_mask;
> +	/* represents: struct ib_uverbs_ah_attr_ex * __user */
> +	void __user *ah_attr_ex;
> +	/* represents: struct ib_uverbs_ah_attr_ex * __user */
> +	void __user *alt_ah_attr_ex;
> +	__u8	smac[ETH_ALEN];
> +	__u8	alt_smac[ETH_ALEN];
> +	__u16	vlan_id;
> +	__u16	alt_vlan_id;
> +};
> +
>  struct ib_uverbs_create_qp {
>  	__u64 response;
>  	__u64 user_handle;
> @@ -492,6 +572,27 @@ struct ib_uverbs_qp_dest {
>  	__u8  port_num;
>  };
> 
> +enum ib_uverbs_qp_dest_attr_mask {
> +	IB_UVERBS_QP_DEST_ATTR_DMAC = 1ULL << 0
> +};
> +
> +struct ib_uverbs_qp_dest_ex {
> +	__u8  dgid[16];
> +	__u32 flow_label;
> +	__u16 dlid;
> +	__u16 reserved;
> +	__u8  sgid_index;
> +	__u8  hop_limit;
> +	__u8  traffic_class;
> +	__u8  sl;
> +	__u8  src_path_bits;
> +	__u8  static_rate;
> +	__u8  is_global;
> +	__u8  port_num;
> +	__u32 comp_mask;
> +	__u8  dmac[ETH_ALEN];
> +};
> +
>  struct ib_uverbs_query_qp {
>  	__u64 response;
>  	__u32 qp_handle;
> @@ -563,6 +664,54 @@ struct ib_uverbs_modify_qp {
>  	__u64 driver_data[0];
>  };
> 
> +enum ib_uverbs_modify_qp_ex_comp_mask {
> +	IB_UVERBS_MODIFY_QP_EX_SMAC		      = (1ULL << 0),
> +	IB_UVERBS_MODIFY_QP_EX_ALT_SMAC		      = (1ULL << 1),
> +	IB_UVERBS_MODIFY_QP_EX_VID		      = (1ULL << 2),
> +	IB_UVERBS_MODIFY_QP_EX_ALT_VID		      = (1ULL << 3),
> +	IB_UVERBS_MODIFY_QP_EX_DEST_EX		      = (1ULL << 4),
> +	IB_UVERBS_MODIFY_QP_EX_ALT_DEST_EX	      = (1ULL << 5)
> +};
> +
> +struct ib_uverbs_modify_qp_ex {
> +	__u32 comp_mask;
> +	struct ib_uverbs_qp_dest dest;
> +	struct ib_uverbs_qp_dest alt_dest;
> +	__u32 qp_handle;
> +	__u32 attr_mask;
> +	__u32 qkey;
> +	__u32 rq_psn;
> +	__u32 sq_psn;
> +	__u32 dest_qp_num;
> +	__u32 qp_access_flags;
> +	__u16 pkey_index;
> +	__u16 alt_pkey_index;
> +	__u8  qp_state;
> +	__u8  cur_qp_state;
> +	__u8  path_mtu;
> +	__u8  path_mig_state;
> +	__u8  en_sqd_async_notify;
> +	__u8  max_rd_atomic;
> +	__u8  max_dest_rd_atomic;
> +	__u8  min_rnr_timer;
> +	__u8  port_num;
> +	__u8  timeout;
> +	__u8  retry_cnt;
> +	__u8  rnr_retry;
> +	__u8  alt_port_num;
> +	__u8  alt_timeout;
> +	__u8  reserved[2];
> +	__u8  smac[ETH_ALEN];
> +	__u8  alt_smac[ETH_ALEN];
> +	__u16 vid;
> +	__u16 alt_vid;
> +	/* represents: struct ib_uverbs_qp_dest_ex * __user */
> +	void __user *dest_ex;
> +	/* represents: struct ib_uverbs_qp_dest_ex * __user */
> +	void __user *alt_dest_ex;
> +	__u64 driver_data[0];
> +};
> +
>  struct ib_uverbs_modify_qp_resp {
>  };
> 
> @@ -672,6 +821,15 @@ struct ib_uverbs_create_ah {
>  	struct ib_uverbs_ah_attr attr;
>  };
> 
> +struct ib_uverbs_create_ah_ex {
> +	__u32 comp_mask;


This create a hole or introduced unaligned access depending on the 
64bits ABI.

The way this structure is used to map a struct ib_uverbs_create_ah seems 
wrong.

> +	__u64 response;
> +	__u64 user_handle;
> +	__u32 pd_handle;
> +	__u32 reserved;
> +	struct ib_uverbs_ah_attr_ex attr;
> +};
> +
>  struct ib_uverbs_create_ah_resp {
>  	__u32 ah_handle;
>  };


I haven't dig too much on others new structures, but sure, they deserve 
a thorough review.

Regards.

-- 
Yann Droneaud
OPTEYA


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH V4 8/9] IB/core: Add RoCE IP based addressing extensions for rdma_ucm
       [not found]         ` <26c47667e463e65dd79caaa4bddc437b-zgzEX58YAwA@public.gmane.org>
@ 2013-09-11 11:32           ` Or Gerlitz
       [not found]             ` <523054BA.2040608-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 38+ messages in thread
From: Or Gerlitz @ 2013-09-11 11:32 UTC (permalink / raw)
  To: Yann Droneaud
  Cc: roland-DgEjT+Ai2ygdnm+yROfE0A, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	monis-VPRAkNaXOzVWk0Htik3J/w, matanb-VPRAkNaXOzVWk0Htik3J/w

On 11/09/2013 12:52, Yann Droneaud wrote:
> Le 10.09.2013 16:41, Or Gerlitz a écrit :
>> +static ssize_t ucma_init_qp_attr_ex(struct ucma_file *file,
>> +                    const char __user *inbuf,
>> +                    int in_len, int out_len)
>> +{
>> +    struct rdma_ucm_init_qp_attr_ex cmd;
>> +    struct ib_uverbs_qp_attr_ex resp;
>> +    struct ucma_context *ctx;
>> +    struct ib_qp_attr qp_attr;
>> +    int ret;
>> +
>> +    if (out_len < sizeof(resp))
>> +        return -ENOSPC;
>> +
>> +    if (copy_from_user(&cmd, inbuf, sizeof(cmd)))
>> +        return -EFAULT;
>> +
>> +    if (copy_from_user(&resp, (void __user *)(unsigned 
>> long)cmd.response,
>> +               sizeof(resp)))
>> +        return -EFAULT;
>> +
>
>
> Reading from the response buffer ? I haven't seen that before in 
> IB/core before.

The intent here is to use copy_from_user just to make sure the user space
provided buffer has enough room to hold the kernel response structure. This
this command may be extended in the future without bumping the overall 
uverbs ABI
version we wanted to add this extra protection.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH V4 7/9] IB/core: Add RoCE IP based addressing extensions for uverbs
       [not found]         ` <6d494aa8d403e0c50b16f09fbd2c3ab6-zgzEX58YAwA@public.gmane.org>
@ 2013-09-11 11:38           ` Or Gerlitz
       [not found]             ` <52305632.1030604-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 38+ messages in thread
From: Or Gerlitz @ 2013-09-11 11:38 UTC (permalink / raw)
  To: meuh-zgzEX58YAwA
  Cc: roland-DgEjT+Ai2ygdnm+yROfE0A, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	monis-VPRAkNaXOzVWk0Htik3J/w, matanb-VPRAkNaXOzVWk0Htik3J/w

On 11/09/2013 13:06, meuh-zgzEX58YAwA@public.gmane.org wrote:
>> @@ -672,6 +821,15 @@ struct ib_uverbs_create_ah {
>>      struct ib_uverbs_ah_attr attr;
>>  };
>>
>> +struct ib_uverbs_create_ah_ex {
>> +    __u32 comp_mask;
>
>
> This create a hole or introduced unaligned access depending on the 
> 64bits ABI.

Matan is OOO for couple of days, but this seems to me as good catch.

Anyway, note that the current upstream providers of RoCE (ocrdma, mlx4) 
don't support
IB_USER_VERBS_CMD_CREATE_AH. This means their user space library need 
not access
the kernel for AH creation, we will drop the extended version of the 
this command from
the patch series altogether to reduce the volume of kernel changes and
ease with the review, thanks for the feedback so far!

Or.


>
> The way this structure is used to map a struct ib_uverbs_create_ah 
> seems wrong.
>
>> +    __u64 response;
>> +    __u64 user_handle;
>> +    __u32 pd_handle;
>> +    __u32 reserved;
>> +    struct ib_uverbs_ah_attr_ex attr;
>> +};
>> +
>>  struct ib_uverbs_create_ah_resp {
>>      __u32 ah_handle;
>>  }; 

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH V4 8/9] IB/core: Add RoCE IP based addressing extensions for rdma_ucm
       [not found]             ` <523054BA.2040608-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2013-09-11 12:36               ` Yann Droneaud
       [not found]                 ` <97104d76028c356b458509ce95b08c92-zgzEX58YAwA@public.gmane.org>
  0 siblings, 1 reply; 38+ messages in thread
From: Yann Droneaud @ 2013-09-11 12:36 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: roland-DgEjT+Ai2ygdnm+yROfE0A, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	monis-VPRAkNaXOzVWk0Htik3J/w, matanb-VPRAkNaXOzVWk0Htik3J/w

Le 11.09.2013 13:32, Or Gerlitz a écrit :
> On 11/09/2013 12:52, Yann Droneaud wrote:
>> Le 10.09.2013 16:41, Or Gerlitz a écrit :
>>> +static ssize_t ucma_init_qp_attr_ex(struct ucma_file *file,
>>> +                    const char __user *inbuf,
>>> +                    int in_len, int out_len)
>>> +{
>>> +    struct rdma_ucm_init_qp_attr_ex cmd;
>>> +    struct ib_uverbs_qp_attr_ex resp;
>>> +    struct ucma_context *ctx;
>>> +    struct ib_qp_attr qp_attr;
>>> +    int ret;
>>> +
>>> +    if (out_len < sizeof(resp))
>>> +        return -ENOSPC;
>>> +
>>> +    if (copy_from_user(&cmd, inbuf, sizeof(cmd)))
>>> +        return -EFAULT;
>>> +
>>> +    if (copy_from_user(&resp, (void __user *)(unsigned 
>>> long)cmd.response,
>>> +               sizeof(resp)))
>>> +        return -EFAULT;
>>> +
>> 
>> 
>> Reading from the response buffer ? I haven't seen that before in 
>> IB/core before.
> 
> The intent here is to use copy_from_user just to make sure the user 
> space
> provided buffer has enough room to hold the kernel response structure. 
> This
> this command may be extended in the future without bumping the overall
> uverbs ABI
> version we wanted to add this extra protection.

It's checking nothing ... you should not suppose the user having / not 
having its buffer end
aligned on a page boundary, so that the kernel will detect the end of 
buffer
when trying to read from it.

BTW, out_len is already checked against resp size ... so I don't 
understand yet.

Regards.

-- 
Yann Droneaud
OPTEYA

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH V4 7/9] IB/core: Add RoCE IP based addressing extensions for uverbs
       [not found]             ` <52305632.1030604-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2013-09-11 12:42               ` Yann Droneaud
  0 siblings, 0 replies; 38+ messages in thread
From: Yann Droneaud @ 2013-09-11 12:42 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: roland-DgEjT+Ai2ygdnm+yROfE0A, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	monis-VPRAkNaXOzVWk0Htik3J/w, matanb-VPRAkNaXOzVWk0Htik3J/w

Le 11.09.2013 13:38, Or Gerlitz a écrit :
> On 11/09/2013 13:06, meuh-zgzEX58YAwA@public.gmane.org wrote:
>>> @@ -672,6 +821,15 @@ struct ib_uverbs_create_ah {
>>>      struct ib_uverbs_ah_attr attr;
>>>  };
>>> 
>>> +struct ib_uverbs_create_ah_ex {
>>> +    __u32 comp_mask;
>> 
>> 
>> This create a hole or introduced unaligned access depending on the 
>> 64bits ABI.
> 
> Matan is OOO for couple of days, but this seems to me as good catch.
> 

The way struct ib_uverbs_create_ah_ex encompass struct 
ib_uverbs_create_ah should be clearer
(explicitly having struct ib_uverbs_create_ah as part of struct 
ib_uverbs_create_ah_ex ?)

Specifically, I don't like the way pointer math is used to access struct 
ib_uverbs_create_ah from  struct ib_uverbs_create_ah_ex pointer.

> Anyway, note that the current upstream providers of RoCE (ocrdma,
> mlx4) don't support
> IB_USER_VERBS_CMD_CREATE_AH. This means their user space library need 
> not access
> the kernel for AH creation, we will drop the extended version of the
> this command from
> the patch series altogether to reduce the volume of kernel changes and
> ease with the review, thanks for the feedback so far!
> 

Thanks for taking time to address my concerns.

Regards.

-- 
Yann Droneaud
OPTEYA

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH V4 9/9] IB/mlx4: Enable mlx4_ib support for MODIFY_QP_EX
       [not found]     ` <1378824099-22150-10-git-send-email-ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2013-09-12  5:26       ` Devesh Sharma
       [not found]         ` <CAGgPuS1tAiyA3TZ5_fpua3ue6JrZ9ruS+O+QU-7t28i0dZ7cUw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 38+ messages in thread
From: Devesh Sharma @ 2013-09-12  5:26 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: roland-DgEjT+Ai2ygdnm+yROfE0A, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	monis-VPRAkNaXOzVWk0Htik3J/w, matanb-VPRAkNaXOzVWk0Htik3J/w

Hi Or,

I don't see any patches to librdmacm/libibverbs git to call _EX
version of uverbs commands. The patch you have pointed in v4-0 patch
still seems to be incomplete. Are these broken?

On Tue, Sep 10, 2013 at 8:11 PM, Or Gerlitz <ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> wrote:
> From: Matan Barak <matanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
>
> mlx4_ib driver should indicate that it supports
> MODIFY_QP_EX user verbs extended command.
>
> Signed-off-by: Matan Barak <matanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> Signed-off-by: Or Gerlitz <ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> ---
>  drivers/infiniband/hw/mlx4/main.c |    3 ++-
>  1 files changed, 2 insertions(+), 1 deletions(-)
>
> diff --git a/drivers/infiniband/hw/mlx4/main.c b/drivers/infiniband/hw/mlx4/main.c
> index 7a29ad5..77c87d0 100644
> --- a/drivers/infiniband/hw/mlx4/main.c
> +++ b/drivers/infiniband/hw/mlx4/main.c
> @@ -1755,7 +1755,8 @@ static void *mlx4_ib_add(struct mlx4_dev *dev)
>                 (1ull << IB_USER_VERBS_CMD_QUERY_SRQ)           |
>                 (1ull << IB_USER_VERBS_CMD_DESTROY_SRQ)         |
>                 (1ull << IB_USER_VERBS_CMD_CREATE_XSRQ)         |
> -               (1ull << IB_USER_VERBS_CMD_OPEN_QP);
> +               (1ull << IB_USER_VERBS_CMD_OPEN_QP)             |
> +               (1ull << IB_USER_VERBS_CMD_MODIFY_QP_EX);
>
>         ibdev->ib_dev.query_device      = mlx4_ib_query_device;
>         ibdev->ib_dev.query_port        = mlx4_ib_query_port;
> --
> 1.7.1
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH V4 9/9] IB/mlx4: Enable mlx4_ib support for MODIFY_QP_EX
       [not found]         ` <CAGgPuS1tAiyA3TZ5_fpua3ue6JrZ9ruS+O+QU-7t28i0dZ7cUw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2013-09-12 10:45           ` Or Gerlitz
       [not found]             ` <52319B38.5070807-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 38+ messages in thread
From: Or Gerlitz @ 2013-09-12 10:45 UTC (permalink / raw)
  To: Devesh Sharma
  Cc: roland-DgEjT+Ai2ygdnm+yROfE0A, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	monis-VPRAkNaXOzVWk0Htik3J/w, matanb-VPRAkNaXOzVWk0Htik3J/w

On 12/09/2013 08:26, Devesh Sharma wrote:
> I don't see any patches to librdmacm/libibverbs git to call _EX version of uverbs commands.

We've posted the kernel patches, that should be enough for the review. 
If you have any specific questions re user
space aspects of this series, feel free to send them now.

> The patch you have pointed in v4-0 patch still seems to be incomplete. Are these broken?

I don't understand the question. During the review of V4 we were pointed 
to a part missing in patch 5/9
and this will be fixed in V5, sure.

Or.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH V4 9/9] IB/mlx4: Enable mlx4_ib support for MODIFY_QP_EX
       [not found]             ` <52319B38.5070807-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2013-09-12 11:31               ` Devesh Sharma
  2013-09-12 12:24                 ` Or Gerlitz
  2013-09-12 11:46               ` Devesh Sharma
  1 sibling, 1 reply; 38+ messages in thread
From: Devesh Sharma @ 2013-09-12 11:31 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: roland-DgEjT+Ai2ygdnm+yROfE0A, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	monis-VPRAkNaXOzVWk0Htik3J/w, matanb-VPRAkNaXOzVWk0Htik3J/w

Inline Below:

On Thu, Sep 12, 2013 at 4:15 PM, Or Gerlitz <ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> wrote:
> On 12/09/2013 08:26, Devesh Sharma wrote:
>>
>> I don't see any patches to librdmacm/libibverbs git to call _EX version of
>> uverbs commands.
>
>
> We've posted the kernel patches, that should be enough for the review. If
> you have any specific questions re user
> space aspects of this series, feel free to send them now.

Yes! for kenel space I see the above set of patches will work fine
without any issues. On the other hand, if from user space some
application tries to establish a connection using RDMACM, the Driver
will receive dmac and vlanid fields as Zeros because
libibverbs/librdmacm still does not call _EX versions of UVERBS/UCM
commands, which are introduced in these set of patches (7/9, 8/9). So,
for example if I try to run ib_send_bw with -R, traffice will not
run!!

So what are the plans to add these changes in libibverbs/librdmacm libraries.
                 OR
there is some flaw in my understanding that librdmacm/libibverbs needs
changes in order to uses newly proposed scheme. Please clarify.

-Regards
 Devesh
>
>
>> The patch you have pointed in v4-0 patch still seems to be incomplete. Are
>> these broken?
>
>
> I don't understand the question. During the review of V4 we were pointed to
> a part missing in patch 5/9
> and this will be fixed in V5, sure.
>
> Or.
>
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH V4 9/9] IB/mlx4: Enable mlx4_ib support for MODIFY_QP_EX
       [not found]             ` <52319B38.5070807-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  2013-09-12 11:31               ` Devesh Sharma
@ 2013-09-12 11:46               ` Devesh Sharma
  1 sibling, 0 replies; 38+ messages in thread
From: Devesh Sharma @ 2013-09-12 11:46 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: roland-DgEjT+Ai2ygdnm+yROfE0A, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	monis-VPRAkNaXOzVWk0Htik3J/w, matanb-VPRAkNaXOzVWk0Htik3J/w

Inline Below on Second question:

On Thu, Sep 12, 2013 at 4:15 PM, Or Gerlitz <ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> wrote:
> On 12/09/2013 08:26, Devesh Sharma wrote:
>>
>> I don't see any patches to librdmacm/libibverbs git to call _EX version of
>> uverbs commands.
>
>
> We've posted the kernel patches, that should be enough for the review. If
> you have any specific questions re user
> space aspects of this series, feel free to send them now.
>
>
>> The patch you have pointed in v4-0 patch still seems to be incomplete. Are
>> these broken?
>
>
> I don't understand the question. During the review of V4 we were pointed to
> a part missing in patch 5/9
> and this will be fixed in V5, sure.
Yes, 5/9 miss a part related to populatin gid table during load time.
Well I was totally concerned about the user space apps, with current
git of libibverbs/librdmacm user app will fail to perform data
transfer operations.
After digging more into the liunx-rdma git I found the completed set
of patches for flow-steering which introduces extension commands in
kernel space. Still looking for corresponding patches for
liblibverbs/librdmacm.

-Regards
 Devesh Sharama

>
> Or.
>
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH V4 9/9] IB/mlx4: Enable mlx4_ib support for MODIFY_QP_EX
  2013-09-12 11:31               ` Devesh Sharma
@ 2013-09-12 12:24                 ` Or Gerlitz
       [not found]                   ` <5231B28E.4090605-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 38+ messages in thread
From: Or Gerlitz @ 2013-09-12 12:24 UTC (permalink / raw)
  To: Devesh Sharma
  Cc: roland-DgEjT+Ai2ygdnm+yROfE0A, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	monis-VPRAkNaXOzVWk0Htik3J/w, matanb-VPRAkNaXOzVWk0Htik3J/w

On 12/09/2013 14:31, Devesh Sharma wrote:
> On Thu, Sep 12, 2013 at 4:15 PM, Or Gerlitz <ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> wrote:
>> We've posted the kernel patches, that should be enough for the review. If you have any specific questions re user space aspects of this series, feel free to send them now.
> Yes! for kenel space I see the above set of patches will work fine
> without any issues. On the other hand, if from user space some
> application tries to establish a connection using RDMACM, the Driver
> will receive dmac and vlanid fields as Zeros because
> libibverbs/librdmacm still does not call _EX versions of UVERBS/UCM
> commands, which are introduced in these set of patches (7/9, 8/9). So,
> for example if I try to run ib_send_bw with -R, traffice will not run!!
>
> So what are the plans to add these changes in libibverbs/librdmacm libraries.
>                   OR
> there is some flaw in my understanding that librdmacm/libibverbs needs
> changes in order to uses newly proposed scheme. Please clarify.
>

Let me clarify this. The idea is that current RoCE applications will run 
as is after they update "their"
librdmacm, since its this library that works with the new uverbs entries.

Note that the RoCE stack assumes existence of Ethernet device for the 
specific vendor along with
their IB driver and this series does another tiny step assuming this 
device has IP address configured
and hence RoCE applications are expected to use librdmacm.

If the underlying device/port run IB, biz as usual, the patches / 
extended commands need not come into play and
the related non-extended  uverbs commands are all working as they did 
before the patches. This is per port, that
is if port1 is IB and port2 is Eth, QPs on port1 can be worked out with 
the older uverbs commands.

As for your question, ib_send_bw -R uses the rdma-cm and hence will not 
need to change, since the rdma-cm
is doing the qp modifications.

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH V4 9/9] IB/mlx4: Enable mlx4_ib support for MODIFY_QP_EX
       [not found]                   ` <5231B28E.4090605-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2013-09-12 17:22                     ` Jason Gunthorpe
       [not found]                       ` <20130912172252.GA4611-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  0 siblings, 1 reply; 38+ messages in thread
From: Jason Gunthorpe @ 2013-09-12 17:22 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Devesh Sharma, roland-DgEjT+Ai2ygdnm+yROfE0A,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, monis-VPRAkNaXOzVWk0Htik3J/w,
	matanb-VPRAkNaXOzVWk0Htik3J/w

On Thu, Sep 12, 2013 at 03:24:46PM +0300, Or Gerlitz wrote:

> Let me clarify this. The idea is that current RoCE applications will
> run as is after they update "their" librdmacm, since its this
> library that works with the new uverbs entries.

Or, we are not supposed to break userspace. You can't insist that a
user space library be updated in-sync with the kernel.

Can you please think of a way to retain compatability?

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH V4 8/9] IB/core: Add RoCE IP based addressing extensions for rdma_ucm
       [not found]                 ` <97104d76028c356b458509ce95b08c92-zgzEX58YAwA@public.gmane.org>
@ 2013-09-17 10:02                   ` Matan Barak
       [not found]                     ` <5238289D.40608-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 38+ messages in thread
From: Matan Barak @ 2013-09-17 10:02 UTC (permalink / raw)
  To: Yann Droneaud
  Cc: Or Gerlitz, roland-DgEjT+Ai2ygdnm+yROfE0A,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, monis-VPRAkNaXOzVWk0Htik3J/w

On 11/9/2013 3:36 PM, Yann Droneaud wrote:
> Le 11.09.2013 13:32, Or Gerlitz a écrit :
>> On 11/09/2013 12:52, Yann Droneaud wrote:
>>> Le 10.09.2013 16:41, Or Gerlitz a écrit :
>>>> +static ssize_t ucma_init_qp_attr_ex(struct ucma_file *file,
>>>> +                    const char __user *inbuf,
>>>> +                    int in_len, int out_len)
>>>> +{
>>>> +    struct rdma_ucm_init_qp_attr_ex cmd;
>>>> +    struct ib_uverbs_qp_attr_ex resp;
>>>> +    struct ucma_context *ctx;
>>>> +    struct ib_qp_attr qp_attr;
>>>> +    int ret;
>>>> +
>>>> +    if (out_len < sizeof(resp))
>>>> +        return -ENOSPC;
>>>> +
>>>> +    if (copy_from_user(&cmd, inbuf, sizeof(cmd)))
>>>> +        return -EFAULT;
>>>> +
>>>> +    if (copy_from_user(&resp, (void __user *)(unsigned
>>>> long)cmd.response,
>>>> +               sizeof(resp)))
>>>> +        return -EFAULT;
>>>> +
>>>
>>>
>>> Reading from the response buffer ? I haven't seen that before in
>>> IB/core before.
>>
>> The intent here is to use copy_from_user just to make sure the user space
>> provided buffer has enough room to hold the kernel response structure.
>> This
>> this command may be extended in the future without bumping the overall
>> uverbs ABI
>> version we wanted to add this extra protection.
>
> It's checking nothing ... you should not suppose the user having / not
> having its buffer end
> aligned on a page boundary, so that the kernel will detect the end of
> buffer
> when trying to read from it.
>
> BTW, out_len is already checked against resp size ... so I don't
> understand yet.

That's right - we're not checking anything here.
struct ib_uverbs_qp_attr_ex contains 2 user pointers:
+       /* represents: struct ib_uverbs_ah_attr_ex * __user */
+       void __user *ah_attr_ex;
+       /* represents: struct ib_uverbs_ah_attr_ex * __user */
+       void __user *alt_ah_attr_ex;

Those pointers should be given to rdma_init_qp_attr in order to 
initialize them.

We are using pointers here in order to maintain future extendability of 
the address handle structures.

>
> Regards.
>
Regards
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH V4 8/9] IB/core: Add RoCE IP based addressing extensions for rdma_ucm
       [not found]                     ` <5238289D.40608-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2013-09-17 10:25                       ` Yann Droneaud
       [not found]                         ` <bcec9d3a9a72ed1d612a4dd49b670800-zgzEX58YAwA@public.gmane.org>
  0 siblings, 1 reply; 38+ messages in thread
From: Yann Droneaud @ 2013-09-17 10:25 UTC (permalink / raw)
  To: Matan Barak
  Cc: Or Gerlitz, roland-DgEjT+Ai2ygdnm+yROfE0A,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, monis-VPRAkNaXOzVWk0Htik3J/w

Hi,

Le 17.09.2013 12:02, Matan Barak a écrit :
> 
> That's right - we're not checking anything here.
> struct ib_uverbs_qp_attr_ex contains 2 user pointers:
> +       /* represents: struct ib_uverbs_ah_attr_ex * __user */
> +       void __user *ah_attr_ex;
> +       /* represents: struct ib_uverbs_ah_attr_ex * __user */
> +       void __user *alt_ah_attr_ex;
> 
> Those pointers should be given to rdma_init_qp_attr in order to 
> initialize them.
> 
> We are using pointers here in order to maintain future extendability
> of the address handle structures.
> 

First: you can't put pointer asis in public data structure. Look to all 
others "command" structures
declared in include/uapi/rdma/ib_user_verbs.h

Second: if you're duplicating struct ib_uverbs_qp_attr, why not include 
it in struct ib_uverbs_qp_attr_ex
it will reduce maintenance burden, clutter, etc.
Or drop the unused fields in the _ex version while taking care of being 
at least equal size than
the original version.

Same apply to struct ib_uverbs_modify_qp_ex, but struct 
ib_uverbs_modify_qp_ex has the comp_mask as first field (introducing a 
hole).

Regards.

-- 
Yann Droneaud
OPTEYA

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH V4 8/9] IB/core: Add RoCE IP based addressing extensions for rdma_ucm
       [not found]                         ` <bcec9d3a9a72ed1d612a4dd49b670800-zgzEX58YAwA@public.gmane.org>
@ 2013-09-17 15:13                           ` Matan Barak
       [not found]                             ` <523871A2.8010109-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 38+ messages in thread
From: Matan Barak @ 2013-09-17 15:13 UTC (permalink / raw)
  To: Yann Droneaud
  Cc: Or Gerlitz, roland-DgEjT+Ai2ygdnm+yROfE0A,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, monis-VPRAkNaXOzVWk0Htik3J/w

On 17/9/2013 1:25 PM, Yann Droneaud wrote:
> Hi,
>
> Le 17.09.2013 12:02, Matan Barak a écrit :
>>
>> That's right - we're not checking anything here.
>> struct ib_uverbs_qp_attr_ex contains 2 user pointers:
>> +       /* represents: struct ib_uverbs_ah_attr_ex * __user */
>> +       void __user *ah_attr_ex;
>> +       /* represents: struct ib_uverbs_ah_attr_ex * __user */
>> +       void __user *alt_ah_attr_ex;
>>
>> Those pointers should be given to rdma_init_qp_attr in order to
>> initialize them.
>>
>> We are using pointers here in order to maintain future extendability
>> of the address handle structures.
>>
>
> First: you can't put pointer asis in public data structure. Look to all
> others "command" structures
> declared in include/uapi/rdma/ib_user_verbs.h

Thanks for the review. Looking at other commands, I see that pointers 
(such a the response) are passed as __u64 at the command structure. Is 
that what you mean ? I think it's a bit odd to pass those pointers as a 
part of the command, as they are output only attributes. Though, I'll 
change the code to use __u64 instead of the actual __user pointers.

>
> Second: if you're duplicating struct ib_uverbs_qp_attr, why not include
> it in struct ib_uverbs_qp_attr_ex
> it will reduce maintenance burden, clutter, etc.
> Or drop the unused fields in the _ex version while taking care of being
> at least equal size than
> the original version.

The extension verbs approach should be an evolution. Instead of using a 
cumbersome extended.basic.field notation, extended.field is more compact 
and readable. Using this notation will allow us to deprecate the basic 
structures when they won't be in use anymore.
Since the basic structures shouldn't change anymore as user applications 
relay on them, I don't see a burden in maintainability doing it this way.

>
> Same apply to struct ib_uverbs_modify_qp_ex, but struct
> ib_uverbs_modify_qp_ex has the comp_mask as first field (introducing a
> hole).

We'll add a reserved field to fix this hole. Thanks for that catch!

>
> Regards.
>
Regards
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH V4 8/9] IB/core: Add RoCE IP based addressing extensions for rdma_ucm
       [not found]                             ` <523871A2.8010109-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2013-09-17 15:43                               ` Yann Droneaud
       [not found]                                 ` <8bb85d86eca247afa5786b7c7e4c737a-zgzEX58YAwA@public.gmane.org>
  0 siblings, 1 reply; 38+ messages in thread
From: Yann Droneaud @ 2013-09-17 15:43 UTC (permalink / raw)
  To: Matan Barak
  Cc: Or Gerlitz, roland-DgEjT+Ai2ygdnm+yROfE0A,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, monis-VPRAkNaXOzVWk0Htik3J/w

Le 17.09.2013 17:13, Matan Barak a écrit :
> On 17/9/2013 1:25 PM, Yann Droneaud wrote:
>> Hi,
>> 
>> Le 17.09.2013 12:02, Matan Barak a écrit :
>>> 
>>> That's right - we're not checking anything here.
>>> struct ib_uverbs_qp_attr_ex contains 2 user pointers:
>>> +       /* represents: struct ib_uverbs_ah_attr_ex * __user */
>>> +       void __user *ah_attr_ex;
>>> +       /* represents: struct ib_uverbs_ah_attr_ex * __user */
>>> +       void __user *alt_ah_attr_ex;
>>> 
>>> Those pointers should be given to rdma_init_qp_attr in order to
>>> initialize them.
>>> 
>>> We are using pointers here in order to maintain future extendability
>>> of the address handle structures.
>>> 
>> 
>> First: you can't put pointer asis in public data structure. Look to 
>> all
>> others "command" structures
>> declared in include/uapi/rdma/ib_user_verbs.h
> 
> Thanks for the review. Looking at other commands, I see that pointers
> (such a the response) are passed as __u64 at the command structure.

Indeed. So that 32bits and 64bits binaries use the same layout.

> Is that what you mean ? I think it's a bit odd to pass those pointers 
> as
> a part of the command, as they are output only attributes. Though,
> I'll change the code to use __u64 instead of the actual __user
> pointers.
> 

How can those pointers be output parameters ? Does kernel allocate some 
pages
and putting the addresses of those in the ah_attr_ex and alt_ah_attr_ex 
pointers ?

I don't buy the "maintain future extendability" argument for such 
specific cruft.
It's not enough generic to fall in the extensible pattern.

>> 
>> Second: if you're duplicating struct ib_uverbs_qp_attr, why not 
>> include
>> it in struct ib_uverbs_qp_attr_ex
>> it will reduce maintenance burden, clutter, etc.
>> Or drop the unused fields in the _ex version while taking care of 
>> being
>> at least equal size than
>> the original version.
> 
> The extension verbs approach should be an evolution. Instead of using
> a cumbersome extended.basic.field notation, extended.field is more
> compact and readable. Using this notation will allow us to deprecate
> the basic structures when they won't be in use anymore.
> Since the basic structures shouldn't change anymore as user
> applications relay on them, I don't see a burden in maintainability
> doing it this way.
> 

Oh ... I'm waiting to see qp_attr being deprecated</sarcasm>

>> 
>> Same apply to struct ib_uverbs_modify_qp_ex, but struct
>> ib_uverbs_modify_qp_ex has the comp_mask as first field (introducing a
>> hole).
> 
> We'll add a reserved field to fix this hole. Thanks for that catch!
> 

Why not putting that field after the struct ib_uverbs_modify_qp field ?
(even if I don't like the comp_mask things, I waiting for a real world 
example).

Regards.

-- 
Yann Droneaud
OPTEYA

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH V4 9/9] IB/mlx4: Enable mlx4_ib support for MODIFY_QP_EX
       [not found]                       ` <20130912172252.GA4611-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2013-09-17 17:50                         ` Roland Dreier
  0 siblings, 0 replies; 38+ messages in thread
From: Roland Dreier @ 2013-09-17 17:50 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Or Gerlitz, Devesh Sharma, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	monis, matanb

On Thu, Sep 12, 2013 at 10:22 AM, Jason Gunthorpe
<jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> wrote:

> On Thu, Sep 12, 2013 at 03:24:46PM +0300, Or Gerlitz wrote:
> > Let me clarify this. The idea is that current RoCE applications will
> > run as is after they update "their" librdmacm, since its this
> > library that works with the new uverbs entries.

> Or, we are not supposed to break userspace. You can't insist that a
> user space library be updated in-sync with the kernel.

Agree.  This "IP based addressing" for RoCE looks like a big problem
at the moment.  Let me reiterate my understanding, and you guys can
correct me if I get something wrong:

 - current addressing scheme is broken for virtualization use cases,
because VMs may not know about what VLANs are in use.  (also there are
issues around bonding modes that use different Ethernet addresses)

 - proposed change requires:

   * all systems must update kernel at the same time, because old and
new kernels cannot talk to each other

   * all systems must update librdmacm when they update the kernel,
because old librdmacm does not work with new kernel

I understand that we want to fix the issue around VLAN tagged traffic
from VMs, but I don't see how we can break the whole stack to
accomplish that.  Isn't there some incremental way forward?

 - R.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH V4 8/9] IB/core: Add RoCE IP based addressing extensions for rdma_ucm
       [not found]                                 ` <8bb85d86eca247afa5786b7c7e4c737a-zgzEX58YAwA@public.gmane.org>
@ 2013-09-18  8:40                                   ` Matan Barak
       [not found]                                     ` <52396719.4050809-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 38+ messages in thread
From: Matan Barak @ 2013-09-18  8:40 UTC (permalink / raw)
  To: Yann Droneaud
  Cc: Or Gerlitz, roland-DgEjT+Ai2ygdnm+yROfE0A,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, monis-VPRAkNaXOzVWk0Htik3J/w

On 17/9/2013 6:43 PM, Yann Droneaud wrote:
> Le 17.09.2013 17:13, Matan Barak a écrit :
>> On 17/9/2013 1:25 PM, Yann Droneaud wrote:
>>> Hi,
>>>
>>> Le 17.09.2013 12:02, Matan Barak a écrit :
>>>>
>>>> That's right - we're not checking anything here.
>>>> struct ib_uverbs_qp_attr_ex contains 2 user pointers:
>>>> +       /* represents: struct ib_uverbs_ah_attr_ex * __user */
>>>> +       void __user *ah_attr_ex;
>>>> +       /* represents: struct ib_uverbs_ah_attr_ex * __user */
>>>> +       void __user *alt_ah_attr_ex;
>>>>
>>>> Those pointers should be given to rdma_init_qp_attr in order to
>>>> initialize them.
>>>>
>>>> We are using pointers here in order to maintain future extendability
>>>> of the address handle structures.
>>>>
>>>
>>> First: you can't put pointer asis in public data structure. Look to all
>>> others "command" structures
>>> declared in include/uapi/rdma/ib_user_verbs.h
>>
>> Thanks for the review. Looking at other commands, I see that pointers
>> (such a the response) are passed as __u64 at the command structure.
>
> Indeed. So that 32bits and 64bits binaries use the same layout.
>
>> Is that what you mean ? I think it's a bit odd to pass those pointers as
>> a part of the command, as they are output only attributes. Though,
>> I'll change the code to use __u64 instead of the actual __user
>> pointers.
>>
>
> How can those pointers be output parameters ? Does kernel allocate some
> pages
> and putting the addresses of those in the ah_attr_ex and alt_ah_attr_ex
> pointers ?

No, the user allocates them and passes the addresses to the kernel. The 
kernel fills the user-allocated space. It works the exact same way as 
the response structure.

>
> I don't buy the "maintain future extendability" argument for such
> specific cruft.
> It's not enough generic to fall in the extensible pattern.

I don't see why.
struct inner1 {
	... some fields ...
};
struct inner2 {
	... some fields ...
};
struct outer {
	... some fields ...
	struct inner1 in1;
	struct inner2 in2;
	... some fields ...
};

Now lets say we need to extend inner1 with a new field in its bottom.
If inner1 was inlined, outer memory layout would be changed and we'll 
get problems with new user <-> old kernel. That's a general problem that 
can be solved easily by using:
struct outer {
	... some fields ...
	struct inner1 *in1;
	struct inner2 *in2;
	... some fields ...
};

Since we're using __u64 to pass pointers, instead of struct inner[1-2]* 
we could use __u64. That's ok.
In our case the only usage of in[1-2] is as output parameters. The user 
allocates space only to let the kernel fills it for him.
That's why I think putting it as a part of the response is more appropriate.

>
>>>
>>> Second: if you're duplicating struct ib_uverbs_qp_attr, why not include
>>> it in struct ib_uverbs_qp_attr_ex
>>> it will reduce maintenance burden, clutter, etc.
>>> Or drop the unused fields in the _ex version while taking care of being
>>> at least equal size than
>>> the original version.
>>
>> The extension verbs approach should be an evolution. Instead of using
>> a cumbersome extended.basic.field notation, extended.field is more
>> compact and readable. Using this notation will allow us to deprecate
>> the basic structures when they won't be in use anymore.
>> Since the basic structures shouldn't change anymore as user
>> applications relay on them, I don't see a burden in maintainability
>> doing it this way.
>>
>
> Oh ... I'm waiting to see qp_attr being deprecated</sarcasm>

Wasn't 386 arch deprecated after ~20 years?

>
>>>
>>> Same apply to struct ib_uverbs_modify_qp_ex, but struct
>>> ib_uverbs_modify_qp_ex has the comp_mask as first field (introducing a
>>> hole).
>>
>> We'll add a reserved field to fix this hole. Thanks for that catch!
>>
>
> Why not putting that field after the struct ib_uverbs_modify_qp field ?
> (even if I don't like the comp_mask things, I waiting for a real world
> example).

Since every struct starts with u32 comp_mask, I suggest putting required 
reserved field right after this comp_mask. I've reviewed this case again 
and I don't think a reserved field is needed here. The largest field in 
struct ib_uverbs_qp_dest is u32. Hence, the struct will be 32bit 
aligned. Since struct ib_uverbs_qp_dest shouldn't be changed anymore, as 
it'll change the memory layout, there's no reason to force struct 
ib_uverbs_qp_dest to be 64bit aligned.

>
> Regards.
>
Regards
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH V4 8/9] IB/core: Add RoCE IP based addressing extensions for rdma_ucm
       [not found]                                     ` <52396719.4050809-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2013-09-18 10:07                                       ` Yann Droneaud
       [not found]                                         ` <698ad99050d7ece7bac8a591e4318f45-zgzEX58YAwA@public.gmane.org>
  0 siblings, 1 reply; 38+ messages in thread
From: Yann Droneaud @ 2013-09-18 10:07 UTC (permalink / raw)
  To: Matan Barak
  Cc: Or Gerlitz, roland-DgEjT+Ai2ygdnm+yROfE0A,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, monis-VPRAkNaXOzVWk0Htik3J/w

Hi,

Le 18.09.2013 10:40, Matan Barak a écrit :
> On 17/9/2013 6:43 PM, Yann Droneaud wrote:
>> Le 17.09.2013 17:13, Matan Barak a écrit :
>>> On 17/9/2013 1:25 PM, Yann Droneaud wrote:
>>>> Hi,
>>>> 
>>>> Le 17.09.2013 12:02, Matan Barak a écrit :
>>>>> 
>>>>> That's right - we're not checking anything here.
>>>>> struct ib_uverbs_qp_attr_ex contains 2 user pointers:
>>>>> +       /* represents: struct ib_uverbs_ah_attr_ex * __user */
>>>>> +       void __user *ah_attr_ex;
>>>>> +       /* represents: struct ib_uverbs_ah_attr_ex * __user */
>>>>> +       void __user *alt_ah_attr_ex;
>>>>> 
>>>>> Those pointers should be given to rdma_init_qp_attr in order to
>>>>> initialize them.
>>>>> 
>>>>> We are using pointers here in order to maintain future 
>>>>> extendability
>>>>> of the address handle structures.
>>>>> 
>>>> 
>>>> First: you can't put pointer asis in public data structure. Look to 
>>>> all
>>>> others "command" structures
>>>> declared in include/uapi/rdma/ib_user_verbs.h
>>> 
>>> Thanks for the review. Looking at other commands, I see that pointers
>>> (such a the response) are passed as __u64 at the command structure.
>> 
>> Indeed. So that 32bits and 64bits binaries use the same layout.
>> 
>>> Is that what you mean ? I think it's a bit odd to pass those pointers 
>>> as
>>> a part of the command, as they are output only attributes. Though,
>>> I'll change the code to use __u64 instead of the actual __user
>>> pointers.
>>> 
>> 
>> How can those pointers be output parameters ? Does kernel allocate 
>> some
>> pages
>> and putting the addresses of those in the ah_attr_ex and 
>> alt_ah_attr_ex
>> pointers ?
> 
> No, the user allocates them and passes the addresses to the kernel.
> The kernel fills the user-allocated space. It works the exact same way
> as the response structure.
> 

So it's not "odd to pass those pointers as part of the command" :)
The commands structure hold all the information needed to process it.

I've seen that the proposed command has pointers to others user memory 
buffers
not part of the write() operations ... just like some ucma commands :(

This make the kernel do userspace pointer chasing ... it's not good
from maintainability point of view: it's adding complexity.

>> 
>> I don't buy the "maintain future extendability" argument for such
>> specific cruft.
>> It's not enough generic to fall in the extensible pattern.
> 
> I don't see why.
> struct inner1 {
> ... some fields ...
> };
> struct inner2 {
> ... some fields ...
> };
> struct outer {
> ... some fields ...
> struct inner1 in1;
> struct inner2 in2;
> ... some fields ...
> };
> 
> Now lets say we need to extend inner1 with a new field in its bottom.
> If inner1 was inlined, outer memory layout would be changed and we'll
> get problems with new user <-> old kernel. That's a general problem
> that can be solved easily by using:
> struct outer {
> ... some fields ...
> struct inner1 *in1;
> struct inner2 *in2;
> ... some fields ...
> };
> 
> Since we're using __u64 to pass pointers, instead of struct
> inner[1-2]* we could use __u64. That's ok.
> In our case the only usage of in[1-2] is as output parameters. The
> user allocates space only to let the kernel fills it for him.
> That's why I think putting it as a part of the response is more 
> appropriate.
> 

[ Currently "response" is write only. It's cleaner that way ]

The scheme presented, as I read here, is starting to look like a forest 
of pointers,
a recursive octopus... and I don't like those.

I think of something like NETLINK and IOVEC as a better scheme (quite 
like a TLV things)
But using such scheme would introduce a new ABI ... and came with its 
own set of complexity problems.

>>>> 
>>>> Same apply to struct ib_uverbs_modify_qp_ex, but struct
>>>> ib_uverbs_modify_qp_ex has the comp_mask as first field (introducing 
>>>> a
>>>> hole).
>>> 
>>> We'll add a reserved field to fix this hole. Thanks for that catch!
>>> 
>> 
>> Why not putting that field after the struct ib_uverbs_modify_qp field 
>> ?
>> (even if I don't like the comp_mask things, I waiting for a real world
>> example).
> 
> Since every struct starts with u32 comp_mask, I suggest putting
> required reserved field right after this comp_mask. I've reviewed this
> case again and I don't think a reserved field is needed here. The
> largest field in struct ib_uverbs_qp_dest is u32. Hence, the struct
> will be 32bit aligned. Since struct ib_uverbs_qp_dest shouldn't be
> changed anymore, as it'll change the memory layout, there's no reason
> to force struct ib_uverbs_qp_dest to be 64bit aligned.

Yes you're right.

Perhaps I was thinkig of ib_uverbs_create_ah_ex ?

Regards.

-- 
Yann Droneaud
OPTEYA

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH V4 8/9] IB/core: Add RoCE IP based addressing extensions for rdma_ucm
       [not found]                                         ` <698ad99050d7ece7bac8a591e4318f45-zgzEX58YAwA@public.gmane.org>
@ 2013-09-22  7:32                                           ` Matan Barak
       [not found]                                             ` <523E9D06.8050804-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 38+ messages in thread
From: Matan Barak @ 2013-09-22  7:32 UTC (permalink / raw)
  To: Yann Droneaud
  Cc: Or Gerlitz, roland-DgEjT+Ai2ygdnm+yROfE0A,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, monis-VPRAkNaXOzVWk0Htik3J/w

On 18/9/2013 1:07 PM, Yann Droneaud wrote:
> Hi,
>
> Le 18.09.2013 10:40, Matan Barak a écrit :
>> On 17/9/2013 6:43 PM, Yann Droneaud wrote:
>>> Le 17.09.2013 17:13, Matan Barak a écrit :
>>>> On 17/9/2013 1:25 PM, Yann Droneaud wrote:
>>>>> Hi,
>>>>>
>>>>> Le 17.09.2013 12:02, Matan Barak a écrit :
>>>>>>
>>>>>> That's right - we're not checking anything here.
>>>>>> struct ib_uverbs_qp_attr_ex contains 2 user pointers:
>>>>>> +       /* represents: struct ib_uverbs_ah_attr_ex * __user */
>>>>>> +       void __user *ah_attr_ex;
>>>>>> +       /* represents: struct ib_uverbs_ah_attr_ex * __user */
>>>>>> +       void __user *alt_ah_attr_ex;
>>>>>>
>>>>>> Those pointers should be given to rdma_init_qp_attr in order to
>>>>>> initialize them.
>>>>>>
>>>>>> We are using pointers here in order to maintain future extendability
>>>>>> of the address handle structures.
>>>>>>
>>>>>
>>>>> First: you can't put pointer asis in public data structure. Look to
>>>>> all
>>>>> others "command" structures
>>>>> declared in include/uapi/rdma/ib_user_verbs.h
>>>>
>>>> Thanks for the review. Looking at other commands, I see that pointers
>>>> (such a the response) are passed as __u64 at the command structure.
>>>
>>> Indeed. So that 32bits and 64bits binaries use the same layout.
>>>
>>>> Is that what you mean ? I think it's a bit odd to pass those
>>>> pointers as
>>>> a part of the command, as they are output only attributes. Though,
>>>> I'll change the code to use __u64 instead of the actual __user
>>>> pointers.
>>>>
>>>
>>> How can those pointers be output parameters ? Does kernel allocate some
>>> pages
>>> and putting the addresses of those in the ah_attr_ex and alt_ah_attr_ex
>>> pointers ?
>>
>> No, the user allocates them and passes the addresses to the kernel.
>> The kernel fills the user-allocated space. It works the exact same way
>> as the response structure.
>>
>
> So it's not "odd to pass those pointers as part of the command" :)
> The commands structure hold all the information needed to process it.
>
> I've seen that the proposed command has pointers to others user memory
> buffers
> not part of the write() operations ... just like some ucma commands :(
>
> This make the kernel do userspace pointer chasing ... it's not good
> from maintainability point of view: it's adding complexity.
>

The response/cmd buffer are already userspace pointers, thus the kernel 
is already doing some userspace pointer chasing. Though, it might be 
better to use "iovec like" behavior by putting several output buffers in 
the command. What do you think about moving the 
ah_attr_ex/alt_ah_attr_ex into the command ?

>>>
>>> I don't buy the "maintain future extendability" argument for such
>>> specific cruft.
>>> It's not enough generic to fall in the extensible pattern.
>>
>> I don't see why.
>> struct inner1 {
>> ... some fields ...
>> };
>> struct inner2 {
>> ... some fields ...
>> };
>> struct outer {
>> ... some fields ...
>> struct inner1 in1;
>> struct inner2 in2;
>> ... some fields ...
>> };
>>
>> Now lets say we need to extend inner1 with a new field in its bottom.
>> If inner1 was inlined, outer memory layout would be changed and we'll
>> get problems with new user <-> old kernel. That's a general problem
>> that can be solved easily by using:
>> struct outer {
>> ... some fields ...
>> struct inner1 *in1;
>> struct inner2 *in2;
>> ... some fields ...
>> };
>>
>> Since we're using __u64 to pass pointers, instead of struct
>> inner[1-2]* we could use __u64. That's ok.
>> In our case the only usage of in[1-2] is as output parameters. The
>> user allocates space only to let the kernel fills it for him.
>> That's why I think putting it as a part of the response is more
>> appropriate.
>>
>
> [ Currently "response" is write only. It's cleaner that way ]
>
> The scheme presented, as I read here, is starting to look like a forest
> of pointers,
> a recursive octopus... and I don't like those.
>
> I think of something like NETLINK and IOVEC as a better scheme (quite
> like a TLV things)
> But using such scheme would introduce a new ABI ... and came with its
> own set of complexity problems.
>

Currently, we only have 2 levels of indirections. I don't think other 
commands will require the usage of pointers. Anyway, moving those 
pointers into the command buffer will flatten this tree into one level.
I don't think that using other methods, which currently aren't used in 
uverbs, is something we should introduce without carefully examining the 
possible implications.

>>>>>
>>>>> Same apply to struct ib_uverbs_modify_qp_ex, but struct
>>>>> ib_uverbs_modify_qp_ex has the comp_mask as first field (introducing a
>>>>> hole).
>>>>
>>>> We'll add a reserved field to fix this hole. Thanks for that catch!
>>>>
>>>
>>> Why not putting that field after the struct ib_uverbs_modify_qp field ?
>>> (even if I don't like the comp_mask things, I waiting for a real world
>>> example).
>>
>> Since every struct starts with u32 comp_mask, I suggest putting
>> required reserved field right after this comp_mask. I've reviewed this
>> case again and I don't think a reserved field is needed here. The
>> largest field in struct ib_uverbs_qp_dest is u32. Hence, the struct
>> will be 32bit aligned. Since struct ib_uverbs_qp_dest shouldn't be
>> changed anymore, as it'll change the memory layout, there's no reason
>> to force struct ib_uverbs_qp_dest to be 64bit aligned.
>
> Yes you're right.
>
> Perhaps I was thinkig of ib_uverbs_create_ah_ex ?
>

create_ah_ex will be removed in the next version if ip based addressing 
patchset :-)

> Regards.
>
Regards

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH V4 8/9] IB/core: Add RoCE IP based addressing extensions for rdma_ucm
       [not found]                                             ` <523E9D06.8050804-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2013-10-27 15:29                                               ` Tzahi Oved
  0 siblings, 0 replies; 38+ messages in thread
From: Tzahi Oved @ 2013-10-27 15:29 UTC (permalink / raw)
  To: Matan Barak, Yann Droneaud
  Cc: Or Gerlitz, roland-DgEjT+Ai2ygdnm+yROfE0A,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, monis-VPRAkNaXOzVWk0Htik3J/w

On 22/09/2013 10:32, Matan Barak wrote:
> On 18/9/2013 1:07 PM, Yann Droneaud wrote:
>> Hi,
>>
>> Le 18.09.2013 10:40, Matan Barak a écrit :
>>> On 17/9/2013 6:43 PM, Yann Droneaud wrote:
>>>> Le 17.09.2013 17:13, Matan Barak a écrit :
>>>>> On 17/9/2013 1:25 PM, Yann Droneaud wrote:
>>>>>> Hi,
>>>>>>
>>>>>> Le 17.09.2013 12:02, Matan Barak a écrit :
>>>>>>>
>>>>>>> That's right - we're not checking anything here.
>>>>>>> struct ib_uverbs_qp_attr_ex contains 2 user pointers:
>>>>>>> +       /* represents: struct ib_uverbs_ah_attr_ex * __user */
>>>>>>> +       void __user *ah_attr_ex;
>>>>>>> +       /* represents: struct ib_uverbs_ah_attr_ex * __user */
>>>>>>> +       void __user *alt_ah_attr_ex;
>>>>>>>
>>>>>>> Those pointers should be given to rdma_init_qp_attr in order to
>>>>>>> initialize them.
>>>>>>>
>>>>>>> We are using pointers here in order to maintain future 
>>>>>>> extendability
>>>>>>> of the address handle structures.
>>>>>>>
>>>>>>
>>>>>> First: you can't put pointer asis in public data structure. Look to
>>>>>> all
>>>>>> others "command" structures
>>>>>> declared in include/uapi/rdma/ib_user_verbs.h
>>>>>
>>>>> Thanks for the review. Looking at other commands, I see that pointers
>>>>> (such a the response) are passed as __u64 at the command structure.
>>>>
>>>> Indeed. So that 32bits and 64bits binaries use the same layout.
>>>>
>>>>> Is that what you mean ? I think it's a bit odd to pass those
>>>>> pointers as
>>>>> a part of the command, as they are output only attributes. Though,
>>>>> I'll change the code to use __u64 instead of the actual __user
>>>>> pointers.
>>>>>
>>>>
>>>> How can those pointers be output parameters ? Does kernel allocate 
>>>> some
>>>> pages
>>>> and putting the addresses of those in the ah_attr_ex and 
>>>> alt_ah_attr_ex
>>>> pointers ?
>>>
>>> No, the user allocates them and passes the addresses to the kernel.
>>> The kernel fills the user-allocated space. It works the exact same way
>>> as the response structure.
>>>
>>
>> So it's not "odd to pass those pointers as part of the command" :)
>> The commands structure hold all the information needed to process it.
>>
>> I've seen that the proposed command has pointers to others user memory
>> buffers
>> not part of the write() operations ... just like some ucma commands :(
>>
>> This make the kernel do userspace pointer chasing ... it's not good
>> from maintainability point of view: it's adding complexity.
>>
>
> The response/cmd buffer are already userspace pointers, thus the 
> kernel is already doing some userspace pointer chasing. Though, it 
> might be better to use "iovec like" behavior by putting several output 
> buffers in the command. What do you think about moving the 
> ah_attr_ex/alt_ah_attr_ex into the command ?
>

We'd like to move forward here with submitting the extended kernel 
commands scheme, and would like to refrain from over-complicating it. In 
this singular case we need one extra level of indirection in the 
commands thus would hate to build a whole new scheme to
nested pointers commands. As Matan suggested, future pointer usage can 
be flattened to the main command struct itself and thus further nesting 
can be avoided.
The use of pointers in commands is dealt per command basis and doesn't 
impact the infrastructure itself, and as u said, this is not the first 
time it is used.

>>>>
>>>> I don't buy the "maintain future extendability" argument for such
>>>> specific cruft.
>>>> It's not enough generic to fall in the extensible pattern.
>>>
>>> I don't see why.
>>> struct inner1 {
>>> ... some fields ...
>>> };
>>> struct inner2 {
>>> ... some fields ...
>>> };
>>> struct outer {
>>> ... some fields ...
>>> struct inner1 in1;
>>> struct inner2 in2;
>>> ... some fields ...
>>> };
>>>
>>> Now lets say we need to extend inner1 with a new field in its bottom.
>>> If inner1 was inlined, outer memory layout would be changed and we'll
>>> get problems with new user <-> old kernel. That's a general problem
>>> that can be solved easily by using:
>>> struct outer {
>>> ... some fields ...
>>> struct inner1 *in1;
>>> struct inner2 *in2;
>>> ... some fields ...
>>> };
>>>
>>> Since we're using __u64 to pass pointers, instead of struct
>>> inner[1-2]* we could use __u64. That's ok.
>>> In our case the only usage of in[1-2] is as output parameters. The
>>> user allocates space only to let the kernel fills it for him.
>>> That's why I think putting it as a part of the response is more
>>> appropriate.
>>>
>>
>> [ Currently "response" is write only. It's cleaner that way ]
>>
>> The scheme presented, as I read here, is starting to look like a forest
>> of pointers,
>> a recursive octopus... and I don't like those.
>>
>> I think of something like NETLINK and IOVEC as a better scheme (quite
>> like a TLV things)
>> But using such scheme would introduce a new ABI ... and came with its
>> own set of complexity problems.
>>
>
> Currently, we only have 2 levels of indirections. I don't think other 
> commands will require the usage of pointers. Anyway, moving those 
> pointers into the command buffer will flatten this tree into one level.
> I don't think that using other methods, which currently aren't used in 
> uverbs, is something we should introduce without carefully examining 
> the possible implications.
>
>>>>>>
>>>>>> Same apply to struct ib_uverbs_modify_qp_ex, but struct
>>>>>> ib_uverbs_modify_qp_ex has the comp_mask as first field 
>>>>>> (introducing a
>>>>>> hole).
>>>>>
>>>>> We'll add a reserved field to fix this hole. Thanks for that catch!
>>>>>
>>>>
>>>> Why not putting that field after the struct ib_uverbs_modify_qp 
>>>> field ?
>>>> (even if I don't like the comp_mask things, I waiting for a real world
>>>> example).
>>>
>>> Since every struct starts with u32 comp_mask, I suggest putting
>>> required reserved field right after this comp_mask. I've reviewed this
>>> case again and I don't think a reserved field is needed here. The
>>> largest field in struct ib_uverbs_qp_dest is u32. Hence, the struct
>>> will be 32bit aligned. Since struct ib_uverbs_qp_dest shouldn't be
>>> changed anymore, as it'll change the memory layout, there's no reason
>>> to force struct ib_uverbs_qp_dest to be 64bit aligned.
>>
>> Yes you're right.
>>
>> Perhaps I was thinkig of ib_uverbs_create_ah_ex ?
>>
>
> create_ah_ex will be removed in the next version if ip based 
> addressing patchset :-)
>
>> Regards.
>>
> Regards
>
> -- 
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH V4 9/9] IB/mlx4: Enable mlx4_ib support for MODIFY_QP_EX
       [not found]     ` <52480568.8000801-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  2013-10-02 15:09       ` Devesh Sharma
@ 2013-10-10 21:26       ` Or Gerlitz
  1 sibling, 0 replies; 38+ messages in thread
From: Or Gerlitz @ 2013-10-10 21:26 UTC (permalink / raw)
  To: Roland Dreier
  Cc: Jason Gunthorpe, Devesh Sharma,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Moni Shoua, Matan Barak

On Sun, Sep 29, 2013 at 1:48 PM, Or Gerlitz <ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> wrote:
> On 17/09/2013 23:49, Or Gerlitz wrote:
>> On Tue, Sep 17, 2013 at 8:50 PM, Roland Dreier wrote:
>>> On Thu, Sep 12, 2013 at 10:22 AM, Jason Gunthorpe wrote:
>>>> On Thu, Sep 12, 2013 at 03:24:46PM +0300, Or Gerlitz wrote:

>>>>> Let me clarify this. The idea is that current RoCE applications will
>>>>> run as is after they update "their" librdmacm, since its this
>>>>> library that works with the new uverbs entries.

>>>> Or, we are not supposed to break userspace. You can't insist that a
>>>> user space library be updated in-sync with the kernel.

>>> Agree.  This "IP based addressing" for RoCE looks like a big problem
>>> at the moment.  Let me reiterate my understanding, and you guys can
>>> correct me if I get something wrong:

>>>   - current addressing scheme is broken for virtualization use cases,
>>> because VMs may not know about what VLANs are in use.  (also there are
>>> issues around bonding modes that use different Ethernet addresses)
>>
>> The current addressing is actually broken for vlan use cases, both
>> native and virtualized, for the virt as of the argument you mentioned,
>> for native as of one node connected to Ethernet edge switch acting in
>> access mode (that is the switch does vlan insertion/stripping) and the
>> other node handling vlans by itself. Each one will form different GID
>> for the other party.
>>
>>>   - proposed change requires:
>>>     * all systems must update kernel at the same time, because old and
>>> new kernels cannot talk to each other
>>>     * all systems must update librdmacm when they update the kernel,
>>> because old librdmacm does not work with new kernel
>>> I understand that we want to fix the issue around VLAN tagged traffic
>>> from VMs, but I don't see how we can break the whole stack to
>>> accomplish that.  Isn't there some incremental way forward?
>>
>> To begin with, we don't break the whole stack -- using the current
>> patch set, for ports whose link is IB, all biz as usual, and this is
>> the in the port resolution, that is if for a given device one port is
>> IB and one port Eth, existing librdmacm keep working on the IB por.
>>
>> Another fact to put in the fire is that SRIOV VMs don't have RoCE now
>> (not supported by upstream). Actually we're holding off with the SRIOV
>> RoCE patches submission b/c of the breakage with the current scheme
>> --> no need for backward compatibility here either. The vast majority
>> if not all the Cloud use cases we are aware to which would use RoCE
>> need VST and need it to work right.
>>
>> With vlans being broken already, I would say we need 1st and most fix
>> that and only/maybe later worry on backward compatibility for the few
>> native mode use cases that somehow manage to workaround the buggish
>> gid format when they use vlans.
>>
>> As for those who don't use vlans, which is also rare, as RoCE is
>> working best over some lossless channel which is typically achieved
>> using PFC over a vlan... we can use the fact that the IP bases
>> addressing patches configure both interface IPv4 and IPv6 addresses
>> into the gid table.
>>
>> Now,  the IPv6 link address is actually also plugged into the gid
>> table by nodes running the old code since this is how the non-vlan MAC
>> based GID is constructed. Using this fact, we can allow
>>
>> 1. the patched kernel to work with non updated user space, as long as
>> they use the GID which relates to an IPv6 link local address
>>
>> 2. node running the "old" code to talk with "new" node over what the
>> old node sees as a non-vlan MAC based GID and the new node sees as
>> IPv6 link local gid.
>>
>> Sounds better?
>>
>>
>
> Hi Roland, ping, I have wrote a detailed reply to your concerns and no word
> from you except on the "begin with" part, can you? Or.

Roland, its been almost a month since I replied to your concerns and
so far not a word from you on my core arguments, 3.12 is almost on rc5
and another kernel cycle can be easliy lost here.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH V4 9/9] IB/mlx4: Enable mlx4_ib support for MODIFY_QP_EX
       [not found]         ` <CAGgPuS2791OXo9JrZ030qSn_4Yi777Vw5f8LP1-u2npNKppoKA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2013-10-02 20:01           ` Or Gerlitz
  0 siblings, 0 replies; 38+ messages in thread
From: Or Gerlitz @ 2013-10-02 20:01 UTC (permalink / raw)
  To: Devesh Sharma
  Cc: Or Gerlitz, Roland Dreier, Jason Gunthorpe,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Moni Shoua, Matan Barak

On Wed, Oct 2, 2013 at 6:09 PM, Devesh Sharma <desh.t2-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> One more point I have is, since current application like
> perftest/qperf/rping/krping do not have code to receive ipv6 address,
> do you have plans to modify these?

rping supports ipv6, as for krping, yes, it needs to be enhanced to
support that too.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH V4 9/9] IB/mlx4: Enable mlx4_ib support for MODIFY_QP_EX
       [not found]     ` <52480568.8000801-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2013-10-02 15:09       ` Devesh Sharma
       [not found]         ` <CAGgPuS2791OXo9JrZ030qSn_4Yi777Vw5f8LP1-u2npNKppoKA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2013-10-10 21:26       ` Or Gerlitz
  1 sibling, 1 reply; 38+ messages in thread
From: Devesh Sharma @ 2013-10-02 15:09 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Roland Dreier, Jason Gunthorpe,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Moni Shoua, Matan Barak

Hi Or,

One more point I have is, since current application like
perftest/qperf/rping/krping do not have code to receive ipv6 address,
do you have plans to modify these?

On Sun, Sep 29, 2013 at 4:18 PM, Or Gerlitz <ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> wrote:
> On 17/09/2013 23:49, Or Gerlitz wrote:
>>
>> On Tue, Sep 17, 2013 at 8:50 PM, Roland Dreier wrote:
>>>
>>> On Thu, Sep 12, 2013 at 10:22 AM, Jason Gunthorpe wrote:
>>>>
>>>> On Thu, Sep 12, 2013 at 03:24:46PM +0300, Or Gerlitz wrote:
>>>>>
>>>>> Let me clarify this. The idea is that current RoCE applications will
>>>>> run as is after they update "their" librdmacm, since its this
>>>>> library that works with the new uverbs entries.
>>>>
>>>> Or, we are not supposed to break userspace. You can't insist that a
>>>> user space library be updated in-sync with the kernel.
>>>
>>> Agree.  This "IP based addressing" for RoCE looks like a big problem
>>> at the moment.  Let me reiterate my understanding, and you guys can
>>> correct me if I get something wrong:
>>>
>>>   - current addressing scheme is broken for virtualization use cases,
>>> because VMs may not know about what VLANs are in use.  (also there are
>>> issues around bonding modes that use different Ethernet addresses)
>>
>> The current addressing is actually broken for vlan use cases, both
>> native and virtualized, for the virt as of the argument you mentioned,
>> for native as of one node connected to Ethernet edge switch acting in
>> access mode (that is the switch does vlan insertion/stripping) and the
>> other node handling vlans by itself. Each one will form different GID
>> for the other party.
>>
>>>   - proposed change requires:
>>>     * all systems must update kernel at the same time, because old and
>>> new kernels cannot talk to each other
>>>     * all systems must update librdmacm when they update the kernel,
>>> because old librdmacm does not work with new kernel
>>> I understand that we want to fix the issue around VLAN tagged traffic
>>> from VMs, but I don't see how we can break the whole stack to
>>> accomplish that.  Isn't there some incremental way forward?
>>
>> To begin with, we don't break the whole stack -- using the current
>> patch set, for ports whose link is IB, all biz as usual, and this is
>> the in the port resolution, that is if for a given device one port is
>> IB and one port Eth, existing librdmacm keep working on the IB por.
>>
>> Another fact to put in the fire is that SRIOV VMs don't have RoCE now
>> (not supported by upstream). Actually we're holding off with the SRIOV
>> RoCE patches submission b/c of the breakage with the current scheme
>> --> no need for backward compatibility here either. The vast majority
>> if not all the Cloud use cases we are aware to which would use RoCE
>> need VST and need it to work right.
>>
>> With vlans being broken already, I would say we need 1st and most fix
>> that and only/maybe later worry on backward compatibility for the few
>> native mode use cases that somehow manage to workaround the buggish
>> gid format when they use vlans.
>>
>> As for those who don't use vlans, which is also rare, as RoCE is
>> working best over some lossless channel which is typically achieved
>> using PFC over a vlan... we can use the fact that the IP bases
>> addressing patches configure both interface IPv4 and IPv6 addresses
>> into the gid table.
>>
>> Now,  the IPv6 link address is actually also plugged into the gid
>> table by nodes running the old code since this is how the non-vlan MAC
>> based GID is constructed. Using this fact, we can allow
>>
>> 1. the patched kernel to work with non updated user space, as long as
>> they use the GID which relates to an IPv6 link local address
>>
>> 2. node running the "old" code to talk with "new" node over what the
>> old node sees as a non-vlan MAC based GID and the new node sees as
>> IPv6 link local gid.
>>
>> Sounds better?
>>
>>
>
> Hi Roland, ping, I have wrote a detailed reply to your concerns and no word
> from you except on the
> "begin with" part, can you? Or.
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH V4 9/9] IB/mlx4: Enable mlx4_ib support for MODIFY_QP_EX
       [not found] ` <CAJZOPZJ_F06xORoQyt-6_SK5P5Y7LXekQuNKHHYSt+oJ8sV1GA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2013-09-17 23:10   ` Roland Dreier
@ 2013-09-29 10:48   ` Or Gerlitz
       [not found]     ` <52480568.8000801-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  1 sibling, 1 reply; 38+ messages in thread
From: Or Gerlitz @ 2013-09-29 10:48 UTC (permalink / raw)
  To: Roland Dreier
  Cc: Jason Gunthorpe, Devesh Sharma,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Moni Shoua, Matan Barak

On 17/09/2013 23:49, Or Gerlitz wrote:
> On Tue, Sep 17, 2013 at 8:50 PM, Roland Dreier wrote:
>> On Thu, Sep 12, 2013 at 10:22 AM, Jason Gunthorpe wrote:
>>> On Thu, Sep 12, 2013 at 03:24:46PM +0300, Or Gerlitz wrote:
>>>> Let me clarify this. The idea is that current RoCE applications will
>>>> run as is after they update "their" librdmacm, since its this
>>>> library that works with the new uverbs entries.
>>> Or, we are not supposed to break userspace. You can't insist that a
>>> user space library be updated in-sync with the kernel.
>> Agree.  This "IP based addressing" for RoCE looks like a big problem
>> at the moment.  Let me reiterate my understanding, and you guys can
>> correct me if I get something wrong:
>>
>>   - current addressing scheme is broken for virtualization use cases,
>> because VMs may not know about what VLANs are in use.  (also there are
>> issues around bonding modes that use different Ethernet addresses)
> The current addressing is actually broken for vlan use cases, both
> native and virtualized, for the virt as of the argument you mentioned,
> for native as of one node connected to Ethernet edge switch acting in
> access mode (that is the switch does vlan insertion/stripping) and the
> other node handling vlans by itself. Each one will form different GID
> for the other party.
>
>>   - proposed change requires:
>>     * all systems must update kernel at the same time, because old and
>> new kernels cannot talk to each other
>>     * all systems must update librdmacm when they update the kernel,
>> because old librdmacm does not work with new kernel
>> I understand that we want to fix the issue around VLAN tagged traffic
>> from VMs, but I don't see how we can break the whole stack to
>> accomplish that.  Isn't there some incremental way forward?
> To begin with, we don't break the whole stack -- using the current
> patch set, for ports whose link is IB, all biz as usual, and this is
> the in the port resolution, that is if for a given device one port is
> IB and one port Eth, existing librdmacm keep working on the IB por.
>
> Another fact to put in the fire is that SRIOV VMs don't have RoCE now
> (not supported by upstream). Actually we're holding off with the SRIOV
> RoCE patches submission b/c of the breakage with the current scheme
> --> no need for backward compatibility here either. The vast majority
> if not all the Cloud use cases we are aware to which would use RoCE
> need VST and need it to work right.
>
> With vlans being broken already, I would say we need 1st and most fix
> that and only/maybe later worry on backward compatibility for the few
> native mode use cases that somehow manage to workaround the buggish
> gid format when they use vlans.
>
> As for those who don't use vlans, which is also rare, as RoCE is
> working best over some lossless channel which is typically achieved
> using PFC over a vlan... we can use the fact that the IP bases
> addressing patches configure both interface IPv4 and IPv6 addresses
> into the gid table.
>
> Now,  the IPv6 link address is actually also plugged into the gid
> table by nodes running the old code since this is how the non-vlan MAC
> based GID is constructed. Using this fact, we can allow
>
> 1. the patched kernel to work with non updated user space, as long as
> they use the GID which relates to an IPv6 link local address
>
> 2. node running the "old" code to talk with "new" node over what the
> old node sees as a non-vlan MAC based GID and the new node sees as
> IPv6 link local gid.
>
> Sounds better?
>
>

Hi Roland, ping, I have wrote a detailed reply to your concerns and no 
word from you except on the
"begin with" part, can you? Or.


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH V4 9/9] IB/mlx4: Enable mlx4_ib support for MODIFY_QP_EX
       [not found]     ` <CAG4TOxOtsy+vtmtYciREk0bOC=o9-ME1T=cqvt46CNssCU57zA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2013-09-18  4:31       ` Or Gerlitz
  0 siblings, 0 replies; 38+ messages in thread
From: Or Gerlitz @ 2013-09-18  4:31 UTC (permalink / raw)
  To: Roland Dreier
  Cc: Jason Gunthorpe, Or Gerlitz, Devesh Sharma,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, monis, matanb

On Wed, Sep 18, 2013 at 2:10 AM, Roland Dreier <roland-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
> On Tue, Sep 17, 2013 at 1:49 PM, Or Gerlitz <or.gerlitz-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>> To begin with, we don't break the whole stack -- using the current
>> patch set, for ports whose link is IB, all biz as usual, and this is
>> the in the port resolution, that is if for a given device one port is
>> IB and one port Eth, existing librdmacm keep working on the IB por.
>
> Sure, and people using USB webcams and wifi are also unaffected by
> changes to the RoCE stack.   For anyone using RoCE the impact is pretty big.

I see your point and this is why I haven't stopped after the "to begin
with" paragraph... any comment on what I wrote following that?

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH V4 9/9] IB/mlx4: Enable mlx4_ib support for MODIFY_QP_EX
       [not found] ` <CAJZOPZJ_F06xORoQyt-6_SK5P5Y7LXekQuNKHHYSt+oJ8sV1GA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2013-09-17 23:10   ` Roland Dreier
       [not found]     ` <CAG4TOxOtsy+vtmtYciREk0bOC=o9-ME1T=cqvt46CNssCU57zA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2013-09-29 10:48   ` Or Gerlitz
  1 sibling, 1 reply; 38+ messages in thread
From: Roland Dreier @ 2013-09-17 23:10 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Jason Gunthorpe, Or Gerlitz, Devesh Sharma,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, monis, matanb

On Tue, Sep 17, 2013 at 1:49 PM, Or Gerlitz <or.gerlitz-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> To begin with, we don't break the whole stack -- using the current
> patch set, for ports whose link is IB, all biz as usual, and this is
> the in the port resolution, that is if for a given device one port is
> IB and one port Eth, existing librdmacm keep working on the IB por.

Sure, and people using USB webcams and wifi are also unaffected by
changes to the RoCE stack.   For anyone using RoCE the impact is
pretty big.

 - R.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH V4 9/9] IB/mlx4: Enable mlx4_ib support for MODIFY_QP_EX
@ 2013-09-17 20:49 Or Gerlitz
       [not found] ` <CAJZOPZJ_F06xORoQyt-6_SK5P5Y7LXekQuNKHHYSt+oJ8sV1GA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 38+ messages in thread
From: Or Gerlitz @ 2013-09-17 20:49 UTC (permalink / raw)
  To: Roland Dreier
  Cc: Jason Gunthorpe, Or Gerlitz, Devesh Sharma,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, monis, matanb

On Tue, Sep 17, 2013 at 8:50 PM, Roland Dreier wrote:
> On Thu, Sep 12, 2013 at 10:22 AM, Jason Gunthorpe wrote:
>> On Thu, Sep 12, 2013 at 03:24:46PM +0300, Or Gerlitz wrote:

>>> Let me clarify this. The idea is that current RoCE applications will
>>> run as is after they update "their" librdmacm, since its this
>>> library that works with the new uverbs entries.

>> Or, we are not supposed to break userspace. You can't insist that a
>> user space library be updated in-sync with the kernel.

> Agree.  This "IP based addressing" for RoCE looks like a big problem
> at the moment.  Let me reiterate my understanding, and you guys can
> correct me if I get something wrong:
>
>  - current addressing scheme is broken for virtualization use cases,
> because VMs may not know about what VLANs are in use.  (also there are
> issues around bonding modes that use different Ethernet addresses)

The current addressing is actually broken for vlan use cases, both
native and virtualized, for the virt as of the argument you mentioned,
for native as of one node connected to Ethernet edge switch acting in
access mode (that is the switch does vlan insertion/stripping) and the
other node handling vlans by itself. Each one will form different GID
for the other party.

>  - proposed change requires:
>    * all systems must update kernel at the same time, because old and
> new kernels cannot talk to each other
>    * all systems must update librdmacm when they update the kernel,
> because old librdmacm does not work with new kernel

> I understand that we want to fix the issue around VLAN tagged traffic
> from VMs, but I don't see how we can break the whole stack to
> accomplish that.  Isn't there some incremental way forward?

To begin with, we don't break the whole stack -- using the current
patch set, for ports whose link is IB, all biz as usual, and this is
the in the port resolution, that is if for a given device one port is
IB and one port Eth, existing librdmacm keep working on the IB por.

Another fact to put in the fire is that SRIOV VMs don't have RoCE now
(not supported by upstream). Actually we're holding off with the SRIOV
RoCE patches submission b/c of the breakage with the current scheme
--> no need for backward compatibility here either. The vast majority
if not all the Cloud use cases we are aware to which would use RoCE
need VST and need it to work right.

With vlans being broken already, I would say we need 1st and most fix
that and only/maybe later worry on backward compatibility for the few
native mode use cases that somehow manage to workaround the buggish
gid format when they use vlans.

As for those who don't use vlans, which is also rare, as RoCE is
working best over some lossless channel which is typically achieved
using PFC over a vlan... we can use the fact that the IP bases
addressing patches configure both interface IPv4 and IPv6 addresses
into the gid table.

Now,  the IPv6 link address is actually also plugged into the gid
table by nodes running the old code since this is how the non-vlan MAC
based GID is constructed. Using this fact, we can allow

1. the patched kernel to work with non updated user space, as long as
they use the GID which relates to an IPv6 link local address

2. node running the "old" code to talk with "new" node over what the
old node sees as a non-vlan MAC based GID and the new node sees as
IPv6 link local gid.

Sounds better?

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2013-10-27 15:29 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-09-10 14:41 [PATCH V4 0/9] IP based RoCE GID Addressing Or Gerlitz
     [not found] ` <1378824099-22150-1-git-send-email-ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2013-09-10 14:41   ` [PATCH V4 1/9] IB/core: Ethernet L2 attributes in verbs/cm structures Or Gerlitz
2013-09-10 14:41   ` [PATCH V4 2/9] IB/CMA: RoCE IP based GID addressing Or Gerlitz
2013-09-10 14:41   ` [PATCH V4 3/9] IB/mlx4: Use RoCE IP based GIDs in the port GID table Or Gerlitz
2013-09-10 14:41   ` [PATCH V4 4/9] IB/mlx4: Handle Ethernet L2 parameters for IP based GID addressing Or Gerlitz
2013-09-10 14:41   ` [PATCH V4 5/9] IB/ocrdma: Populate GID table with IP based gids Or Gerlitz
2013-09-10 14:41   ` [PATCH V4 6/9] IB/ocrdma: Handle Ethernet L2 parameters for IP based GID addressing Or Gerlitz
2013-09-10 14:41   ` [PATCH V4 7/9] IB/core: Add RoCE IP based addressing extensions for uverbs Or Gerlitz
     [not found]     ` <1378824099-22150-8-git-send-email-ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2013-09-11 10:06       ` meuh-zgzEX58YAwA
     [not found]         ` <6d494aa8d403e0c50b16f09fbd2c3ab6-zgzEX58YAwA@public.gmane.org>
2013-09-11 11:38           ` Or Gerlitz
     [not found]             ` <52305632.1030604-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2013-09-11 12:42               ` Yann Droneaud
2013-09-10 14:41   ` [PATCH V4 8/9] IB/core: Add RoCE IP based addressing extensions for rdma_ucm Or Gerlitz
     [not found]     ` <1378824099-22150-9-git-send-email-ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2013-09-11  9:52       ` Yann Droneaud
     [not found]         ` <26c47667e463e65dd79caaa4bddc437b-zgzEX58YAwA@public.gmane.org>
2013-09-11 11:32           ` Or Gerlitz
     [not found]             ` <523054BA.2040608-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2013-09-11 12:36               ` Yann Droneaud
     [not found]                 ` <97104d76028c356b458509ce95b08c92-zgzEX58YAwA@public.gmane.org>
2013-09-17 10:02                   ` Matan Barak
     [not found]                     ` <5238289D.40608-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2013-09-17 10:25                       ` Yann Droneaud
     [not found]                         ` <bcec9d3a9a72ed1d612a4dd49b670800-zgzEX58YAwA@public.gmane.org>
2013-09-17 15:13                           ` Matan Barak
     [not found]                             ` <523871A2.8010109-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2013-09-17 15:43                               ` Yann Droneaud
     [not found]                                 ` <8bb85d86eca247afa5786b7c7e4c737a-zgzEX58YAwA@public.gmane.org>
2013-09-18  8:40                                   ` Matan Barak
     [not found]                                     ` <52396719.4050809-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2013-09-18 10:07                                       ` Yann Droneaud
     [not found]                                         ` <698ad99050d7ece7bac8a591e4318f45-zgzEX58YAwA@public.gmane.org>
2013-09-22  7:32                                           ` Matan Barak
     [not found]                                             ` <523E9D06.8050804-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2013-10-27 15:29                                               ` Tzahi Oved
2013-09-10 14:41   ` [PATCH V4 9/9] IB/mlx4: Enable mlx4_ib support for MODIFY_QP_EX Or Gerlitz
     [not found]     ` <1378824099-22150-10-git-send-email-ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2013-09-12  5:26       ` Devesh Sharma
     [not found]         ` <CAGgPuS1tAiyA3TZ5_fpua3ue6JrZ9ruS+O+QU-7t28i0dZ7cUw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-09-12 10:45           ` Or Gerlitz
     [not found]             ` <52319B38.5070807-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2013-09-12 11:31               ` Devesh Sharma
2013-09-12 12:24                 ` Or Gerlitz
     [not found]                   ` <5231B28E.4090605-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2013-09-12 17:22                     ` Jason Gunthorpe
     [not found]                       ` <20130912172252.GA4611-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2013-09-17 17:50                         ` Roland Dreier
2013-09-12 11:46               ` Devesh Sharma
2013-09-17 20:49 Or Gerlitz
     [not found] ` <CAJZOPZJ_F06xORoQyt-6_SK5P5Y7LXekQuNKHHYSt+oJ8sV1GA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-09-17 23:10   ` Roland Dreier
     [not found]     ` <CAG4TOxOtsy+vtmtYciREk0bOC=o9-ME1T=cqvt46CNssCU57zA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-09-18  4:31       ` Or Gerlitz
2013-09-29 10:48   ` Or Gerlitz
     [not found]     ` <52480568.8000801-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2013-10-02 15:09       ` Devesh Sharma
     [not found]         ` <CAGgPuS2791OXo9JrZ030qSn_4Yi777Vw5f8LP1-u2npNKppoKA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-10-02 20:01           ` Or Gerlitz
2013-10-10 21:26       ` Or Gerlitz

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.