[PATCH RFC 0/3] Support standard SRIOV configuration for IB VFs

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH RFC 0/3] Support standard SRIOV configuration for IB VFs
@ 2015-05-21 16:24 Or Gerlitz
       [not found] ` <1432225447-6536-1-git-send-email-ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 31+ messages in thread
From: Or Gerlitz @ 2015-05-21 16:24 UTC (permalink / raw)
  To: Doug Ledford; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Amir Vadai, Or Gerlitz

Standard configuration of SRIOV VFs through the host is done over the
following chain of calls: libvirt --> netlink --> PF netdevice -- where
the PF netdevice exports the ndo_set_vf_ calls.

When this comes to IB/IPoIB we should normalize this into the verbs
framework so we further go: PF IPoIB --> verbs API --> PF HW driver

Virtualization systems assign VMs with 48 bits mac, to allow working 
with non-modified SW layers (open-stack, libvirt, etc), we can safely
extend this mac to unique 64 bits GUID. Hence the IPoIB ndo_set_vf_mac
entry calls the set_vf_guid verb.

One thing to clean for being beyond RFC is to make the get_vf_config 
verb return guid and have IPoIB to make it back a mac.

Here's how it looks when using the ip tool (libvirt runs the same
netlink to set it out) and later reflected when the VF read their port.

# ip link set dev ib0 vf 1 mac aa:bb:cc:dd:ee:ff

# ibstat -d mlx4_2
CA 'mlx4_2'
        CA type: MT4100
        Number of ports: 1
        Firmware version: 2.34.1260
        Hardware version: 0
        Node GUID: 0x00140500f30e84c4
        System image GUID: 0xf452140300117423
        Port 1:
                State: Active
                Physical state: LinkUp
                Rate: 56
                Base lid: 7
                LMC: 0
                SM lid: 1
                Capability mask: 0x02514868
                Port GUID: 0xffeeddfeffccbbaa
                Link layer: InfiniBand

Or Gerlitz (3):
  IB/IPoIB: Support SRIOV standard configuration
  IB/mlx4: Refactor Alias GUID storing
  IB/mlx4: Add support for SRIOV VF management

 drivers/infiniband/hw/mlx4/main.c         |   26 ++++++++++++++
 drivers/infiniband/hw/mlx4/mlx4_ib.h      |    4 ++
 drivers/infiniband/hw/mlx4/sysfs.c        |   54 ++++++++++++++++++----------
 drivers/infiniband/ulp/ipoib/ipoib_main.c |   39 +++++++++++++++++++++
 drivers/net/ethernet/mellanox/mlx4/cmd.c  |   26 +++++++++----
 include/linux/mlx4/device.h               |    2 +
 include/rdma/ib_verbs.h                   |    4 ++
 7 files changed, 128 insertions(+), 27 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH RFC 1/3] IB/IPoIB: Support SRIOV standard configuration
       [not found] ` <1432225447-6536-1-git-send-email-ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2015-05-21 16:24   ` Or Gerlitz
       [not found]     ` <1432225447-6536-2-git-send-email-ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  2015-05-21 16:24   ` [PATCH RFC 2/3] IB/mlx4: Refactor Alias GUID storing Or Gerlitz
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 31+ messages in thread
From: Or Gerlitz @ 2015-05-21 16:24 UTC (permalink / raw)
  To: Doug Ledford; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Amir Vadai, Or Gerlitz

Standard configuration of SRIOV VFs through the host is done over the
following chain of calls: libvirt --> netlink --> PF netdevice

When this comes to IB/IPoIB we should normalize this into the verbs
framework so we further go: PF IPoIB --> verbs API --> PF HW driver

Virtualization systems assign VMs 48 bits mac, to allow working with
non-modified SW layers (open-stack, libvirt, etc), we can safely
extend this mac to unique 64 bits GUID. Hence the IPoIB ndo_set_vf_mac
entry calls the set_vf_guid verb.

Signed-off-by: Or Gerlitz <ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
---
 drivers/infiniband/ulp/ipoib/ipoib_main.c |   39 +++++++++++++++++++++++++++++
 include/rdma/ib_verbs.h                   |    4 +++
 2 files changed, 43 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c
index 9e1b203..8f82870 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_main.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
@@ -1357,6 +1357,43 @@ void ipoib_dev_cleanup(struct net_device *dev)
 	priv->tx_ring = NULL;
 }
 
+static int ipoib_get_vf_config(struct net_device *dev, int vf, struct ifla_vf_info *ivf)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+
+	if (priv->ca->get_vf_config)
+		return priv->ca->get_vf_config(priv->ca, priv->port, vf, ivf);
+	else
+		return -EINVAL;
+}
+
+static int ipoib_set_vf_mac(struct net_device *dev, int queue, u8 *mac)
+{
+	char *raw_guid;
+	u64 guid = 0;
+
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+
+	raw_guid = (char *)&guid;
+	raw_guid[0] = mac[0];
+	raw_guid[1] = mac[1];
+	raw_guid[2] = mac[2];
+	raw_guid[3] = 0xff;
+	raw_guid[4] = 0xfe;
+	raw_guid[5] = mac[3];
+	raw_guid[6] = mac[4];
+	raw_guid[7] = mac[5];
+
+	guid &= ~(cpu_to_be64(1ULL << 56));
+	guid |= cpu_to_be64(1ULL << 57);
+
+	if (priv->ca->set_vf_guid)
+		return priv->ca->set_vf_guid(priv->ca, priv->port, queue, guid);
+	else
+		return -EINVAL;
+}
+
+
 static const struct header_ops ipoib_header_ops = {
 	.create	= ipoib_hard_header,
 };
@@ -1371,6 +1408,8 @@ static const struct net_device_ops ipoib_netdev_ops = {
 	.ndo_tx_timeout		 = ipoib_timeout,
 	.ndo_set_rx_mode	 = ipoib_set_mcast_list,
 	.ndo_get_iflink		 = ipoib_get_iflink,
+	.ndo_get_vf_config	 = ipoib_get_vf_config,
+	.ndo_set_vf_mac	 	 = ipoib_set_vf_mac,
 };
 
 void ipoib_setup(struct net_device *dev)
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 65994a1..6589520 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -53,6 +53,7 @@
 #include <linux/atomic.h>
 #include <linux/mmu_notifier.h>
 #include <asm/uaccess.h>
+#include <linux/if_link.h>
 
 extern struct workqueue_struct *ib_wq;
 
@@ -1653,6 +1654,9 @@ struct ib_device {
 	int			   (*check_mr_status)(struct ib_mr *mr, u32 check_mask,
 						      struct ib_mr_status *mr_status);
 
+	int 			  (*set_vf_guid)  (struct ib_device *device, int port, int vf, u64 guid);
+	int 			  (*get_vf_config)(struct ib_device *device, int port, int vf, struct ifla_vf_info *ivf);
+
 	struct ib_dma_mapping_ops   *dma_ops;
 
 	struct module               *owner;
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH RFC 2/3] IB/mlx4: Refactor Alias GUID storing
       [not found] ` <1432225447-6536-1-git-send-email-ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  2015-05-21 16:24   ` [PATCH RFC 1/3] IB/IPoIB: Support SRIOV standard configuration Or Gerlitz
@ 2015-05-21 16:24   ` Or Gerlitz
  2015-05-21 16:24   ` [PATCH RFC 3/3] IB/mlx4: Add support for SRIOV VF management Or Gerlitz
  2015-05-21 16:40   ` [PATCH RFC 0/3] Support standard SRIOV configuration for IB VFs Doug Ledford
  3 siblings, 0 replies; 31+ messages in thread
From: Or Gerlitz @ 2015-05-21 16:24 UTC (permalink / raw)
  To: Doug Ledford; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Amir Vadai, Or Gerlitz

Move the code that actually set the alias GUID provided by the admin
into a function which isn't tied to the mlx4 SRIOV sysfs constructs.

So we can use for the verbs which deal with guid setting too.

This commit does not change any functionality.

Signed-off-by: Or Gerlitz <ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
---
 drivers/infiniband/hw/mlx4/mlx4_ib.h |    4 ++
 drivers/infiniband/hw/mlx4/sysfs.c   |   54 ++++++++++++++++++++++------------
 2 files changed, 39 insertions(+), 19 deletions(-)

diff --git a/drivers/infiniband/hw/mlx4/mlx4_ib.h b/drivers/infiniband/hw/mlx4/mlx4_ib.h
index fce3934..a13a814 100644
--- a/drivers/infiniband/hw/mlx4/mlx4_ib.h
+++ b/drivers/infiniband/hw/mlx4/mlx4_ib.h
@@ -807,6 +807,10 @@ void mlx4_ib_device_unregister_sysfs(struct mlx4_ib_dev *device);
 
 __be64 mlx4_ib_gen_node_guid(void);
 
+
+void mlx4_store_admin_alias_guid(struct mlx4_ib_dev *mdev, int port, int slave,
+				 __be64 guid);
+
 int mlx4_ib_steer_qp_alloc(struct mlx4_ib_dev *dev, int count, int *qpn);
 void mlx4_ib_steer_qp_free(struct mlx4_ib_dev *dev, u32 qpn, int count);
 int mlx4_ib_steer_qp_reg(struct mlx4_ib_dev *mdev, struct mlx4_ib_qp *mqp,
diff --git a/drivers/infiniband/hw/mlx4/sysfs.c b/drivers/infiniband/hw/mlx4/sysfs.c
index 6797108..705e3b8 100644
--- a/drivers/infiniband/hw/mlx4/sysfs.c
+++ b/drivers/infiniband/hw/mlx4/sysfs.c
@@ -59,6 +59,38 @@ static ssize_t show_admin_alias_guid(struct device *dev,
 	return sprintf(buf, "%llx\n", be64_to_cpu(sysadmin_ag_val));
 }
 
+
+void mlx4_store_admin_alias_guid(struct mlx4_ib_dev *mdev, int port, int slave,
+				 __be64 guid)
+{
+	unsigned long flags;
+	int record_num;/*0-15*/
+	int guid_index_in_rec; /*0 - 7*/
+
+	record_num        = slave / 8;
+	guid_index_in_rec = slave % 8;
+
+	spin_lock_irqsave(&mdev->sriov.alias_guid.ag_work_lock, flags);
+
+	*(__be64 *)&mdev->sriov.alias_guid.ports_guid[port - 1].
+		all_rec_per_port[record_num].
+		all_recs[GUID_REC_SIZE * guid_index_in_rec] = guid;
+
+	/* Change the state to be pending for update */
+	mdev->sriov.alias_guid.ports_guid[port - 1].all_rec_per_port[record_num].status
+		= MLX4_GUID_INFO_STATUS_IDLE ;
+
+	mlx4_set_admin_guid(mdev->dev, guid, slave, port);
+
+	/* set the record index */
+	mdev->sriov.alias_guid.ports_guid[port - 1].all_rec_per_port[record_num].guid_indexes
+		|= mlx4_ib_get_aguid_comp_mask_from_ix(guid_index_in_rec);
+
+	spin_unlock_irqrestore(&mdev->sriov.alias_guid.ag_work_lock, flags);
+
+	mlx4_ib_init_alias_guid_work(mdev, port - 1);
+}
+
 /* store_admin_alias_guid stores the (new) administratively assigned value of that GUID.
  * Values in buf parameter string:
  *	0			- requests opensm to assign a value.
@@ -76,7 +108,6 @@ static ssize_t store_admin_alias_guid(struct device *dev,
 	struct mlx4_ib_iov_port *port = mlx4_ib_iov_dentry->ctx;
 	struct mlx4_ib_dev *mdev = port->dev;
 	u64 sysadmin_ag_val;
-	unsigned long flags;
 
 	record_num = mlx4_ib_iov_dentry->entry_num / 8;
 	guid_index_in_rec = mlx4_ib_iov_dentry->entry_num % 8;
@@ -84,26 +115,11 @@ static ssize_t store_admin_alias_guid(struct device *dev,
 		pr_err("GUID 0 block 0 is RO\n");
 		return count;
 	}
-	spin_lock_irqsave(&mdev->sriov.alias_guid.ag_work_lock, flags);
 	sscanf(buf, "%llx", &sysadmin_ag_val);
-	*(__be64 *)&mdev->sriov.alias_guid.ports_guid[port->num - 1].
-		all_rec_per_port[record_num].
-		all_recs[GUID_REC_SIZE * guid_index_in_rec] =
-			cpu_to_be64(sysadmin_ag_val);
-
-	/* Change the state to be pending for update */
-	mdev->sriov.alias_guid.ports_guid[port->num - 1].all_rec_per_port[record_num].status
-		= MLX4_GUID_INFO_STATUS_IDLE ;
-	mlx4_set_admin_guid(mdev->dev, cpu_to_be64(sysadmin_ag_val),
-			    mlx4_ib_iov_dentry->entry_num,
-			    port->num);
 
-	/* set the record index */
-	mdev->sriov.alias_guid.ports_guid[port->num - 1].all_rec_per_port[record_num].guid_indexes
-		|= mlx4_ib_get_aguid_comp_mask_from_ix(guid_index_in_rec);
-
-	spin_unlock_irqrestore(&mdev->sriov.alias_guid.ag_work_lock, flags);
-	mlx4_ib_init_alias_guid_work(mdev, port->num - 1);
+	mlx4_store_admin_alias_guid(mdev, port->num,
+				    mlx4_ib_iov_dentry->entry_num,
+				    cpu_to_be64(sysadmin_ag_val));
 
 	return count;
 }
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH RFC 3/3] IB/mlx4: Add support for SRIOV VF management
       [not found] ` <1432225447-6536-1-git-send-email-ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  2015-05-21 16:24   ` [PATCH RFC 1/3] IB/IPoIB: Support SRIOV standard configuration Or Gerlitz
  2015-05-21 16:24   ` [PATCH RFC 2/3] IB/mlx4: Refactor Alias GUID storing Or Gerlitz
@ 2015-05-21 16:24   ` Or Gerlitz
  2015-05-21 16:40   ` [PATCH RFC 0/3] Support standard SRIOV configuration for IB VFs Doug Ledford
  3 siblings, 0 replies; 31+ messages in thread
From: Or Gerlitz @ 2015-05-21 16:24 UTC (permalink / raw)
  To: Doug Ledford; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Amir Vadai, Or Gerlitz

Add support for the set_vf_guid and get_vf_config verbs.

Signed-off-by: Or Gerlitz <ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
---
 drivers/infiniband/hw/mlx4/main.c        |   26 ++++++++++++++++++++++++++
 drivers/net/ethernet/mellanox/mlx4/cmd.c |   26 ++++++++++++++++++--------
 include/linux/mlx4/device.h              |    2 ++
 3 files changed, 46 insertions(+), 8 deletions(-)

diff --git a/drivers/infiniband/hw/mlx4/main.c b/drivers/infiniband/hw/mlx4/main.c
index 57070c5..17b6fa7 100644
--- a/drivers/infiniband/hw/mlx4/main.c
+++ b/drivers/infiniband/hw/mlx4/main.c
@@ -1325,6 +1325,27 @@ err_malloc:
 	return err;
 }
 
+static int mlx4_ib_set_vf_guid(struct ib_device *ibdev, int port, int vf, u64 guid)
+{
+	int slave;
+	struct mlx4_ib_dev *mdev = to_mdev(ibdev);
+
+	slave = mlx4_get_slave_indx(mdev->dev, vf);
+	if (slave < 0)
+		return -EINVAL;
+
+	mlx4_store_admin_alias_guid(mdev, port, slave, cpu_to_be64(guid));
+
+	return 0;
+}
+
+static int mlx4_ib_get_vf_config(struct ib_device *ibdev, int port, int vf, struct ifla_vf_info *ivf)
+{
+	struct mlx4_ib_dev *mdev = to_mdev(ibdev);
+
+	return mlx4_get_vf_config(mdev->dev, port, vf, ivf);
+}
+
 static struct mlx4_ib_gid_entry *find_gid_entry(struct mlx4_ib_qp *qp, u8 *raw)
 {
 	struct mlx4_ib_gid_entry *ge;
@@ -2250,6 +2271,11 @@ static void *mlx4_ib_add(struct mlx4_dev *dev)
 		ibdev->ib_dev.dealloc_fmr	= mlx4_ib_fmr_dealloc;
 	}
 
+	if (mlx4_is_master(ibdev->dev)) {
+		ibdev->ib_dev.set_vf_guid   = mlx4_ib_set_vf_guid;
+		ibdev->ib_dev.get_vf_config = mlx4_ib_get_vf_config;
+	}
+
 	if (dev->caps.flags & MLX4_DEV_CAP_FLAG_MEM_WINDOW ||
 	    dev->caps.bmme_flags & MLX4_BMME_FLAG_TYPE_2_WIN) {
 		ibdev->ib_dev.alloc_mw = mlx4_ib_alloc_mw;
diff --git a/drivers/net/ethernet/mellanox/mlx4/cmd.c b/drivers/net/ethernet/mellanox/mlx4/cmd.c
index 7761045..a544650 100644
--- a/drivers/net/ethernet/mellanox/mlx4/cmd.c
+++ b/drivers/net/ethernet/mellanox/mlx4/cmd.c
@@ -2647,7 +2647,7 @@ u32 mlx4_comm_get_version(void)
 	 return ((u32) CMD_CHAN_IF_REV << 8) | (u32) CMD_CHAN_VER;
 }
 
-static int mlx4_get_slave_indx(struct mlx4_dev *dev, int vf)
+int mlx4_get_slave_indx(struct mlx4_dev *dev, int vf)
 {
 	if ((vf < 0) || (vf >= dev->persist->num_vfs)) {
 		mlx4_err(dev, "Bad vf number:%d (number of activated vf: %d)\n",
@@ -2657,6 +2657,7 @@ static int mlx4_get_slave_indx(struct mlx4_dev *dev, int vf)
 
 	return vf+1;
 }
+EXPORT_SYMBOL_GPL(mlx4_get_slave_indx);
 
 int mlx4_get_vf_indx(struct mlx4_dev *dev, int slave)
 {
@@ -3089,13 +3090,22 @@ int mlx4_get_vf_config(struct mlx4_dev *dev, int port, int vf, struct ifla_vf_in
 	s_info = &priv->mfunc.master.vf_admin[slave].vport[port];
 	ivf->vf = vf;
 
-	/* need to convert it to a func */
-	ivf->mac[0] = ((s_info->mac >> (5*8)) & 0xff);
-	ivf->mac[1] = ((s_info->mac >> (4*8)) & 0xff);
-	ivf->mac[2] = ((s_info->mac >> (3*8)) & 0xff);
-	ivf->mac[3] = ((s_info->mac >> (2*8)) & 0xff);
-	ivf->mac[4] = ((s_info->mac >> (1*8)) & 0xff);
-	ivf->mac[5] = ((s_info->mac)  & 0xff);
+	if (dev->caps.port_mask[port] == MLX4_PORT_TYPE_ETH) {
+		ivf->mac[0] = ((s_info->mac >> (5*8)) & 0xff);
+		ivf->mac[1] = ((s_info->mac >> (4*8)) & 0xff);
+		ivf->mac[2] = ((s_info->mac >> (3*8)) & 0xff);
+		ivf->mac[3] = ((s_info->mac >> (2*8)) & 0xff);
+		ivf->mac[4] = ((s_info->mac >> (1*8)) & 0xff);
+		ivf->mac[5] = ((s_info->mac)  & 0xff);
+	} else {
+		u64 guid = be64_to_cpu(s_info->guid);
+		ivf->mac[0] = ((guid >> (7*8)) & 0xff);
+		ivf->mac[1] = ((guid >> (6*8)) & 0xff);
+		ivf->mac[2] = ((guid >> (5*8)) & 0xff);
+		ivf->mac[3] = ((guid >> (2*8)) & 0xff);
+		ivf->mac[4] = ((guid >> (1*8)) & 0xff);
+		ivf->mac[5] = ((guid)  & 0xff);
+	}
 
 	ivf->vlan		= s_info->default_vlan;
 	ivf->qos		= s_info->default_qos;
diff --git a/include/linux/mlx4/device.h b/include/linux/mlx4/device.h
index 83e80ab..e5a70bd 100644
--- a/include/linux/mlx4/device.h
+++ b/include/linux/mlx4/device.h
@@ -1382,6 +1382,8 @@ int mlx4_get_slave_from_roce_gid(struct mlx4_dev *dev, int port, u8 *gid,
 int mlx4_get_roce_gid_from_slave(struct mlx4_dev *dev, int port, int slave_id,
 				 u8 *gid);
 
+int mlx4_get_slave_indx(struct mlx4_dev *dev, int vf);
+
 int mlx4_FLOW_STEERING_IB_UC_QP_RANGE(struct mlx4_dev *dev, u32 min_range_qpn,
 				      u32 max_range_qpn);
 
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH RFC 0/3] Support standard SRIOV configuration for IB VFs
       [not found] ` <1432225447-6536-1-git-send-email-ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
                     ` (2 preceding siblings ...)
  2015-05-21 16:24   ` [PATCH RFC 3/3] IB/mlx4: Add support for SRIOV VF management Or Gerlitz
@ 2015-05-21 16:40   ` Doug Ledford
       [not found]     ` <1432226406.28905.22.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  3 siblings, 1 reply; 31+ messages in thread
From: Doug Ledford @ 2015-05-21 16:40 UTC (permalink / raw)
  To: Or Gerlitz; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Amir Vadai

[-- Attachment #1: Type: text/plain, Size: 2953 bytes --]

On Thu, 2015-05-21 at 19:24 +0300, Or Gerlitz wrote:
> Standard configuration of SRIOV VFs through the host is done over the
> following chain of calls: libvirt --> netlink --> PF netdevice -- where
> the PF netdevice exports the ndo_set_vf_ calls.
> 
> When this comes to IB/IPoIB we should normalize this into the verbs
> framework so we further go: PF IPoIB --> verbs API --> PF HW driver
> 
> Virtualization systems assign VMs with 48 bits mac, to allow working 
> with non-modified SW layers

Why are we suggesting to make this work with unmodified software?  Why
aren't we doing this right and adding a new ndo entry point for the
GUID?  The MAC/GUID mapping isn't the only thing that has to be faked
here, you would also have to fake the vlan/pkey mapping.  This just
seems the wrong thing to do.

>  (open-stack, libvirt, etc), we can safely
> extend this mac to unique 64 bits GUID. Hence the IPoIB ndo_set_vf_mac
> entry calls the set_vf_guid verb.
> 
> One thing to clean for being beyond RFC is to make the get_vf_config 
> verb return guid and have IPoIB to make it back a mac.
> 
> Here's how it looks when using the ip tool (libvirt runs the same
> netlink to set it out) and later reflected when the VF read their port.
> 
> # ip link set dev ib0 vf 1 mac aa:bb:cc:dd:ee:ff
> 
> # ibstat -d mlx4_2
> CA 'mlx4_2'
>         CA type: MT4100
>         Number of ports: 1
>         Firmware version: 2.34.1260
>         Hardware version: 0
>         Node GUID: 0x00140500f30e84c4
>         System image GUID: 0xf452140300117423
>         Port 1:
>                 State: Active
>                 Physical state: LinkUp
>                 Rate: 56
>                 Base lid: 7
>                 LMC: 0
>                 SM lid: 1
>                 Capability mask: 0x02514868
>                 Port GUID: 0xffeeddfeffccbbaa
>                 Link layer: InfiniBand
> 
> Or Gerlitz (3):
>   IB/IPoIB: Support SRIOV standard configuration
>   IB/mlx4: Refactor Alias GUID storing
>   IB/mlx4: Add support for SRIOV VF management
> 
>  drivers/infiniband/hw/mlx4/main.c         |   26 ++++++++++++++
>  drivers/infiniband/hw/mlx4/mlx4_ib.h      |    4 ++
>  drivers/infiniband/hw/mlx4/sysfs.c        |   54 ++++++++++++++++++----------
>  drivers/infiniband/ulp/ipoib/ipoib_main.c |   39 +++++++++++++++++++++
>  drivers/net/ethernet/mellanox/mlx4/cmd.c  |   26 +++++++++----
>  include/linux/mlx4/device.h               |    2 +
>  include/rdma/ib_verbs.h                   |    4 ++
>  7 files changed, 128 insertions(+), 27 deletions(-)
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


-- 
Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
              GPG KeyID: 0E572FDD


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH RFC 1/3] IB/IPoIB: Support SRIOV standard configuration
       [not found]     ` <1432225447-6536-2-git-send-email-ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2015-05-21 18:46       ` Jason Gunthorpe
       [not found]         ` <20150521184613.GD6771-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  0 siblings, 1 reply; 31+ messages in thread
From: Jason Gunthorpe @ 2015-05-21 18:46 UTC (permalink / raw)
  To: Or Gerlitz; +Cc: Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA, Amir Vadai

On Thu, May 21, 2015 at 07:24:05PM +0300, Or Gerlitz wrote:
> Standard configuration of SRIOV VFs through the host is done over the
> following chain of calls: libvirt --> netlink --> PF netdevice
> 
> When this comes to IB/IPoIB we should normalize this into the verbs
> framework so we further go: PF IPoIB --> verbs API --> PF HW driver
> 
> Virtualization systems assign VMs 48 bits mac, to allow working with
> non-modified SW layers (open-stack, libvirt, etc), we can safely
> extend this mac to unique 64 bits GUID. Hence the IPoIB ndo_set_vf_mac
> entry calls the set_vf_guid verb.
> 
> Signed-off-by: Or Gerlitz <ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
>  drivers/infiniband/ulp/ipoib/ipoib_main.c |   39 +++++++++++++++++++++++++++++
>  include/rdma/ib_verbs.h                   |    4 +++
>  2 files changed, 43 insertions(+), 0 deletions(-)
> 
> diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c
> index 9e1b203..8f82870 100644
> +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
> @@ -1357,6 +1357,43 @@ void ipoib_dev_cleanup(struct net_device *dev)
>  	priv->tx_ring = NULL;
>  }
>  
> +static int ipoib_get_vf_config(struct net_device *dev, int vf, struct ifla_vf_info *ivf)
> +{
> +	struct ipoib_dev_priv *priv = netdev_priv(dev);
> +
> +	if (priv->ca->get_vf_config)
> +		return priv->ca->get_vf_config(priv->ca, priv->port, vf, ivf);
> +	else
> +		return -EINVAL;
> +}
> +
> +static int ipoib_set_vf_mac(struct net_device *dev, int queue, u8 *mac)
> +{
> +	char *raw_guid;
> +	u64 guid = 0;

This doesn't seem right at all.

It makes no sense that a IPoIB interface with a 20 byte LLADDR would
accept an 8 byte LLADDR only for 'ip link set vf mac'

The definition of the netlink struct seems to confirm this:

struct ifla_vf_mac {
        __u32 vf;
        __u8 mac[32]; /* MAX_ADDR_LEN */
};

If it was really just ever a mac it would really only be 6 bytes.
[Honestly, this whole feature seems very inconistent with the rest of
 the design of ip net link, so who knows]

If the ifla_vf_mac had been variable-sized (like every other address
related attribute) then sure, auto detect the length and do the right
thing.

But with this API, I think you have no choice, 'ip set vf mac LLADDR'
can only be the 20 byte address.

If you really, really want this: Get someone from iproute or netdev to
sign off that they really mean that 'ip set vf mac LLADDR' is always a
6 byte mac.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH RFC 0/3] Support standard SRIOV configuration for IB VFs
       [not found]     ` <1432226406.28905.22.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2015-05-21 19:55       ` Or Gerlitz
       [not found]         ` <CAJ3xEMjzpqnQuA=0HQaN8noVq04d9BkVvEWGY7Lq5ZntVTKm4w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 31+ messages in thread
From: Or Gerlitz @ 2015-05-21 19:55 UTC (permalink / raw)
  To: Doug Ledford; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Amir Vadai

On Thu, May 21, 2015 at 7:40 PM, Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:

> The MAC/GUID mapping isn't the only thing that has to be faked here

Exactly nothing is faked here, Virtualization systems such as
open-stack provision unique 48 bit mac values to VMs, and it's
perfectly legitimate and viable to derive 64 bit guid value from that
mac.

> Why are we suggesting to make this work with unmodified software?  Why
> aren't we doing this right and adding a new ndo entry point for the GUID?

Because rome wasn't built in a day and nor will be the support for IB
in today's/tomorrow's virtualization systems, e.g if you follow on
this layering

[1] Open-Stack / ODL controller
[2] Open-Stack neutron / ODL agent
[3] libvirt
[4] user/kernel netlink API
[5] kernel ndo API
[6] ipoib
[7] kernel verbs API
[8] PF IB driver

with the approach presented here,  we only simply (yeah, simplicity
could turn to be critical criteria in engineering) to few kernel only
patches that deal with layers 6-8 and we are ready for all sorts of
bring-ups, testing and even production!

For reasons which I don't really see the practical / real life use
case where there's a must to get them to work (but I will happy to
hear on) one can go & change the world, namely patch layers 5 ---> 1
too and deal with all sort of dependencies for setting up a system.
But guess what, this can be perfectly done in parallel with this small
change.

> you would also have to fake the vlan/pkey mapping.  This just
> seems the wrong thing to do.

Repeating the above argument --- virt systems provision 12bit vlan-id
to be set for VM traffic, which can be nicely map to 16 bit IB pkey
doing the same job.

I understand that you have sort of  desire to see IB ala the full spec
going into libvirt and from there up to the whole virtualization
management space, but this doesn't need as an argument to not enable
doing thing in the right direction. The upstream kernel supports SRIOV
for IB over mlx4 for 3 years now, but this can't work with libvirt as
is. Using these patches can make the thing.

Couple of months ago, we both attended a call with the libvirt
developers / maintainers from red-hat and they really liked this
staged approach.

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH RFC 1/3] IB/IPoIB: Support SRIOV standard configuration
       [not found]         ` <20150521184613.GD6771-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2015-05-21 20:05           ` Or Gerlitz
       [not found]             ` <CAJ3xEMgJvXjg3aFbTNEudj9QWMfP4==eBq0ccuhjuVJsv9mRmQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2015-05-21 21:05           ` Doug Ledford
  1 sibling, 1 reply; 31+ messages in thread
From: Or Gerlitz @ 2015-05-21 20:05 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Or Gerlitz, Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA, Amir Vadai

On Thu, May 21, 2015, Jason Gunthorpe  wrote:

>> +static int ipoib_set_vf_mac(struct net_device *dev, int queue, u8 *mac)
>> +{
>> +     char *raw_guid;
>> +     u64 guid = 0;
>
> This doesn't seem right at all.
>
> It makes no sense that a IPoIB interface with a 20 byte LLADDR would
> accept an 8 byte LLADDR only for 'ip link set vf mac'
>
> The definition of the netlink struct seems to confirm this:
>
> struct ifla_vf_mac {
>         __u32 vf;
>         __u8 mac[32]; /* MAX_ADDR_LEN */
> };
>
> If it was really just ever a mac it would really only be 6 bytes.
> [Honestly, this whole feature seems very inconistent with the rest of
>  the design of ip net link, so who knows]
>
> If the ifla_vf_mac had been variable-sized (like every other address
> related attribute) then sure, auto detect the length and do the right
> thing.
>
> But with this API, I think you have no choice, 'ip set vf mac LLADDR'
> can only be the 20 byte address.

You can't enforce 20 byte address on ipoib instance b/c the driver
can't dictate the QPN of their UD QP. Also, you don't need to force
that, b/c what the virtualization system want to provision relates to
a VM ID which is their mac (--> guid).

If the ifla_vf_mac had been variable-sized we could have the ipoib
set_vf_mac implementation to check if user-space provided 48 bits
(mac), 64bits (guid) or even 128bits (whole gid), but it doesn't and I
would like to either use the lowest common denominator (48bits) or
just take always 64bits which could have two zero bytes (e.g when
libvirt calls into that api through netlink).
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH RFC 1/3] IB/IPoIB: Support SRIOV standard configuration
       [not found]             ` <CAJ3xEMgJvXjg3aFbTNEudj9QWMfP4==eBq0ccuhjuVJsv9mRmQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2015-05-21 21:01               ` Jason Gunthorpe
  0 siblings, 0 replies; 31+ messages in thread
From: Jason Gunthorpe @ 2015-05-21 21:01 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Or Gerlitz, Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA, Amir Vadai

On Thu, May 21, 2015 at 11:05:32PM +0300, Or Gerlitz wrote:
> On Thu, May 21, 2015, Jason Gunthorpe  wrote:
> 
> >> +static int ipoib_set_vf_mac(struct net_device *dev, int queue, u8 *mac)
> >> +{
> >> +     char *raw_guid;
> >> +     u64 guid = 0;
> >
> > This doesn't seem right at all.
> >
> > It makes no sense that a IPoIB interface with a 20 byte LLADDR would
> > accept an 8 byte LLADDR only for 'ip link set vf mac'
> >
> > The definition of the netlink struct seems to confirm this:
> >
> > struct ifla_vf_mac {
> >         __u32 vf;
> >         __u8 mac[32]; /* MAX_ADDR_LEN */
> > };
> >
> > If it was really just ever a mac it would really only be 6 bytes.
> > [Honestly, this whole feature seems very inconistent with the rest of
> >  the design of ip net link, so who knows]
> >
> > If the ifla_vf_mac had been variable-sized (like every other address
> > related attribute) then sure, auto detect the length and do the right
> > thing.
> >
> > But with this API, I think you have no choice, 'ip set vf mac LLADDR'
> > can only be the 20 byte address.
> 
> You can't enforce 20 byte address on ipoib instance b/c the driver
> can't dictate the QPN of their UD QP.

I can think of several ways off the top of my head to deal with
that. Don't see it as a problem.

> Also, you don't need to force that, b/c what the virtualization
> system want to provision relates to a VM ID which is their mac (-->
> guid).

No, I do need to force that because that is how netlink works: each
device has a defined hw address length, and all netlink messages
dealing with LLADDR must use an identical format for the hw
address. For IPoIB that is 20 byte.

It is a simple matter of UAPI sanity, it is not the kernel's problem
that some userspace tools assume a 6 byte LLADDR for all devices -
they are broken.

> (mac), 64bits (guid) or even 128bits (whole gid), but it doesn't and I
> would like to either use the lowest common denominator (48bits) or
> just take always 64bits which could have two zero bytes (e.g when
> libvirt calls into that api through netlink).

Then get the iproute and netdev people to sign off on using
ifla_vf_mac with something other than the defined 20 byte IPoIB HW
address. Maybe you can work out some scheme with them. Making the
IFLA_VF_MAC properly variable sized might be a good angle.

NACK otherwise.

The rest of the series's idea seems sound.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH RFC 1/3] IB/IPoIB: Support SRIOV standard configuration
       [not found]         ` <20150521184613.GD6771-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  2015-05-21 20:05           ` Or Gerlitz
@ 2015-05-21 21:05           ` Doug Ledford
       [not found]             ` <1432242331.28905.67.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  1 sibling, 1 reply; 31+ messages in thread
From: Doug Ledford @ 2015-05-21 21:05 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: Or Gerlitz, linux-rdma-u79uwXL29TY76Z2rM5mHXA, Amir Vadai

[-- Attachment #1: Type: text/plain, Size: 2283 bytes --]

On Thu, 2015-05-21 at 12:46 -0600, Jason Gunthorpe wrote:
> On Thu, May 21, 2015 at 07:24:05PM +0300, Or Gerlitz wrote:
> > Standard configuration of SRIOV VFs through the host is done over the
> > following chain of calls: libvirt --> netlink --> PF netdevice
> > 
> > When this comes to IB/IPoIB we should normalize this into the verbs
> > framework so we further go: PF IPoIB --> verbs API --> PF HW driver
> > 
> > Virtualization systems assign VMs 48 bits mac, to allow working with
> > non-modified SW layers (open-stack, libvirt, etc), we can safely
> > extend this mac to unique 64 bits GUID. Hence the IPoIB ndo_set_vf_mac
> > entry calls the set_vf_guid verb.
> > 
> > Signed-off-by: Or Gerlitz <ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> >  drivers/infiniband/ulp/ipoib/ipoib_main.c |   39 +++++++++++++++++++++++++++++
> >  include/rdma/ib_verbs.h                   |    4 +++
> >  2 files changed, 43 insertions(+), 0 deletions(-)
> > 
> > diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c
> > index 9e1b203..8f82870 100644
> > +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
> > @@ -1357,6 +1357,43 @@ void ipoib_dev_cleanup(struct net_device *dev)
> >  	priv->tx_ring = NULL;
> >  }
> >  
> > +static int ipoib_get_vf_config(struct net_device *dev, int vf, struct ifla_vf_info *ivf)
> > +{
> > +	struct ipoib_dev_priv *priv = netdev_priv(dev);
> > +
> > +	if (priv->ca->get_vf_config)
> > +		return priv->ca->get_vf_config(priv->ca, priv->port, vf, ivf);
> > +	else
> > +		return -EINVAL;
> > +}
> > +
> > +static int ipoib_set_vf_mac(struct net_device *dev, int queue, u8 *mac)
> > +{
> > +	char *raw_guid;
> > +	u64 guid = 0;
> 
> This doesn't seem right at all.
> 
> It makes no sense that a IPoIB interface with a 20 byte LLADDR would
> accept an 8 byte LLADDR only for 'ip link set vf mac'

It has to be the 8byte guid, the next 8bytes are the subnet prefix which
is set by the SM and not under driver control, and the final 4bytes are
specific to the IPoIB connection.  You could pass in all 20bytes, but
you would only be able to set 8 of them.

-- 
Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
              GPG KeyID: 0E572FDD


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH RFC 0/3] Support standard SRIOV configuration for IB VFs
       [not found]         ` <CAJ3xEMjzpqnQuA=0HQaN8noVq04d9BkVvEWGY7Lq5ZntVTKm4w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2015-05-21 21:11           ` Doug Ledford
       [not found]             ` <1432242708.28905.73.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 31+ messages in thread
From: Doug Ledford @ 2015-05-21 21:11 UTC (permalink / raw)
  To: Or Gerlitz; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Amir Vadai

[-- Attachment #1: Type: text/plain, Size: 4480 bytes --]

On Thu, 2015-05-21 at 22:55 +0300, Or Gerlitz wrote:
> On Thu, May 21, 2015 at 7:40 PM, Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> 
> > The MAC/GUID mapping isn't the only thing that has to be faked here
> 
> Exactly nothing is faked here, Virtualization systems such as
> open-stack provision unique 48 bit mac values to VMs, and it's
> perfectly legitimate and viable to derive 64 bit guid value from that
> mac.

OK, faked wasn't the best use of words.  How's converted behind the
software's back?  And if the management software set the MAC, then tried
to check it via ARP after the guest is up and running, it would never
find the guest.  I don't know if Open-Stack or any other controller
would both A) attempt to set the MAC of the device in libvirt and start
the guest and B) enter the MAC into a dhcp.conf file for static IP
assignment, but they could, and this sort of manipulation would directly
break that.

> > Why are we suggesting to make this work with unmodified software?  Why
> > aren't we doing this right and adding a new ndo entry point for the GUID?
> 
> Because rome wasn't built in a day and nor will be the support for IB
> in today's/tomorrow's virtualization systems, e.g if you follow on
> this layering
> 
> [1] Open-Stack / ODL controller
> [2] Open-Stack neutron / ODL agent
> [3] libvirt
> [4] user/kernel netlink API
> [5] kernel ndo API
> [6] ipoib
> [7] kernel verbs API
> [8] PF IB driver
> 
> with the approach presented here,  we only simply (yeah, simplicity
> could turn to be critical criteria in engineering) to few kernel only
> patches that deal with layers 6-8 and we are ready for all sorts of
> bring-ups, testing and even production!

You're ready to pretend that your IB device is a regular ethernet
device.  Not even a RoCE or iWARP device.  You totally obscured that the
device is RDMA capable and made the entire stack above 6 unawares of
what you are doing.  If your guest actually intends to use any RDMA
capabilities, then this is, at best, a quick and dirty workaround to get
you up and running while you work through the process of doing things
right, which includes making libvirt aware of the difference between
RDMA capable devices and not and how to select those devices and how to
mark certain guests as RDMA device needy.

> For reasons which I don't really see the practical / real life use
> case where there's a must to get them to work (but I will happy to
> hear on) one can go & change the world, namely patch layers 5 ---> 1
> too and deal with all sort of dependencies for setting up a system.
> But guess what, this can be perfectly done in parallel with this small
> change.
> 
> > you would also have to fake the vlan/pkey mapping.  This just
> > seems the wrong thing to do.
> 
> Repeating the above argument --- virt systems provision 12bit vlan-id
> to be set for VM traffic, which can be nicely map to 16 bit IB pkey
> doing the same job.
> 
> I understand that you have sort of  desire to see IB ala the full spec
> going into libvirt and from there up to the whole virtualization
> management space, but this doesn't need as an argument to not enable
> doing thing in the right direction. The upstream kernel supports SRIOV
> for IB over mlx4 for 3 years now, but this can't work with libvirt as
> is. Using these patches can make the thing.

It's a workaround.  It comes with limitations, and if we get around to
adding an ndo later for really setting the guid, then it would be
possible to call the set_guid ndo with a complete guid that didn't use
fffe in the middle 2 bytes, and then when we call get vf_info, we get a
MAC back that removes those 2 bytes and generates an inconsistency
between what we *think* our constructed guid should be and what the set
guid actually is.

> Couple of months ago, we both attended a call with the libvirt
> developers / maintainers from red-hat and they really liked this
> staged approach.

My recollection of that call was they said "Oh, you guys don't have an
API for us to set the GUIDs yet.  Ok, we'll close all the bugs and wait
until you do."  And they promptly closed the bugs and moved on.  But
that didn't specify the API to use.  That's what we are doing here.  But
I'm not finding this an entirely convincing solution.

-- 
Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
              GPG KeyID: 0E572FDD

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH RFC 1/3] IB/IPoIB: Support SRIOV standard configuration
       [not found]             ` <1432242331.28905.67.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2015-05-21 21:21               ` Jason Gunthorpe
  0 siblings, 0 replies; 31+ messages in thread
From: Jason Gunthorpe @ 2015-05-21 21:21 UTC (permalink / raw)
  To: Doug Ledford; +Cc: Or Gerlitz, linux-rdma-u79uwXL29TY76Z2rM5mHXA, Amir Vadai

On Thu, May 21, 2015 at 05:05:31PM -0400, Doug Ledford wrote:
> It has to be the 8byte guid, the next 8bytes are the subnet prefix which
> is set by the SM and not under driver control, and the final 4bytes are
> specific to the IPoIB connection.  You could pass in all 20bytes, but
> you would only be able to set 8 of them.

Of course you can't set all 20 bytes, but you still have to provide
them, that is the defined LLADDR format for this device.

1) Kernel demands immutable bits are zero and fills them in
   automatically
2) Userspace copies the current LLADDR and only modifies the 8 bytes
   of the GUID, kernel demands immutable bits are unchanged
3) Who is to say the QPN can't be made settable someday? People have
   complained about the unstable QPN in the past.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH RFC 0/3] Support standard SRIOV configuration for IB VFs
       [not found]             ` <1432242708.28905.73.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2015-05-21 21:44               ` Jason Gunthorpe
  2015-05-25 20:04               ` Or Gerlitz
  1 sibling, 0 replies; 31+ messages in thread
From: Jason Gunthorpe @ 2015-05-21 21:44 UTC (permalink / raw)
  To: Doug Ledford; +Cc: Or Gerlitz, linux-rdma-u79uwXL29TY76Z2rM5mHXA, Amir Vadai

On Thu, May 21, 2015 at 05:11:48PM -0400, Doug Ledford wrote:
> fffe in the middle 2 bytes, and then when we call get vf_info, we get a
> MAC back that removes those 2 bytes and generates an inconsistency
> between what we *think* our constructed guid should be and what the set
> guid actually is.

Bingo - When ndo_get_vf_config is called on the PF it must return the
same 20 byte LLADDR that will match the IFLA_ADDRESS returned on the
VF netdevice.

If do_get_vf_config returns the 20 byte LLADDR, then ndo_set_vf_mac
must also accept the same format.

This probably suggests option #2 from my last email is the way to go,
as the input and ouput from the set, and any followup IFLA_ADDRESS or
ndo_get_vf_config will all be the same LLADDR. Hidden changes to the
LLADDR are probably just going to confuse things.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH RFC 0/3] Support standard SRIOV configuration for IB VFs
       [not found]             ` <1432242708.28905.73.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2015-05-21 21:44               ` Jason Gunthorpe
@ 2015-05-25 20:04               ` Or Gerlitz
       [not found]                 ` <CAJ3xEMh3BaxJzCu9mV9m6ZMwgrDaO2UvTyS1i=DEPq9nuLX3oA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  1 sibling, 1 reply; 31+ messages in thread
From: Or Gerlitz @ 2015-05-25 20:04 UTC (permalink / raw)
  To: Doug Ledford; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Amir Vadai

On Fri, May 22, 2015 at 12:11 AM, Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> On Thu, 2015-05-21 at 22:55 +0300, Or Gerlitz wrote:
>> On Thu, May 21, 2015 at 7:40 PM, Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
>>
>> > The MAC/GUID mapping isn't the only thing that has to be faked here
>>
>> Exactly nothing is faked here, Virtualization systems such as
>> open-stack provision unique 48 bit mac values to VMs, and it's
>> perfectly legitimate and viable to derive 64 bit guid value from that
>> mac.
>
> OK, faked wasn't the best use of words.  How's converted behind the
> software's back?  And if the management software set the MAC, then tried
> to check it via ARP after the guest is up and running, it would never
> find the guest.  I don't know if Open-Stack or any other controller
> would both A) attempt to set the MAC of the device in libvirt and start
> the guest and B) enter the MAC into a dhcp.conf file for static IP
> assignment, but they could, and this sort of manipulation would directly
> break that.

OK, so rewinding a bit, the IB VF [1] identity is their 8 bytes port
GUID, and as Jason noted the
user/kernel API allows to deliver up to 32 bytes between user and
kernel under the set_vf_mac flow
(do_setvfinfo() in net/core/rtnetlink.c). Trying it out through
**non-modified** ip tool and net/core/rtnetlink.c
things just work -  I can set eight bytes value to be the virtual port GUID :

# ip link set dev ib0 vf 1 mac aa:bb:cc:dd:ee:ff:11:22

# ibstat -d mlx4_2
CA 'mlx4_2'
        CA type: MT4100
        Number of ports: 1
        Firmware version: 2.34.1260
        Hardware version: 0
        Node GUID: 0x001405003bca04bb
        System image GUID: 0xf452140300117423
        Port 1:
                State: Active
                Physical state: LinkUp
                Rate: 56
                Base lid: 7
                LMC: 0
                SM lid: 1
                Capability mask: 0x02514868
                Port GUID: 0x2211ffeeddccbbaa
                Link layer: InfiniBand


Re DHCP: RFC 2131 adds a “client identifier” option that replaces the
client hardware address as the unique identifier of the client in its
subnet. DHCP over IB RFC 4390 [1] requires that IPoIB DHCP clients use
the client identifier (as they cannot fit their 20 byte MAC in the
client hardware address field). DHCP packages (e.g ISC) that support
IPoIB use client identifier which is based on the unique eight byte
port GUID, so with the modified patches that use 8 bytes, we're OK
DHCP wise.

[1] https://tools.ietf.org/html/rfc4390#section-2.1

[...]
> It's a workaround.  It comes with limitations, and if we get around to
> adding an ndo later for really setting the guid, then it would be
> possible to call the set_guid ndo with a complete guid that didn't use
> fffe in the middle 2 bytes, and then when we call get vf_info, we get a
> MAC back that removes those 2 bytes and generates an inconsistency
> between what we *think* our constructed guid should be and what the set
> guid actually is.

OK, as written above, I have managed to get away from this possible mess
which you described here by providing eight bytes from user to kernel through
the existing netlink API (which is used by the ip tool and libvirt).

>> Couple of months ago, we both attended a call with the libvirt
>> developers / maintainers from red-hat and they really liked this
>> staged approach.

> My recollection of that call was they said "Oh, you guys don't have an
> API for us to set the GUIDs yet.  Ok, we'll close all the bugs and wait
> until you do."  And they promptly closed the bugs and moved on.  But
> that didn't specify the API to use.  That's what we are doing here.  But
> I'm not finding this an entirely convincing solution.

So that was what said, we were wrong and with this small ipoib/verbs patch,
we fully have the API to provision the vGUID of the VF.

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH RFC 0/3] Support standard SRIOV configuration for IB VFs
       [not found]                 ` <CAJ3xEMh3BaxJzCu9mV9m6ZMwgrDaO2UvTyS1i=DEPq9nuLX3oA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2015-05-25 21:14                   ` Jason Gunthorpe
       [not found]                     ` <20150525211433.GA9186-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  0 siblings, 1 reply; 31+ messages in thread
From: Jason Gunthorpe @ 2015-05-25 21:14 UTC (permalink / raw)
  To: Or Gerlitz; +Cc: Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA, Amir Vadai

On Mon, May 25, 2015 at 11:04:41PM +0300, Or Gerlitz wrote:

> OK, so rewinding a bit, the IB VF [1] identity is their 8 bytes port
> GUID, and as Jason noted the user/kernel API allows to deliver up to
> 32 bytes between user and kernel under the set_vf_mac flow
> (do_setvfinfo() in net/core/rtnetlink.c). Trying it out through
> **non-modified** ip tool and net/core/rtnetlink.c things just work -
> I can set eight bytes value to be the virtual port GUID :

Was I not perfectly clear? You have to use the 20 byte LLADDR format
here:

> # ip link set dev ib0 vf 1 mac aa:bb:cc:dd:ee:ff:11:22


>                 Port GUID: 0x2211ffeeddccbbaa

The byte order got screwed up someplace.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH RFC 0/3] Support standard SRIOV configuration for IB VFs
       [not found]                     ` <20150525211433.GA9186-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2015-05-25 21:50                       ` Or Gerlitz
       [not found]                         ` <CAJ3xEMhG2W6WzxC4Kc2CFmdMwTRUF5ppBgcDZ6SMA=kgJowUpQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 31+ messages in thread
From: Or Gerlitz @ 2015-05-25 21:50 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA, Amir Vadai

On Tue, May 26, 2015 at 12:14 AM, Jason Gunthorpe
<jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> wrote:
> On Mon, May 25, 2015 at 11:04:41PM +0300, Or Gerlitz wrote:
>
>> OK, so rewinding a bit, the IB VF [1] identity is their 8 bytes port
>> GUID, and as Jason noted the user/kernel API allows to deliver up to
>> 32 bytes between user and kernel under the set_vf_mac flow
>> (do_setvfinfo() in net/core/rtnetlink.c). Trying it out through
>> **non-modified** ip tool and net/core/rtnetlink.c things just work -
>> I can set eight bytes value to be the virtual port GUID :
>
> Was I not perfectly clear? You have to use the 20 byte LLADDR format here:
>
>> # ip link set dev ib0 vf 1 mac aa:bb:cc:dd:ee:ff:11:22

Jason,

I am aiming to provision the VF IB end-node address == port GUID (vGUID)
in the same manner that VF Eth end-node address is their MAC, not
more, not less.

20 bytes are the lladdr of IPoIB devices which isn't the VF IB
end-node address but
rather made of flags (1B) + QPN (3B) + subnet prefix (8B) + VF GUID --
way more then
the virtualization system care or can provision.

>>                 Port GUID: 0x2211ffeeddccbbaa

> The byte order got screwed up someplace.

thx, will fix
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH RFC 0/3] Support standard SRIOV configuration for IB VFs
       [not found]                         ` <CAJ3xEMhG2W6WzxC4Kc2CFmdMwTRUF5ppBgcDZ6SMA=kgJowUpQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2015-05-25 22:32                           ` Jason Gunthorpe
       [not found]                             ` <20150525223235.GA9858-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  2015-05-26 16:53                           ` Doug Ledford
  1 sibling, 1 reply; 31+ messages in thread
From: Jason Gunthorpe @ 2015-05-25 22:32 UTC (permalink / raw)
  To: Or Gerlitz; +Cc: Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA, Amir Vadai

On Tue, May 26, 2015 at 12:50:52AM +0300, Or Gerlitz wrote:
> On Tue, May 26, 2015 at 12:14 AM, Jason Gunthorpe
> <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> wrote:
> > On Mon, May 25, 2015 at 11:04:41PM +0300, Or Gerlitz wrote:
> >
> >> OK, so rewinding a bit, the IB VF [1] identity is their 8 bytes port
> >> GUID, and as Jason noted the user/kernel API allows to deliver up to
> >> 32 bytes between user and kernel under the set_vf_mac flow
> >> (do_setvfinfo() in net/core/rtnetlink.c). Trying it out through
> >> **non-modified** ip tool and net/core/rtnetlink.c things just work -
> >> I can set eight bytes value to be the virtual port GUID :
> >
> > Was I not perfectly clear? You have to use the 20 byte LLADDR format here:
> >
> >> # ip link set dev ib0 vf 1 mac aa:bb:cc:dd:ee:ff:11:22
> 
> Jason,
> 
> I am aiming to provision the VF IB end-node address == port GUID (vGUID)
> in the same manner that VF Eth end-node address is their MAC, not
> more, not less.

I perfectly understand what you are trying to do.

I care about the design and consistency of netlink - and that means
there is one LLADDR definition for a net device, and every single netlink
message that touches a LLADDR uses that definition - for IPoIB that is 20
bytes.

To violate that design invariant needs an incredibly strong argument,
and you haven't made it.

This allows generic code to work with the LLADDR - for instance 'ip
link set vf mac' should have checked the size of the IFLA_ADDRESS and
demanded that the address argument is the same number of bytes. It is
very broken the command happily accepts an 8 byte and 6 byte argument
for the same device.

Yes, it is ugly that the PF side's ndo_get_vf_config cannot return the
same 20 byte address of the VF's ipoib interface, but I think that is
less ugly than forcing a different address format just for the vf calls.

If you have doubts then *ask netdev*. Ask them if ndo_set_vf_mac must
follow the same address size and format as IFLA_ADDRESS, or if we can
use something else.

Such a netlink architecture choice is beyond the authority of
linux-rdma.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH RFC 0/3] Support standard SRIOV configuration for IB VFs
       [not found]                             ` <20150525223235.GA9858-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2015-05-26  3:33                               ` Or Gerlitz
       [not found]                                 ` <CAJ3xEMjSU0xVWyqd8v_-OO5JvsHycGTU6gg=BHpZD8PSqRfzQg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2015-05-26 16:53                               ` Doug Ledford
  1 sibling, 1 reply; 31+ messages in thread
From: Or Gerlitz @ 2015-05-26  3:33 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA

On Tue, May 26, 2015 at 1:32 AM, Jason Gunthorpe
<jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> wrote:
> On Tue, May 26, 2015 at 12:50:52AM +0300, Or Gerlitz wrote:
>> On Tue, May 26, 2015 at 12:14 AM, Jason Gunthorpe
>> <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> wrote:
>> > On Mon, May 25, 2015 at 11:04:41PM +0300, Or Gerlitz wrote:
>> >
>> >> OK, so rewinding a bit, the IB VF [1] identity is their 8 bytes port
>> >> GUID, and as Jason noted the user/kernel API allows to deliver up to
>> >> 32 bytes between user and kernel under the set_vf_mac flow
>> >> (do_setvfinfo() in net/core/rtnetlink.c). Trying it out through
>> >> **non-modified** ip tool and net/core/rtnetlink.c things just work -
>> >> I can set eight bytes value to be the virtual port GUID :
>> >
>> > Was I not perfectly clear? You have to use the 20 byte LLADDR format here:
>> >
>> >> # ip link set dev ib0 vf 1 mac aa:bb:cc:dd:ee:ff:11:22
>>
>> Jason,
>>
>> I am aiming to provision the VF IB end-node address == port GUID (vGUID)
>> in the same manner that VF Eth end-node address is their MAC, not
>> more, not less.
>
> I perfectly understand what you are trying to do.

Good, we should be doing things for a reason, and not just for the sake of doing
them 111% right by someone possibly subjective judgement

> I care about the design and consistency of netlink - and that means
> there is one LLADDR definition for a net device, and every single netlink
> message that touches a LLADDR uses that definition - for IPoIB that is 20
> bytes.

> To violate that design invariant needs an incredibly strong argument,
> and you haven't made it.


> This allows generic code to work with the LLADDR - for instance 'ip
> link set vf mac' should have checked the size of the IFLA_ADDRESS and
> demanded that the address argument is the same number of bytes. It is
> very broken the command happily accepts an 8 byte and 6 byte argument
> for the same device.

OK, so per your view, the existing kernel code for this flow is broken
and you resist my attempt to use it as is, and

> Yes, it is ugly that the PF side's ndo_get_vf_config cannot return the
> same 20 byte address of the VF's ipoib interface, but I think that is
> less ugly than forcing a different address format just for the vf calls.

you claim that what I propose is uglier from the fact that the PF can't
by no means return the 20 VF's IPoIB address and it's OK if I only let
the PF configure 20 bytes with part (say four) of them being arbitrary
and only have consistent 20B get/set by the PF.

Would you be happier if the ipoib ndo_set_vf_mac ndo be

1. getting 20B from user-space and treating 16 of them as the VF
subnet-prefix (8B) + vGUID (8B)

2. checking that the subnet-prefix  is correct

3. provision the vGUID through PF driver / the verb I proposed for the VF

???

on the way back, for the get_vf_config

1. read the VF vGUID from the PF IB driver through the verbs

2. add the port subnet prefix

3. return 20B to user-space

???

> If you have doubts then *ask netdev*. Ask them if ndo_set_vf_mac must
> follow the same address size and format as IFLA_ADDRESS, or if we can
> use something else.

> Such a netlink architecture choice is beyond the authority of linux-rdma.

I am not @ the point to change start changing this specific netlink flow.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH RFC 0/3] Support standard SRIOV configuration for IB VFs
       [not found]                                 ` <CAJ3xEMjSU0xVWyqd8v_-OO5JvsHycGTU6gg=BHpZD8PSqRfzQg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2015-05-26  5:58                                   ` Jason Gunthorpe
  0 siblings, 0 replies; 31+ messages in thread
From: Jason Gunthorpe @ 2015-05-26  5:58 UTC (permalink / raw)
  To: Or Gerlitz; +Cc: Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA

On Tue, May 26, 2015 at 06:33:15AM +0300, Or Gerlitz wrote:
> > This allows generic code to work with the LLADDR - for instance 'ip
> > link set vf mac' should have checked the size of the IFLA_ADDRESS and
> > demanded that the address argument is the same number of bytes. It is
> > very broken the command happily accepts an 8 byte and 6 byte argument
> > for the same device.
> 
> OK, so per your view, the existing kernel code for this flow is broken
> and you resist my attempt to use it as is, and

What does that mean?

The current kernel code that uses this interface is all ethernet
drivers. They use 6 bytes, in exactly the same format as the
LLADDR. Do you see something different?

And yes, the kernel code for this interface is very much inconsistent
with the rest of netlink. Carrying a LLADDR in a fixed and unsized 32
byte array and calling it a 'mac' is awful.

Looking through the comments from 2010 on the patch that introduced
this function, it is pretty clear that the original author did not
understand netlink.

The userspace side in iproute2 is even worse:

static void print_vfinfo(FILE *fp, struct rtattr *vfinfo)
{
...
        fprintf(fp, "\n    vf %d MAC %s", vf_mac->vf,
	                ll_addr_n2a((unsigned char *)&vf_mac->mac,
			                ETH_ALEN, 0, b1, sizeof(b1)));

Hard coded ETH_ALEN! For a 32 byte array! BLECK!

And the set side:

static int iplink_parse_vf(int vf, int *argcp, char ***argvp,
                           struct iplink_req *req, int dev_index)
{
...
      len = ll_addr_a2n((char *)ivm.mac, 32, *argv);
      if (len < 0)
          return -1;
      addattr_l(&req->n, sizeof(*req), IFLA_VF_MAC, &ivm, sizeof(ivm));

It just passes random stack garbage into the kernel if the
user doesn't provide 32 bytes on the command line. BLECK!

The kernel didn't even validate the *length* of address sub attribute
until recently:

https://patchwork.ozlabs.org/patch/436901/

So, to your question yes, I think it is broken / horribly written.

But we can live with it, we don't need to radically change it.

So. looking at this interface there are two options for the 32
byte fixed array:
 1) 'mac' is some arbitary interface specific thing with no relation
    to anything else in netlink
 2) 'mac' is a LLADDR, and must match the construction of IFLA_ADDRESS

Speaking generally - netlink is a general purpose interface - #2 is by
far the saner of the two choices.

Why? Because 'ip' needs to know the length of the address for the two
places noted above. Using the same length as IFLA_ADDRESS is simple
and 100% general.

Using a lookup on the device type to guess the address length is 8
because it is infiniband, but 6 for ethernet, and who knows what if
the tool is old and doesn't recognize the device type.. Crappy. Not
The Netlink Way.

Thus, the very best choice (at this point) is to use the length of the
IFLA_ADDRESS (aka dev->addr_len), which is completely general and
works for every single device choice.

Is #1 possible? Yes. But linux-rdma is the wrong place to make that
choice. Ask netdev.

[ .. if we could go back in time fixing the original patch to use a
  *sized* array for IFLA_VF_MAC would have been ideal, and made #1
  much more desirable, but that is not straightforward now, and we
  have to live with the bad choices someone else made. Try not to
  add to this crap .. ]

> > Yes, it is ugly that the PF side's ndo_get_vf_config cannot return the
> > same 20 byte address of the VF's ipoib interface, but I think that is
> > less ugly than forcing a different address format just for the vf calls.
> 
> you claim that what I propose is uglier from the fact that the PF can't
> by no means return the 20 VF's IPoIB address and it's OK if I only let
> the PF configure 20 bytes with part (say four) of them being arbitrary
> and only have consistent 20B get/set by the PF.
> 
> Would you be happier if the ipoib ndo_set_vf_mac ndo be
> 
> 1. getting 20B from user-space and treating 16 of them as the VF
> subnet-prefix (8B) + vGUID (8B)
> 
> 2. checking that the subnet-prefix  is correct
> 
> 3. provision the vGUID through PF driver / the verb I proposed for the VF
> 
> ???
> 
> on the way back, for the get_vf_config
> 
> 1. read the VF vGUID from the PF IB driver through the verbs
> 
> 2. add the port subnet prefix
> 
> 3. return 20B to user-space
> 
> ???

Yes, absolutely. That maintains the general invariant for LLADDR and
the broad integrity of the existing netlink design.

Using anything else is also short sited. Current hardware is
restricted to a randomized QPN, and current IBA is restricted to a
single GID prefix: But are you *absolutely* sure that will be the case
forever? This is a UAPI, after all.

I'm not: For instance controlling/choosing the QPN solves real issues
with live migration of VF IPoIB - the need to invalidate remote ND
entires when the QPN changes on migration.

> I am not @ the point to change start changing this specific netlink flow.

We are not changing anything, we are trying to guess what netdev wants
to use the fixed 32 byte array called 'mac' for, when we are not going
to store a MAC in it. This is new territory, never been done before.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH RFC 0/3] Support standard SRIOV configuration for IB VFs
       [not found]                             ` <20150525223235.GA9858-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  2015-05-26  3:33                               ` Or Gerlitz
@ 2015-05-26 16:53                               ` Doug Ledford
       [not found]                                 ` <1432659226.28905.151.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  1 sibling, 1 reply; 31+ messages in thread
From: Doug Ledford @ 2015-05-26 16:53 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: Or Gerlitz, linux-rdma-u79uwXL29TY76Z2rM5mHXA, Amir Vadai

[-- Attachment #1: Type: text/plain, Size: 5965 bytes --]

On Mon, 2015-05-25 at 16:32 -0600, Jason Gunthorpe wrote:
> On Tue, May 26, 2015 at 12:50:52AM +0300, Or Gerlitz wrote:
> > On Tue, May 26, 2015 at 12:14 AM, Jason Gunthorpe
> > <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> wrote:
> > > On Mon, May 25, 2015 at 11:04:41PM +0300, Or Gerlitz wrote:
> > >
> > >> OK, so rewinding a bit, the IB VF [1] identity is their 8 bytes port
> > >> GUID, and as Jason noted the user/kernel API allows to deliver up to
> > >> 32 bytes between user and kernel under the set_vf_mac flow
> > >> (do_setvfinfo() in net/core/rtnetlink.c). Trying it out through
> > >> **non-modified** ip tool and net/core/rtnetlink.c things just work -
> > >> I can set eight bytes value to be the virtual port GUID :
> > >
> > > Was I not perfectly clear? You have to use the 20 byte LLADDR format here:
> > >
> > >> # ip link set dev ib0 vf 1 mac aa:bb:cc:dd:ee:ff:11:22
> > 
> > Jason,
> > 
> > I am aiming to provision the VF IB end-node address == port GUID (vGUID)
> > in the same manner that VF Eth end-node address is their MAC, not
> > more, not less.
> 
> I perfectly understand what you are trying to do.
> 
> I care about the design and consistency of netlink - and that means
> there is one LLADDR definition for a net device, and every single netlink
> message that touches a LLADDR uses that definition - for IPoIB that is 20
> bytes.

So, here's my $.02.

First, 8 bytes is the right size for the data to be transferred.  The
remaining 12 bytes of the IPoIB LLADDR are generated constructs at least
partially controlled by elements outside of this machine, and moreover,
they are *not* the *actual* LLADDR of the underlying IB device and the
20 byte LLADDR never hits the wire.  So, it's an artificial construct.

The fact that the 20 byte LLADDR of IPoIB is problematic is born out by
multiple facts:  we only use 8 bytes in DHCP requests and responses,
only use 8 bytes in matching IPoIB netdevices against HWADDR entries in
the sysconfig/ifcfg-* files, only use 8 bytes for matching the udev
rules to rename IPoIB devices, etc.

This clearly demonstrates that attaching the netlink ndo points to the
IPoIB netdevice is, let's just say, a shortcut.  The *real* device that
we should be manipulating is not the IPoIB device, but the verbs device.
If we were manipulating the verbs device, no one would be arguing for
anything other than the exact proper item: an 8 byte GUID.  Of course,
the problem then becomes an issue of "can we attach the underlying verbs
device to the netlink ndo mechanism?"  And the answer there is "Not
easily, and Dave Miller would likely throw a fit" (and rightfully so, as
all netdevice entries are assumed to be IP capable and we aren't, and so
we can't really be used by the core of the net stack).

Since I don't see us getting permission from Dave to attach verbs
devices to the netstack just so we can A) be ignored by the core of the
netstack while also B) having only 3 or 4 of the ndo_ entry points
defined so we can co-opt their infrastructure for our own uses.

That leaves using the IPoIB device.  In which case, here's what I want:

If a user passes in a MAC addresses, please use the following checks:

if (mac_len != 8 && mac_len != 20)
	err out
set guid as lower 8 bytes of mac, leave remainder untouched regardless
of what's passed in

You will note that if someone passes in a mac size of 8, this is not an
error.  It is consistent with the existing NetworkManager device
settings, ifup-ib device settings, and udev device settings that an 8
byte MAC for IPoIB devices is the right thing to uniquely identify the
device.

Here's why I disagree with Jason.

If you have an app that queries a device, sees that it has a 20byte MAC,
and then tries to set that MAC, it may have no specific knowledge of how
IPoIB relates to the underlying verbs device and could just be trying to
set a MAC blindly.  And that's ok, but we aren't going to honor the
entire MAC they pass in, and that's likely to confuse the app (or the
person running the app).

On the other hand, if we advertise a 20 byte MAC, and the app/person
passes in an 8 byte MAC instead, then that implicitly denotes that the
app/person *knows* this is an IPoIB device, and *knows* that we are
actually setting the underlying GUID of the verbs device, and is aware
of and prepared for the fact that the entire 20 byte MAC is not in fact
settable.  IPoIB and verbs devices are an exception to the rest of the
netdevice stack, and this usage tacitly acknowledges that fact.  So I am
of the opinion that if an app/user knows exactly how accessing the IPoIB
device is merely as a passthrough to the verbs device and that 8 bytes
is the right size, then we should accept it instead of requiring them to
supply fake data they know won't be used.

> To violate that design invariant needs an incredibly strong argument,
> and you haven't made it.

Hopefully my argument satisfies your requirements.

> This allows generic code to work with the LLADDR - for instance 'ip
> link set vf mac' should have checked the size of the IFLA_ADDRESS and
> demanded that the address argument is the same number of bytes. It is
> very broken the command happily accepts an 8 byte and 6 byte argument
> for the same device.
> 
> Yes, it is ugly that the PF side's ndo_get_vf_config cannot return the
> same 20 byte address of the VF's ipoib interface, but I think that is
> less ugly than forcing a different address format just for the vf calls.
> 
> If you have doubts then *ask netdev*. Ask them if ndo_set_vf_mac must
> follow the same address size and format as IFLA_ADDRESS, or if we can
> use something else.
> 
> Such a netlink architecture choice is beyond the authority of
> linux-rdma.
> 
> Jason

-- 
Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
              GPG KeyID: 0E572FDD

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH RFC 0/3] Support standard SRIOV configuration for IB VFs
       [not found]                         ` <CAJ3xEMhG2W6WzxC4Kc2CFmdMwTRUF5ppBgcDZ6SMA=kgJowUpQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2015-05-25 22:32                           ` Jason Gunthorpe
@ 2015-05-26 16:53                           ` Doug Ledford
  1 sibling, 0 replies; 31+ messages in thread
From: Doug Ledford @ 2015-05-26 16:53 UTC (permalink / raw)
  To: Or Gerlitz; +Cc: Jason Gunthorpe, linux-rdma-u79uwXL29TY76Z2rM5mHXA, Amir Vadai

[-- Attachment #1: Type: text/plain, Size: 1717 bytes --]

On Tue, 2015-05-26 at 00:50 +0300, Or Gerlitz wrote:
> On Tue, May 26, 2015 at 12:14 AM, Jason Gunthorpe
> <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> wrote:
> > On Mon, May 25, 2015 at 11:04:41PM +0300, Or Gerlitz wrote:
> >
> >> OK, so rewinding a bit, the IB VF [1] identity is their 8 bytes port
> >> GUID, and as Jason noted the user/kernel API allows to deliver up to
> >> 32 bytes between user and kernel under the set_vf_mac flow
> >> (do_setvfinfo() in net/core/rtnetlink.c). Trying it out through
> >> **non-modified** ip tool and net/core/rtnetlink.c things just work -
> >> I can set eight bytes value to be the virtual port GUID :
> >
> > Was I not perfectly clear? You have to use the 20 byte LLADDR format here:
> >
> >> # ip link set dev ib0 vf 1 mac aa:bb:cc:dd:ee:ff:11:22
> 
> Jason,
> 
> I am aiming to provision the VF IB end-node address == port GUID (vGUID)
> in the same manner that VF Eth end-node address is their MAC, not
> more, not less.
> 
> 20 bytes are the lladdr of IPoIB devices which isn't the VF IB
> end-node address but
> rather made of flags (1B) + QPN (3B) + subnet prefix (8B) + VF GUID --
> way more then
> the virtualization system care or can provision.
> 
> >>                 Port GUID: 0x2211ffeeddccbbaa
> 
> > The byte order got screwed up someplace.
> 
> thx, will fix

Or, while you are working on this, please attempt setting a larger than
12byte vlan->pkey.  The actual ndo_ for setting a vlan uses a u16, so
it's possible that just like the MAC setting, our pkey setting might
work unaltered too.

-- 
Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
              GPG KeyID: 0E572FDD


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH RFC 0/3] Support standard SRIOV configuration for IB VFs
       [not found]                                 ` <1432659226.28905.151.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2015-05-26 18:13                                   ` Jason Gunthorpe
       [not found]                                     ` <20150526181336.GD11800-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  0 siblings, 1 reply; 31+ messages in thread
From: Jason Gunthorpe @ 2015-05-26 18:13 UTC (permalink / raw)
  To: Doug Ledford; +Cc: Or Gerlitz, linux-rdma-u79uwXL29TY76Z2rM5mHXA, Amir Vadai

On Tue, May 26, 2015 at 12:53:46PM -0400, Doug Ledford wrote:
> > I perfectly understand what you are trying to do.
> > 
> > I care about the design and consistency of netlink - and that means
> > there is one LLADDR definition for a net device, and every single netlink
> > message that touches a LLADDR uses that definition - for IPoIB that is 20
> > bytes.
> 
> So, here's my $.02.
> 
> First, 8 bytes is the right size for the data to be transferred.  The
> remaining 12 bytes of the IPoIB LLADDR are generated constructs at
> least

Today only 8 bytes are settable, but the GID prefix and QPN are both
logically something that *could* be set by the PF in some future.

> partially controlled by elements outside of this machine, and moreover,
> they are *not* the *actual* LLADDR of the underlying IB device and the
> 20 byte LLADDR never hits the wire.  So, it's an artificial construct.

Yes, it is ugly that we are using a IPoIB netdev to configure the RDMA
device.

If you want to suggest that we use the RDMA Netlink interface to do
this, that would make sense to me too. But I thought the whole point
was to provide something somewhat software compatible with the
ethernet world..

> The fact that the 20 byte LLADDR of IPoIB is problematic is born out by
> multiple facts:  we only use 8 bytes in DHCP requests and responses,
> only use 8 bytes in matching IPoIB netdevices against HWADDR entries in
> the sysconfig/ifcfg-* files, only use 8 bytes for matching the udev
> rules to rename IPoIB devices, etc.

The LLADDR for IPoIB *is* 20 bytes.

Truncating it down is *broken userspace*:
 - DHCP: Not sending the full 20 bytes in the client request means the
   server cannot unicast the reply. This causes all sorts of problems
   and is discouraged in the RFCs these days.
 - ifcfg/udev/networkmanager: So what happens when I do
    ip link add link ib0 name ib0.1 type ipoib
   And get two IPoIB interfaces with the same GUID? I doubt any sane
   user would want to apply the same config to those two interfaces.

There is a reason the kernel uses 20 byte - it is the unique LLADDR
for the interface, it is the value it stuffs in a ND packet. If
userspace takes the 20 byte LLADDR and mangles it so it is no longer
unique, then that is it's prerogative.

But that doesn't mean the kernel should ever accept the 8 byte value
for anything.

> This clearly demonstrates that attaching the netlink ndo points to the
> IPoIB netdevice is, let's just say, a shortcut.  The *real* device that
> we should be manipulating is not the IPoIB device, but the verbs
> device.

Yep.

> That leaves using the IPoIB device.  In which case, here's what I want:
> 
> If a user passes in a MAC addresses, please use the following checks:
> 
> if (mac_len != 8 && mac_len != 20)
> 	err out

As I've already said, this is not possible. The person who designed
this netlink extension screwed it up and 'mac_len' is fixed to 32
bytes.

Even if they didn't, the netlink common code should validate the input
length for the drivers and confirm it is dev->addr_len, as is done for
all other cases.

Unbreaking it is a UAPI change, not impossible, but do we really care
enough about 8 or 20 to push for that?

> If you have an app that queries a device, sees that it has a 20byte MAC,
> and then tries to set that MAC, it may have no specific knowledge of how
> IPoIB relates to the underlying verbs device and could just be trying to
> set a MAC blindly.  And that's ok, but we aren't going to honor the
> entire MAC they pass in, and that's likely to confuse the app (or the
> person running the app).

Yes, I agree with this, but no matter what we do, apps using this
feature need some kind of IB specific knowledge to format the set.

If that specific knoweldge is 'ignore the LL_ADDRESS size and use 8
byte' or 'specially format the 20 byte' seems pretty much equal to me.

> On the other hand, if we advertise a 20 byte MAC, and the app/person
> passes in an 8 byte MAC instead, then that implicitly denotes that the
> app/person *knows* this is an IPoIB device, and *knows* that we are
> actually setting the underlying GUID of the verbs device, and is aware
> of and prepared for the fact that the entire 20 byte MAC is not in fact
> settable.

Well, as above, this is unworkable, but even if we could specify the
length, I think this is goofy.

You've been concentrating on the set side, but there is a get side
too, and the response to set is a get packet.

What does get return? If we accept 8 or 20, then get must return 20.

Should get response to a set of 8 return 8 or 20? 

The kernel UAPI is much cleaner if there is one case: 20 byte LLADDR.

The kernel internal API is much cleaner if drivers don't have the
option to use two sizes for their LLADDR.

Just checking that the 20 byte address is sane (right GID prefix, 0
QPN, right flags) should be enough for the app to denote it knows what
it is doing.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH RFC 0/3] Support standard SRIOV configuration for IB VFs
       [not found]                                     ` <20150526181336.GD11800-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2015-05-26 20:32                                       ` Doug Ledford
       [not found]                                         ` <1432672378.28905.178.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 31+ messages in thread
From: Doug Ledford @ 2015-05-26 20:32 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: Or Gerlitz, linux-rdma-u79uwXL29TY76Z2rM5mHXA, Amir Vadai

[-- Attachment #1: Type: text/plain, Size: 4930 bytes --]

On Tue, 2015-05-26 at 12:13 -0600, Jason Gunthorpe wrote:
> On Tue, May 26, 2015 at 12:53:46PM -0400, Doug Ledford wrote:
> > 
> Yes, it is ugly that we are using a IPoIB netdev to configure the RDMA
> device.
> 
> If you want to suggest that we use the RDMA Netlink interface to do
> this,

I had thought of exactly that...

>  that would make sense to me too. But I thought the whole point
> was to provide something somewhat software compatible with the
> ethernet world..

Not so much ethernet world as netdevice world.  The iproute2 program is
used to configure any and all netdevices, ethernet or otherwise.  Right
now, we can abuse it to do the same here, but it uses the netdevice ndo_
ops, not rtnetlink to accomplish what it does, so we are limited in how
we do thing if we want to maintain tool usage.

> > The fact that the 20 byte LLADDR of IPoIB is problematic is born out by
> > multiple facts:  we only use 8 bytes in DHCP requests and responses,
> > only use 8 bytes in matching IPoIB netdevices against HWADDR entries in
> > the sysconfig/ifcfg-* files, only use 8 bytes for matching the udev
> > rules to rename IPoIB devices, etc.
> 
> The LLADDR for IPoIB *is* 20 bytes.
> 
> Truncating it down is *broken userspace*:
>  - DHCP: Not sending the full 20 bytes in the client request means the
>    server cannot unicast the reply. This causes all sorts of problems
>    and is discouraged in the RFCs these days.

Reference?  The RFCs I've read (4390 -> 4361 -> 3315) list a number of
options (three at the moment), but the LLADDR options all call for using
a LLADDR from a device that is a permanent part of the machine (not
common with add in cards), so the option most commonly used in IB is
option 2, DUID Assigned by Vendor, aka GUID.  According to that,
truncating to 8 bytes is precisely what you are supposed to do.  And, at
least in all current Red Hat products, that's exactly how dhcp client
creates the client-id.

>  - ifcfg/udev/networkmanager: So what happens when I do
>     ip link add link ib0 name ib0.1 type ipoib
>    And get two IPoIB interfaces with the same GUID? I doubt any sane
>    user would want to apply the same config to those two interfaces.

No, they probably don't want to apply the same rules to both interfaces.
I'm not entirely sure I agree with the argument though.  I fully
expected this to fail without a pkey argument on the ip command line.
The net stack doesn't allow users to do the same thing with Ethernet
devices, so I'm not sure we shouldn't be disallowing this as opposed to
creating duplicate devices that are identical in all ways except name.

> As I've already said, this is not possible. The person who designed
> this netlink extension screwed it up and 'mac_len' is fixed to 32
> bytes.

You're right.  We don't have a way of passing in the length we wish to
set versus the size of the message.

> Even if they didn't, the netlink common code should validate the input
> length for the drivers and confirm it is dev->addr_len, as is done for
> all other cases.
> 
> Unbreaking it is a UAPI change, not impossible, but do we really care
> enough about 8 or 20 to push for that?

In truth, at least right now, it's all moot.  Since we can't set the
subnet prefix, the qpn, or the flags, anything above 8 bytes is
immutable regardless of how many bytes we pass in.  So even if we say we
aren't going to change the UAPI and for everything to 20, the real world
result is that 8 works exactly the same and has no functional
difference.

> What does get return? If we accept 8 or 20, then get must return 20.

The get has to return 20 regardless.  It's the only accepted means of
getting all 20 bytes of the LLADDR.  There was a hack long ago (that
might even still work, but I haven't checked) where you could call the
ioctl to get the lladdr, which was supposed to be limited to 6 bytes,
but if the caller of the ioctl passed in an oversized buffer, the kernel
would write the full 20 bytes as long as the buffer could hold it.  But
that hack may have long since been closed up.

> Should get response to a set of 8 return 8 or 20? 

20.

> The kernel UAPI is much cleaner if there is one case: 20 byte LLADDR.

As I mentioned above, due to the constraints of what we can set and the
fact that we don't actually *tell* the kernel what size we've filled out
in the 32byte struct, it's really all moot at this point.

> The kernel internal API is much cleaner if drivers don't have the
> option to use two sizes for their LLADDR.
> 
> Just checking that the 20 byte address is sane (right GID prefix, 0
> QPN, right flags) should be enough for the app to denote it knows what
> it is doing.

Yes, checking the subnet prefix and flags would work.

-- 
Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
              GPG KeyID: 0E572FDD

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH RFC 0/3] Support standard SRIOV configuration for IB VFs
       [not found]                                         ` <1432672378.28905.178.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2015-05-26 21:11                                           ` Jason Gunthorpe
       [not found]                                             ` <20150526211114.GB4502-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  0 siblings, 1 reply; 31+ messages in thread
From: Jason Gunthorpe @ 2015-05-26 21:11 UTC (permalink / raw)
  To: Doug Ledford; +Cc: Or Gerlitz, linux-rdma-u79uwXL29TY76Z2rM5mHXA, Amir Vadai

On Tue, May 26, 2015 at 04:32:58PM -0400, Doug Ledford wrote:

> Not so much ethernet world as netdevice world.  The iproute2 program is
> used to configure any and all netdevices, ethernet or otherwise.  Right
> now, we can abuse it to do the same here, but it uses the netdevice ndo_
> ops, not rtnetlink to accomplish what it does, so we are limited in how
> we do thing if we want to maintain tool usage.

Hmm? iproute2 does it over rtnetlink?

> > The LLADDR for IPoIB *is* 20 bytes.
> > 
> > Truncating it down is *broken userspace*:
> >  - DHCP: Not sending the full 20 bytes in the client request means the
> >    server cannot unicast the reply. This causes all sorts of problems
> >    and is discouraged in the RFCs these days.
> 
> Reference?  The RFCs I've read (4390 -> 4361 -> 3315) list a number of
> options (three at the moment), but the LLADDR options all call for using

I'm talking about this part from RFC 4390:

   As described above, the link-layer address is unavailable to the DHCP
   server because the link-layer address is larger than the "chaddr"
   field length.  As a result, the server cannot unicast its reply to
   the client.  Therefore, a DHCP client MUST request that the server
   send a broadcast reply by setting the BROADCAST flag when IPoIB
   Address Resolution Protocol (ARP) is not possible, i.e., in
   situations where the client does not know its IP address.

AFAIK, nobody ever solved this, and it actually does cause real world
problems for cloud stuff as there is limited randomness in the
TID. This is the network side of DHCP.

> a LLADDR from a device that is a permanent part of the machine (not
> common with add in cards), so the option most commonly used in IB is
> option 2, DUID Assigned by Vendor, aka GUID.  According to that,
> truncating to 8 bytes is precisely what you are supposed to do.  And, at
> least in all current Red Hat products, that's exactly how dhcp client
> creates the client-id.

Using the GUID as the client-id is sort of OK from a policy
perspective (ie what IP should I use), but it doesn't help the network
side, and it breaks down completely when you create child interfaces.

Basically, the dhcp server not having the LLADDR at all is a pretty
big hack.. No other network I know of runs DHCP like that.

> >  - ifcfg/udev/networkmanager: So what happens when I do
> >     ip link add link ib0 name ib0.1 type ipoib
> >    And get two IPoIB interfaces with the same GUID? I doubt any sane
> >    user would want to apply the same config to those two interfaces.
> 
> No, they probably don't want to apply the same rules to both interfaces.
> I'm not entirely sure I agree with the argument though.  I fully
> expected this to fail without a pkey argument on the ip command
> line.

Does that matter to the above tools? Are they using PKey,GUID as their
key?

> The net stack doesn't allow users to do the same thing with Ethernet
> devices, so I'm not sure we shouldn't be disallowing this as opposed to
> creating duplicate devices that are identical in all ways except name.

The netstack doesn't allow it for ethernet because it would create a
2nd identical LLADDR, and LLADDRs must be unique.

Because the QPN is part of the LLADDR IB can create two interfaces on
the same physical port that are completely separated by hardware. Read
Haggi's email, he explains how they plan to use this to create
interfaces that can be delegated to namespaces. It is not a bad idea
really.. 

So prepare for a world where each namespace has a child IPoIB
interface with a unique QPN, but the same Pkey and GUID as the
host. The breakage from assuming GUID == unique will become a problem.

> > Unbreaking it is a UAPI change, not impossible, but do we really care
> > enough about 8 or 20 to push for that?
> 
> In truth, at least right now, it's all moot.  Since we can't set the
> subnet prefix, the qpn, or the flags, anything above 8 bytes is
> immutable regardless of how many bytes we pass in.  So even if we say we
> aren't going to change the UAPI and for everything to 20, the real world
> result is that 8 works exactly the same and has no functional
> difference.

Not quite, in the 20 byte format the 8 bytes of the GUID are in the
last 8/20 bytes, so the app would have to place 12 zeros and then the
GUID to follow the 20 byte format (or 4 zeros, the prefix, then the GUID)

This is why the question of 'what is ILFA_VF_MAC' is so important,
every option presented (MAC,GUID,LLADDR) are incompatible with each
other.

> > What does get return? If we accept 8 or 20, then get must return 20.
> 
> The get has to return 20 regardless.  It's the only accepted means of
> getting all 20 bytes of the LLADDR.

You are conflating IFLA_ADDRESS and IFLA_VF_MAC.

IFLA_VF_MAC could be 8 byte and IFLA_ADDRESS could be 20, I think that
makes no sense, but it wouldn't break existing stuff.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH RFC 0/3] Support standard SRIOV configuration for IB VFs
       [not found]                                             ` <20150526211114.GB4502-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2015-05-27 13:01                                               ` Or Gerlitz
  2015-05-27 14:14                                               ` Doug Ledford
  1 sibling, 0 replies; 31+ messages in thread
From: Or Gerlitz @ 2015-05-27 13:01 UTC (permalink / raw)
  To: Jason Gunthorpe, Doug Ledford
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Amir Vadai

On 5/27/2015 12:11 AM, Jason Gunthorpe wrote:
> On Tue, May 26, 2015 at 04:32:58PM -0400, Doug Ledford wrote:
>
>>>   - ifcfg/udev/networkmanager: So what happens when I do
>>>      ip link add link ib0 name ib0.1 type ipoib
>>>     And get two IPoIB interfaces with the same GUID? I doubt any sane
>>>     user would want to apply the same config to those two interfaces.
>> No, they probably don't want to apply the same rules to both interfaces.
>> I'm not entirely sure I agree with the argument though.  I fully
>> expected this to fail without a pkey argument on the ip command
>> line.
> Does that matter to the above tools? Are they using PKey,GUID as their
> key?
>
>> The net stack doesn't allow users to do the same thing with Ethernet
>> devices, so I'm not sure we shouldn't be disallowing this as opposed to
>> creating duplicate devices that are identical in all ways except name.
> The netstack doesn't allow it for ethernet because it would create a
> 2nd identical LLADDR, and LLADDRs must be unique.
>
> Because the QPN is part of the LLADDR IB can create two interfaces on
> the same physical port that are completely separated by hardware. Read
> Haggi's email, he explains how they plan to use this to create
> interfaces that can be delegated to namespaces. It is not a bad idea
> really..
>
> So prepare for a world where each namespace has a child IPoIB
> interface with a unique QPN, but the same Pkey and GUID as the
> host. The breakage from assuming GUID == unique will become a problem.
>
>>> Unbreaking it is a UAPI change, not impossible, but do we really care
>>> enough about 8 or 20 to push for that?
>> In truth, at least right now, it's all moot.  Since we can't set the
>> subnet prefix, the qpn, or the flags, anything above 8 bytes is
>> immutable regardless of how many bytes we pass in.  So even if we say we
>> aren't going to change the UAPI and for everything to 20, the real world
>> result is that 8 works exactly the same and has no functional
>> difference.
> Not quite, in the 20 byte format the 8 bytes of the GUID are in the
> last 8/20 bytes, so the app would have to place 12 zeros and then the
> GUID to follow the 20 byte format (or 4 zeros, the prefix, then the GUID)
>
> This is why the question of 'what is ILFA_VF_MAC' is so important,
> every option presented (MAC,GUID,LLADDR) are incompatible with each
> other.

I agree with Doug that to be practical here, libvirt and Co. would 
really want to use rtnetlink based provisioning of IB VFs, at least in a 
similar manner done for Eth VFs.

So with this assumption at hand, my vote goes to having user-space to 
provide the eight bytes of vGUID through the ndo_set_vf_mac call into 
IPoIB.

I don't see the real value of user space providing the four zero bytes 
(19-16) and the 8 bytes of the subnet prefix provided by the SM.

My personal thinking is that the important thing to address is 
consistency between what the virtualization system provisions on the 
host (ndo_set_vf_mac) to the DHCP server scheme they build.

Do we have a go here?

Also few comments on DHCP:

If we're talking on different vlans/Eth or pkey/IB - it's totally OK for 
two entities (== IPoIB instances under IB) on the physical subnet to use 
the same identifier (IB/GUID, Eth/MAC) if they are on two different L2 
broadcast domains. The DHCP server is expected to have a different 
mapping scheme per such virtual L2 subnet.

For SRIOV, we don't expect two VFs on the network to use the same vGUID, 
so DHCP wise we should be OK. Today the Client-ID works fine for SRIOV 
schemes which are based on 8byte vGUIDs.

Re two IPoIB child devices using the same GUID and the same pkey, we can 
enhance the system and take advantage of IB Alias GUIDs which today are 
only used for SRIOV for Para-Virtual and other environments too, thanks 
for the heads up on the necessity of doing so.

>
>>> What does get return? If we accept 8 or 20, then get must return 20.
>> The get has to return 20 regardless.  It's the only accepted means of
>> getting all 20 bytes of the LLADDR.
> You are conflating IFLA_ADDRESS and IFLA_VF_MAC.
>
> IFLA_VF_MAC could be 8 byte and IFLA_ADDRESS could be 20, I think that
> makes no sense, but it wouldn't break existing stuff.
>

Just to make sure we're on the same page, this thread deals with using 
rtnetlink's IFLA_VF_MAC(== struct ifla_vf_mac) for provisioning vGUID 
for IB VFs, through the PF IPoIB interface, not attempting to use 
IFLA_ADDRESS.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH RFC 0/3] Support standard SRIOV configuration for IB VFs
       [not found]                                             ` <20150526211114.GB4502-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  2015-05-27 13:01                                               ` Or Gerlitz
@ 2015-05-27 14:14                                               ` Doug Ledford
       [not found]                                                 ` <1432736046.28905.215.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  1 sibling, 1 reply; 31+ messages in thread
From: Doug Ledford @ 2015-05-27 14:14 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: Or Gerlitz, linux-rdma-u79uwXL29TY76Z2rM5mHXA, Amir Vadai

[-- Attachment #1: Type: text/plain, Size: 10211 bytes --]

On Tue, 2015-05-26 at 15:11 -0600, Jason Gunthorpe wrote:
> On Tue, May 26, 2015 at 04:32:58PM -0400, Doug Ledford wrote:
> 
> > Not so much ethernet world as netdevice world.  The iproute2 program is
> > used to configure any and all netdevices, ethernet or otherwise.  Right
> > now, we can abuse it to do the same here, but it uses the netdevice ndo_
> > ops, not rtnetlink to accomplish what it does, so we are limited in how
> > we do thing if we want to maintain tool usage.
> 
> Hmm? iproute2 does it over rtnetlink?

Yes, it does.  Sorry for being imprecise.  It uses the specific netlink
packet that maps back to the predefined ndo_ entry point and so you
can't use that netlink on anything other than a netdevice with a defined
ndo_ entry point.  We can't use it, for example, to extend the new RDMA
netlink operations that Kaike posted without also making modifications
to iproute2 to know about the new netlink.

> > > The LLADDR for IPoIB *is* 20 bytes.
> > > 
> > > Truncating it down is *broken userspace*:
> > >  - DHCP: Not sending the full 20 bytes in the client request means the
> > >    server cannot unicast the reply. This causes all sorts of problems
> > >    and is discouraged in the RFCs these days.
> > 
> > Reference?  The RFCs I've read (4390 -> 4361 -> 3315) list a number of
> > options (three at the moment), but the LLADDR options all call for using
> 
> I'm talking about this part from RFC 4390:
> 
>    As described above, the link-layer address is unavailable to the DHCP
>    server because the link-layer address is larger than the "chaddr"
>    field length.  As a result, the server cannot unicast its reply to
>    the client.  Therefore, a DHCP client MUST request that the server
>    send a broadcast reply by setting the BROADCAST flag when IPoIB
>    Address Resolution Protocol (ARP) is not possible, i.e., in
>    situations where the client does not know its IP address.
> 
> AFAIK, nobody ever solved this, and it actually does cause real world
> problems for cloud stuff as there is limited randomness in the
> TID. This is the network side of DHCP.
> 
> > a LLADDR from a device that is a permanent part of the machine (not
> > common with add in cards), so the option most commonly used in IB is
> > option 2, DUID Assigned by Vendor, aka GUID.  According to that,
> > truncating to 8 bytes is precisely what you are supposed to do.  And, at
> > least in all current Red Hat products, that's exactly how dhcp client
> > creates the client-id.
> 
> Using the GUID as the client-id is sort of OK from a policy
> perspective (ie what IP should I use), but it doesn't help the network
> side, and it breaks down completely when you create child interfaces.

Actually, no, it doesn't.  My standard test network inside Red Hat uses
child interfaces and dhcp exclusively and it all works as expected.
This is because each child interface has its own unique pkey and all of
the devices include the pkey in the broadcast address, so one child does
not see another child's broadcast reply, not the parent's.

It does, however, mean that you only want a single link on a given pkey
for a given device, which is why adding a second ipoib link without a
unique pkey seems so broken to me.

> Basically, the dhcp server not having the LLADDR at all is a pretty
> big hack.. No other network I know of runs DHCP like that.

That's how they decided to solve the issue in the RFCs, so that's what
we have.

> > >  - ifcfg/udev/networkmanager: So what happens when I do
> > >     ip link add link ib0 name ib0.1 type ipoib
> > >    And get two IPoIB interfaces with the same GUID? I doubt any sane
> > >    user would want to apply the same config to those two interfaces.
> > 
> > No, they probably don't want to apply the same rules to both interfaces.
> > I'm not entirely sure I agree with the argument though.  I fully
> > expected this to fail without a pkey argument on the ip command
> > line.
> 
> Does that matter to the above tools? Are they using PKey,GUID as their
> key?

They are pkey aware, yes.  There is the parent device, and all
non-default pkey devices are listed as a pkey device and given a pkey
number and they share the same GUID.  They are considered children of
the parent, just like with vlan setups.

> > The net stack doesn't allow users to do the same thing with Ethernet
> > devices, so I'm not sure we shouldn't be disallowing this as opposed to
> > creating duplicate devices that are identical in all ways except name.
> 
> The netstack doesn't allow it for ethernet because it would create a
> 2nd identical LLADDR, and LLADDRs must be unique.

And as far as the configuration scripts (at least on Red Hat) as well as
NetworkManager is concerned, the unique requirements are GUID/P_Key.
But, the P_Key doesn't show up in the LLADDR, only in the broadcast
address, so only the GUID is checked in the LLADDR.  All IPoIB devices
are either parent devices (meaning no PKEY field specified in the config
file) in which case they are the first started and match the GUID, or
they are PKEY devices and are started only after their parent device is
brought up (and here I think they actually match on parent name, not on
GUID, but I would have to double check that).

> Because the QPN is part of the LLADDR IB can create two interfaces on
> the same physical port that are completely separated by hardware. Read
> Haggi's email, he explains how they plan to use this to create
> interfaces that can be delegated to namespaces. It is not a bad idea
> really.. 

Yes, it is actually.  The whole reason we went to GUID matching long ago
was because of this exact issue.  Actually, allow me to be perfectly
clear about this point: the qpn changing on IPoIB interfaces *drove* us
to drop the qpn from device matching.  In addition, links that were slow
to come up (such as 40GBit links that needed more time to synchronize at
40GBit/s than it took to try and start the IPoIB interface) drove us to
drop the subnet prefix from the match because you could start to
configure your IPoIB interfaces before OpenSM had a chance to tell you
what your subnet prefix actually was.  So we were forced to stick with
only the 8byte GUID.  You can not put a transient item into your device
identifier, *ever*.  The problem here is that even a slight change in
ordering of module loading can change the qpn that each IPoIB device
gets.  Or, even easier to demonstrate, was the fact that if you rmmod
ib_ipoib; modprobe ib_ipoib, then all of your devices will fail to start
because all of your qpns no longer match up!

The *only* way this will ever be a workable item is if we A) reserve a
number of queue pairs from the driver specifically for IPoIB use and B)
specify which queue pairs go to which IPoIB devices at IPoIB module load
time.  Short of that, using qpn in the device identifier is a complete
non-starter.  And that still doesn't address the subnet prefix either.
If you want that included, then we need the ability to tell ib_ipoib
what each link's subnet prefix will be once it talks to the SM so we
don't have the wrong device identifier because we booted up faster than
the link synchronized.

> 
> So prepare for a world where each namespace has a child IPoIB
> interface with a unique QPN, but the same Pkey and GUID as the
> host. The breakage from assuming GUID == unique will become a problem.

See above.  Without further changes, this is a total non-starter.  And
this will require coordination with initscripts and NetworkManager (at
least, I don't know if other distros use their own custom tools here).

> > > Unbreaking it is a UAPI change, not impossible, but do we really care
> > > enough about 8 or 20 to push for that?
> > 
> > In truth, at least right now, it's all moot.  Since we can't set the
> > subnet prefix, the qpn, or the flags, anything above 8 bytes is
> > immutable regardless of how many bytes we pass in.  So even if we say we
> > aren't going to change the UAPI and for everything to 20, the real world
> > result is that 8 works exactly the same and has no functional
> > difference.
> 
> Not quite, in the 20 byte format the 8 bytes of the GUID are in the
> last 8/20 bytes, so the app would have to place 12 zeros and then the
> GUID to follow the 20 byte format (or 4 zeros, the prefix, then the GUID)
> 
> This is why the question of 'what is ILFA_VF_MAC' is so important,
> every option presented (MAC,GUID,LLADDR) are incompatible with each
> other.

For Ethernet devices, it's the MAC.  The equivalent of MAC on IB is the
GUID.  I would leave it at that.  IPoIB devices are constructs on top of
the GUID/link, and you can have 10 IPoIB interfaces between the parent
and children, but we don't need to specify all of those LLADDRs, we just
need to give a unique GUID and allow the guest OS to create their own
IPoIB devices on top of that.

> > > What does get return? If we accept 8 or 20, then get must return 20.
> > 
> > The get has to return 20 regardless.  It's the only accepted means of
> > getting all 20 bytes of the LLADDR.
> 
> You are conflating IFLA_ADDRESS and IFLA_VF_MAC.
> 
> IFLA_VF_MAC could be 8 byte and IFLA_ADDRESS could be 20, I think that
> makes no sense, but it wouldn't break existing stuff.

Sorry, you're right.  I was thinking about getting the address of the
parent IPoIB device we are talking to, not getting the VF_MAC.  For
VF_MAC, I would say it should be the 8 byte GUID.  In a world where we
are configuring a base device via a parent base device, then as you
suggest it would make no sense that the two would be different sizes.
But in a world where we are using a construct on top of the base device
in order to access config space for the base device, and we are
configuring SRIOV instances of the base device for use in a guest (and
not our construct we are accessing things through), then our access
device and configure device need not have the same address size.

-- 
Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
              GPG KeyID: 0E572FDD

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH RFC 0/3] Support standard SRIOV configuration for IB VFs
       [not found]                                                 ` <1432736046.28905.215.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2015-05-27 17:11                                                   ` Jason Gunthorpe
       [not found]                                                     ` <20150527171143.GB9909-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  0 siblings, 1 reply; 31+ messages in thread
From: Jason Gunthorpe @ 2015-05-27 17:11 UTC (permalink / raw)
  To: Doug Ledford; +Cc: Or Gerlitz, linux-rdma-u79uwXL29TY76Z2rM5mHXA, Amir Vadai

On Wed, May 27, 2015 at 10:14:06AM -0400, Doug Ledford wrote:
> > Because the QPN is part of the LLADDR IB can create two interfaces on
> > the same physical port that are completely separated by hardware. Read
> > Haggi's email, he explains how they plan to use this to create
> > interfaces that can be delegated to namespaces. It is not a bad idea
> > really.. 
> 
> Yes, it is actually.  The whole reason we went to GUID matching long ago
> was because of this exact issue.

I reflected on this some more last night, and yes, I am leaning toward
'bad idea' direction too.

Too much stuff breaks if you create multiple children with the same
pkey/guid:
 - RDMA CM cannot disambiguate CM packets between them
 - DHCP cannot tell them apart
 - Net scripts/network manager won't work
 - IPv6 becomes totally broken

That means the namespace stuff will have to create children using GUID
aliases..

> The *only* way this will ever be a workable item is if we A) reserve a
> number of queue pairs from the driver specifically for IPoIB use and B)
> specify which queue pairs go to which IPoIB devices at IPoIB module
> load

This basic idea is exactly why I think we should stick with the 20
byte LLADDR for ILFA_VF_MAC. It gives a route for the PF to tell the
VF what QPN to use for IPoIB (if we ever see HW support to implement that)

If we use 8 bytes then that route is closed off forever.

> > Not quite, in the 20 byte format the 8 bytes of the GUID are in the
> > last 8/20 bytes, so the app would have to place 12 zeros and then the
> > GUID to follow the 20 byte format (or 4 zeros, the prefix, then the GUID)
> > 
> > This is why the question of 'what is ILFA_VF_MAC' is so important,
> > every option presented (MAC,GUID,LLADDR) are incompatible with each
> > other.
> 
> For Ethernet devices, it's the MAC.  The equivalent of MAC on IB is the
> GUID.  I would leave it at that.

Yes, both arguments can be made:
  - Our netlink end point is targetting an IPoIB interface, and
    the equivelent to an Ethernet MAC in IPoIB language is the LLADDR.
  - Our netlink interface is targetting the hardware under the IPoIB
    interface and that MAC equivilent is the GUID

> IPoIB devices are constructs on top of
> the GUID/link, and you can have 10 IPoIB interfaces between the parent
> and children, but we don't need to specify all of those LLADDRs, we just
> need to give a unique GUID and allow the guest OS to create their own
> IPoIB devices on top of that.

As I've said, I would like to see netdev review that idea before we
merge any patches..

There are pragmatic downsides to the 8 byte choice: Userspace
completely looses the ability to size the address without a table
based on link type. That is terrible in the context of netlink's
design. For instance iproute2 would need IB specific code to format
the 'ip link show' (review print_vfinfo in iproute2) and to length
check 'ip link set vf mac'

If we do use 8, then it would be ideal (and my strong preference) to
also fix the IFLA_VF_MAC message to have a working length. I think
that could be done compatibly with a bit of work. At least that way
iproute2 can be kept clean when it learns to do IB, and we could have
the option again of using 20 someday if we need.

So to be clear, to go with the 8 byte option I suggest:
 - Engage netdev/iproute and confirm they are philosophically OK
   with IFLA_VF_MAC != IFLA_ADDRESS
 - Make a kernel patch to properly size the IFLA_VF_MAC message
 - Make a iproute patch to use the IFLA_VF_MAC size in print_vfinfo
   instead of hardcoded ETH_ALEN (using len == 32 mean len 6 for compat)
 - Drop in the IB patch
 
Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH RFC 0/3] Support standard SRIOV configuration for IB VFs
       [not found]                                                     ` <20150527171143.GB9909-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2015-05-27 17:53                                                       ` Doug Ledford
       [not found]                                                         ` <1432749191.28905.243.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 31+ messages in thread
From: Doug Ledford @ 2015-05-27 17:53 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: Or Gerlitz, linux-rdma-u79uwXL29TY76Z2rM5mHXA, Amir Vadai

[-- Attachment #1: Type: text/plain, Size: 5184 bytes --]

On Wed, 2015-05-27 at 11:11 -0600, Jason Gunthorpe wrote:
> On Wed, May 27, 2015 at 10:14:06AM -0400, Doug Ledford wrote:
> > > Because the QPN is part of the LLADDR IB can create two interfaces on
> > > the same physical port that are completely separated by hardware. Read
> > > Haggi's email, he explains how they plan to use this to create
> > > interfaces that can be delegated to namespaces. It is not a bad idea
> > > really.. 
> > 
> > Yes, it is actually.  The whole reason we went to GUID matching long ago
> > was because of this exact issue.
> 
> I reflected on this some more last night, and yes, I am leaning toward
> 'bad idea' direction too.
> 
> Too much stuff breaks if you create multiple children with the same
> pkey/guid:
>  - RDMA CM cannot disambiguate CM packets between them
>  - DHCP cannot tell them apart
>  - Net scripts/network manager won't work
>  - IPv6 becomes totally broken
> 
> That means the namespace stuff will have to create children using GUID
> aliases..

Glad we agree on that ;-)

> > The *only* way this will ever be a workable item is if we A) reserve a
> > number of queue pairs from the driver specifically for IPoIB use and B)
> > specify which queue pairs go to which IPoIB devices at IPoIB module
> > load
> 
> This basic idea is exactly why I think we should stick with the 20
> byte LLADDR for ILFA_VF_MAC. It gives a route for the PF to tell the
> VF what QPN to use for IPoIB (if we ever see HW support to implement that)
> 
> If we use 8 bytes then that route is closed off forever.

And that's exactly as it should be.  If we allow setting all 20 bytes
via the VF_MAC calls, then we violate the "guests should behave like
they are on bare metal as much as possible" rule.  As a host, we get a
GUID and if we want to control the QPNs for IPoIB (and indeed if we want
to control how many IPoIB interfaces and on what P_Keys) then we must
create config files in /etc/sysconfig/network-scripts (on Red Hat,
similar requirements on other distros) that would instruct the OS to
create exactly what we want.  But, the key point is that we are only
given a GUID, and we must create everything else from our config files.
Guests should be the same way.  They only get the GUID to start, then
the guest disk image with its self contained configuration will take
over and control the rest.

> > > Not quite, in the 20 byte format the 8 bytes of the GUID are in the
> > > last 8/20 bytes, so the app would have to place 12 zeros and then the
> > > GUID to follow the 20 byte format (or 4 zeros, the prefix, then the GUID)
> > > 
> > > This is why the question of 'what is ILFA_VF_MAC' is so important,
> > > every option presented (MAC,GUID,LLADDR) are incompatible with each
> > > other.
> > 
> > For Ethernet devices, it's the MAC.  The equivalent of MAC on IB is the
> > GUID.  I would leave it at that.
> 
> Yes, both arguments can be made:
>   - Our netlink end point is targetting an IPoIB interface, and
>     the equivelent to an Ethernet MAC in IPoIB language is the LLADDR.
>   - Our netlink interface is targetting the hardware under the IPoIB
>     interface and that MAC equivilent is the GUID
> 
> > IPoIB devices are constructs on top of
> > the GUID/link, and you can have 10 IPoIB interfaces between the parent
> > and children, but we don't need to specify all of those LLADDRs, we just
> > need to give a unique GUID and allow the guest OS to create their own
> > IPoIB devices on top of that.
> 
> As I've said, I would like to see netdev review that idea before we
> merge any patches..
> 
> There are pragmatic downsides to the 8 byte choice: Userspace
> completely looses the ability to size the address without a table
> based on link type. That is terrible in the context of netlink's
> design.

Well, let's just be clear: netlink/iproute2 screwed the pooch on their
implementation of this stuff.  Any solution that doesn't include fixing
this up in some way is not really a good solution.

>  For instance iproute2 would need IB specific code to format
> the 'ip link show' (review print_vfinfo in iproute2) and to length
> check 'ip link set vf mac'
> 
> If we do use 8, then it would be ideal (and my strong preference) to
> also fix the IFLA_VF_MAC message to have a working length. I think
> that could be done compatibly with a bit of work. At least that way
> iproute2 can be kept clean when it learns to do IB, and we could have
> the option again of using 20 someday if we need.
> 
> So to be clear, to go with the 8 byte option I suggest:
>  - Engage netdev/iproute and confirm they are philosophically OK
>    with IFLA_VF_MAC != IFLA_ADDRESS
>  - Make a kernel patch to properly size the IFLA_VF_MAC message
>  - Make a iproute patch to use the IFLA_VF_MAC size in print_vfinfo
>    instead of hardcoded ETH_ALEN (using len == 32 mean len 6 for compat)
>  - Drop in the IB patch

Sounds like a reasonable plan.

Or, this is your patch set, are you going to pick up these action items?

-- 
Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
              GPG KeyID: 0E572FDD


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH RFC 0/3] Support standard SRIOV configuration for IB VFs
       [not found]                                                         ` <1432749191.28905.243.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2015-05-27 18:29                                                           ` Jason Gunthorpe
  2015-05-27 21:58                                                           ` Or Gerlitz
  1 sibling, 0 replies; 31+ messages in thread
From: Jason Gunthorpe @ 2015-05-27 18:29 UTC (permalink / raw)
  To: Doug Ledford; +Cc: Or Gerlitz, linux-rdma-u79uwXL29TY76Z2rM5mHXA, Amir Vadai

On Wed, May 27, 2015 at 01:53:11PM -0400, Doug Ledford wrote:

> > This basic idea is exactly why I think we should stick with the 20
> > byte LLADDR for ILFA_VF_MAC. It gives a route for the PF to tell the
> > VF what QPN to use for IPoIB (if we ever see HW support to implement that)
> > 
> > If we use 8 bytes then that route is closed off forever.
> 
> And that's exactly as it should be.  If we allow setting all 20 bytes
> via the VF_MAC calls, then we violate the "guests should behave like
> they are on bare metal as much as possible" rule.  As a host, we get a
> GUID and if we want to control the QPNs for IPoIB (and indeed if we
> want

No, that doesn't really work. The PF's QPN pool is shared between all
VF guests. The VF HCA does a call to the PF driver to get a free QPN
to use. The VM cannot unilaterally select a QPN without global
coordination.

In other words, if a QPN is to be forced for IPoIB then it must come
from the PF, which must get it from the orchestration software, which
must globally coordinate to ensure the QPN is available on every PF
the VM may want to run on. Similar to how the MAC must be globally
unique.

Will this ever be done? Probably not. But if it was, this interface
would be the sanest way to go.

Anyhow - as far as I'm concerned, fixing the IFLA_VF_MAC to have a
proper length nullifies my concern. If we really needed we can switch
to a 20 byte format with good compatability options in the future once
we have a proper length.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH RFC 0/3] Support standard SRIOV configuration for IB VFs
       [not found]                                                         ` <1432749191.28905.243.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2015-05-27 18:29                                                           ` Jason Gunthorpe
@ 2015-05-27 21:58                                                           ` Or Gerlitz
       [not found]                                                             ` <CAJ3xEMjXXKy=DSeksTFEX-GAN=nYm_6nn5msvsYOwnp0roEHJQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  1 sibling, 1 reply; 31+ messages in thread
From: Or Gerlitz @ 2015-05-27 21:58 UTC (permalink / raw)
  To: Doug Ledford
  Cc: Jason Gunthorpe, linux-rdma-u79uwXL29TY76Z2rM5mHXA, Amir Vadai

On Wed, May 27, 2015 at 8:53 PM, Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:

> On Wed, 2015-05-27 at 11:11 -0600, Jason Gunthorpe wrote:
> Well, let's just be clear: netlink/iproute2 screwed the pooch on their
> implementation of this stuff.  Any solution that doesn't include fixing
> this up in some way is not really a good solution.

>>  For instance iproute2 would need IB specific code to format
>> the 'ip link show' (review print_vfinfo in iproute2) and to length
>> check 'ip link set vf mac'

>> If we do use 8, then it would be ideal (and my strong preference) to
>> also fix the IFLA_VF_MAC message to have a working length. I think
>> that could be done compatibly with a bit of work. At least that way
>> iproute2 can be kept clean when it learns to do IB, and we could have
>> the option again of using 20 someday if we need.
>>

> Sounds like a reasonable plan.
> Or, this is your patch set, are you going to pick up these action items?

Let see

>> So to be clear, to go with the 8 byte option I suggest:

>>  - Engage netdev/iproute and confirm they are philosophically OK
>>    with IFLA_VF_MAC != IFLA_ADDRESS

the last thing netdev are is having philosophical views, they are very
practical (and non-perfect and happy and lively community), the one
thing before the last  netdev are is caring for the rdma subsystem. If
something has to change @ their direct part, we should come with
patches.

>>  - Make a kernel patch to properly size the IFLA_VF_MAC message

You mean the below structure which is expected by the kernel after
they see the IFLA_VF_MAC NL attribute

struct ifla_vf_mac {
         __u32 vf;
         __u8 mac[32]; /* MAX_ADDR_LEN */
 };

How you thought to patch things such that the size of the address
provided by user-space will propagate into the kernel w.o breaking the
NL ABI here?

Why not just take the eight lower bytes and set them NL --> ipoib -->  PF driver

>>  - Make a iproute patch to use the IFLA_VF_MAC size in print_vfinfo
>>    instead of hardcoded ETH_ALEN (using len == 32 mean len 6 for compat)

I was thinking to patch iproute to sense the link type: if eth print
six bytes, if ipoib print 8 bytes, simple.

>>  - Drop in the IB patch

sounds good.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH RFC 0/3] Support standard SRIOV configuration for IB VFs
       [not found]                                                             ` <CAJ3xEMjXXKy=DSeksTFEX-GAN=nYm_6nn5msvsYOwnp0roEHJQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2015-05-27 22:42                                                               ` Jason Gunthorpe
  0 siblings, 0 replies; 31+ messages in thread
From: Jason Gunthorpe @ 2015-05-27 22:42 UTC (permalink / raw)
  To: Or Gerlitz; +Cc: Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA, Amir Vadai

On Thu, May 28, 2015 at 12:58:59AM +0300, Or Gerlitz wrote:
> On Wed, May 27, 2015 at 8:53 PM, Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> 
> > On Wed, 2015-05-27 at 11:11 -0600, Jason Gunthorpe wrote:
> > Well, let's just be clear: netlink/iproute2 screwed the pooch on their
> > implementation of this stuff.  Any solution that doesn't include fixing
> > this up in some way is not really a good solution.
> 
> >>  For instance iproute2 would need IB specific code to format
> >> the 'ip link show' (review print_vfinfo in iproute2) and to length
> >> check 'ip link set vf mac'
> 
> >> If we do use 8, then it would be ideal (and my strong preference) to
> >> also fix the IFLA_VF_MAC message to have a working length. I think
> >> that could be done compatibly with a bit of work. At least that way
> >> iproute2 can be kept clean when it learns to do IB, and we could have
> >> the option again of using 20 someday if we need.
> >>
> 
> > Sounds like a reasonable plan.
> > Or, this is your patch set, are you going to pick up these action items?
> 
> Let see
> 
> >> So to be clear, to go with the 8 byte option I suggest:
> 
> >>  - Engage netdev/iproute and confirm they are philosophically OK
> >>    with IFLA_VF_MAC != IFLA_ADDRESS
> 
> the last thing netdev are is having philosophical views, they are very
> practical (and non-perfect and happy and lively community), the one
> thing before the last  netdev are is caring for the rdma subsystem. If
> something has to change @ their direct part, we should come with
> patches.

Yes, you need to talk in patches, I will suggest you start with this
(totally untested, quickly thrown together):

diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index 8de36824018d..715b59e9925a 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -1172,7 +1172,7 @@ static int rtnl_fill_ifinfo(struct sk_buff *skb, struct net_device *dev,
 				nla_nest_cancel(skb, vfinfo);
 				goto nla_put_failure;
 			}
-			if (nla_put(skb, IFLA_VF_MAC, sizeof(vf_mac), &vf_mac) ||
+			if (nla_put(skb, IFLA_VF_MAC, sizeof(u32) + dev->addr_len, &vf_mac) ||
 			    nla_put(skb, IFLA_VF_VLAN, sizeof(vf_vlan), &vf_vlan) ||
 			    nla_put(skb, IFLA_VF_RATE, sizeof(vf_rate),
 				    &vf_rate) ||
@@ -1292,7 +1292,8 @@ static const struct nla_policy ifla_vfinfo_policy[IFLA_VF_INFO_MAX+1] = {
 };
 
 static const struct nla_policy ifla_vf_policy[IFLA_VF_MAX+1] = {
-	[IFLA_VF_MAC]		= { .len = sizeof(struct ifla_vf_mac) },
+	[IFLA_VF_MAC]		= { .type = NLA_BINARY,
+				    .len = sizeof(struct ifla_vf_mac) },
 	[IFLA_VF_VLAN]		= { .len = sizeof(struct ifla_vf_vlan) },
 	[IFLA_VF_TX_RATE]	= { .len = sizeof(struct ifla_vf_tx_rate) },
 	[IFLA_VF_SPOOFCHK]	= { .len = sizeof(struct ifla_vf_spoofchk) },
@@ -1447,6 +1448,21 @@ static int do_setvfinfo(struct net_device *dev, struct nlattr *attr)
 		switch (nla_type(vf)) {
 		case IFLA_VF_MAC: {
 			struct ifla_vf_mac *ivm;
+			int sz;
+
+			/* Legacy compatability, old iproute2 will pass in 32
+			 * bytes for everything, assume it is a 6 byte MAC
+			 * because that is all old iproute really supported.
+			 */
+			sz = nla_len(attr);
+			if (sz == 36)
+				sz = 10;
+
+			if (sz != (sizeof(u32) + dev->addr_len)) {
+				err = -EINVAL;
+				break;
+			}
+
 			ivm = nla_data(vf);
 			err = -EOPNOTSUPP;
 			if (ops->ndo_set_vf_mac)

And make the argument why the above is OK.
 - Reader side only ever read 6 bytes anyhow, so it doesn't matter
   that ifla_vf_mac is undersized in the RTA_DATA
 - Confirm any other users of this (openstack?) you know are similarly
   OK.
 - Old write side will always send the full 36 byte struct, so
   assume 6 bytes.
 - Justify this as needing to add IB support next which does not use a
   6 byte addr_len
   New code that wants to work with IB will have to use correct lengths.

Then based on that iproute2 can be fixed in the two areas I pointed
out, doing that at the same time will probably earn some rep.

Finally you can make a patch that switches addr_len to something else
- that would trigger the philosophical question of 'what actually is a
IFLA_VF_MAC', if everyone is OK with changing it then golden.

> >>  - Make a iproute patch to use the IFLA_VF_MAC size in print_vfinfo
> >>    instead of hardcoded ETH_ALEN (using len == 32 mean len 6 for compat)
> 
> I was thinking to patch iproute to sense the link type: if eth print
> six bytes, if ipoib print 8 bytes, simple.

This is against the design ideals of netlink: a tool like iproute
should not need link specific knowledge to parse kernel messages.

Now is the best time to make this fix because we can use the sz == 36
so sz == 10 assumption.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2015-05-27 22:42 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-05-21 16:24 [PATCH RFC 0/3] Support standard SRIOV configuration for IB VFs Or Gerlitz
     [not found] ` <1432225447-6536-1-git-send-email-ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2015-05-21 16:24   ` [PATCH RFC 1/3] IB/IPoIB: Support SRIOV standard configuration Or Gerlitz
     [not found]     ` <1432225447-6536-2-git-send-email-ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2015-05-21 18:46       ` Jason Gunthorpe
     [not found]         ` <20150521184613.GD6771-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2015-05-21 20:05           ` Or Gerlitz
     [not found]             ` <CAJ3xEMgJvXjg3aFbTNEudj9QWMfP4==eBq0ccuhjuVJsv9mRmQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-05-21 21:01               ` Jason Gunthorpe
2015-05-21 21:05           ` Doug Ledford
     [not found]             ` <1432242331.28905.67.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2015-05-21 21:21               ` Jason Gunthorpe
2015-05-21 16:24   ` [PATCH RFC 2/3] IB/mlx4: Refactor Alias GUID storing Or Gerlitz
2015-05-21 16:24   ` [PATCH RFC 3/3] IB/mlx4: Add support for SRIOV VF management Or Gerlitz
2015-05-21 16:40   ` [PATCH RFC 0/3] Support standard SRIOV configuration for IB VFs Doug Ledford
     [not found]     ` <1432226406.28905.22.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2015-05-21 19:55       ` Or Gerlitz
     [not found]         ` <CAJ3xEMjzpqnQuA=0HQaN8noVq04d9BkVvEWGY7Lq5ZntVTKm4w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-05-21 21:11           ` Doug Ledford
     [not found]             ` <1432242708.28905.73.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2015-05-21 21:44               ` Jason Gunthorpe
2015-05-25 20:04               ` Or Gerlitz
     [not found]                 ` <CAJ3xEMh3BaxJzCu9mV9m6ZMwgrDaO2UvTyS1i=DEPq9nuLX3oA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-05-25 21:14                   ` Jason Gunthorpe
     [not found]                     ` <20150525211433.GA9186-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2015-05-25 21:50                       ` Or Gerlitz
     [not found]                         ` <CAJ3xEMhG2W6WzxC4Kc2CFmdMwTRUF5ppBgcDZ6SMA=kgJowUpQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-05-25 22:32                           ` Jason Gunthorpe
     [not found]                             ` <20150525223235.GA9858-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2015-05-26  3:33                               ` Or Gerlitz
     [not found]                                 ` <CAJ3xEMjSU0xVWyqd8v_-OO5JvsHycGTU6gg=BHpZD8PSqRfzQg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-05-26  5:58                                   ` Jason Gunthorpe
2015-05-26 16:53                               ` Doug Ledford
     [not found]                                 ` <1432659226.28905.151.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2015-05-26 18:13                                   ` Jason Gunthorpe
     [not found]                                     ` <20150526181336.GD11800-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2015-05-26 20:32                                       ` Doug Ledford
     [not found]                                         ` <1432672378.28905.178.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2015-05-26 21:11                                           ` Jason Gunthorpe
     [not found]                                             ` <20150526211114.GB4502-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2015-05-27 13:01                                               ` Or Gerlitz
2015-05-27 14:14                                               ` Doug Ledford
     [not found]                                                 ` <1432736046.28905.215.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2015-05-27 17:11                                                   ` Jason Gunthorpe
     [not found]                                                     ` <20150527171143.GB9909-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2015-05-27 17:53                                                       ` Doug Ledford
     [not found]                                                         ` <1432749191.28905.243.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2015-05-27 18:29                                                           ` Jason Gunthorpe
2015-05-27 21:58                                                           ` Or Gerlitz
     [not found]                                                             ` <CAJ3xEMjXXKy=DSeksTFEX-GAN=nYm_6nn5msvsYOwnp0roEHJQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-05-27 22:42                                                               ` Jason Gunthorpe
2015-05-26 16:53                           ` Doug Ledford

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.