All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH V2 00/12] Add Ethernet IPoIB driver
@ 2012-08-01 17:09 Or Gerlitz
  2012-08-01 17:09 ` [PATCH V2 01/12] IB/ipoib: Add rtnl_link_ops support Or Gerlitz
                   ` (11 more replies)
  0 siblings, 12 replies; 71+ messages in thread
From: Or Gerlitz @ 2012-08-01 17:09 UTC (permalink / raw)
  To: davem; +Cc: roland, netdev, ali, sean.hefty, Or Gerlitz, Erez Shitrit

Changes from V1: 
- applied feedback from Ben H.
 - add stats64
 - use memcpy on ethtool
 - remove double indexing on ethtool and unused variables
 - remove extra tab ...
 - remove HW_VLAN_FILTER and the  add/kill_vid ndo entries

- applied feedback from Dave Miller to avoid using sysfs
 - added rtnl_link_ops support in ipoib and use them to add/delete childs
 - added usage to ndo_add/remove_slave in eipoib
 - added support for eipoib VIFs in rtnetlink and new ndo op to configure 
   that to the eipoib device

- some more little cleanups of unused variables 

changes from V0:
 - applied feedback from Eric/Dave - RX flow uses only the last 20 bytes of skb->cb[]
 - applied feedback from Ben H. on ethtool changes
 - fix sparse error on function which should be made static
 - made the netdev features related code of the driver more elegant/robust
 - used _bh locking in some paths which used plain rw locking in V0
 - some code rearrangements in flows that send ARPs

The eIPoIB driver provides a standard Ethernet netdevice over 
the InfiniBand IPoIB interface.

Some services can run only on top of Ethernet L2 interfaces, and cannot be
bound to an IPoIB interface. With this new driver, these services can run
seamlessly.

Main use case of the driver is the Ethernet Virtual Switching used in
virtualized environments, where an eipoib netdevice can be used as a 
Physical Interface (PIF) in the hypervisor domain, and allow other 
guests Virtual Interfaces (VIF) connected to the same Virtual Switch 
to run over the InfiniBand fabric.

This driver supports L2 Switching (Direct Bridging) as well as other L3
Switching modes (e.g. NAT).

Whenever an IPoIB interface is created, one eIPoIB PIF netdevice 
will be created. The default naming scheme is as in other Ethernet 
interfaces: ethX, for example, on a system with two IPoIB interfaces,
ib0 and ib1, two interfaces will be created ethX and ethX+1 When "X" 
is the next free Ethernet number in the system.

Using "ethtool -i " over the new interface can tell on which IPoIB
PIF interface that interface is above.  For example: driver: eth_ipoib:ib0 
indicates that eth3 is the Ethernet interface over the ib0 IPoIB interface.

The driver can be used as independent interface or to serve in
virtualization environment as the physical layer for the virtual
interfaces on the virtual guest.

The driver interface (eipoib interface or which is also referred to as parent) 
uses slave interfaces, IPoIB clones, which are the VIFs described above.

VIFs interfaces are enslaved/released from the eipoib driver on demand, according 
to the management interface provided to user space.

The management interface for the driver uses rtnl interface. Via these interfaces 
the driver gets details on new VIF's to manage. The driver can 
enslave new VIF (IPoIB cloned interface) or detaches from it.
The driver can also configure the slave to support specific mac/vlan of virtual guest.

Here's an example script on how the managment interface looks
like with the new rtnetlink code we added, this is based
on a patch to iproute2 which will be posted once the interface
is agreed.

#!/bin/bash

# create IPoIB clone (ib0.1), enslave it to eIPoIB (eth4), add VIF to 
# serve the master, the provided MAC is the master's one

ip link add link ib0 name ib0.1 type ipoib index 1
ip link set dev ib0.1 master eth4
ip link set dev eth4 vif ib0.1 mac 00:02:C9:43:3B:F1

# add a bridge whose uplink is the eIPOIB device

brctl addbr br2
brctl addif br2 eth4
ifconfig br2 12.134.41.1/16 up

# create IPoIB clone (ib0.2), enslave it to eIPoIB (eth4), add VIF to 
# serve a guest, the provided MAC is the guest's one, vnet1 is a hypervisor
# NIC (e.g tap) that serves that guest

ip link add link ib0 name ib0.2 type ipoib index 2
ip link set dev ib0.2 master eth4
ip link set dev eth4 vif ib0.2 mac 52:54:00:67:E9:BD
brctl addif br2 vnet1

# repeat the above with vlan

ip link add link ib0 name ib0.8003 type ipoib pkey 0x8003
ip link set dev ib0.8003 master eth4
ip link set dev eth4 vif ib0.8003 mac 00:02:C9:43:3B:F1 vlan 3

vconfig add eth4 3 
ifconfig eth4.3 up

brctl addbr br3
brctl addif br3 eth4.3
ifconfig br3 13.134.41.1/16 up

ip link add link ib0 name ib0.8003.1 type ipoib pkey 0x8003 index 1
ip link set dev ib0.8003.1 master eth4
ip link set dev eth4 vif ib0.8003.1 mac 52:54:00:DE:3A:23 vlan 3

cat /sys/class/net/eth4/eth/vifs

============= END OF SCRIPT =======

Note: Each ethX interface has at least one ibX.Y slave to serve the PIF
itself, in the VIFs list of ethX you'll notice that ibX.1 is always created 
to serve applications running from the Hypervisor on top of ethX interface directly.

For IB applications that require native IPoIB interfaces (e.g. RDMA-CM), the
original ipoib interfaces ibX can still be used.  For example, RDMA-CM and
eth_ipoib drivers can co-exist and make use of IPoIB

The last patch of this series was made such that the series works as is over net-next.
In parallel to effort a patch to modify IPoIB such that it doesn't assume dst/neighbour 
on the skb was pushed upstream, its commit b63b70d87741 "IPoIB: Use a private hash table 
for path lookup in xmit path", once present in net-next this patch can simply be dropped.

Erez Shitrit (10):
  include/linux: Add private flags for IPoIB interfaces
  IB/ipoib: Add support for acting as VIF
  net: Add ndo_set_vif_param operation to serve eIPoIB VIFs
  net/core: Add rtnetlink support to vif parameters
  net/eipoib: Add private header file
  net/eipoib: Add ethtool file support
  net/eipoib: Add main driver functionality
  net/eipoib: Add sysfs support
  net/eipoib: Add Makefile, Kconfig and MAINTAINERS entries
  IB/ipoib: Add support for transmission of skbs w.o dst/neighbour

Or Gerlitz (2):
  IB/ipoib: Add rtnl_link_ops support
  IB/ipoib: Add support for clones / multiple childs on the same partition

 Documentation/infiniband/ipoib.txt           |   15 +
 MAINTAINERS                                  |    6 +
 drivers/infiniband/ulp/ipoib/Makefile        |    3 +-
 drivers/infiniband/ulp/ipoib/ipoib.h         |   16 +
 drivers/infiniband/ulp/ipoib/ipoib_cm.c      |    9 +
 drivers/infiniband/ulp/ipoib/ipoib_ib.c      |    8 +-
 drivers/infiniband/ulp/ipoib/ipoib_main.c    |   38 +-
 drivers/infiniband/ulp/ipoib/ipoib_netlink.c |  119 ++
 drivers/infiniband/ulp/ipoib/ipoib_vlan.c    |   89 +-
 drivers/net/Kconfig                          |   15 +
 drivers/net/Makefile                         |    1 +
 drivers/net/eipoib/Makefile                  |    4 +
 drivers/net/eipoib/eth_ipoib.h               |  215 +++
 drivers/net/eipoib/eth_ipoib_ethtool.c       |  114 ++
 drivers/net/eipoib/eth_ipoib_main.c          | 1953 ++++++++++++++++++++++++++
 drivers/net/eipoib/eth_ipoib_sysfs.c         |  435 ++++++
 include/linux/if.h                           |    2 +
 include/linux/if_link.h                      |   16 +
 include/linux/netdevice.h                    |    5 +-
 include/rdma/e_ipoib.h                       |   54 +
 net/core/rtnetlink.c                         |   42 +-
 21 files changed, 3117 insertions(+), 42 deletions(-)
 create mode 100644 drivers/infiniband/ulp/ipoib/ipoib_netlink.c
 create mode 100644 drivers/net/eipoib/Makefile
 create mode 100644 drivers/net/eipoib/eth_ipoib.h
 create mode 100644 drivers/net/eipoib/eth_ipoib_ethtool.c
 create mode 100644 drivers/net/eipoib/eth_ipoib_main.c
 create mode 100644 drivers/net/eipoib/eth_ipoib_sysfs.c
 create mode 100644 include/rdma/e_ipoib.h

Cc: Erez Shitrit <erezsh@mellanox.co.il>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH V2 01/12] IB/ipoib: Add rtnl_link_ops support
  2012-08-01 17:09 [PATCH V2 00/12] Add Ethernet IPoIB driver Or Gerlitz
@ 2012-08-01 17:09 ` Or Gerlitz
  2012-08-01 17:09 ` [PATCH V2 02/12] IB/ipoib: Add support for clones / multiple childs on the same partition Or Gerlitz
                   ` (10 subsequent siblings)
  11 siblings, 0 replies; 71+ messages in thread
From: Or Gerlitz @ 2012-08-01 17:09 UTC (permalink / raw)
  To: davem; +Cc: roland, netdev, ali, sean.hefty, Or Gerlitz, Erez Shitrit

Add rtnl_link_ops to IPoIB, with the first usage being child device
create/delete through them. For that end, did little refactoring
of the ipoib_vlan_add/delete code which is now used by both the
sysfs and the rtnl_link_ops code.

Signed-off-by: Erez Shitrit <erezsh@mellanox.co.il>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
---
 Documentation/infiniband/ipoib.txt           |    3 +
 drivers/infiniband/ulp/ipoib/Makefile        |    3 +-
 drivers/infiniband/ulp/ipoib/ipoib.h         |    8 ++
 drivers/infiniband/ulp/ipoib/ipoib_main.c    |   10 ++-
 drivers/infiniband/ulp/ipoib/ipoib_netlink.c |  110 ++++++++++++++++++++++++++
 drivers/infiniband/ulp/ipoib/ipoib_vlan.c    |   80 ++++++++++++-------
 6 files changed, 181 insertions(+), 33 deletions(-)
 create mode 100644 drivers/infiniband/ulp/ipoib/ipoib_netlink.c

diff --git a/Documentation/infiniband/ipoib.txt b/Documentation/infiniband/ipoib.txt
index 64eeb55..f2cfe26 100644
--- a/Documentation/infiniband/ipoib.txt
+++ b/Documentation/infiniband/ipoib.txt
@@ -24,6 +24,9 @@ Partitions and P_Keys
   The P_Key for any interface is given by the "pkey" file, and the
   main interface for a subinterface is in "parent."
 
+  Child interface create/delete can also be done using IPoIB's
+  rtnl_link_ops, where childs created using either way behave the same.
+
 Datagram vs Connected modes
 
   The IPoIB driver supports two modes of operation: datagram and
diff --git a/drivers/infiniband/ulp/ipoib/Makefile b/drivers/infiniband/ulp/ipoib/Makefile
index 3090100..e5430dd 100644
--- a/drivers/infiniband/ulp/ipoib/Makefile
+++ b/drivers/infiniband/ulp/ipoib/Makefile
@@ -5,7 +5,8 @@ ib_ipoib-y					:= ipoib_main.o \
 						   ipoib_multicast.o \
 						   ipoib_verbs.o \
 						   ipoib_vlan.o \
-						   ipoib_ethtool.o
+						   ipoib_ethtool.o \
+						   ipoib_netlink.o
 ib_ipoib-$(CONFIG_INFINIBAND_IPOIB_CM)		+= ipoib_cm.o
 ib_ipoib-$(CONFIG_INFINIBAND_IPOIB_DEBUG)	+= ipoib_fs.o
 
diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h
index 86df632..adcfa66 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib.h
+++ b/drivers/infiniband/ulp/ipoib/ipoib.h
@@ -493,6 +493,14 @@ void ipoib_event(struct ib_event_handler *handler,
 int ipoib_vlan_add(struct net_device *pdev, unsigned short pkey);
 int ipoib_vlan_delete(struct net_device *pdev, unsigned short pkey);
 
+int __ipoib_vlan_add(struct ipoib_dev_priv *ppriv, struct ipoib_dev_priv *priv, u16 pkey);
+struct net_device * __ipoib_vlan_delete(struct net_device *pdev, u16 pkey);
+
+int  __init ipoib_netlink_init(void);
+void __exit ipoib_netlink_fini(void);
+
+void ipoib_setup(struct net_device *dev);
+
 void ipoib_pkey_poll(struct work_struct *work);
 int ipoib_pkey_dev_delay_open(struct net_device *dev);
 void ipoib_drain_cq(struct net_device *dev);
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c
index bbee4b2..637980b 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_main.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
@@ -995,7 +995,7 @@ static const struct net_device_ops ipoib_netdev_ops = {
 	.ndo_neigh_setup	 = ipoib_neigh_setup_dev,
 };
 
-static void ipoib_setup(struct net_device *dev)
+void ipoib_setup(struct net_device *dev)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
 
@@ -1402,8 +1402,15 @@ static int __init ipoib_init_module(void)
 	if (ret)
 		goto err_sa;
 
+	ret = ipoib_netlink_init();
+	if (ret)
+		goto err_client;
+
 	return 0;
 
+err_client:
+	ib_unregister_client(&ipoib_client);
+
 err_sa:
 	ib_sa_unregister_client(&ipoib_sa_client);
 	destroy_workqueue(ipoib_workqueue);
@@ -1416,6 +1423,7 @@ err_fs:
 
 static void __exit ipoib_cleanup_module(void)
 {
+	ipoib_netlink_fini();
 	ib_unregister_client(&ipoib_client);
 	ib_sa_unregister_client(&ipoib_sa_client);
 	ipoib_unregister_debugfs();
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_netlink.c b/drivers/infiniband/ulp/ipoib/ipoib_netlink.c
new file mode 100644
index 0000000..2b8905e
--- /dev/null
+++ b/drivers/infiniband/ulp/ipoib/ipoib_netlink.c
@@ -0,0 +1,110 @@
+/*
+ * Copyright (c) 2012 Mellanox Technologies. -  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#include <linux/netdevice.h>
+#include <linux/module.h>
+#include <net/rtnetlink.h>
+#include "ipoib.h"
+
+enum {
+	IFLA_IPOIB_UNSPEC,
+	IFLA_IPOIB_CHILD_PKEY,
+	__IFLA_IPOIB_MAX
+};
+
+#define IFLA_IPOIB_MAX (__IFLA_IPOIB_MAX - 1)
+
+static const struct nla_policy ipoib_policy[IFLA_IPOIB_MAX + 1] = {
+	[IFLA_IPOIB_CHILD_PKEY]		= { .type = NLA_U16 },
+};
+
+static int ipoib_new_child_link(struct net *src_net, struct net_device *dev,
+			       struct nlattr *tb[], struct nlattr *data[])
+{
+	struct net_device *pdev;
+	struct ipoib_dev_priv *ppriv;
+	u16 child_pkey;
+	int err;
+
+	if (!tb[IFLA_LINK])
+		return -EINVAL;
+
+	pdev = __dev_get_by_index(src_net, nla_get_u32(tb[IFLA_LINK]));
+	if (!pdev)
+		return -ENODEV;
+	ppriv = netdev_priv(pdev);
+
+	if (!data || !data[IFLA_IPOIB_CHILD_PKEY]) {
+		ipoib_warn(ppriv, "no pkey specified, failing request\n");
+		return -EINVAL;
+	} else
+		child_pkey  = nla_get_u16(data[IFLA_IPOIB_CHILD_PKEY]);
+
+	err = __ipoib_vlan_add(ppriv, netdev_priv(dev), child_pkey);
+
+	return err;
+}
+
+static void ipoib_unregister_child_dev(struct net_device *dev, struct list_head *head)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+
+	(void)__ipoib_vlan_delete(priv->parent, priv->pkey);
+}
+
+static size_t ipoib_get_size(const struct net_device *dev)
+{
+	return nla_total_size(2);	/* IFLA_IPOIB_CHILD_PKEY */
+}
+
+static struct rtnl_link_ops ipoib_link_ops __read_mostly = {
+	.kind		= "ipoib",
+	.maxtype	= IFLA_IPOIB_MAX,
+	.policy		= ipoib_policy,
+	.priv_size	= sizeof(struct ipoib_dev_priv),
+	.setup		= ipoib_setup,
+	.newlink	= ipoib_new_child_link,
+	.dellink	= ipoib_unregister_child_dev,
+	.get_size	= ipoib_get_size,
+};
+
+int __init ipoib_netlink_init(void)
+{
+	return rtnl_link_register(&ipoib_link_ops);
+}
+
+void __exit ipoib_netlink_fini(void)
+{
+	rtnl_link_unregister(&ipoib_link_ops);
+}
+
+MODULE_ALIAS_RTNL_LINK("ipoib");
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_vlan.c b/drivers/infiniband/ulp/ipoib/ipoib_vlan.c
index d7e9740..4a9b7e6 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_vlan.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_vlan.c
@@ -49,19 +49,13 @@ static ssize_t show_parent(struct device *d, struct device_attribute *attr,
 }
 static DEVICE_ATTR(parent, S_IRUGO, show_parent, NULL);
 
-int ipoib_vlan_add(struct net_device *pdev, unsigned short pkey)
+
+int __ipoib_vlan_add(struct ipoib_dev_priv *ppriv, struct ipoib_dev_priv *priv,
+		     u16 pkey)
 {
-	struct ipoib_dev_priv *ppriv, *priv;
-	char intf_name[IFNAMSIZ];
 	int result;
+	struct ipoib_dev_priv *tpriv;
 
-	if (!capable(CAP_NET_ADMIN))
-		return -EPERM;
-
-	ppriv = netdev_priv(pdev);
-
-	if (!rtnl_trylock())
-		return restart_syscall();
 	mutex_lock(&ppriv->vlan_mutex);
 
 	/*
@@ -70,26 +64,16 @@ int ipoib_vlan_add(struct net_device *pdev, unsigned short pkey)
 	 */
 	if (ppriv->pkey == pkey) {
 		result = -ENOTUNIQ;
-		priv = NULL;
 		goto err;
 	}
 
-	list_for_each_entry(priv, &ppriv->child_intfs, list) {
-		if (priv->pkey == pkey) {
+	list_for_each_entry(tpriv, &ppriv->child_intfs, list) {
+		if (tpriv->pkey == pkey) {
 			result = -ENOTUNIQ;
-			priv = NULL;
 			goto err;
 		}
 	}
 
-	snprintf(intf_name, sizeof intf_name, "%s.%04x",
-		 ppriv->dev->name, pkey);
-	priv = ipoib_intf_alloc(intf_name);
-	if (!priv) {
-		result = -ENOMEM;
-		goto err;
-	}
-
 	priv->max_ib_mtu = ppriv->max_ib_mtu;
 	/* MTU will be reset when mcast join happens */
 	priv->dev->mtu   = IPOIB_UD_MTU(priv->max_ib_mtu);
@@ -137,11 +121,10 @@ int ipoib_vlan_add(struct net_device *pdev, unsigned short pkey)
 	list_add_tail(&priv->list, &ppriv->child_intfs);
 
 	mutex_unlock(&ppriv->vlan_mutex);
-	rtnl_unlock();
-
 	return 0;
 
 sysfs_failed:
+	result = -ENOMEM;
 	ipoib_delete_debug_files(priv->dev);
 	unregister_netdevice(priv->dev);
 
@@ -150,25 +133,46 @@ register_failed:
 
 err:
 	mutex_unlock(&ppriv->vlan_mutex);
-	rtnl_unlock();
-	if (priv)
-		free_netdev(priv->dev);
-
 	return result;
 }
 
-int ipoib_vlan_delete(struct net_device *pdev, unsigned short pkey)
+int ipoib_vlan_add(struct net_device *pdev, unsigned short pkey)
 {
-	struct ipoib_dev_priv *ppriv, *priv, *tpriv;
-	struct net_device *dev = NULL;
+	struct ipoib_dev_priv *ppriv, *priv;
+	char intf_name[IFNAMSIZ];
+	int result;
 
 	if (!capable(CAP_NET_ADMIN))
 		return -EPERM;
 
 	ppriv = netdev_priv(pdev);
 
+	snprintf(intf_name, sizeof intf_name, "%s.%04x",
+		 ppriv->dev->name, pkey);
+	priv = ipoib_intf_alloc(intf_name);
+	if (!priv)
+		return -ENOMEM;
+
 	if (!rtnl_trylock())
 		return restart_syscall();
+
+	result = __ipoib_vlan_add(ppriv, priv, pkey);
+
+	if (result)
+		free_netdev(priv->dev);
+
+	rtnl_unlock();
+
+	return result;
+}
+
+struct net_device * __ipoib_vlan_delete(struct net_device *pdev, u16 pkey)
+{
+	struct ipoib_dev_priv *ppriv, *priv, *tpriv;
+	struct net_device *dev = NULL;
+
+	ppriv = netdev_priv(pdev);
+
 	mutex_lock(&ppriv->vlan_mutex);
 	list_for_each_entry_safe(priv, tpriv, &ppriv->child_intfs, list) {
 		if (priv->pkey == pkey) {
@@ -180,6 +184,20 @@ int ipoib_vlan_delete(struct net_device *pdev, unsigned short pkey)
 		}
 	}
 	mutex_unlock(&ppriv->vlan_mutex);
+
+	return dev;
+}
+
+int ipoib_vlan_delete(struct net_device *pdev, unsigned short pkey)
+{
+	struct net_device *dev;
+
+	if (!capable(CAP_NET_ADMIN))
+		return -EPERM;
+
+	if (!rtnl_trylock())
+		return restart_syscall();
+	dev = __ipoib_vlan_delete(pdev, pkey);
 	rtnl_unlock();
 
 	if (dev) {
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH V2 02/12] IB/ipoib: Add support for clones / multiple childs on the same partition
  2012-08-01 17:09 [PATCH V2 00/12] Add Ethernet IPoIB driver Or Gerlitz
  2012-08-01 17:09 ` [PATCH V2 01/12] IB/ipoib: Add rtnl_link_ops support Or Gerlitz
@ 2012-08-01 17:09 ` Or Gerlitz
  2012-08-01 17:09 ` [PATCH V2 03/12] include/linux: Add private flags for IPoIB interfaces Or Gerlitz
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 71+ messages in thread
From: Or Gerlitz @ 2012-08-01 17:09 UTC (permalink / raw)
  To: davem; +Cc: roland, netdev, ali, sean.hefty, Or Gerlitz, Erez Shitrit

Allow creating "clone" child interfaces which further partition an
IPoIB interface to sub interfaces who either use the same pkey as
their parent or use the same pkey as already created child interface.

Each child now has a child index, which together with the pkey is
used as the identifier of the created network device.

Clone interfaces can only be created using rtnl_link_ops, where the user
is allowed to provide a pkey and index. If no pkey is provided, the
parent pkey is used and if no index is provided an index of zero
is used.

A major use case for clone childs is for virtualization purposes, where
a per VM NIC is desired at the hypervisor level, such as the solution
provided by the newly introduced Ethernet IPoIB driver.

Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Erez Shitrit <erezsh@mellanox.co.il>
---
 Documentation/infiniband/ipoib.txt           |   12 ++++++++++++
 drivers/infiniband/ulp/ipoib/ipoib.h         |    7 +++++--
 drivers/infiniband/ulp/ipoib/ipoib_netlink.c |   21 +++++++++++++++------
 drivers/infiniband/ulp/ipoib/ipoib_vlan.c    |   19 +++++++++++--------
 4 files changed, 43 insertions(+), 16 deletions(-)

diff --git a/Documentation/infiniband/ipoib.txt b/Documentation/infiniband/ipoib.txt
index f2cfe26..6ef87bb 100644
--- a/Documentation/infiniband/ipoib.txt
+++ b/Documentation/infiniband/ipoib.txt
@@ -27,6 +27,18 @@ Partitions and P_Keys
   Child interface create/delete can also be done using IPoIB's
   rtnl_link_ops, where childs created using either way behave the same.
 
+Clones
+  Its possible to further partition an IPoIB interfaces, and create
+  "clone" child interfaces which either use the same pkey as their
+  parent, or as an already created child interface. Each child now has
+  a child index, which together with the pkey is used as the identifier
+  of the created network device.
+
+  Clone interfaces can only be created using rtnl_link_ops, where the user
+  is allowed to provide a pkey and index. If no pkey is provided, the
+  parent pkey is used and if no index is provided an index of zero
+  is used.
+
 Datagram vs Connected modes
 
   The IPoIB driver supports two modes of operation: datagram and
diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h
index adcfa66..0df6668 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib.h
+++ b/drivers/infiniband/ulp/ipoib/ipoib.h
@@ -332,6 +332,7 @@ struct ipoib_dev_priv {
 	struct net_device *parent;
 	struct list_head child_intfs;
 	struct list_head list;
+	u16		 child_index;
 
 #ifdef CONFIG_INFINIBAND_IPOIB_CM
 	struct ipoib_cm_dev_priv cm;
@@ -493,8 +494,10 @@ void ipoib_event(struct ib_event_handler *handler,
 int ipoib_vlan_add(struct net_device *pdev, unsigned short pkey);
 int ipoib_vlan_delete(struct net_device *pdev, unsigned short pkey);
 
-int __ipoib_vlan_add(struct ipoib_dev_priv *ppriv, struct ipoib_dev_priv *priv, u16 pkey);
-struct net_device * __ipoib_vlan_delete(struct net_device *pdev, u16 pkey);
+int __ipoib_vlan_add(struct ipoib_dev_priv *ppriv, struct ipoib_dev_priv *priv,
+		     u16 pkey, u16 child_index);
+struct net_device * __ipoib_vlan_delete(struct net_device *pdev, u16 pkey,
+					u16 child_index);
 
 int  __init ipoib_netlink_init(void);
 void __exit ipoib_netlink_fini(void);
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_netlink.c b/drivers/infiniband/ulp/ipoib/ipoib_netlink.c
index 2b8905e..d902e04 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_netlink.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_netlink.c
@@ -38,6 +38,7 @@
 enum {
 	IFLA_IPOIB_UNSPEC,
 	IFLA_IPOIB_CHILD_PKEY,
+	IFLA_IPOIB_CHILD_INDEX,
 	__IFLA_IPOIB_MAX
 };
 
@@ -45,6 +46,7 @@ enum {
 
 static const struct nla_policy ipoib_policy[IFLA_IPOIB_MAX + 1] = {
 	[IFLA_IPOIB_CHILD_PKEY]		= { .type = NLA_U16 },
+	[IFLA_IPOIB_CHILD_INDEX]	= { .type = NLA_U16 },
 };
 
 static int ipoib_new_child_link(struct net *src_net, struct net_device *dev,
@@ -52,7 +54,7 @@ static int ipoib_new_child_link(struct net *src_net, struct net_device *dev,
 {
 	struct net_device *pdev;
 	struct ipoib_dev_priv *ppriv;
-	u16 child_pkey;
+	u16 child_pkey, child_index;
 	int err;
 
 	if (!tb[IFLA_LINK])
@@ -64,12 +66,18 @@ static int ipoib_new_child_link(struct net *src_net, struct net_device *dev,
 	ppriv = netdev_priv(pdev);
 
 	if (!data || !data[IFLA_IPOIB_CHILD_PKEY]) {
-		ipoib_warn(ppriv, "no pkey specified, failing request\n");
-		return -EINVAL;
+		ipoib_warn(ppriv, "no pkey specified, using parent pkey\n");
+		child_pkey = ppriv->pkey;
 	} else
 		child_pkey  = nla_get_u16(data[IFLA_IPOIB_CHILD_PKEY]);
 
-	err = __ipoib_vlan_add(ppriv, netdev_priv(dev), child_pkey);
+	if (!data || !data[IFLA_IPOIB_CHILD_INDEX]) {
+		ipoib_warn(ppriv, "no index specified, using 0\n");
+		child_index = 0;
+	} else
+		child_index  = nla_get_u16(data[IFLA_IPOIB_CHILD_INDEX]);
+
+	err = __ipoib_vlan_add(ppriv, netdev_priv(dev), child_pkey, child_index);
 
 	return err;
 }
@@ -78,12 +86,13 @@ static void ipoib_unregister_child_dev(struct net_device *dev, struct list_head
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
 
-	(void)__ipoib_vlan_delete(priv->parent, priv->pkey);
+	(void)__ipoib_vlan_delete(priv->parent, priv->pkey, priv->child_index);
 }
 
 static size_t ipoib_get_size(const struct net_device *dev)
 {
-	return nla_total_size(2);	/* IFLA_IPOIB_CHILD_PKEY */
+	return nla_total_size(2) +	/* IFLA_IPOIB_CHILD_PKEY  */
+		nla_total_size(2);	/* IFLA_IPOIB_CHILD_INDEX */
 }
 
 static struct rtnl_link_ops ipoib_link_ops __read_mostly = {
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_vlan.c b/drivers/infiniband/ulp/ipoib/ipoib_vlan.c
index 4a9b7e6..5f6e07d 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_vlan.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_vlan.c
@@ -51,7 +51,7 @@ static DEVICE_ATTR(parent, S_IRUGO, show_parent, NULL);
 
 
 int __ipoib_vlan_add(struct ipoib_dev_priv *ppriv, struct ipoib_dev_priv *priv,
-		     u16 pkey)
+		     u16 pkey, u16 child_index)
 {
 	int result;
 	struct ipoib_dev_priv *tpriv;
@@ -60,15 +60,16 @@ int __ipoib_vlan_add(struct ipoib_dev_priv *ppriv, struct ipoib_dev_priv *priv,
 
 	/*
 	 * First ensure this isn't a duplicate. We check the parent device and
-	 * then all of the child interfaces to make sure the Pkey doesn't match.
+	 * then all of the child interfaces to make sure the Pkey AND the child
+	 * index don't match.
 	 */
-	if (ppriv->pkey == pkey) {
+	if (ppriv->pkey == pkey && ppriv->child_index == child_index) {
 		result = -ENOTUNIQ;
 		goto err;
 	}
 
 	list_for_each_entry(tpriv, &ppriv->child_intfs, list) {
-		if (tpriv->pkey == pkey) {
+		if (tpriv->pkey == pkey && tpriv->child_index == child_index) {
 			result = -ENOTUNIQ;
 			goto err;
 		}
@@ -85,6 +86,7 @@ int __ipoib_vlan_add(struct ipoib_dev_priv *ppriv, struct ipoib_dev_priv *priv,
 		goto err;
 
 	priv->pkey = pkey;
+	priv->child_index = child_index;
 
 	memcpy(priv->dev->dev_addr, ppriv->dev->dev_addr, INFINIBAND_ALEN);
 	priv->dev->broadcast[8] = pkey >> 8;
@@ -156,7 +158,7 @@ int ipoib_vlan_add(struct net_device *pdev, unsigned short pkey)
 	if (!rtnl_trylock())
 		return restart_syscall();
 
-	result = __ipoib_vlan_add(ppriv, priv, pkey);
+	result = __ipoib_vlan_add(ppriv, priv, pkey, 0);
 
 	if (result)
 		free_netdev(priv->dev);
@@ -166,7 +168,8 @@ int ipoib_vlan_add(struct net_device *pdev, unsigned short pkey)
 	return result;
 }
 
-struct net_device * __ipoib_vlan_delete(struct net_device *pdev, u16 pkey)
+struct net_device * __ipoib_vlan_delete(struct net_device *pdev, u16 pkey,
+					u16 child_index)
 {
 	struct ipoib_dev_priv *ppriv, *priv, *tpriv;
 	struct net_device *dev = NULL;
@@ -175,7 +178,7 @@ struct net_device * __ipoib_vlan_delete(struct net_device *pdev, u16 pkey)
 
 	mutex_lock(&ppriv->vlan_mutex);
 	list_for_each_entry_safe(priv, tpriv, &ppriv->child_intfs, list) {
-		if (priv->pkey == pkey) {
+		if (priv->pkey == pkey && priv->child_index == child_index) {
 			unregister_netdevice(priv->dev);
 			ipoib_dev_cleanup(priv->dev);
 			list_del(&priv->list);
@@ -197,7 +200,7 @@ int ipoib_vlan_delete(struct net_device *pdev, unsigned short pkey)
 
 	if (!rtnl_trylock())
 		return restart_syscall();
-	dev = __ipoib_vlan_delete(pdev, pkey);
+	dev = __ipoib_vlan_delete(pdev, pkey, 0);
 	rtnl_unlock();
 
 	if (dev) {
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH V2 03/12] include/linux: Add private flags for IPoIB interfaces
  2012-08-01 17:09 [PATCH V2 00/12] Add Ethernet IPoIB driver Or Gerlitz
  2012-08-01 17:09 ` [PATCH V2 01/12] IB/ipoib: Add rtnl_link_ops support Or Gerlitz
  2012-08-01 17:09 ` [PATCH V2 02/12] IB/ipoib: Add support for clones / multiple childs on the same partition Or Gerlitz
@ 2012-08-01 17:09 ` Or Gerlitz
  2012-08-01 17:09 ` [PATCH V2 04/12] IB/ipoib: Add support for acting as VIF Or Gerlitz
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 71+ messages in thread
From: Or Gerlitz @ 2012-08-01 17:09 UTC (permalink / raw)
  To: davem; +Cc: roland, netdev, ali, sean.hefty, Erez Shitrit, Or Gerlitz

From: Erez Shitrit <erezsh@mellanox.co.il>

The new 2 bits indicates whenever a device is considered PIF interface,
which means the "main" interfaces (ib0, ib1 etc), or cloned interfaces
(ib0.1, ib1.2 etc.) that is now in use by the eIPoIB driver.

Signed-off-by: Erez Shitrit <erezsh@mellanox.co.il>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
---
 include/linux/if.h |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/include/linux/if.h b/include/linux/if.h
index 1ec407b..f50dbf2 100644
--- a/include/linux/if.h
+++ b/include/linux/if.h
@@ -84,6 +84,8 @@
 #define IFF_LIVE_ADDR_CHANGE 0x100000	/* device supports hardware address
 					 * change when it's running */
 
+#define IFF_EIPOIB_PIF  0x200000       /* IPoIB PIF intf (ib0, ib1 etc.) */
+#define IFF_EIPOIB_VIF  0x400000       /* IPoIB VIF intf (ib0.x, ib1.x etc.) */
 
 #define IF_GET_IFACE	0x0001		/* for querying only */
 #define IF_GET_PROTO	0x0002
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH V2 04/12] IB/ipoib: Add support for acting as VIF
  2012-08-01 17:09 [PATCH V2 00/12] Add Ethernet IPoIB driver Or Gerlitz
                   ` (2 preceding siblings ...)
  2012-08-01 17:09 ` [PATCH V2 03/12] include/linux: Add private flags for IPoIB interfaces Or Gerlitz
@ 2012-08-01 17:09 ` Or Gerlitz
  2012-08-01 17:09 ` [PATCH V2 05/12] net: Add ndo_set_vif_param operation to serve eIPoIB VIFs Or Gerlitz
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 71+ messages in thread
From: Or Gerlitz @ 2012-08-01 17:09 UTC (permalink / raw)
  To: davem; +Cc: roland, netdev, ali, sean.hefty, Erez Shitrit, Or Gerlitz

From: Erez Shitrit <erezsh@mellanox.co.il>

When IPoIB interface acts as a VIF for an eIPoIB interface, it uses
the skb cb storage area on the RX flow, to place information which
can be of use to the upper layer device.

One such usage example, is when an eIPoIB inteface needs to generate
a source mac for incoming Ethernet frames.

The IPoIB code checks the VIF private flag on the RX path, and according
to the value of the flag prepares the skb CB data, etc.

Signed-off-by: Erez Shitrit <erezsh@mellanox.co.il>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
---
 drivers/infiniband/ulp/ipoib/ipoib.h      |    5 +++
 drivers/infiniband/ulp/ipoib/ipoib_cm.c   |    9 +++++
 drivers/infiniband/ulp/ipoib/ipoib_ib.c   |    8 ++++-
 drivers/infiniband/ulp/ipoib/ipoib_main.c |   21 +++++++++++
 include/rdma/e_ipoib.h                    |   54 +++++++++++++++++++++++++++++
 5 files changed, 96 insertions(+), 1 deletions(-)
 create mode 100644 include/rdma/e_ipoib.h

diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h
index 0df6668..3235033 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib.h
+++ b/drivers/infiniband/ulp/ipoib/ipoib.h
@@ -52,6 +52,7 @@
 #include <rdma/ib_pack.h>
 #include <rdma/ib_sa.h>
 #include <linux/sched.h>
+#include <rdma/e_ipoib.h>
 
 /* constants */
 
@@ -209,6 +210,7 @@ struct ipoib_cm_rx {
 	unsigned long		jiffies;
 	enum ipoib_cm_state	state;
 	int			recv_count;
+	u32			qpn;
 };
 
 struct ipoib_cm_tx {
@@ -703,6 +705,9 @@ extern int ipoib_recvq_size;
 
 extern struct ib_sa_client ipoib_sa_client;
 
+void set_skb_oob_cb_data(struct sk_buff *skb, struct ib_wc *wc,
+			 struct napi_struct *napi);
+
 #ifdef CONFIG_INFINIBAND_IPOIB_DEBUG
 extern int ipoib_debug_level;
 
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
index 6d66ab0..c561956 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
@@ -440,6 +440,7 @@ static int ipoib_cm_req_handler(struct ib_cm_id *cm_id, struct ib_cm_event *even
 	struct net_device *dev = cm_id->context;
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
 	struct ipoib_cm_rx *p;
+	struct ipoib_cm_data *data = event->private_data;
 	unsigned psn;
 	int ret;
 
@@ -452,6 +453,10 @@ static int ipoib_cm_req_handler(struct ib_cm_id *cm_id, struct ib_cm_event *even
 	cm_id->context = p;
 	p->state = IPOIB_CM_RX_LIVE;
 	p->jiffies = jiffies;
+
+	/* used to keep track of base qpn in CM mode */
+	p->qpn = be32_to_cpu(data->qpn);
+
 	INIT_LIST_HEAD(&p->list);
 
 	p->qp = ipoib_cm_create_rx_qp(dev, p);
@@ -669,6 +674,10 @@ copied:
 	skb->dev = dev;
 	/* XXX get correct PACKET_ type here */
 	skb->pkt_type = PACKET_HOST;
+	/* if handler is registered on top of ipoib, set skb oob data. */
+	if (skb->dev->priv_flags & IFF_EIPOIB_VIF)
+		set_skb_oob_cb_data(skb, wc, NULL);
+
 	netif_receive_skb(skb);
 
 repost:
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
index f10221f..f248e6e 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
@@ -304,7 +304,13 @@ static void ipoib_ib_handle_rx_wc(struct net_device *dev, struct ib_wc *wc)
 			likely(wc->wc_flags & IB_WC_IP_CSUM_OK))
 		skb->ip_summed = CHECKSUM_UNNECESSARY;
 
-	napi_gro_receive(&priv->napi, skb);
+	/* if handler is registered on top of ipoib, set skb oob data */
+	if (dev->priv_flags & IFF_EIPOIB_VIF) {
+		set_skb_oob_cb_data(skb, wc, &priv->napi);
+		/* the registered handler will take care of the skb */
+		netif_receive_skb(skb);
+	} else
+		napi_gro_receive(&priv->napi, skb);
 
 repost:
 	if (unlikely(ipoib_ib_post_receive(dev, wr_id)))
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c
index 637980b..47034b4 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_main.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
@@ -91,6 +91,24 @@ static struct ib_client ipoib_client = {
 	.remove = ipoib_remove_one
 };
 
+void set_skb_oob_cb_data(struct sk_buff *skb, struct ib_wc *wc,
+			 struct napi_struct *napi)
+{
+	struct ipoib_cm_rx *p_cm_ctx = NULL;
+	struct eipoib_cb_data *data = NULL;
+
+	p_cm_ctx = wc->qp->qp_context;
+	data = IPOIB_HANDLER_CB(skb);
+
+	data->rx.slid = wc->slid;
+	data->rx.sqpn = wc->src_qp;
+	data->rx.napi = napi;
+
+	/* in CM mode, use the "base" qpn as sqpn */
+	if (p_cm_ctx)
+		data->rx.sqpn = p_cm_ctx->qpn;
+}
+
 int ipoib_open(struct net_device *dev)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
@@ -1253,6 +1271,9 @@ static struct net_device *ipoib_add_port(const char *format,
 		goto event_failed;
 	}
 
+	/* indicates pif port */
+	priv->dev->priv_flags |= IFF_EIPOIB_PIF;
+
 	result = register_netdev(priv->dev);
 	if (result) {
 		printk(KERN_WARNING "%s: couldn't register ipoib port %d; error %d\n",
diff --git a/include/rdma/e_ipoib.h b/include/rdma/e_ipoib.h
new file mode 100644
index 0000000..7249334
--- /dev/null
+++ b/include/rdma/e_ipoib.h
@@ -0,0 +1,54 @@
+/*
+ * Copyright (c) 2012 Mellanox Technologies. All rights reserved
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * openfabric.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#ifndef _LINUX_ETH_IB_IPOIB_H
+#define _LINUX_ETH_IB_IPOIB_H
+
+#include <net/sch_generic.h>
+
+struct eipoib_cb_data {
+	/*
+	 * extra care taken not to collide with the usage done
+	 * by the qdisc layer in struct skb cb data.
+	 */
+	struct qdisc_skb_cb	qdisc_cb;
+	struct { /* must be <= 20 bytes */
+		u32 sqpn;
+		struct napi_struct *napi;
+		u16 slid;
+		u8 data[6];
+	} __packed rx;
+};
+
+#define IPOIB_HANDLER_CB(skb) ((struct eipoib_cb_data *)(skb)->cb)
+
+#endif /* _LINUX_ETH_IB_IPOIB_H */
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH V2 05/12] net: Add ndo_set_vif_param operation to serve eIPoIB VIFs
  2012-08-01 17:09 [PATCH V2 00/12] Add Ethernet IPoIB driver Or Gerlitz
                   ` (3 preceding siblings ...)
  2012-08-01 17:09 ` [PATCH V2 04/12] IB/ipoib: Add support for acting as VIF Or Gerlitz
@ 2012-08-01 17:09 ` Or Gerlitz
  2012-08-02  0:17   ` Ben Hutchings
  2012-08-01 17:09 ` [PATCH V2 06/12] net/core: Add rtnetlink support to vif parameters Or Gerlitz
                   ` (6 subsequent siblings)
  11 siblings, 1 reply; 71+ messages in thread
From: Or Gerlitz @ 2012-08-01 17:09 UTC (permalink / raw)
  To: davem; +Cc: roland, netdev, ali, sean.hefty, Erez Shitrit, Or Gerlitz

From: Erez Shitrit <erezsh@mellanox.co.il>

The Ethernet IPoIB driver enslaves IPoIB devices and uses them as
VIFs (Virtual Interface) which serve an Ethernet NIC e.g present in a 
guest OS. For each such slave that acts as a VIF, eIPoIB needs to know 
the mac and optionally the vlan uses by that NIC, the new ndo opertaion 
is used to associate the mac/vlan for that slave.

Signed-off-by: Erez Shitrit <erezsh@mellanox.co.il>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
---
 include/linux/netdevice.h |    5 ++++-
 1 files changed, 4 insertions(+), 1 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index eb06e58..2709551 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1013,7 +1013,10 @@ struct net_device_ops {
 						    netdev_features_t features);
 	int			(*ndo_neigh_construct)(struct neighbour *n);
 	void			(*ndo_neigh_destroy)(struct neighbour *n);
-
+	int			(*ndo_set_vif_param) (struct net_device *dev,
+						      struct net_device *vif,
+						      u16 vlan,
+						      u8 *mac);
 	int			(*ndo_fdb_add)(struct ndmsg *ndm,
 					       struct net_device *dev,
 					       unsigned char *addr,
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH V2 06/12] net/core: Add rtnetlink support to vif parameters
  2012-08-01 17:09 [PATCH V2 00/12] Add Ethernet IPoIB driver Or Gerlitz
                   ` (4 preceding siblings ...)
  2012-08-01 17:09 ` [PATCH V2 05/12] net: Add ndo_set_vif_param operation to serve eIPoIB VIFs Or Gerlitz
@ 2012-08-01 17:09 ` Or Gerlitz
  2012-08-02  0:20   ` Ben Hutchings
  2012-08-01 17:09 ` [PATCH V2 07/12] net/eipoib: Add private header file Or Gerlitz
                   ` (5 subsequent siblings)
  11 siblings, 1 reply; 71+ messages in thread
From: Or Gerlitz @ 2012-08-01 17:09 UTC (permalink / raw)
  To: davem; +Cc: roland, netdev, ali, sean.hefty, Erez Shitrit, Or Gerlitz

From: Erez Shitrit <erezsh@mellanox.co.il>

Add IFLA_VIF_INFO which allows user space to configure mac/vlan for a VIF

Signed-off-by: Erez Shitrit <erezsh@mellanox.co.il>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
---
 include/linux/if_link.h |   16 ++++++++++++++++
 net/core/rtnetlink.c    |   42 +++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 57 insertions(+), 1 deletions(-)

diff --git a/include/linux/if_link.h b/include/linux/if_link.h
index ac173bd..2a401f3 100644
--- a/include/linux/if_link.h
+++ b/include/linux/if_link.h
@@ -142,6 +142,7 @@ enum {
 #define IFLA_PROMISCUITY IFLA_PROMISCUITY
 	IFLA_NUM_TX_QUEUES,
 	IFLA_NUM_RX_QUEUES,
+	IFLA_VIFINFO_LIST,/* config virtual interface, gives it mac/vlan */
 	__IFLA_MAX
 };
 
@@ -398,4 +399,19 @@ struct ifla_port_vsi {
 	__u8 pad[3];
 };
 
+/* support vif configuration. */
+enum {
+	IFLA_VIF_INFO_UNSPEC,
+	IFLA_VIF_INFO,
+	__IFLA_VIF_INFO_MAX,
+};
+
+#define IFLA_VIF_INFO_MAX (__IFLA_VIF_INFO_MAX - 1)
+
+struct ifla_vif_param {
+	__u32 vif;
+	__u8  mac[32]; /* MAX_ADDR_LEN */
+	__u16 vlan;
+};
+
 #endif /* _LINUX_IF_LINK_H */
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index 334b930..751e55d 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -780,7 +780,8 @@ static noinline size_t if_nlmsg_size(const struct net_device *dev,
 	       + rtnl_vfinfo_size(dev, ext_filter_mask) /* IFLA_VFINFO_LIST */
 	       + rtnl_port_size(dev) /* IFLA_VF_PORTS + IFLA_PORT_SELF */
 	       + rtnl_link_get_size(dev) /* IFLA_LINKINFO */
-	       + rtnl_link_get_af_size(dev); /* IFLA_AF_SPEC */
+	       + rtnl_link_get_af_size(dev) /* IFLA_AF_SPEC */
+	       + nla_total_size(sizeof(struct ifla_vif_param)); /* IFLA_VIF_INFO */
 }
 
 static int rtnl_vf_ports_fill(struct sk_buff *skb, struct net_device *dev)
@@ -1114,6 +1115,7 @@ const struct nla_policy ifla_policy[IFLA_MAX+1] = {
 	[IFLA_PROMISCUITY]	= { .type = NLA_U32 },
 	[IFLA_NUM_TX_QUEUES]	= { .type = NLA_U32 },
 	[IFLA_NUM_RX_QUEUES]	= { .type = NLA_U32 },
+	[IFLA_VIFINFO_LIST]     = {. type = NLA_NESTED },
 };
 EXPORT_SYMBOL(ifla_policy);
 
@@ -1137,6 +1139,11 @@ static const struct nla_policy ifla_vf_policy[IFLA_VF_MAX+1] = {
 				    .len = sizeof(struct ifla_vf_spoofchk) },
 };
 
+static const struct nla_policy ifla_vifinfo_policy[IFLA_VIF_INFO_MAX+1] = {
+	[IFLA_VIF_INFO]		= { .type = NLA_BINARY,
+				    .len = sizeof(struct ifla_vif_param)},
+};
+
 static const struct nla_policy ifla_port_policy[IFLA_PORT_MAX+1] = {
 	[IFLA_PORT_VF]		= { .type = NLA_U32 },
 	[IFLA_PORT_PROFILE]	= { .type = NLA_STRING,
@@ -1258,6 +1265,22 @@ static int do_setvfinfo(struct net_device *dev, struct nlattr *attr)
 	return err;
 }
 
+static int do_setvifinfo(struct net_device *dev, struct ifla_vif_param *ivp)
+{
+	int  err = -EINVAL;
+	struct net_device *vif_dev = __dev_get_by_index(dev_net(dev),
+							    ivp->vif);
+	const struct net_device_ops *ops = dev->netdev_ops;
+
+	if (ops->ndo_set_vif_param)
+		err = ops->ndo_set_vif_param(dev,
+					     vif_dev,
+					     ivp->vlan,
+					     ivp->mac);
+
+	return err;
+}
+
 static int do_set_master(struct net_device *dev, int ifindex)
 {
 	struct net_device *master_dev;
@@ -1449,8 +1472,25 @@ static int do_setlink(struct net_device *dev, struct ifinfomsg *ifm,
 			modified = 1;
 		}
 	}
+
 	err = 0;
+	if (tb[IFLA_VIFINFO_LIST]) {
+		struct nlattr *vif_param_info[IFLA_VIF_INFO_MAX + 1];
+		struct ifla_vif_param ivp;
+		err = nla_parse_nested(vif_param_info, IFLA_VIF_INFO_MAX,
+				       tb[IFLA_VIFINFO_LIST], ifla_vifinfo_policy);
+		if (err < 0)
+			return err;
+		if (vif_param_info[IFLA_VIF_INFO]) {
+			nla_memcpy(&ivp, vif_param_info[IFLA_VIF_INFO], sizeof(ivp));
+			err = do_setvifinfo(dev, &ivp);
+			if (err < 0)
+				goto errout;
+			modified = 1;
+		}
+	}
 
+	err = 0;
 	if (tb[IFLA_VF_PORTS]) {
 		struct nlattr *port[IFLA_PORT_MAX+1];
 		struct nlattr *attr;
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH V2 07/12] net/eipoib: Add private header file
  2012-08-01 17:09 [PATCH V2 00/12] Add Ethernet IPoIB driver Or Gerlitz
                   ` (5 preceding siblings ...)
  2012-08-01 17:09 ` [PATCH V2 06/12] net/core: Add rtnetlink support to vif parameters Or Gerlitz
@ 2012-08-01 17:09 ` Or Gerlitz
  2012-08-01 17:09 ` [PATCH V2 08/12] net/eipoib: Add ethtool file support Or Gerlitz
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 71+ messages in thread
From: Or Gerlitz @ 2012-08-01 17:09 UTC (permalink / raw)
  To: davem; +Cc: roland, netdev, ali, sean.hefty, Erez Shitrit, Or Gerlitz

From: Erez Shitrit <erezsh@mellanox.co.il>

The header file includes all structures, macros and non-static
functions which are of use by the driver.

Signed-off-by: Erez Shitrit <erezsh@mellanox.co.il>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
---
 drivers/net/eipoib/eth_ipoib.h |  215 ++++++++++++++++++++++++++++++++++++++++
 1 files changed, 215 insertions(+), 0 deletions(-)
 create mode 100644 drivers/net/eipoib/eth_ipoib.h

diff --git a/drivers/net/eipoib/eth_ipoib.h b/drivers/net/eipoib/eth_ipoib.h
new file mode 100644
index 0000000..9a23f7d
--- /dev/null
+++ b/drivers/net/eipoib/eth_ipoib.h
@@ -0,0 +1,215 @@
+/*
+ * Copyright (c) 2012 Mellanox Technologies. All rights reserved
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * openfabric.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#ifndef _LINUX_ETH_IPOIB_H
+#define _LINUX_ETH_IPOIB_H
+
+#include <linux/module.h>
+#include <linux/errno.h>
+#include <linux/netdevice.h>
+#include <linux/skbuff.h>
+#include <net/arp.h>
+#include <linux/if_vlan.h>
+#include <net/net_namespace.h>
+#include <net/netns/generic.h>
+#include <linux/if_infiniband.h>
+#include <rdma/ib_verbs.h>
+
+#include <rdma/e_ipoib.h>
+
+/* macros and definitions */
+#define DRV_VERSION		"1.0.0"
+#define DRV_RELDATE		"June 1, 2012"
+#define DRV_NAME		"eth_ipoib"
+#define SDRV_NAME		"ipoib"
+#define DRV_DESCRIPTION		"IP-over-InfiniBand Para Virtualized Driver"
+#define EIPOIB_ABI_VER	1
+
+#undef  pr_fmt
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#define GID_LEN			16
+#define GUID_LEN		8
+
+#define PARENT_VLAN_FEATURES (NETIF_F_HW_VLAN_RX | NETIF_F_HW_VLAN_TX)
+
+#define parent_for_each_slave(_parent, slave)		\
+		list_for_each_entry(slave, &(_parent)->slave_list, list)\
+
+#define PARENT_IS_OK(_parent)				\
+		(((_parent)->dev->flags & IFF_UP) &&	\
+		 netif_running((_parent)->dev)    &&	\
+		 ((_parent)->slave_cnt > 0))
+
+#define IS_E_IPOIB_PROTO(_proto)			\
+		 (((_proto) == htons(ETH_P_ARP)) ||	\
+		 ((_proto) == htons(ETH_P_RARP)) ||	\
+		 ((_proto) == htons(ETH_P_IP)))
+
+enum eipoib_emac_guest_info {
+	VALID,
+	MIGRATED_OUT,
+	INVALID,
+};
+
+/* structs */
+struct eth_arp_data {
+	u8 arp_sha[ETH_ALEN];
+	__be32 arp_sip;
+	u8 arp_dha[ETH_ALEN];
+	__be32 arp_dip;
+} __packed;
+
+struct ipoib_arp_data {
+	u8 arp_sha[INFINIBAND_ALEN];
+	__be32 arp_sip;
+	u8 arp_dha[INFINIBAND_ALEN];
+	__be32 arp_dip;
+} __packed;
+
+/* live migration support structures: */
+struct ip_member {
+	__be32 ip;
+	struct list_head list;
+};
+
+/*
+ * for each slave (emac) saves all the ip over that mac.
+ * the parent keeps that list for live migration.
+ */
+struct guest_emac_info {
+	u8 emac[ETH_ALEN];
+	u16 vlan;
+	struct list_head ip_list;
+	struct list_head list;
+	enum eipoib_emac_guest_info rec_state;
+	int num_of_retries;
+};
+
+struct neigh {
+	struct list_head list;
+	u8 emac[ETH_ALEN];
+	u8 imac[INFINIBAND_ALEN];
+	/* this part is used for neigh_add_list */
+	char cmd[PAGE_SIZE];
+};
+
+struct slave {
+	struct net_device *dev;
+	struct list_head list;
+	u16    pkey;
+	u16    vlan;
+	u8     emac[ETH_ALEN];
+	u8     imac[INFINIBAND_ALEN];
+	struct list_head neigh_list;
+};
+
+struct port_stats {
+	/* update PORT_STATS_LEN (number of stat fields)accordingly */
+	uint64_t tx_parent_dropped;
+	uint64_t tx_vif_miss;
+	uint64_t tx_neigh_miss;
+	uint64_t tx_vlan;
+	uint64_t tx_shared;
+	uint64_t tx_proto_errors;
+	uint64_t tx_skb_errors;
+	uint64_t tx_slave_err;
+
+	uint64_t rx_parent_dropped;
+	uint64_t rx_vif_miss;
+	uint64_t rx_neigh_miss;
+	uint64_t rx_vlan;
+	uint64_t rx_shared;
+	uint64_t rx_proto_errors;
+	uint64_t rx_skb_errors;
+	uint64_t rx_slave_err;
+};
+
+struct parent {
+	struct   net_device *dev;
+	int      index;
+	struct   neigh_parms nparms;
+	struct   list_head slave_list;
+	/* never change this value outside the attach/detach wrappers */
+	s32      slave_cnt;
+	rwlock_t lock;
+	struct   port_stats port_stats;
+	struct   list_head parent_list;
+	u16      flags;
+	struct   workqueue_struct *wq;
+	s8       kill_timers;
+	struct   delayed_work neigh_learn_work;
+	struct   delayed_work vif_learn_work;
+	struct   list_head neigh_add_list;
+	union    ib_gid gid;
+	char     ipoib_main_interface[IFNAMSIZ];
+	/* live migration and bonding support */
+	struct   list_head emac_ip_list;
+	struct   delayed_work emac_ip_work;
+	struct   delayed_work migrate_out_work;
+};
+
+#define eipoib_slave_get_rcu(dev) \
+	((struct slave *) rcu_dereference(dev->rx_handler_data))
+
+/* name space support for sys/fs */
+struct eipoib_net {
+	struct net	*net;	/* Associated network namespace */
+	struct class_attribute class_attr_eipoib_interfaces;
+};
+
+/* exported from main.c */
+extern int eipoib_net_id;
+extern struct list_head parent_dev_list;
+
+/* functions prototypes */
+int mod_create_sysfs(struct eipoib_net *eipoib_n);
+void mod_destroy_sysfs(struct eipoib_net *eipoib_n);
+void parent_destroy_sysfs_entry(struct parent *parent);
+int parent_create_sysfs_entry(struct parent *parent);
+int create_slave_symlinks(struct net_device *master,
+			  struct net_device *slave);
+void destroy_slave_symlinks(struct net_device *master,
+			    struct net_device *slave);
+int parent_enslave(struct net_device *parent_dev,
+		   struct net_device *slave_dev);
+int parent_release_slave(struct net_device *parent_dev,
+			 struct net_device *slave_dev);
+struct neigh *parent_get_neigh_cmd(char op, char *ifname,
+				   u8 *remac, u8 *rimac);
+struct slave *parent_get_vif_cmd(char op, char *ifname, u8 *lemac);
+ssize_t __parent_store_neighs(struct device *d,
+			      struct device_attribute *attr,
+			      const char *buffer, size_t count);
+void parent_set_ethtool_ops(struct net_device *dev);
+
+#endif /* _LINUX_ETH_IPOIB_H */
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH V2 08/12] net/eipoib: Add ethtool file support
  2012-08-01 17:09 [PATCH V2 00/12] Add Ethernet IPoIB driver Or Gerlitz
                   ` (6 preceding siblings ...)
  2012-08-01 17:09 ` [PATCH V2 07/12] net/eipoib: Add private header file Or Gerlitz
@ 2012-08-01 17:09 ` Or Gerlitz
  2012-08-02  0:22   ` Ben Hutchings
  2012-08-01 17:09 ` [PATCH V2 09/12] net/eipoib: Add main driver functionality Or Gerlitz
                   ` (3 subsequent siblings)
  11 siblings, 1 reply; 71+ messages in thread
From: Or Gerlitz @ 2012-08-01 17:09 UTC (permalink / raw)
  To: davem; +Cc: roland, netdev, ali, sean.hefty, Erez Shitrit, Or Gerlitz

From: Erez Shitrit <erezsh@mellanox.co.il>

Via ethtool the driver describes its version, ABI version, on what PIF
interface it runs and various statistics.

Signed-off-by: Erez Shitrit <erezsh@mellanox.co.il>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
---
 drivers/net/eipoib/eth_ipoib_ethtool.c |  114 ++++++++++++++++++++++++++++++++
 1 files changed, 114 insertions(+), 0 deletions(-)
 create mode 100644 drivers/net/eipoib/eth_ipoib_ethtool.c

diff --git a/drivers/net/eipoib/eth_ipoib_ethtool.c b/drivers/net/eipoib/eth_ipoib_ethtool.c
new file mode 100644
index 0000000..0731d94
--- /dev/null
+++ b/drivers/net/eipoib/eth_ipoib_ethtool.c
@@ -0,0 +1,114 @@
+/*
+ * Copyright (c) 2012 Mellanox Technologies. All rights reserved
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * openfabric.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#include "eth_ipoib.h"
+
+static void parent_ethtool_get_drvinfo(struct net_device *parent_dev,
+				       struct ethtool_drvinfo *drvinfo)
+{
+	struct parent *parent = netdev_priv(parent_dev);
+
+	strncpy(drvinfo->driver, DRV_NAME, 32);
+
+	strncpy(drvinfo->version, DRV_VERSION, 32);
+
+	strncpy(drvinfo->bus_info, parent->ipoib_main_interface,
+		ETHTOOL_BUSINFO_LEN);
+
+	/* indicates ABI version */
+	snprintf(drvinfo->fw_version, 32, "%d", EIPOIB_ABI_VER);
+}
+
+static const char parent_strings[][ETH_GSTRING_LEN] = {
+	/* private statistics */
+	"tx_parent_dropped",
+	"tx_vif_miss",
+	"tx_neigh_miss",
+	"tx_vlan",
+	"tx_shared",
+	"tx_proto_errors",
+	"tx_skb_errors",
+	"tx_slave_err",
+
+	"rx_parent_dropped",
+	"rx_vif_miss",
+	"rx_neigh_miss",
+	"rx_vlan",
+	"rx_shared",
+	"rx_proto_errors",
+	"rx_skb_errors",
+	"rx_slave_err",
+};
+
+#define PORT_STATS_LEN (sizeof(parent_strings) / ETH_GSTRING_LEN)
+
+static void parent_get_strings(struct net_device *parent_dev,
+			       uint32_t stringset, uint8_t *data)
+{
+	if (stringset != ETH_SS_STATS)
+		return;
+	memcpy(data, parent_strings, sizeof(parent_strings));
+}
+
+static void parent_get_ethtool_stats(struct net_device *parent_dev,
+				     struct ethtool_stats *stats,
+				     uint64_t *data)
+{
+	struct parent *parent = netdev_priv(parent_dev);
+
+	read_lock_bh(&parent->lock);
+	memcpy(data, &parent->port_stats, sizeof(parent->port_stats));
+	read_unlock_bh(&parent->lock);
+}
+
+static int parent_get_sset_count(struct net_device *parent_dev, int sset)
+{
+	switch (sset) {
+	case ETH_SS_STATS:
+		return PORT_STATS_LEN;
+	default:
+		return -EOPNOTSUPP;
+	}
+}
+
+static const struct ethtool_ops parent_ethtool_ops = {
+	.get_drvinfo		= parent_ethtool_get_drvinfo,
+	.get_strings		= parent_get_strings,
+	.get_ethtool_stats	= parent_get_ethtool_stats,
+	.get_sset_count		= parent_get_sset_count,
+	.get_link		= ethtool_op_get_link,
+};
+
+void parent_set_ethtool_ops(struct net_device *dev)
+{
+	SET_ETHTOOL_OPS(dev, &parent_ethtool_ops);
+}
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH V2 09/12] net/eipoib: Add main driver functionality
  2012-08-01 17:09 [PATCH V2 00/12] Add Ethernet IPoIB driver Or Gerlitz
                   ` (7 preceding siblings ...)
  2012-08-01 17:09 ` [PATCH V2 08/12] net/eipoib: Add ethtool file support Or Gerlitz
@ 2012-08-01 17:09 ` Or Gerlitz
  2012-08-02 17:15   ` Eric W. Biederman
  2012-08-01 17:09 ` [PATCH V2 10/12] net/eipoib: Add sysfs support Or Gerlitz
                   ` (2 subsequent siblings)
  11 siblings, 1 reply; 71+ messages in thread
From: Or Gerlitz @ 2012-08-01 17:09 UTC (permalink / raw)
  To: davem; +Cc: roland, netdev, ali, sean.hefty, Erez Shitrit, Or Gerlitz

From: Erez Shitrit <erezsh@mellanox.co.il>

The eipoib driver provides a standard Ethernet netdevice over
the InfiniBand IPoIB interface .

Some services can run only on top of Ethernet L2 interfaces, and cannot be
bound to an IPoIB interface. With this new driver, these services can run
seamlessly.

Main use case of the driver is the Ethernet Virtual Switching used in
virtualized environments, where an eipoib netdevice can be used as a
Physical Interface (PIF) in the hypervisor domain, and allow other
guests Virtual Interfaces (VIF) connected to the same Virtual Switch
to run over the InfiniBand fabric.

This driver supports L2 Switching (Direct Bridging) as well as other L3
Switching modes (e.g. NAT).

Whenever an IPoIB interface is created, one eIPoIB PIF netdevice
will be created. The default naming scheme is as in other Ethernet
interfaces: ethX, for example, on a system with two IPoIB interfaces,
ib0 and ib1, two interfaces will be created ethX and ethX+1 When "X"
is the next free Ethernet number in the system.

Using "ethtool -i " over the new interface can tell on which IPoIB
PIF interface that interface is above.  For example: driver: eth_ipoib:ib0
indicates that eth3 is the Ethernet interface over the ib0 IPoIB interface.

The driver can be used as independent interface or to serve in
virtualization environment as the physical layer for the virtual
interfaces on the virtual guest.

The driver interface (eipoib interface or which is also referred to as parent)
uses slave interfaces, IPoIB clones, which are the VIFs described above.

VIFs interfaces are enslaved/released from the eipoib driver on demand, according
to the management interface provided to user space.

Note: Each ethX interface has at least one ibX.Y slave to serve the PIF
itself, in the VIFs list of ethX you'll notice that ibX.1 is always created
to serve applications running from the Hypervisor on top of ethX interface directly.

For IB applications that require native IPoIB interfaces (e.g. RDMA-CM), the
original ipoib interfaces ibX can still be used.  For example, RDMA-CM and
eth_ipoib drivers can co-exist and make use of IPoIB

Support for Live migration:

The driver expose sysfs interface through which the manager can notify on
migrated out guests ("detached VIF"). The abandoned eIPoIB driver instance
sends ARP requests (defined number) to the network in order to triger the migrated
guest to publish its mac address of the new VIF that hosts it. When the guest
responds to that ARP, the eIPoIB driver on the host that owns that guest, sends
Gratuitous ARP in behalf of that guest such that all the peers on the network
which communicate with this VM are noticed on the new VIF address which is
used by the guest.

Signed-off-by: Erez Shitrit <erezsh@mellanox.co.il>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
---
 drivers/net/eipoib/eth_ipoib_main.c | 1953 +++++++++++++++++++++++++++++++++++
 1 files changed, 1953 insertions(+), 0 deletions(-)
 create mode 100644 drivers/net/eipoib/eth_ipoib_main.c

diff --git a/drivers/net/eipoib/eth_ipoib_main.c b/drivers/net/eipoib/eth_ipoib_main.c
new file mode 100644
index 0000000..0a8e7f4
--- /dev/null
+++ b/drivers/net/eipoib/eth_ipoib_main.c
@@ -0,0 +1,1953 @@
+/*
+ * Copyright (c) 2012 Mellanox Technologies. All rights reserved
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * openfabric.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#include "eth_ipoib.h"
+#include <net/ip.h>
+#include <linux/if_link.h>
+
+#define EMAC_IP_GC_TIME (10 * HZ)
+
+#define MIG_OUT_ARP_REQ_ISSUE_TIME (0.5 * HZ)
+
+#define MIG_OUT_MAX_ARP_RETRIES 5
+
+#define LIVE_MIG_PACKET 1
+
+#define PARENT_MAC_MASK 0xe7
+
+/* forward declaration */
+static rx_handler_result_t eipoib_handle_frame(struct sk_buff **pskb);
+static int eipoib_device_event(struct notifier_block *unused,
+			       unsigned long event, void *ptr);
+static void free_ip_mem_in_rec(struct guest_emac_info *emac_info);
+
+static const char * const version =
+	DRV_DESCRIPTION ": v" DRV_VERSION " (" DRV_RELDATE ")\n";
+
+LIST_HEAD(parent_dev_list);
+
+/* name space sys/fs functions */
+int eipoib_net_id __read_mostly;
+
+static int __net_init eipoib_net_init(struct net *net)
+{
+	int rc;
+	struct eipoib_net *eipoib_n = net_generic(net, eipoib_net_id);
+
+	eipoib_n->net = net;
+	rc = mod_create_sysfs(eipoib_n);
+
+	return rc;
+}
+
+static void __net_exit eipoib_net_exit(struct net *net)
+{
+	struct eipoib_net *eipoib_n = net_generic(net, eipoib_net_id);
+
+	mod_destroy_sysfs(eipoib_n);
+}
+
+static struct pernet_operations eipoib_net_ops = {
+	.init = eipoib_net_init,
+	.exit = eipoib_net_exit,
+	.id   = &eipoib_net_id,
+	.size = sizeof(struct eipoib_net),
+};
+
+/* set mac fields emac=<qpn><lid> */
+static inline
+void build_neigh_mac(u8 *_mac, u32 _qpn, u16 _lid)
+{
+	/* _qpn: 3B _lid: 2B */
+	*((__be32 *)(_mac)) = cpu_to_be32(_qpn);
+	*(u8 *)(_mac) = 0x2; /* set LG bit */
+	*(__be16 *)(_mac + sizeof(_qpn)) = cpu_to_be16(_lid);
+}
+
+static inline
+struct slave *get_slave_by_dev(struct parent *parent,
+			       struct net_device *slave_dev)
+{
+	struct slave *slave, *slave_tmp;
+	int found = 0;
+
+	parent_for_each_slave(parent, slave_tmp) {
+		if (slave_tmp->dev == slave_dev) {
+			found = 1;
+			slave = slave_tmp;
+			break;
+		}
+	}
+
+	return found ? slave : NULL;
+}
+
+static inline
+struct slave *get_slave_by_mac_and_vlan(struct parent *parent, u8 *mac,
+					u16 vlan)
+{
+	struct slave *slave, *slave_tmp;
+	int found = 0;
+
+	parent_for_each_slave(parent, slave_tmp) {
+		if ((!memcmp(slave_tmp->emac, mac, ETH_ALEN)) &&
+		    (slave_tmp->vlan == vlan)) {
+			found = 1;
+			slave = slave_tmp;
+			break;
+		}
+	}
+
+	return found ? slave : NULL;
+}
+
+
+static inline
+struct guest_emac_info *get_mac_ip_info_by_mac_and_vlan(struct parent *parent,
+							u8 *mac, u16 vlan)
+{
+	struct guest_emac_info *emac_info, *emac_info_ret;
+	int found = 0;
+
+	list_for_each_entry(emac_info, &parent->emac_ip_list, list) {
+		if ((!memcmp(emac_info->emac, mac, ETH_ALEN)) &&
+		    vlan == emac_info->vlan) {
+			found = 1;
+			emac_info_ret = emac_info;
+			break;
+		}
+	}
+
+	return found ? emac_info_ret : NULL;
+}
+
+/*
+ * searches for the relevant guest_emac_info in the parent.
+ * if found it, check if it contains the required ip
+ * if no such guest_emac_info object or no ip return 0,
+ * otherwise return 1 and if exist set the guest_emac_info obj.
+ */
+static inline
+int is_mac_info_contain_ip(struct parent *parent, u8 *mac, __be32 ip,
+			  struct guest_emac_info *emac_info, u16 vlan)
+{
+	struct ip_member *ipm;
+	int found = 0;
+
+	emac_info = get_mac_ip_info_by_mac_and_vlan(parent, mac, vlan);
+	if (!emac_info)
+		return 0;
+
+	list_for_each_entry(ipm, &emac_info->ip_list, list) {
+		if (ipm->ip == ip) {
+			found = 1;
+			break;
+		}
+	}
+
+	return found;
+}
+
+static inline int netdev_set_parent_master(struct net_device *slave,
+					   struct net_device *master)
+{
+	int err;
+
+	ASSERT_RTNL();
+
+	err = netdev_set_master(slave, master);
+	if (err)
+		return err;
+	if (master) {
+			slave->priv_flags |= IFF_EIPOIB_VIF;
+			/* deny bonding from enslaving it. */;
+			slave->flags |= IFF_SLAVE;
+	} else {
+		slave->priv_flags &= ~(IFF_EIPOIB_VIF);
+		slave->flags &= ~(IFF_SLAVE);
+	}
+
+	return 0;
+}
+
+static inline int is_driver_owner(struct net_device *dev, char *name)
+{
+	struct ethtool_drvinfo drvinfo;
+
+	if (dev->ethtool_ops && dev->ethtool_ops->get_drvinfo) {
+		memset(&drvinfo, 0, sizeof(drvinfo));
+		dev->ethtool_ops->get_drvinfo(dev, &drvinfo);
+		if (!strstr(drvinfo.driver, name))
+			return 0;
+	} else
+		return 0;
+
+	return 1;
+}
+
+static inline int is_parent(struct net_device *dev)
+{
+	return (dev->priv_flags & IFF_EIPOIB_PIF) &&
+		is_driver_owner(dev, DRV_NAME);
+}
+
+static inline int is_parent_mac(struct net_device *dev, u8 *mac)
+{
+	return is_parent(dev) && !memcmp(mac, dev->dev_addr, dev->addr_len);
+}
+
+static inline int __is_slave(struct net_device *dev)
+{
+	return dev->master && is_parent(dev->master);
+}
+
+static inline int is_slave(struct net_device *dev)
+{
+	return (dev->priv_flags & IFF_EIPOIB_VIF) &&
+		is_driver_owner(dev, SDRV_NAME) && __is_slave(dev);
+}
+
+/*
+ * ------------------------------- Link status ------------------
+ * set parent carrier:
+ * link is up if at least one slave has link up
+ * otherwise, bring link down
+ * return 1 if parent carrier changed, zero otherwise
+ */
+static int parent_set_carrier(struct parent *parent)
+{
+	struct slave *slave;
+
+	if (parent->slave_cnt == 0)
+		goto down;
+
+	/* bring parent link up if one slave (at least) is up */
+	parent_for_each_slave(parent, slave) {
+		if (netif_carrier_ok(slave->dev)) {
+			if (!netif_carrier_ok(parent->dev)) {
+				netif_carrier_on(parent->dev);
+				return 1;
+			}
+			return 0;
+		}
+	}
+
+down:
+	if (netif_carrier_ok(parent->dev)) {
+		pr_info("bring down carrier\n");
+		netif_carrier_off(parent->dev);
+		return 1;
+	}
+	return 0;
+}
+
+static int parent_set_mtu(struct parent *parent)
+{
+	struct slave *slave, *f_slave;
+	unsigned int mtu;
+
+	if (parent->slave_cnt == 0)
+		return 0;
+
+	/* find min mtu */
+	f_slave = list_first_entry(&parent->slave_list, struct slave, list);
+	mtu = f_slave->dev->mtu;
+
+	parent_for_each_slave(parent, slave)
+		mtu = min(slave->dev->mtu, mtu);
+
+	if (parent->dev->mtu != mtu) {
+		dev_set_mtu(parent->dev, mtu);
+		return 1;
+	}
+
+	return 0;
+}
+
+/*
+ *--------------------------- slave list handling ------
+ *
+ * This function attaches the slave to the end of list.
+ * pay attention, the caller should held paren->lock
+ */
+static void parent_attach_slave(struct parent *parent,
+				struct slave *new_slave)
+{
+	list_add_tail(&new_slave->list, &parent->slave_list);
+	parent->slave_cnt++;
+}
+
+static void parent_detach_slave(struct parent *parent, struct slave *slave)
+{
+	list_del(&slave->list);
+	parent->slave_cnt--;
+}
+
+static netdev_features_t parent_fix_features(struct net_device *dev,
+					     netdev_features_t features)
+{
+	struct slave *slave;
+	struct parent *parent = netdev_priv(dev);
+	netdev_features_t mask;
+
+	read_lock_bh(&parent->lock);
+
+	mask = features;
+	features &= ~NETIF_F_ONE_FOR_ALL;
+	features |= NETIF_F_ALL_FOR_ALL;
+
+	parent_for_each_slave(parent, slave)
+		features = netdev_increment_features(features,
+						     slave->dev->features,
+						     mask);
+
+	features &= ~NETIF_F_VLAN_CHALLENGED;
+	read_unlock_bh(&parent->lock);
+	return features;
+}
+
+static int parent_compute_features(struct parent *parent)
+{
+	struct net_device *parent_dev = parent->dev;
+	u64 hw_features, features;
+	struct slave *slave;
+
+	if (list_empty(&parent->slave_list))
+		goto done;
+
+	/* starts with the max set of features mask */
+	hw_features = features = ~0LL;
+
+	/* gets the common features from all slaves */
+	parent_for_each_slave(parent, slave) {
+		features &= slave->dev->features;
+		hw_features &= slave->dev->hw_features;
+	}
+
+	features = features | PARENT_VLAN_FEATURES;
+	hw_features = hw_features | PARENT_VLAN_FEATURES;
+
+	hw_features &= ~NETIF_F_VLAN_CHALLENGED;
+	features &= hw_features;
+
+	parent_dev->hw_features = hw_features;
+	parent_dev->features = features;
+	parent_dev->vlan_features = parent_dev->features & ~PARENT_VLAN_FEATURES;
+done:
+	pr_info("%s: %s: Features: 0x%llx\n",
+		__func__, parent_dev->name, parent_dev->features);
+
+	return 0;
+}
+
+static inline u16 slave_get_pkey(struct net_device *dev)
+{
+	u16 pkey = (dev->broadcast[8] << 8) + dev->broadcast[9];
+
+	return pkey;
+}
+
+static void parent_setup_by_slave(struct net_device *parent_dev,
+				  struct net_device *slave_dev)
+{
+	struct parent *parent = netdev_priv(parent_dev);
+	const struct net_device_ops *slave_ops = slave_dev->netdev_ops;
+
+	parent_dev->mtu = slave_dev->mtu;
+	parent_dev->hard_header_len = slave_dev->hard_header_len;
+
+	slave_ops->ndo_neigh_setup(slave_dev, &parent->nparms);
+
+}
+
+/* enslave device <slave> to parent device <master> */
+int parent_enslave(struct net_device *parent_dev, struct net_device *slave_dev)
+{
+	struct parent *parent = netdev_priv(parent_dev);
+	struct slave *new_slave = NULL;
+	int old_features = parent_dev->features;
+	int res = 0;
+	/* slave must be claimed by ipoib */
+	if (!is_driver_owner(slave_dev, SDRV_NAME))
+		return -EOPNOTSUPP;
+
+	/* parent must be initialized by parent_open() before enslaving */
+	if (!(parent_dev->flags & IFF_UP)) {
+		pr_warn("%s parent is not up in "
+			"parent_enslave\n",
+			parent_dev->name);
+		return -EPERM;
+	}
+
+	/* already enslaved */
+	if ((slave_dev->flags & IFF_SLAVE) ||
+		(slave_dev->priv_flags & IFF_EIPOIB_VIF)) {
+		pr_err("%s was already enslaved!!!\n", slave_dev->name);
+		return -EBUSY;
+	}
+
+	/* mark it as ipoib clone vif */
+	slave_dev->priv_flags |= IFF_EIPOIB_VIF;
+
+	/* set parent netdev attributes */
+	if (parent->slave_cnt == 0)
+		parent_setup_by_slave(parent_dev, slave_dev);
+	else {
+		/* check netdev attr match */
+		if (slave_dev->hard_header_len != parent_dev->hard_header_len) {
+			pr_err("%s slave %s has different HDR len %d != %d\n",
+			       parent_dev->name, slave_dev->name,
+			       slave_dev->hard_header_len,
+			       parent_dev->hard_header_len);
+			res = -EINVAL;
+			goto err_undo_flags;
+		}
+
+		if (slave_dev->type != ARPHRD_INFINIBAND ||
+		    slave_dev->addr_len != INFINIBAND_ALEN) {
+			pr_err("%s slave type/addr_len is invalid (%d/%d)\n",
+			       parent_dev->name, slave_dev->type,
+			       slave_dev->addr_len);
+			res = -EINVAL;
+			goto err_undo_flags;
+		}
+	}
+	/*
+	 * verfiy that this (slave) device belongs to the relevant PIF
+	 * abort if the name of the slave is not as the regular way in ipoib
+	 */
+	if (!strstr(slave_dev->name, parent->ipoib_main_interface)) {
+		pr_err("%s slave name (%s) doesn't contain parent name (%s) ",
+		       parent_dev->name, slave_dev->name,
+		       parent->ipoib_main_interface);
+		res = -EINVAL;
+		goto err_undo_flags;
+	}
+
+	new_slave = kzalloc(sizeof(struct slave), GFP_KERNEL);
+	if (!new_slave) {
+		res = -ENOMEM;
+		goto err_undo_flags;
+	}
+
+	INIT_LIST_HEAD(&new_slave->neigh_list);
+
+	/* save slave's vlan */
+	new_slave->pkey = slave_get_pkey(slave_dev);
+
+	res = netdev_set_parent_master(slave_dev, parent_dev);
+	if (res) {
+		pr_err("%s %d calling netdev_set_master\n",
+		       slave_dev->name, res);
+		goto err_free;
+	}
+
+	res = dev_open(slave_dev);
+	if (res) {
+		pr_info("open failed %s\n",
+			slave_dev->name);
+		goto err_unset_master;
+	}
+
+	new_slave->dev = slave_dev;
+
+	write_lock_bh(&parent->lock);
+
+	parent_attach_slave(parent, new_slave);
+
+	parent_compute_features(parent);
+
+	write_unlock_bh(&parent->lock);
+
+	read_lock_bh(&parent->lock);
+
+	parent_set_carrier(parent);
+
+	read_unlock_bh(&parent->lock);
+
+	res = create_slave_symlinks(parent_dev, slave_dev);
+	if (res)
+		goto err_close;
+
+	/* register handler */
+	res = netdev_rx_handler_register(slave_dev, eipoib_handle_frame,
+					 new_slave);
+	if (res) {
+		pr_warn("%s %d calling netdev_rx_handler_register\n",
+			parent_dev->name, res);
+		goto err_close;
+	}
+
+	pr_info("%s: enslaving %s\n", parent_dev->name, slave_dev->name);
+
+	/* enslave is successful */
+	return 0;
+
+/* Undo stages on error */
+err_close:
+	dev_close(slave_dev);
+
+err_unset_master:
+	netdev_set_parent_master(slave_dev, NULL);
+
+err_free:
+	kfree(new_slave);
+
+err_undo_flags:
+	parent_dev->features = old_features;
+
+	return res;
+}
+
+static void slave_free(struct parent *parent, struct slave *slave)
+{
+	struct neigh *neigh, *neigh_tmp;
+
+	list_for_each_entry_safe(neigh, neigh_tmp, &slave->neigh_list, list) {
+		list_del(&neigh->list);
+		kfree(neigh);
+	}
+
+	netdev_rx_handler_unregister(slave->dev);
+
+	kfree(slave);
+}
+
+int parent_release_slave(struct net_device *parent_dev,
+			 struct net_device *slave_dev)
+{
+	struct parent *parent = netdev_priv(parent_dev);
+	struct slave *slave;
+	struct guest_emac_info *emac_info;
+
+	/* slave is not a slave or master is not master of this slave */
+	if (!(slave_dev->flags & IFF_SLAVE) ||
+	    (slave_dev->master != parent_dev)) {
+		pr_err("%s cannot release %s.\n",
+		       parent_dev->name, slave_dev->name);
+		return -EINVAL;
+	}
+
+	write_lock_bh(&parent->lock);
+
+	slave = get_slave_by_dev(parent, slave_dev);
+	if (!slave) {
+		/* not a slave of this parent */
+		pr_warn("%s not enslaved %s\n",
+			parent_dev->name, slave_dev->name);
+		write_unlock_bh(&parent->lock);
+		return -EINVAL;
+	}
+
+	pr_info("%s: releasing interface %s\n", parent_dev->name,
+		slave_dev->name);
+
+	/* for live migration, mark its mac_ip record as invalid */
+	emac_info = get_mac_ip_info_by_mac_and_vlan(parent, slave->emac, slave->vlan);
+	if (!emac_info)
+		pr_warn("%s %s didn't find emac: %pM\n",
+			parent_dev->name, slave_dev->name, slave->emac);
+	else {
+		emac_info->rec_state = MIGRATED_OUT;
+		/* start GC work */
+		pr_info("%s: sending clean task for slave mac: %pM\n",
+			__func__, slave->emac);
+		queue_delayed_work(parent->wq, &parent->migrate_out_work, 0);
+		queue_delayed_work(parent->wq, &parent->emac_ip_work,
+				   EMAC_IP_GC_TIME);
+	}
+
+	/* release the slave from its parent */
+	parent_detach_slave(parent, slave);
+
+	parent_compute_features(parent);
+
+	if (parent->slave_cnt == 0)
+		parent_set_carrier(parent);
+
+	write_unlock_bh(&parent->lock);
+
+	/* must do this from outside any spinlocks */
+	destroy_slave_symlinks(parent_dev, slave_dev);
+
+	netdev_set_parent_master(slave_dev, NULL);
+
+	dev_close(slave_dev);
+
+	slave_free(parent, slave);
+
+	return 0;  /* deletion OK */
+}
+
+static int parent_release_all(struct net_device *parent_dev)
+{
+	struct parent *parent = netdev_priv(parent_dev);
+	struct slave *slave, *slave_tmp;
+	struct net_device *slave_dev;
+	struct neigh *neigh_cmd, *neigh_cmd_tmp;
+	struct guest_emac_info *emac_info, *emac_info_tmp;
+	struct slave;
+
+	write_lock_bh(&parent->lock);
+
+	netif_carrier_off(parent_dev);
+
+	if (parent->slave_cnt == 0)
+		goto out;
+
+	list_for_each_entry_safe(slave, slave_tmp, &parent->slave_list, list) {
+		slave_dev = slave->dev;
+
+		/* remove slave from parent's slave-list */
+		parent_detach_slave(parent, slave);
+
+		parent_compute_features(parent);
+
+		write_unlock_bh(&parent->lock);
+
+		destroy_slave_symlinks(parent_dev, slave_dev);
+
+		netdev_set_parent_master(slave_dev, NULL);
+
+		dev_close(slave_dev);
+
+		slave_free(parent, slave);
+
+		write_lock_bh(&parent->lock);
+	}
+
+	list_for_each_entry_safe(neigh_cmd, neigh_cmd_tmp,
+				 &parent->neigh_add_list, list) {
+		list_del(&neigh_cmd->list);
+			kfree(neigh_cmd);
+		}
+
+	list_for_each_entry_safe(emac_info, emac_info_tmp,
+				 &parent->emac_ip_list, list) {
+		free_ip_mem_in_rec(emac_info);
+		list_del(&emac_info->list);
+		kfree(emac_info);
+	}
+
+	pr_info("%s: released all slaves\n", parent_dev->name);
+
+out:
+	write_unlock_bh(&parent->lock);
+
+	return 0;
+}
+
+/* -------------------------- Device entry points --------------------------- */
+static struct rtnl_link_stats64 *parent_get_stats(struct net_device *parent_dev,
+						  struct rtnl_link_stats64 *stats)
+{
+	struct parent *parent = netdev_priv(parent_dev);
+	struct slave *slave;
+	struct rtnl_link_stats64 temp;
+
+	memset(stats, 0, sizeof(*stats));
+
+	read_lock_bh(&parent->lock);
+
+	parent_for_each_slave(parent, slave) {
+		const struct rtnl_link_stats64 *sstats =
+			dev_get_stats(slave->dev, &temp);
+
+		stats->rx_packets += sstats->rx_packets;
+		stats->rx_bytes += sstats->rx_bytes;
+		stats->rx_errors += sstats->rx_errors;
+		stats->rx_dropped += sstats->rx_dropped;
+
+		stats->tx_packets += sstats->tx_packets;
+		stats->tx_bytes += sstats->tx_bytes;
+		stats->tx_errors += sstats->tx_errors;
+		stats->tx_dropped += sstats->tx_dropped;
+
+		stats->multicast += sstats->multicast;
+		stats->collisions += sstats->collisions;
+
+		stats->rx_length_errors += sstats->rx_length_errors;
+		stats->rx_over_errors += sstats->rx_over_errors;
+		stats->rx_crc_errors += sstats->rx_crc_errors;
+		stats->rx_frame_errors += sstats->rx_frame_errors;
+		stats->rx_fifo_errors += sstats->rx_fifo_errors;
+		stats->rx_missed_errors += sstats->rx_missed_errors;
+
+		stats->tx_aborted_errors += sstats->tx_aborted_errors;
+		stats->tx_carrier_errors += sstats->tx_carrier_errors;
+		stats->tx_fifo_errors += sstats->tx_fifo_errors;
+		stats->tx_heartbeat_errors += sstats->tx_heartbeat_errors;
+		stats->tx_window_errors += sstats->tx_window_errors;
+	}
+
+	read_unlock_bh(&parent->lock);
+
+	return stats;
+}
+
+/* ---------------------------- Main funcs ---------------------------------- */
+static struct neigh *neigh_cmd_find_by_mac(struct slave *slave, u8 *mac)
+{
+	struct net_device *dev = slave->dev;
+	struct net_device *parent_dev = dev->master;
+	struct parent *parent = netdev_priv(parent_dev);
+	struct neigh *neigh;
+	int found = 0;
+
+	list_for_each_entry(neigh, &parent->neigh_add_list, list) {
+		if (!memcmp(neigh->emac, mac, ETH_ALEN)) {
+			found = 1;
+			break;
+		}
+	}
+
+	return found ? neigh : NULL;
+}
+
+static struct neigh *neigh_find_by_mac(struct slave *slave, u8 *mac)
+{
+	struct neigh *neigh;
+	int found = 0;
+
+	list_for_each_entry(neigh, &slave->neigh_list, list) {
+		if (!memcmp(neigh->emac, mac, ETH_ALEN)) {
+			found = 1;
+			break;
+		}
+	}
+
+	return found ? neigh : NULL;
+}
+
+static int neigh_learn(struct slave *slave, struct sk_buff *skb, u8 *remac)
+{
+	struct net_device *dev = slave->dev;
+	struct net_device *parent_dev = dev->master;
+	struct parent *parent = netdev_priv(parent_dev);
+	struct neigh *neigh_cmd;
+	u8 *rimac;
+	int rc;
+
+	/* linearize to easy on reading the arp payload */
+	rc = skb_linearize(skb);
+	if (rc) {
+		pr_err("%s: skb_linearize failed rc %d\n", dev->name, rc);
+		goto out;
+	} else
+		rimac = skb->data + sizeof(struct arphdr);
+
+	/* check if entry is being processed or already exists */
+	if (neigh_find_by_mac(slave, remac))
+		goto out;
+
+	if (neigh_cmd_find_by_mac(slave, remac))
+		goto out;
+
+	neigh_cmd = parent_get_neigh_cmd('+', slave->dev->name, remac, rimac);
+	if (!neigh_cmd) {
+		pr_err("%s cannot build neigh cmd\n", slave->dev->name);
+		rc = -ENOMEM;
+		goto out;
+	}
+
+	list_add_tail(&neigh_cmd->list, &parent->neigh_add_list);
+
+	/* calls neigh_learn_task() */
+	queue_delayed_work(parent->wq, &parent->neigh_learn_work, 0);
+
+out:
+	return rc;
+}
+
+static void neigh_learn_task(struct work_struct *work)
+{
+	struct parent *parent = container_of(work, struct parent,
+					     neigh_learn_work.work);
+	struct neigh *neigh_cmd, *neigh_cmd_tmp;
+
+	write_lock_bh(&parent->lock);
+
+	if (parent->kill_timers)
+		goto out;
+
+	list_for_each_entry_safe(neigh_cmd, neigh_cmd_tmp,
+				 &parent->neigh_add_list, list) {
+		__parent_store_neighs(&parent->dev->dev, NULL,
+				      neigh_cmd->cmd, PAGE_SIZE);
+		list_del(&neigh_cmd->list);
+		kfree(neigh_cmd);
+	}
+
+out:
+	write_unlock_bh(&parent->lock);
+	return;
+}
+
+static void parent_work_cancel_all(struct parent *parent)
+{
+	write_lock_bh(&parent->lock);
+	parent->kill_timers = 1;
+	write_unlock_bh(&parent->lock);
+
+	if (delayed_work_pending(&parent->neigh_learn_work))
+		cancel_delayed_work(&parent->neigh_learn_work);
+
+	if (delayed_work_pending(&parent->emac_ip_work))
+		cancel_delayed_work(&parent->emac_ip_work);
+
+	if (delayed_work_pending(&parent->migrate_out_work))
+		cancel_delayed_work(&parent->migrate_out_work);
+}
+
+static struct parent *get_parent_by_pif_name(char *pif_name)
+{
+	struct parent *parent, *nxt;
+
+	list_for_each_entry_safe(parent, nxt, &parent_dev_list, parent_list) {
+		if (!strcmp(parent->ipoib_main_interface, pif_name))
+			return parent;
+	}
+	return NULL;
+}
+
+static void free_ip_mem_in_rec(struct guest_emac_info *emac_info)
+{
+	struct ip_member *ipm, *tmp_ipm;
+	list_for_each_entry_safe(ipm, tmp_ipm, &emac_info->ip_list, list) {
+		list_del(&ipm->list);
+		kfree(ipm);
+	}
+}
+
+static inline void free_invalid_emac_ip_det(struct parent *parent)
+{
+	struct guest_emac_info *emac_info, *emac_info_tmp;
+
+	list_for_each_entry_safe(emac_info, emac_info_tmp,
+				 &parent->emac_ip_list, list) {
+		if (emac_info->rec_state == INVALID) {
+			free_ip_mem_in_rec(emac_info);
+			list_del(&emac_info->list);
+			kfree(emac_info);
+		}
+	}
+}
+
+static void emac_info_clean_task(struct work_struct *work)
+{
+	struct parent *parent = container_of(work, struct parent,
+					     emac_ip_work.work);
+
+	write_lock_bh(&parent->lock);
+
+	if (parent->kill_timers)
+		goto out;
+
+	free_invalid_emac_ip_det(parent);
+
+out:
+	write_unlock_bh(&parent->lock);
+	return;
+}
+
+static int migrate_out_gen_arp_req(struct parent *parent, u8 *emac,
+				   u16 vlan)
+{
+	struct guest_emac_info *emac_info;
+	struct ip_member *ipm;
+	struct slave *slave;
+	struct sk_buff *nskb;
+	int ret = 0;
+
+	slave = get_slave_by_mac_and_vlan(parent, parent->dev->dev_addr, vlan);
+	if (unlikely(!slave)) {
+		pr_info("%s: Failed to find parent slave !!! %pM\n",
+			__func__, parent->dev->dev_addr);
+		return -ENODEV;
+	}
+
+	emac_info = get_mac_ip_info_by_mac_and_vlan(parent, emac, vlan);
+
+	if (!emac_info)
+		return 0;
+
+	/* go over all ip's attached to that mac */
+	list_for_each_entry(ipm, &emac_info->ip_list, list) {
+		/* create and send arp request to that ip.*/
+		pr_info("%s: Sending arp For migrate_out event, to %pI4 "
+			"from 0.0.0.0\n", parent->dev->name, &(ipm->ip));
+
+		nskb = arp_create(ARPOP_REQUEST,
+				  ETH_P_ARP,
+				  ipm->ip,
+				  slave->dev,
+				  0,
+				  slave->dev->broadcast,
+				  slave->dev->broadcast,
+				  slave->dev->broadcast);
+		if (nskb)
+			arp_xmit(nskb);
+		else {
+			pr_err("%s: %s failed creating skb\n",
+			       __func__, slave->dev->name);
+			ret = -ENOMEM;
+		}
+	}
+	return ret;
+}
+
+static void migrate_out_work_task(struct work_struct *work)
+{
+	struct parent *parent = container_of(work, struct parent,
+					     migrate_out_work.work);
+	struct guest_emac_info *emac_info;
+	int is_reschedule = 0;
+	int ret;
+
+	write_lock_bh(&parent->lock);
+
+	if (parent->kill_timers)
+		goto out;
+
+	list_for_each_entry(emac_info, &parent->emac_ip_list, list) {
+		if (emac_info->rec_state == MIGRATED_OUT) {
+			if (emac_info->num_of_retries <
+			    MIG_OUT_MAX_ARP_RETRIES) {
+				ret = migrate_out_gen_arp_req(parent, emac_info->emac,
+							      emac_info->vlan);
+				if (ret)
+					pr_err("%s: migrate_out_gen_arp failed: %d\n",
+					       __func__, ret);
+
+				emac_info->num_of_retries =
+					emac_info->num_of_retries + 1;
+				is_reschedule = 1;
+			} else
+				emac_info->rec_state = INVALID;
+		}
+	}
+	/* issue arp request till the device removed that entry from list */
+	if (is_reschedule)
+		queue_delayed_work(parent->wq, &parent->migrate_out_work,
+				   MIG_OUT_ARP_REQ_ISSUE_TIME);
+out:
+	write_unlock_bh(&parent->lock);
+	return;
+}
+
+static inline int add_emac_ip_info(struct net_device *slave_dev, __be32 ip,
+				   u8 *mac, u16 vlan)
+{
+	struct net_device *parent_dev = slave_dev->master;
+	struct parent *parent = netdev_priv(parent_dev);
+	struct guest_emac_info *emac_info = NULL;
+	struct ip_member *ipm;
+	int ret;
+	int is_just_alloc_emac_info = 0;
+
+	ret = is_mac_info_contain_ip(parent, mac, ip, emac_info, vlan);
+	if (ret)
+		return 0;
+
+	/* new ip add it to the emc_ip obj */
+	if (!emac_info) {
+		emac_info = kzalloc(sizeof *emac_info, GFP_ATOMIC);
+		if (!emac_info) {
+			pr_err("%s: Failed allocating emac_info\n",
+			       parent_dev->name);
+			return -ENOMEM;
+		}
+		memcpy(emac_info->emac, mac, ETH_ALEN);
+		INIT_LIST_HEAD(&emac_info->ip_list);
+		emac_info->rec_state = VALID;
+		emac_info->vlan = vlan;
+		emac_info->num_of_retries = 0;
+		list_add_tail(&emac_info->list, &parent->emac_ip_list);
+		is_just_alloc_emac_info = 1;
+	}
+
+	ipm = kzalloc(sizeof *ipm, GFP_ATOMIC);
+	if (!ipm) {
+		pr_err(" %s Failed allocating emac_info (ipm)\n",
+		       parent_dev->name);
+		if (is_just_alloc_emac_info)
+			kfree(emac_info);
+		return -ENOMEM;
+	}
+
+	ipm->ip = ip;
+	list_add_tail(&ipm->list, &emac_info->ip_list);
+
+	return 0;
+}
+
+/* build ipoib arp/rarp request/reply packet */
+static struct sk_buff *get_slave_skb_arp(struct slave *slave,
+					 struct sk_buff *skb,
+					 u8 *rimac, int *ret)
+{
+	struct sk_buff *nskb;
+	struct arphdr *arphdr = (struct arphdr *)
+				(skb->data + sizeof(struct ethhdr));
+	struct eth_arp_data *arp_data = (struct eth_arp_data *)
+					(skb->data + sizeof(struct ethhdr) +
+					 sizeof(struct arphdr));
+	u8 t_addr[ETH_ALEN] = {0};
+	int err = 0;
+	/* mark regular packet handling */
+	*ret = 0;
+
+	/*
+	 * live-migration support: keeps the new mac/ip address:
+	 * In that way each driver knows which mac/vlan - IP's where on the
+	 * guests above, whenever migrate_out event comes it will send
+	 * arp request for all these IP's.
+	 */
+	if (skb->protocol == htons(ETH_P_ARP))
+		err = add_emac_ip_info(slave->dev, arp_data->arp_sip,
+				       arp_data->arp_sha, slave->vlan);
+	if (err)
+		pr_warn("%s: Failed creating: emac_ip_info for ip: %pI4",
+			__func__, &arp_data->arp_sip);
+	/*
+	 * live migration support:
+	 * 1.checck if we are in live migration process
+	 * 2.check if the arp response is for the parent
+	 * 3.ignore local-administrated bit, which was set to make sure
+	 *   that the bridge will not drop it.
+	 */
+	arp_data->arp_dha[0] = arp_data->arp_dha[0] & 0xFD;
+	if (htons(ARPOP_REPLY) == (arphdr->ar_op) &&
+	    !memcmp(arp_data->arp_dha, slave->dev->master->dev_addr, ETH_ALEN)) {
+		/*
+		 * when the source is the parent interface, assumes
+		 * that we are in the middle of live migration process,
+		 * so, we will send gratuitous arp.
+		 */
+		pr_info("%s: Arp packet for parent: %s",
+			__func__, slave->dev->master->name);
+		/* create gratuitous ARP on behalf of the guest */
+		nskb = arp_create(ARPOP_REQUEST,
+				  be16_to_cpu(skb->protocol),
+				  arp_data->arp_sip,
+				  slave->dev,
+				  arp_data->arp_sip,
+				  NULL,
+				  slave->dev->dev_addr,
+				  t_addr);
+		if (unlikely(!nskb))
+			pr_err("%s: %s live migration: failed creating skb\n",
+			       __func__, slave->dev->name);
+	} else {
+		nskb = arp_create(be16_to_cpu(arphdr->ar_op),
+				  be16_to_cpu(skb->protocol),
+				  arp_data->arp_dip,
+				  slave->dev,
+				  arp_data->arp_sip,
+				  rimac,
+				  slave->dev->dev_addr,
+				  NULL);
+	}
+
+	return nskb;
+}
+
+/*
+ * build ipoib arp request packet according to ip header.
+ * uses for live-migration, or missing neigh for new vif.
+ */
+static void get_slave_skb_arp_by_ip(struct slave *slave,
+				    struct sk_buff *skb)
+{
+	struct sk_buff *nskb = NULL;
+	struct iphdr *iph = ip_hdr(skb);
+
+	pr_info("Sending arp on behalf of slave %s, from %pI4"
+		" to %pI4" , slave->dev->name, &(iph->saddr),
+		&(iph->daddr));
+
+	nskb = arp_create(ARPOP_REQUEST,
+			  ETH_P_ARP,
+			  iph->daddr,
+			  slave->dev,
+			  iph->saddr,
+			  slave->dev->broadcast,
+			  slave->dev->dev_addr,
+			  NULL);
+	if (nskb)
+		arp_xmit(nskb);
+	else
+		pr_err("%s: %s failed creating skb\n",
+		       __func__, slave->dev->name);
+}
+
+/* build ipoib ipv4/ipv6 packet */
+static struct sk_buff *get_slave_skb_ip(struct slave *slave,
+					struct sk_buff *skb)
+{
+
+	skb_pull(skb, ETH_HLEN);
+	skb_reset_network_header(skb);
+
+	return skb;
+}
+
+/*
+ * get_slave_skb -- called in TX flow
+ * get skb that can be sent thru slave xmit func,
+ * if skb was adjusted (cloned, pulled, etc..) successfully
+ * the old skb (if any) is freed here.
+ */
+static struct sk_buff *get_slave_skb(struct slave *slave, struct sk_buff *skb)
+{
+	struct net_device *dev = slave->dev;
+	struct net_device *parent_dev = dev->master;
+	struct parent *parent = netdev_priv(parent_dev);
+	struct sk_buff *nskb = NULL;
+	struct ethhdr *ethh = (struct ethhdr *)(skb->data);
+	struct neigh *neigh = NULL;
+	u8 rimac[INFINIBAND_ALEN];
+	int ret = 0;
+
+	/* set neigh mac */
+	if (is_multicast_ether_addr(ethh->h_dest)) {
+		memcpy(rimac, dev->broadcast, INFINIBAND_ALEN);
+	} else {
+		neigh = neigh_find_by_mac(slave, ethh->h_dest);
+		if (neigh) {
+			memcpy(rimac, neigh->imac, INFINIBAND_ALEN);
+		} else {
+			++parent->port_stats.tx_neigh_miss;
+			/*
+			 * assume VIF migration, tries to get the neigh by
+			 * issue arp request on behalf of the vif.
+			 */
+			if (skb->protocol == htons(ETH_P_IP)) {
+				pr_info("Missed neigh for slave: %s,"
+					"issue ARP request\n",
+					slave->dev->name);
+				get_slave_skb_arp_by_ip(slave, skb);
+				goto out_arp_sent_instead;
+			}
+		}
+	}
+
+	if (skb->protocol == htons(ETH_P_ARP) ||
+	    skb->protocol == htons(ETH_P_RARP)) {
+		nskb = get_slave_skb_arp(slave, skb, rimac, &ret);
+		if (!nskb && LIVE_MIG_PACKET == ret) {
+			pr_info("%s: live migration packets\n", __func__);
+			goto err;
+		}
+	} else {
+		if (!neigh)
+			goto err;
+		/* pull ethernet header here */
+		nskb = get_slave_skb_ip(slave, skb);
+	}
+
+	/* if new skb could not be adjusted/allocated, abort */
+	if (!nskb) {
+		pr_err("%s get_slave_skb_ip/arp failed 0x%x\n",
+		       dev->name, skb->protocol);
+		goto err;
+	}
+
+	if (neigh && nskb == skb) { /* ucast & non-arp/rarp */
+		/* dev_hard_header only for ucast, for arp done already.*/
+		if (dev_hard_header(nskb, dev, ntohs(skb->protocol), rimac,
+				    dev->dev_addr, nskb->len) < 0) {
+			pr_warn("%s: dev_hard_header failed\n",
+				dev->name);
+			goto err;
+		}
+	}
+
+	/*
+	 * new skb is ready to be sent, clean old skb if we hold a clone
+	 * (old skb is not shared, already checked that.)
+	 */
+	if ((nskb != skb))
+		dev_kfree_skb(skb);
+
+	nskb->dev = slave->dev;
+	return nskb;
+
+out_arp_sent_instead:/* whenever sent arp instead of ip packet */
+err:
+	/* got error after nskb was adjusted/allocated */
+	if (nskb && (nskb != skb))
+		dev_kfree_skb(nskb);
+
+	return NULL;
+}
+
+static struct sk_buff *get_parent_skb_arp(struct slave *slave,
+					  struct sk_buff *skb,
+					  u8 *remac)
+{
+	struct net_device *dev = slave->dev->master;
+	struct sk_buff *nskb;
+	struct arphdr *arphdr = (struct arphdr *)(skb->data);
+	struct ipoib_arp_data *arp_data = (struct ipoib_arp_data *)
+					(skb->data + sizeof(struct arphdr));
+	u8 *target_hw = slave->emac;
+	u8 *dst_hw = slave->emac;
+	u8 local_eth_addr[ETH_ALEN];
+
+	/* live migration: gets arp with broadcast src and dst */
+	if (!memcmp(arp_data->arp_sha, slave->dev->broadcast, INFINIBAND_ALEN) &&
+	    !memcmp(arp_data->arp_dha, slave->dev->broadcast, INFINIBAND_ALEN)) {
+		pr_info("%s: ARP with bcast src and dest send from src_hw: %pM\n",
+			__func__, slave->dev->master->dev_addr);
+		/* replace the src with the parent src: */
+		memcpy(local_eth_addr, slave->dev->master->dev_addr, ETH_ALEN);
+		/*
+		 * set local administrated bit,
+		 * that way the bridge will not throws it
+		 */
+		local_eth_addr[0] = local_eth_addr[0] | 0x2;
+		memcpy(remac, local_eth_addr, ETH_ALEN);
+		target_hw = NULL;
+		dst_hw = NULL;
+	}
+
+	nskb = arp_create(be16_to_cpu(arphdr->ar_op),
+			  be16_to_cpu(skb->protocol),
+			  arp_data->arp_dip,
+			  dev,
+			  arp_data->arp_sip,
+			  dst_hw,
+			  remac,
+			  target_hw);
+
+	/* prepare place for the headers. */
+	if (nskb)
+		skb_reserve(nskb, ETH_HLEN);
+
+	return nskb;
+}
+
+static struct sk_buff *get_parent_skb_ip(struct slave *slave,
+					 struct sk_buff *skb)
+{
+	/* nop */
+	return skb;
+}
+
+/* get_parent_skb -- called in RX flow */
+static struct sk_buff *get_parent_skb(struct slave *slave,
+				      struct sk_buff *skb, u8 *remac)
+{
+	struct net_device *dev = slave->dev->master;
+	struct sk_buff *nskb = NULL;
+	struct ethhdr *ethh;
+
+	if (skb->protocol == htons(ETH_P_ARP) ||
+	    skb->protocol == htons(ETH_P_RARP))
+		nskb = get_parent_skb_arp(slave, skb, remac);
+	else
+		nskb = get_parent_skb_ip(slave, skb);
+
+	/* if new skb could not be adjusted/allocated, abort */
+	if (!nskb)
+		goto err;
+
+	/* at this point, we can free old skb if it was cloned */
+	if (nskb && (nskb != skb))
+		dev_kfree_skb(skb);
+
+	skb = nskb;
+
+	/* build ethernet header */
+	ethh = (struct ethhdr *)skb_push(skb, ETH_HLEN);
+	ethh->h_proto = skb->protocol;
+	memcpy(ethh->h_source, remac, ETH_ALEN);
+	memcpy(ethh->h_dest, slave->emac, ETH_ALEN);
+
+	/* zero padding whenever is needed (arp for example).to ETH_ZLEN size */
+	if (unlikely((skb->len < ETH_ZLEN))) {
+		if ((skb->tail + (ETH_ZLEN - skb->len) > skb->end) ||
+		    skb_is_nonlinear(skb))
+			/* nothing */;
+		else
+			memset(skb_put(skb, ETH_ZLEN - skb->len), 0,
+			       ETH_ZLEN - skb->len);
+	}
+
+	/* set new skb fields */
+	skb->pkt_type = PACKET_HOST;
+	/*
+	 * use master dev, to allow netpoll_receive_skb()
+	 * in netif_receive_skb()
+	 */
+	skb->dev = dev;
+
+	/* pull the Ethernet header and update other fields */
+	skb->protocol = eth_type_trans(skb, skb->dev);
+
+	return skb;
+
+err:
+	/* got error after nskb was adjusted/allocated */
+	if (nskb && (nskb != skb))
+		dev_kfree_skb(nskb);
+
+	return NULL;
+}
+
+static int parent_rx(struct sk_buff *skb, struct slave *slave)
+{
+	struct net_device *slave_dev = skb->dev;
+	struct net_device *parent_dev = slave_dev->master;
+	struct parent *parent = netdev_priv(parent_dev);
+	struct eipoib_cb_data *data = IPOIB_HANDLER_CB(skb);
+	struct napi_struct *napi =  data->rx.napi;
+	struct sk_buff *nskb;
+	int rc = 0;
+	u8 remac[ETH_ALEN];
+	int vlan_tag;
+
+	build_neigh_mac(remac, data->rx.sqpn, data->rx.slid);
+
+	read_lock_bh(&parent->lock);
+
+	if (unlikely(skb_headroom(skb) < ETH_HLEN)) {
+		pr_warn("%s: small headroom %d < %d\n",
+			skb->dev->name, skb_headroom(skb), ETH_HLEN);
+		++parent->port_stats.rx_skb_errors;
+		goto drop;
+	}
+
+	/* learn neighs based on ARP snooping */
+	if (unlikely(ntohs(skb->protocol) == ETH_P_ARP)) {
+		read_unlock_bh(&parent->lock);
+		write_lock_bh(&parent->lock);
+		neigh_learn(slave, skb, remac);
+		write_unlock_bh(&parent->lock);
+		read_lock_bh(&parent->lock);
+	}
+
+	nskb = get_parent_skb(slave, skb, remac);
+	if (unlikely(!nskb)) {
+		++parent->port_stats.rx_skb_errors;
+		pr_warn("%s: failed to create parent_skb\n",
+			skb->dev->name);
+		goto drop;
+	} else
+		skb = nskb;
+
+	vlan_tag = slave->vlan & 0xfff;
+	if (vlan_tag) {
+		skb = __vlan_hwaccel_put_tag(skb, vlan_tag);
+		if (!skb) {
+			pr_err("%s failed to insert VLAN tag\n",
+			       skb->dev->name);
+			goto drop;
+		}
+		++parent->port_stats.rx_vlan;
+	}
+
+	if (napi)
+		rc = napi_gro_receive(napi, skb);
+	else
+		rc = netif_receive_skb(skb);
+
+	read_unlock_bh(&parent->lock);
+
+	return rc;
+
+drop:
+	dev_kfree_skb_any(skb);
+	read_unlock_bh(&parent->lock);
+
+	return NET_RX_DROP;
+}
+
+static rx_handler_result_t eipoib_handle_frame(struct sk_buff **pskb)
+{
+	struct sk_buff *skb = *pskb;
+	struct slave *slave;
+
+	slave = eipoib_slave_get_rcu(skb->dev);
+
+	parent_rx(skb, slave);
+
+	return RX_HANDLER_CONSUMED;
+}
+
+static netdev_tx_t parent_tx(struct sk_buff *skb, struct net_device *dev)
+{
+	struct parent *parent = netdev_priv(dev);
+	struct slave *slave = NULL;
+	struct ethhdr *ethh = (struct ethhdr *)(skb->data);
+	struct sk_buff *nskb;
+	int rc;
+	u16 vlan;
+	u8 mac_no_admin_bit[ETH_ALEN];
+
+	read_lock_bh(&parent->lock);
+
+	if (unlikely(!IS_E_IPOIB_PROTO(ethh->h_proto))) {
+		++parent->port_stats.tx_proto_errors;
+		goto drop;
+	}
+	/* assume: only orphan skb's */
+	if (unlikely(skb_shared(skb))) {
+		++parent->port_stats.tx_shared;
+		goto drop;
+	}
+
+	/* obtain VLAN information if present */
+	if (vlan_tx_tag_present(skb)) {
+		vlan = vlan_tx_tag_get(skb) & 0xfff;
+		++parent->port_stats.tx_vlan;
+	} else {
+		vlan = VLAN_N_VID;
+	}
+
+	/*
+	 * for live migration: mask the admin bit if exists.
+	 * only in ARP packets that came from parent's VIF interface.
+	 */
+	if (unlikely((htons(ETH_P_ARP) == ethh->h_proto) &&
+	    !memcmp(parent->dev->dev_addr + 1, ethh->h_source + 1, ETH_ALEN - 1))) {
+		/* parent's VIF: */
+		memcpy(mac_no_admin_bit, ethh->h_source, ETH_ALEN);
+		mac_no_admin_bit[0] = mac_no_admin_bit[0] & 0xFD;
+		/* get slave, and queue packet */
+		slave = get_slave_by_mac_and_vlan(parent, mac_no_admin_bit, vlan);
+	}
+	/* get slave, and queue packet */
+	if (!slave)
+		slave = get_slave_by_mac_and_vlan(parent, ethh->h_source, vlan);
+	if (unlikely(!slave)) {
+		pr_info("vif: %pM with vlan: %d miss for parent: %s\n",
+			ethh->h_source, vlan, parent->ipoib_main_interface);
+		++parent->port_stats.tx_vif_miss;
+		goto drop;
+	}
+
+	nskb = get_slave_skb(slave, skb);
+	if (unlikely(!nskb)) {
+		++parent->port_stats.tx_skb_errors;
+		goto drop;
+	} else
+		skb = nskb;
+
+	/*
+	 * VST mode: removes the vlan tag in the tx (will add it in the rx)
+	 * the slave is from IPoIB and it is NETIF_F_VLAN_CHALLENGED,
+	 * so must remove the vlan tag.
+	 */
+	if (vlan != VLAN_N_VID)
+		skb->vlan_tci = 0;
+
+	/* arp packets: */
+	if (skb->protocol == htons(ETH_P_ARP) ||
+	    skb->protocol == htons(ETH_P_RARP)) {
+		arp_xmit(skb);
+		goto out;
+	}
+
+	/* ip packets */
+	skb_record_rx_queue(skb, skb_get_queue_mapping(skb));
+
+	rc = dev_queue_xmit(skb);
+	if (unlikely(rc)) {
+		pr_err("slave tx method failed %d\n", rc);
+		++parent->port_stats.tx_slave_err;
+		dev_kfree_skb(skb);
+	}
+
+	goto out;
+
+drop:
+	++parent->port_stats.tx_parent_dropped;
+	dev_kfree_skb(skb);
+
+out:
+	read_unlock_bh(&parent->lock);
+	return NETDEV_TX_OK;
+}
+
+static int parent_open(struct net_device *parent_dev)
+{
+	struct parent *parent = netdev_priv(parent_dev);
+
+	parent->kill_timers = 0;
+	INIT_DELAYED_WORK(&parent->neigh_learn_work, neigh_learn_task);
+	INIT_DELAYED_WORK(&parent->emac_ip_work, emac_info_clean_task);
+	INIT_DELAYED_WORK(&parent->migrate_out_work, migrate_out_work_task);
+	return 0;
+}
+
+static int parent_close(struct net_device *parent_dev)
+{
+	struct parent *parent = netdev_priv(parent_dev);
+
+	write_lock_bh(&parent->lock);
+	parent->kill_timers = 1;
+	write_unlock_bh(&parent->lock);
+
+	cancel_delayed_work(&parent->neigh_learn_work);
+	cancel_delayed_work(&parent->emac_ip_work);
+	cancel_delayed_work(&parent->migrate_out_work);
+
+	return 0;
+}
+
+
+static void parent_deinit(struct net_device *parent_dev)
+{
+	struct parent *parent = netdev_priv(parent_dev);
+
+	list_del(&parent->parent_list);
+
+	parent_work_cancel_all(parent);
+}
+
+static void parent_uninit(struct net_device *parent_dev)
+{
+	struct parent *parent = netdev_priv(parent_dev);
+
+	parent_deinit(parent_dev);
+	parent_destroy_sysfs_entry(parent);
+
+	if (parent->wq)
+		destroy_workqueue(parent->wq);
+}
+
+static struct lock_class_key parent_netdev_xmit_lock_key;
+static struct lock_class_key parent_netdev_addr_lock_key;
+
+static void parent_set_lockdep_class_one(struct net_device *dev,
+					 struct netdev_queue *txq,
+					 void *_unused)
+{
+	lockdep_set_class(&txq->_xmit_lock,
+			  &parent_netdev_xmit_lock_key);
+}
+
+static void parent_set_lockdep_class(struct net_device *dev)
+{
+	lockdep_set_class(&dev->addr_list_lock,
+			  &parent_netdev_addr_lock_key);
+	netdev_for_each_tx_queue(dev, parent_set_lockdep_class_one, NULL);
+}
+
+static int parent_init(struct net_device *parent_dev)
+{
+	struct parent *parent = netdev_priv(parent_dev);
+
+	parent->wq = create_singlethread_workqueue(parent_dev->name);
+	if (!parent->wq)
+		return -ENOMEM;
+
+	parent_set_lockdep_class(parent_dev);
+
+	list_add_tail(&parent->parent_list, &parent_dev_list);
+
+	return 0;
+}
+
+static u16 parent_select_q(struct net_device *dev, struct sk_buff *skb)
+{
+	return skb_tx_hash(dev, skb);
+}
+
+static int parent_add_vif_param(struct net_device *parent_dev,
+				struct net_device *new_vif_dev,
+				u16 vlan, u8 *mac)
+{
+	struct parent *parent = netdev_priv(parent_dev);
+	struct slave *new_slave, *slave_tmp;
+	int ret = 0;
+
+	if (!is_valid_ether_addr(mac)) {
+		pr_err("Invalid mac input for slave:%pM \n", mac);
+		return -EINVAL;
+	}
+
+	write_lock_bh(&parent->lock);
+
+	new_slave = get_slave_by_dev(parent, new_vif_dev);
+	if (!new_slave) {
+		pr_err("%s: ERROR no slave:%s.!!!! \n",
+		       __func__, new_vif_dev->name);
+		ret = -EINVAL;
+		goto out;
+	}
+
+	if (!is_zero_ether_addr(new_slave->emac)) {
+		pr_err("slave %s mac already set to %pM\n",
+		       new_slave->dev->name, new_slave->emac);
+		ret = -EINVAL;
+		goto out;
+	}
+
+	/* check another slave has this mac/vlan */
+	parent_for_each_slave(parent, slave_tmp) {
+		if (!memcmp(slave_tmp->emac, mac, ETH_ALEN) &&
+		    slave_tmp->vlan == new_slave->vlan) {
+			pr_err("cannot update %s, slave %s already has"
+			       " vlan 0x%x mac %pM\n",
+			       parent->dev->name, new_slave->dev->name,
+			       slave_tmp->vlan,
+			       mac);
+			ret = -EINVAL;
+			goto out;
+		}
+	}
+
+	/* ready to go */
+	pr_info("slave %s mac is set to %pM, vlan set to: %d\n",
+		new_slave->dev->name, mac, vlan);
+
+	memcpy(new_slave->emac, mac, ETH_ALEN);
+
+	new_slave->vlan = vlan;
+
+out:
+	write_unlock_bh(&parent->lock);
+
+	return ret;
+}
+
+static const struct net_device_ops parent_netdev_ops = {
+	.ndo_init		= parent_init,
+	.ndo_uninit		= parent_uninit,
+	.ndo_open		= parent_open,
+	.ndo_stop		= parent_close,
+	.ndo_start_xmit		= parent_tx,
+	.ndo_select_queue	= parent_select_q,
+	/* parnt mtu is min(slaves_mtus) */
+	.ndo_change_mtu		= NULL,
+	.ndo_fix_features	= parent_fix_features,
+	/*
+	 * initial mac address is randomized, can be changed
+	 * thru this func later
+	 */
+	.ndo_set_mac_address = eth_mac_addr,
+	.ndo_get_stats64 = parent_get_stats,
+	.ndo_add_slave = parent_enslave,
+	.ndo_del_slave = parent_release_slave,
+	.ndo_set_vif_param = parent_add_vif_param,
+};
+
+static void parent_setup(struct net_device *parent_dev)
+{
+	struct parent *parent = netdev_priv(parent_dev);
+
+	/* initialize rwlocks */
+	rwlock_init(&parent->lock);
+
+	/* Initialize pointers */
+	parent->dev = parent_dev;
+	INIT_LIST_HEAD(&parent->neigh_add_list);
+	INIT_LIST_HEAD(&parent->slave_list);
+	INIT_LIST_HEAD(&parent->emac_ip_list);
+	/* Initialize the device entry points */
+	ether_setup(parent_dev);
+	/* parent_dev->hard_header_len is adjusted later */
+	parent_dev->netdev_ops = &parent_netdev_ops;
+	parent_set_ethtool_ops(parent_dev);
+
+	/* Initialize the device options */
+	parent_dev->tx_queue_len = 0;
+	/* mark the parent intf as pif (master of other vifs.) */
+	parent_dev->priv_flags = IFF_EIPOIB_PIF;
+
+	parent_dev->hw_features = NETIF_F_SG | NETIF_F_IP_CSUM |
+		NETIF_F_RXCSUM | NETIF_F_GRO | NETIF_F_TSO;
+
+	parent_dev->features = parent_dev->hw_features;
+	parent_dev->vlan_features = parent_dev->hw_features;
+
+	parent_dev->features |= PARENT_VLAN_FEATURES;
+}
+
+/*
+ * Create a new parent based on the specified name and parent parameters.
+ * Caller must NOT hold rtnl_lock; we need to release it here before we
+ * set up our sysfs entries.
+ */
+static struct parent *parent_create(struct net_device *ibd)
+{
+	struct net_device *parent_dev;
+	u32 num_queues;
+	int rc;
+	union ib_gid gid;
+	struct parent *parent = NULL;
+	int i, j;
+
+	memcpy(&gid, ibd->dev_addr + 4, sizeof(union ib_gid));
+	num_queues = num_online_cpus();
+	num_queues = roundup_pow_of_two(num_queues);
+
+	parent_dev = alloc_netdev_mq(sizeof(struct parent), "",
+				     parent_setup, num_queues);
+	if (!parent_dev) {
+		pr_err("%s failed to alloc netdev!\n", ibd->name);
+		rc = -ENOMEM;
+		goto out_rtnl;
+	}
+
+	rc = dev_alloc_name(parent_dev, "eth%d");
+	if (rc < 0)
+		goto out_netdev;
+
+	/* eIPoIB interface mac format. */
+	for (i = 0, j = 0; i < 8; i++) {
+		if ((PARENT_MAC_MASK >> i) & 0x1) {
+			if (j < 6) /* only 6 bytes eth address */
+				parent_dev->dev_addr[j] =
+					gid.raw[GUID_LEN + i];
+			j++;
+		}
+	}
+
+	/* assuming that the ibd->dev.parent was alreadey been set. */
+	SET_NETDEV_DEV(parent_dev, ibd->dev.parent);
+
+	rc = register_netdevice(parent_dev);
+	if (rc < 0)
+		goto out_parent;
+
+	dev_net_set(parent_dev, &init_net);
+
+	rc = parent_create_sysfs_entry(netdev_priv(parent_dev));
+	if (rc < 0)
+		goto out_unreg;
+
+	parent = netdev_priv(parent_dev);
+	memcpy(parent->gid.raw, gid.raw, GID_LEN);
+	strncpy(parent->ipoib_main_interface, ibd->name, IFNAMSIZ);
+	parent_dev->dev_id = ibd->dev_id;
+
+	return parent;
+
+out_unreg:
+	unregister_netdevice(parent_dev);
+out_parent:
+	parent_deinit(parent_dev);
+out_netdev:
+	free_netdev(parent_dev);
+out_rtnl:
+	return ERR_PTR(rc);
+}
+
+
+static void parent_free(struct parent *parent)
+{
+	struct net_device *parent_dev = parent->dev;
+
+	parent_work_cancel_all(parent);
+
+	parent_release_all(parent_dev);
+
+	unregister_netdevice(parent_dev);
+}
+
+static void parent_free_all(void)
+{
+	struct parent *parent, *nxt;
+
+	list_for_each_entry_safe(parent, nxt, &parent_dev_list, parent_list)
+		parent_free(parent);
+}
+
+/* netdev events handlers */
+static inline int is_ipoib_pif_intf(struct net_device *dev)
+{
+	if (ARPHRD_INFINIBAND == dev->type && dev->priv_flags & IFF_EIPOIB_PIF)
+		return 1;
+	return 0;
+}
+
+static int parent_event_changename(struct parent *parent)
+{
+	parent_destroy_sysfs_entry(parent);
+
+	parent_create_sysfs_entry(parent);
+
+	return NOTIFY_DONE;
+}
+
+static int parent_master_netdev_event(unsigned long event,
+				      struct net_device *parent_dev)
+{
+	struct parent *event_parent = netdev_priv(parent_dev);
+
+	switch (event) {
+	case NETDEV_CHANGENAME:
+		pr_info("%s: got NETDEV_CHANGENAME event", parent_dev->name);
+		return parent_event_changename(event_parent);
+	default:
+		break;
+	}
+
+	return NOTIFY_DONE;
+}
+
+static int parent_slave_netdev_event(unsigned long event,
+				     struct net_device *slave_dev)
+{
+	struct net_device *parent_dev = slave_dev->master;
+	struct parent *parent = netdev_priv(parent_dev);
+
+	if (!parent_dev) {
+		pr_err("slave:%s has no parent.\n", slave_dev->name);
+		return NOTIFY_DONE;
+	}
+
+	switch (event) {
+	case NETDEV_UNREGISTER:
+		parent_release_slave(parent_dev, slave_dev);
+		break;
+	case NETDEV_CHANGE:
+	case NETDEV_UP:
+	case NETDEV_DOWN:
+		parent_set_carrier(parent);
+		break;
+	case NETDEV_CHANGEMTU:
+		parent_set_mtu(parent);
+		break;
+	case NETDEV_CHANGENAME:
+		break;
+	case NETDEV_FEAT_CHANGE:
+		parent_compute_features(parent);
+		break;
+	default:
+		break;
+	}
+
+	return NOTIFY_DONE;
+}
+
+static int eipoib_netdev_event(struct notifier_block *this,
+			       unsigned long event, void *ptr)
+{
+	struct net_device *event_dev = (struct net_device *)ptr;
+
+	if (dev_net(event_dev) != &init_net)
+		return NOTIFY_DONE;
+
+	if (is_parent(event_dev))
+		return parent_master_netdev_event(event, event_dev);
+
+	if (is_slave(event_dev))
+		return parent_slave_netdev_event(event, event_dev);
+	/*
+	 * general network device triggers event, check if it is new
+	 * ib interface that we want to enslave.
+	 */
+	return eipoib_device_event(this, event, ptr);
+}
+
+static struct notifier_block parent_netdev_notifier = {
+	.notifier_call = eipoib_netdev_event,
+};
+
+static int eipoib_device_event(struct notifier_block *unused,
+			       unsigned long event, void *ptr)
+{
+	struct net_device *dev = ptr;
+	struct parent *parent;
+
+	if (!is_ipoib_pif_intf(dev))
+		return NOTIFY_DONE;
+
+	switch (event) {
+	case NETDEV_REGISTER:
+		parent = parent_create(dev);
+		if (IS_ERR(parent)) {
+			pr_warn("failed to create parent for %s\n",
+				dev->name);
+			break;
+		}
+		break;
+	case NETDEV_UNREGISTER:
+		parent = get_parent_by_pif_name(dev->name);
+		if (parent)
+			parent_free(parent);
+		break;
+	default:
+		break;
+	}
+
+	return NOTIFY_DONE;
+}
+
+static int __init mod_init(void)
+{
+	int rc;
+
+	pr_info(DRV_NAME": %s", version);
+
+	rc = register_pernet_subsys(&eipoib_net_ops);
+	if (rc)
+		goto out;
+
+	rc = register_netdevice_notifier(&parent_netdev_notifier);
+	if (rc) {
+		pr_err("%s failed to register_netdevice_notifier, rc: 0x%x\n",
+		       __func__, rc);
+		goto unreg_subsys;
+	}
+
+	goto out;
+
+unreg_subsys:
+	unregister_pernet_subsys(&eipoib_net_ops);
+out:
+	return rc;
+
+}
+
+static void __exit mod_exit(void)
+{
+	unregister_netdevice_notifier(&parent_netdev_notifier);
+
+	unregister_pernet_subsys(&eipoib_net_ops);
+
+	rtnl_lock();
+	parent_free_all();
+	rtnl_unlock();
+}
+
+module_init(mod_init);
+module_exit(mod_exit);
+MODULE_LICENSE("Dual BSD/GPL");
+MODULE_VERSION(DRV_VERSION);
+MODULE_DESCRIPTION(DRV_DESCRIPTION ", v" DRV_VERSION);
+MODULE_AUTHOR("Ali Ayoub && Erez Shitrit");
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH V2 10/12] net/eipoib: Add sysfs support
  2012-08-01 17:09 [PATCH V2 00/12] Add Ethernet IPoIB driver Or Gerlitz
                   ` (8 preceding siblings ...)
  2012-08-01 17:09 ` [PATCH V2 09/12] net/eipoib: Add main driver functionality Or Gerlitz
@ 2012-08-01 17:09 ` Or Gerlitz
  2012-08-01 17:09 ` [PATCH V2 11/12] net/eipoib: Add Makefile, Kconfig and MAINTAINERS entries Or Gerlitz
  2012-08-01 17:09 ` [PATCH V2 12/12] IB/ipoib: Add support for transmission of skbs w.o dst/neighbour Or Gerlitz
  11 siblings, 0 replies; 71+ messages in thread
From: Or Gerlitz @ 2012-08-01 17:09 UTC (permalink / raw)
  To: davem; +Cc: roland, netdev, ali, sean.hefty, Erez Shitrit, Or Gerlitz

From: Erez Shitrit <erezsh@mellanox.co.il>

The management interface for the driver was using sysfs. 

While most of this usage (all the "set" operations) was ported to use 
rtnetlink, still some sysfs "show" entries were left in this posting, 
till its clear if/how they can be implemented otherwise.

Here are few sysfs commands that are used in order to get 
information from the driver,

1. get parent's slaves:

	$ cat /sys/class/net/ethZ/eth/slaves

where ethZ is the driver interface

2. see the list of ipoib interfaces enslaved under eipoib interface,

	$ cat /sys/class/net/ethX/eth/vifs

for example:

	$ cat /sys/class/net/eth4/eth/vifs

	SLAVE=ib0.1      MAC=9a:c2:1f:d7:3b:63 VLAN=N/A
	SLAVE=ib0.2      MAC=52:54:00:60:55:88 VLAN=N/A
	SLAVE=ib0.3      MAC=52:54:00:60:55:89 VLAN=N/A

Signed-off-by: Erez Shitrit <erezsh@mellanox.co.il>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
---
 drivers/net/eipoib/eth_ipoib_sysfs.c |  435 ++++++++++++++++++++++++++++++++++
 1 files changed, 435 insertions(+), 0 deletions(-)
 create mode 100644 drivers/net/eipoib/eth_ipoib_sysfs.c

diff --git a/drivers/net/eipoib/eth_ipoib_sysfs.c b/drivers/net/eipoib/eth_ipoib_sysfs.c
new file mode 100644
index 0000000..8217e32
--- /dev/null
+++ b/drivers/net/eipoib/eth_ipoib_sysfs.c
@@ -0,0 +1,435 @@
+/*
+ * Copyright (c) 2012 Mellanox Technologies. All rights reserved
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * openfabric.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/device.h>
+#include <linux/sched.h>
+#include <linux/fs.h>
+#include <linux/types.h>
+#include <linux/string.h>
+#include <linux/netdevice.h>
+#include <linux/inetdevice.h>
+#include <linux/in.h>
+#include <linux/sysfs.h>
+#include <linux/ctype.h>
+#include <linux/inet.h>
+#include <linux/rtnetlink.h>
+#include <linux/etherdevice.h>
+#include <net/net_namespace.h>
+
+#include "eth_ipoib.h"
+
+#define to_dev(obj)	container_of(obj, struct device, kobj)
+#define to_parent(cd)	((struct parent *)(netdev_priv(to_net_dev(cd))))
+#define MOD_NA_STRING		"N/A"
+
+#define _sprintf(p, buf, format, arg...)				\
+((PAGE_SIZE - (int)(p - buf)) <= 0 ? 0 :				\
+	scnprintf(p, PAGE_SIZE - (int)(p - buf), format, ## arg))\
+
+#define _end_of_line(_p, _buf)					\
+do { if (_p - _buf) /* eat the leftover space */			\
+		buf[_p - _buf - 1] = '\n';				\
+} while (0)
+
+/* helper functions */
+static int get_emac(u8 *mac, char *s)
+{
+	if (sscanf(s, "%hhx:%hhx:%hhx:%hhx:%hhx:%hhx",
+		   mac + 0, mac + 1, mac + 2, mac + 3, mac + 4,
+		   mac + 5) != 6)
+		return -1;
+
+	return 0;
+}
+
+static int get_imac(u8 *mac, char *s)
+{
+	if (sscanf(s, "%hhx:%hhx:%hhx:%hhx:%hhx:%hhx:%hhx:%hhx:"
+		   "%hhx:%hhx:%hhx:%hhx:%hhx:%hhx:%hhx:%hhx:"
+		   "%hhx:%hhx:%hhx:%hhx",
+		   mac + 0, mac + 1, mac + 2, mac + 3, mac + 4,
+		   mac + 5, mac + 6, mac + 7, mac + 8, mac + 9,
+		   mac + 10, mac + 11, mac + 12, mac + 13,
+		   mac + 14, mac + 15, mac + 16, mac + 17,
+		   mac + 18, mac + 19) != 20)
+		return -1;
+
+	return 0;
+}
+
+/* show/store functions per module (CLASS_ATTR) */
+static ssize_t show_parents(struct class *cls, struct class_attribute *attr,
+			    char *buf)
+{
+	char *p = buf;
+	struct parent *parent;
+
+	rtnl_lock(); /* because of parent_dev_list */
+
+	list_for_each_entry(parent, &parent_dev_list, parent_list) {
+		p += _sprintf(p, buf, "%s over IB port: %s\n",
+			      parent->dev->name,
+			      parent->ipoib_main_interface);
+	}
+	_end_of_line(p, buf);
+
+	rtnl_unlock();
+	return (ssize_t)(p - buf);
+}
+
+/* show/store functions per parent (DEVICE_ATTR) */
+static ssize_t parent_show_neighs(struct device *d,
+				  struct device_attribute *attr, char *buf)
+{
+	struct slave *slave;
+	struct neigh *neigh;
+	struct parent *parent = to_parent(d);
+	char *p = buf;
+
+	read_lock_bh(&parent->lock);
+	parent_for_each_slave(parent, slave) {
+		list_for_each_entry(neigh, &slave->neigh_list, list) {
+			p += _sprintf(p, buf, "SLAVE=%-10s EMAC=%pM IMAC=%pM:%pM:%pM:%.2x:%.2x\n",
+				      slave->dev->name,
+				      neigh->emac,
+				      neigh->imac, neigh->imac + 6, neigh->imac + 12,
+				      neigh->imac[18], neigh->imac[19]);
+		}
+	}
+
+	read_unlock_bh(&parent->lock);
+
+	_end_of_line(p, buf);
+
+	return (ssize_t)(p - buf);
+}
+
+struct neigh *parent_get_neigh_cmd(char op,
+				   char *ifname, u8 *remac, u8 *rimac)
+{
+	struct neigh *neigh_cmd;
+
+	neigh_cmd = kzalloc(sizeof *neigh_cmd, GFP_ATOMIC);
+	if (!neigh_cmd) {
+		pr_err("%s cannot allocate neigh struct\n", ifname);
+		goto out;
+	}
+
+	/*
+	 * populate emac field so it can be used easily
+	 * in neigh_cmd_find_by_mac()
+	 */
+	memcpy(neigh_cmd->emac, remac, ETH_ALEN);
+	memcpy(neigh_cmd->imac, rimac, INFINIBAND_ALEN);
+
+	/* prepare the command as a string */
+	sprintf(neigh_cmd->cmd, "%c%s %pM %pM:%pM:%pM:%.2x:%.2x",
+		op, ifname, remac, rimac, rimac + 6, rimac + 12, rimac[18], rimac[19]);
+out:
+	return neigh_cmd;
+}
+
+/* write_lock_bh(&parent->lock) must be held */
+ssize_t __parent_store_neighs(struct device *d,
+			      struct device_attribute *attr,
+			      const char *buffer, size_t count)
+{
+	char command[IFNAMSIZ + 1] = { 0, };
+	char emac_str[ETH_ALEN * 3] = { 0, };
+	u8 emac[ETH_ALEN];
+	char imac_str[INFINIBAND_ALEN * 3] = { 0, };
+	u8 imac[INFINIBAND_ALEN];
+	char *ifname;
+	int found = 0, ret = count;
+	struct slave *slave = NULL, *slave_tmp;
+	struct neigh *neigh;
+	struct parent *parent = to_parent(d);
+
+	sscanf(buffer, "%s %s %s", command, emac_str, imac_str);
+
+	/* check ifname */
+	ifname = command + 1;
+	if ((strlen(command) <= 1) || !dev_valid_name(ifname) ||
+	    (command[0] != '+' && command[0] != '-'))
+		goto err_no_cmd;
+
+	/* check if ifname exist */
+	parent_for_each_slave(parent, slave_tmp) {
+		if (!strcmp(slave_tmp->dev->name, ifname)) {
+			found = 1;
+			slave = slave_tmp;
+		}
+	}
+
+	if (!found) {
+		pr_err("%s could not find slave\n", ifname);
+		ret = -EINVAL;
+		goto out;
+	}
+
+	if (get_emac(emac, emac_str)) {
+		pr_err("%s bad emac %s\n", ifname, emac_str);
+		ret = -EINVAL;
+		goto out;
+	}
+
+	if (get_imac(imac, imac_str)) {
+		pr_err("%s bad imac %s\n", ifname, imac_str);
+		ret = -EINVAL;
+		goto out;
+	}
+
+	/* process command */
+	if (command[0] == '+') {
+		found = 0;
+		list_for_each_entry(neigh, &slave->neigh_list, list) {
+			if (!memcmp(neigh->emac, emac, ETH_ALEN))
+				found = 1;
+		}
+
+		if (found) {
+			pr_err("%s: cannot update neigh, slave already has "
+			       "this neigh mac %pM\n",
+			       slave->dev->name, emac);
+			ret = -EINVAL;
+			goto out;
+		}
+
+		neigh = kzalloc(sizeof *neigh, GFP_ATOMIC);
+		if (!neigh) {
+			pr_err("%s cannot allocate neigh struct\n",
+			       slave->dev->name);
+			ret = -ENOMEM;
+			goto out;
+		}
+
+		/* ready to go */
+		pr_info("%s: slave %s neigh mac is set to %pM\n",
+			ifname, parent->dev->name, emac);
+		memcpy(neigh->emac, emac, ETH_ALEN);
+		memcpy(neigh->imac, imac, INFINIBAND_ALEN);
+
+		list_add_tail(&neigh->list, &slave->neigh_list);
+
+		goto out;
+	}
+
+	if (command[0] == '-') {
+		found = 0;
+		list_for_each_entry(neigh, &slave->neigh_list, list) {
+			if (!memcmp(neigh->emac, emac, ETH_ALEN))
+				found = 1;
+		}
+
+		if (!found) {
+			pr_err("%s cannot delete neigh mac %pM\n",
+			       ifname, emac);
+			ret = -EINVAL;
+			goto out;
+		}
+
+		list_del(&neigh->list);
+		kfree(neigh);
+
+		goto out;
+	}
+
+err_no_cmd:
+	pr_err("%s USAGE: (-|+)ifname emac imac\n", DRV_NAME);
+	ret = -EPERM;
+
+out:
+	return ret;
+}
+
+static DEVICE_ATTR(neighs, S_IRUGO, parent_show_neighs,
+		   NULL);
+
+static ssize_t parent_show_vifs(struct device *d,
+				struct device_attribute *attr, char *buf)
+{
+	struct slave *slave;
+	struct parent *parent = to_parent(d);
+	char *p = buf;
+
+	read_lock_bh(&parent->lock);
+	parent_for_each_slave(parent, slave) {
+		if (is_zero_ether_addr(slave->emac)) {
+			p += _sprintf(p, buf, "SLAVE=%-10s MAC=%-17s "
+				      "VLAN=%s\n", slave->dev->name,
+				      MOD_NA_STRING, MOD_NA_STRING);
+		} else if (slave->vlan == VLAN_N_VID) {
+			p += _sprintf(p, buf, "SLAVE=%-10s MAC=%pM VLAN=%s\n",
+				      slave->dev->name,
+				      slave->emac,
+				      MOD_NA_STRING);
+		} else {
+			p += _sprintf(p, buf, "SLAVE=%-10s MAC=%pM VLAN=%d\n",
+				      slave->dev->name,
+				      slave->emac,
+				      slave->vlan);
+		}
+	}
+	read_unlock_bh(&parent->lock);
+
+	_end_of_line(p, buf);
+
+	return (ssize_t)(p - buf);
+}
+
+static DEVICE_ATTR(vifs, S_IRUGO, parent_show_vifs,
+		   NULL);
+
+static ssize_t parent_show_slaves(struct device *d,
+				  struct device_attribute *attr, char *buf)
+{
+	struct slave *slave;
+	struct parent *parent = to_parent(d);
+	char *p = buf;
+
+	read_lock_bh(&parent->lock);
+	parent_for_each_slave(parent, slave)
+		p += _sprintf(p, buf, "%s\n", slave->dev->name);
+	read_unlock_bh(&parent->lock);
+
+	_end_of_line(p, buf);
+
+	return (ssize_t)(p - buf);
+}
+
+static DEVICE_ATTR(slaves, S_IRUGO, parent_show_slaves,
+		   NULL);
+
+/* sysfs create/destroy functions */
+static struct attribute *per_parent_attrs[] = {
+	&dev_attr_slaves.attr, /* DEVICE_ATTR(slaves..) */
+	&dev_attr_vifs.attr,
+	&dev_attr_neighs.attr,
+	NULL,
+};
+
+/* name spcase  support */
+static const void *eipoib_namespace(struct class *cls,
+				    const struct class_attribute *attr)
+{
+	const struct eipoib_net *eipoib_n =
+		container_of(attr,
+			     struct eipoib_net, class_attr_eipoib_interfaces);
+	return eipoib_n->net;
+}
+
+static struct attribute_group parent_group = {
+	/* per parent sysfs files under: /sys/class/net/<IF>/eth/.. */
+	.name = "eth",
+	.attrs = per_parent_attrs
+};
+
+int create_slave_symlinks(struct net_device *master,
+			  struct net_device *slave)
+{
+	char linkname[IFNAMSIZ+7];
+	int ret = 0;
+
+	ret = sysfs_create_link(&(slave->dev.kobj), &(master->dev.kobj),
+				"eth_parent");
+	if (ret)
+		return ret;
+
+	sprintf(linkname, "slave_%s", slave->name);
+	ret = sysfs_create_link(&(master->dev.kobj), &(slave->dev.kobj),
+				linkname);
+	return ret;
+
+}
+
+void destroy_slave_symlinks(struct net_device *master,
+			    struct net_device *slave)
+{
+	char linkname[IFNAMSIZ+7];
+
+	sysfs_remove_link(&(slave->dev.kobj), "eth_parent");
+	sprintf(linkname, "slave_%s", slave->name);
+	sysfs_remove_link(&(master->dev.kobj), linkname);
+}
+
+static struct class_attribute class_attr_eth_ipoib_interfaces = {
+	.attr = {
+		.name = "eth_ipoib_interfaces",
+		.mode = S_IWUSR | S_IRUGO,
+	},
+	.show = show_parents,
+	.namespace = eipoib_namespace,
+};
+
+/* per module sysfs file under: /sys/class/net/eth_ipoib_interfaces */
+int mod_create_sysfs(struct eipoib_net *eipoib_n)
+{
+	int rc;
+	/* defined in CLASS_ATTR(eth_ipoib_interfaces..) */
+	eipoib_n->class_attr_eipoib_interfaces =
+		class_attr_eth_ipoib_interfaces;
+
+	sysfs_attr_init(&eipoib_n->class_attr_eipoib_interfaces.attr);
+
+	rc = netdev_class_create_file(&eipoib_n->class_attr_eipoib_interfaces);
+	if (rc)
+		pr_err("%s failed to create sysfs (rc %d)\n",
+		       eipoib_n->class_attr_eipoib_interfaces.attr.name, rc);
+
+	return rc;
+}
+
+void mod_destroy_sysfs(struct eipoib_net *eipoib_n)
+{
+	netdev_class_remove_file(&eipoib_n->class_attr_eipoib_interfaces);
+}
+
+int parent_create_sysfs_entry(struct parent *parent)
+{
+	struct net_device *dev = parent->dev;
+	int rc;
+
+	rc = sysfs_create_group(&(dev->dev.kobj), &parent_group);
+	if (rc)
+		pr_info("failed to create sysfs group\n");
+
+	return rc;
+}
+
+void parent_destroy_sysfs_entry(struct parent *parent)
+{
+	struct net_device *dev = parent->dev;
+
+	sysfs_remove_group(&(dev->dev.kobj), &parent_group);
+}
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH V2 11/12] net/eipoib: Add Makefile, Kconfig and MAINTAINERS entries
  2012-08-01 17:09 [PATCH V2 00/12] Add Ethernet IPoIB driver Or Gerlitz
                   ` (9 preceding siblings ...)
  2012-08-01 17:09 ` [PATCH V2 10/12] net/eipoib: Add sysfs support Or Gerlitz
@ 2012-08-01 17:09 ` Or Gerlitz
  2012-08-01 17:09 ` [PATCH V2 12/12] IB/ipoib: Add support for transmission of skbs w.o dst/neighbour Or Gerlitz
  11 siblings, 0 replies; 71+ messages in thread
From: Or Gerlitz @ 2012-08-01 17:09 UTC (permalink / raw)
  To: davem; +Cc: roland, netdev, ali, sean.hefty, Erez Shitrit, Or Gerlitz

From: Erez Shitrit <erezsh@mellanox.co.il>

Add Kconfig entry under drivers/net and MAINTAINERS entry for eIPoIB, also
add the driver makefile.

Signed-off-by: Erez Shitrit <erezsh@mellanox.co.il>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
---
 MAINTAINERS                 |    6 ++++++
 drivers/net/Kconfig         |   15 +++++++++++++++
 drivers/net/Makefile        |    1 +
 drivers/net/eipoib/Makefile |    4 ++++
 4 files changed, 26 insertions(+), 0 deletions(-)
 create mode 100644 drivers/net/eipoib/Makefile

diff --git a/MAINTAINERS b/MAINTAINERS
index bd45164..09cc33a 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -2645,6 +2645,12 @@ L:	netdev@vger.kernel.org
 S:	Maintained
 F:	drivers/net/ethernet/ibm/ehea/
 
+EIPoIB (Ethernet services over IPoIB) DRIVER
+M:	Erez Shitrit <erezsh@mellanox.com>
+L:	netdev@vger.kernel.org
+S:	Supported
+F:	drivers/net/eipoib/
+
 EMBEDDED LINUX
 M:	Paul Gortmaker <paul.gortmaker@windriver.com>
 M:	Matt Mackall <mpm@selenic.com>
diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
index 0c2bd80..ba98f61 100644
--- a/drivers/net/Kconfig
+++ b/drivers/net/Kconfig
@@ -68,6 +68,21 @@ config DUMMY
 	  To compile this driver as a module, choose M here: the module
 	  will be called dummy.
 
+config E_IPOIB
+	tristate "Ethernet Services over IPoIB"
+	depends on INFINIBAND_IPOIB
+	---help---
+	  This driver supports Ethernet protocol over InfiniBand IPoIB devices.
+	  Some services can run only on top of Ethernet L2 interfaces, and
+	  cannot be bound to an IPoIB interface.
+	  With this new driver, these services can run seamlessly.
+
+	  Main use case of the driver is the Ethernet Virtual Switching used in
+	  virtualized environments, where an eipoib netdevice can be used as a
+	  Physical Interface (PIF) in the hypervisor domain, and allow other guests
+	  Virtual Interfaces (VIF) connected to the same Virtual Switch to run over
+	  the InfiniBand fabric.
+
 config EQUALIZER
 	tristate "EQL (serial line load balancing) support"
 	---help---
diff --git a/drivers/net/Makefile b/drivers/net/Makefile
index 3d375ca..2c3409e 100644
--- a/drivers/net/Makefile
+++ b/drivers/net/Makefile
@@ -31,6 +31,7 @@ obj-$(CONFIG_CAIF) += caif/
 obj-$(CONFIG_CAN) += can/
 obj-$(CONFIG_ETRAX_ETHERNET) += cris/
 obj-$(CONFIG_NET_DSA) += dsa/
+obj-$(CONFIG_E_IPOIB) += eipoib/
 obj-$(CONFIG_ETHERNET) += ethernet/
 obj-$(CONFIG_FDDI) += fddi/
 obj-$(CONFIG_HIPPI) += hippi/
diff --git a/drivers/net/eipoib/Makefile b/drivers/net/eipoib/Makefile
new file mode 100644
index 0000000..b64e96e
--- /dev/null
+++ b/drivers/net/eipoib/Makefile
@@ -0,0 +1,4 @@
+obj-$(CONFIG_E_IPOIB)                         := eth_ipoib.o
+eth_ipoib-y                                    := eth_ipoib_main.o \
+                                                  eth_ipoib_sysfs.o \
+                                                  eth_ipoib_ethtool.o
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH V2 12/12] IB/ipoib: Add support for transmission of skbs w.o dst/neighbour
  2012-08-01 17:09 [PATCH V2 00/12] Add Ethernet IPoIB driver Or Gerlitz
                   ` (10 preceding siblings ...)
  2012-08-01 17:09 ` [PATCH V2 11/12] net/eipoib: Add Makefile, Kconfig and MAINTAINERS entries Or Gerlitz
@ 2012-08-01 17:09 ` Or Gerlitz
  11 siblings, 0 replies; 71+ messages in thread
From: Or Gerlitz @ 2012-08-01 17:09 UTC (permalink / raw)
  To: davem; +Cc: roland, netdev, ali, sean.hefty, Erez Shitrit, Or Gerlitz

From: Erez Shitrit <erezsh@mellanox.co.il>

Guest IP packets sent by eIPoIB over an IPoIB VIF interface do not point
to dst or neighbour. This patch modifies IPoIB such that trasnmission
of such packets is possible. It does so by extending an already existing
path in the driver which was used so far only for unicast ARP probes.

This patch was made such that the series works as is over net-next, in
parallel to this driver a patch to modify IPoIB such that it doesn't
assume dst/neighbour on the skb was pushed upstream, its commit b63b70d87741 
"IPoIB: Use a private hash table for path lookup in xmit path", once
present in net-next this patch can simply be dropped.

Signed-off-by: Erez Shitrit <erezsh@mellanox.co.il>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
---
 drivers/infiniband/ulp/ipoib/ipoib_main.c |    7 ++++---
 1 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c
index 47034b4..fc4a1f0 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_main.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
@@ -732,7 +732,7 @@ static int ipoib_start_xmit(struct sk_buff *skb, struct net_device *dev)
 	unsigned long flags;
 
 	rcu_read_lock();
-	if (likely(skb_dst(skb))) {
+	if (likely(skb_dst(skb)) && !(dev->priv_flags & IFF_EIPOIB_VIF)) {
 		n = dst_neigh_lookup_skb(skb_dst(skb), skb);
 		if (!n) {
 			++dev->stats.tx_dropped;
@@ -800,7 +800,8 @@ static int ipoib_start_xmit(struct sk_buff *skb, struct net_device *dev)
 			/* unicast GID -- should be ARP or RARP reply */
 
 			if ((be16_to_cpup((__be16 *) skb->data) != ETH_P_ARP) &&
-			    (be16_to_cpup((__be16 *) skb->data) != ETH_P_RARP)) {
+			    (be16_to_cpup((__be16 *) skb->data) != ETH_P_RARP) &&
+				!(dev->priv_flags & IFF_EIPOIB_VIF)) {
 				ipoib_warn(priv, "Unicast, no %s: type %04x, QPN %06x %pI6\n",
 					   skb_dst(skb) ? "neigh" : "dst",
 					   be16_to_cpup((__be16 *) skb->data),
@@ -850,7 +851,7 @@ static int ipoib_hard_header(struct sk_buff *skb,
 	 * destination address into skb->cb so we can figure out where
 	 * to send the packet later.
 	 */
-	if (!skb_dst(skb)) {
+	if (!skb_dst(skb) || dev->priv_flags & IFF_EIPOIB_VIF) {
 		struct ipoib_cb *cb = (struct ipoib_cb *) skb->cb;
 		memcpy(cb->hwaddr, daddr, INFINIBAND_ALEN);
 	}
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* Re: [PATCH V2 05/12] net: Add ndo_set_vif_param operation to serve eIPoIB VIFs
  2012-08-01 17:09 ` [PATCH V2 05/12] net: Add ndo_set_vif_param operation to serve eIPoIB VIFs Or Gerlitz
@ 2012-08-02  0:17   ` Ben Hutchings
  2012-08-02  8:25     ` Erez Shitrit
  0 siblings, 1 reply; 71+ messages in thread
From: Ben Hutchings @ 2012-08-02  0:17 UTC (permalink / raw)
  To: Or Gerlitz; +Cc: davem, roland, netdev, ali, sean.hefty, Erez Shitrit

On Wed, 2012-08-01 at 20:09 +0300, Or Gerlitz wrote:
> From: Erez Shitrit <erezsh@mellanox.co.il>
> 
> The Ethernet IPoIB driver enslaves IPoIB devices and uses them as
> VIFs (Virtual Interface) which serve an Ethernet NIC e.g present in a 
> guest OS. For each such slave that acts as a VIF, eIPoIB needs to know 
> the mac and optionally the vlan uses by that NIC, the new ndo opertaion 
> is used to associate the mac/vlan for that slave.
> 
> Signed-off-by: Erez Shitrit <erezsh@mellanox.co.il>
> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
> ---
>  include/linux/netdevice.h |    5 ++++-
>  1 files changed, 4 insertions(+), 1 deletions(-)
> 
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index eb06e58..2709551 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -1013,7 +1013,10 @@ struct net_device_ops {
>  						    netdev_features_t features);
>  	int			(*ndo_neigh_construct)(struct neighbour *n);
>  	void			(*ndo_neigh_destroy)(struct neighbour *n);
> -
> +	int			(*ndo_set_vif_param) (struct net_device *dev,
> +						      struct net_device *vif,
> +						      u16 vlan,
> +						      u8 *mac);

The semantics of this operation should be documented in the comment
above the structure definition.  One detail worth covering is whether
'vlan' is just a VID or can also include priority+CFI bits.

If this is specific to eIPoIB, why not put that in the name of the
operation?  If not, this *really* needs explaining because so far I have
no whether it is something I should consider implementing on a real
Ethernet device.

Ben.

>  	int			(*ndo_fdb_add)(struct ndmsg *ndm,
>  					       struct net_device *dev,
>  					       unsigned char *addr,

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH V2 06/12] net/core: Add rtnetlink support to vif parameters
  2012-08-01 17:09 ` [PATCH V2 06/12] net/core: Add rtnetlink support to vif parameters Or Gerlitz
@ 2012-08-02  0:20   ` Ben Hutchings
  2012-08-02 15:29     ` Erez Shitrit
  0 siblings, 1 reply; 71+ messages in thread
From: Ben Hutchings @ 2012-08-02  0:20 UTC (permalink / raw)
  To: Or Gerlitz; +Cc: davem, roland, netdev, ali, sean.hefty, Erez Shitrit

On Wed, 2012-08-01 at 20:09 +0300, Or Gerlitz wrote:
> From: Erez Shitrit <erezsh@mellanox.co.il>
> 
> Add IFLA_VIF_INFO which allows user space to configure mac/vlan for a VIF
[...]
> --- a/net/core/rtnetlink.c
> +++ b/net/core/rtnetlink.c
> @@ -780,7 +780,8 @@ static noinline size_t if_nlmsg_size(const struct net_device *dev,
>  	       + rtnl_vfinfo_size(dev, ext_filter_mask) /* IFLA_VFINFO_LIST */
>  	       + rtnl_port_size(dev) /* IFLA_VF_PORTS + IFLA_PORT_SELF */
>  	       + rtnl_link_get_size(dev) /* IFLA_LINKINFO */
> -	       + rtnl_link_get_af_size(dev); /* IFLA_AF_SPEC */
> +	       + rtnl_link_get_af_size(dev) /* IFLA_AF_SPEC */
> +	       + nla_total_size(sizeof(struct ifla_vif_param)); /* IFLA_VIF_INFO */
>  }
[...]

if_nlmsg_size() returns the size of a message describing the interface.
But IFLA_VIF_INFO is write-only (why?) and therefore shouldn't be
included.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH V2 08/12] net/eipoib: Add ethtool file support
  2012-08-01 17:09 ` [PATCH V2 08/12] net/eipoib: Add ethtool file support Or Gerlitz
@ 2012-08-02  0:22   ` Ben Hutchings
  2012-08-02  8:35     ` Erez Shitrit
  0 siblings, 1 reply; 71+ messages in thread
From: Ben Hutchings @ 2012-08-02  0:22 UTC (permalink / raw)
  To: Or Gerlitz; +Cc: davem, roland, netdev, ali, sean.hefty, Erez Shitrit

On Wed, 2012-08-01 at 20:09 +0300, Or Gerlitz wrote:
> From: Erez Shitrit <erezsh@mellanox.co.il>
> 
> Via ethtool the driver describes its version, ABI version, on what PIF
> interface it runs and various statistics.
[...]
> --- /dev/null
> +++ b/drivers/net/eipoib/eth_ipoib_ethtool.c
[...]
> +static void parent_ethtool_get_drvinfo(struct net_device *parent_dev,
> +				       struct ethtool_drvinfo *drvinfo)
> +{
> +	struct parent *parent = netdev_priv(parent_dev);
> +
> +	strncpy(drvinfo->driver, DRV_NAME, 32);
> +
> +	strncpy(drvinfo->version, DRV_VERSION, 32);
> +
> +	strncpy(drvinfo->bus_info, parent->ipoib_main_interface,
> +		ETHTOOL_BUSINFO_LEN);

These must be null-terminated; therefore use strlcpy().

> +	/* indicates ABI version */
> +	snprintf(drvinfo->fw_version, 32, "%d", EIPOIB_ABI_VER);
[...]

This is an abuse of fw_version.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH V2 05/12] net: Add ndo_set_vif_param operation to serve eIPoIB VIFs
  2012-08-02  0:17   ` Ben Hutchings
@ 2012-08-02  8:25     ` Erez Shitrit
  0 siblings, 0 replies; 71+ messages in thread
From: Erez Shitrit @ 2012-08-02  8:25 UTC (permalink / raw)
  To: Ben Hutchings
  Cc: Or Gerlitz, davem, roland, netdev, ali, sean.hefty, Erez Shitrit

On 8/2/2012 3:17 AM, Ben Hutchings wrote:
> On Wed, 2012-08-01 at 20:09 +0300, Or Gerlitz wrote:
>> From: Erez Shitrit <erezsh@mellanox.co.il>
>>
>> The Ethernet IPoIB driver enslaves IPoIB devices and uses them as
>> VIFs (Virtual Interface) which serve an Ethernet NIC e.g present in a
>> guest OS. For each such slave that acts as a VIF, eIPoIB needs to know
>> the mac and optionally the vlan uses by that NIC, the new ndo opertaion
>> is used to associate the mac/vlan for that slave.
>>
>> Signed-off-by: Erez Shitrit <erezsh@mellanox.co.il>
>> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
>> ---
>>   include/linux/netdevice.h |    5 ++++-
>>   1 files changed, 4 insertions(+), 1 deletions(-)
>>
>> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
>> index eb06e58..2709551 100644
>> --- a/include/linux/netdevice.h
>> +++ b/include/linux/netdevice.h
>> @@ -1013,7 +1013,10 @@ struct net_device_ops {
>>   						    netdev_features_t features);
>>   	int			(*ndo_neigh_construct)(struct neighbour *n);
>>   	void			(*ndo_neigh_destroy)(struct neighbour *n);
>> -
>> +	int			(*ndo_set_vif_param) (struct net_device *dev,
>> +						      struct net_device *vif,
>> +						      u16 vlan,
>> +						      u8 *mac);
> The semantics of this operation should be documented in the comment
> above the structure definition.  One detail worth covering is whether
> 'vlan' is just a VID or can also include priority+CFI bits.
We will add more documentation for that.

The idea was just for the VID, (in our driver at least)

>
> If this is specific to eIPoIB, why not put that in the name of the
> operation?  If not, this *really* needs explaining because so far I have
> no whether it is something I should consider implementing on a real
> Ethernet device.
>
> Ben.
Will add more documentation here, perhaps other drivers can use it as 
well for visualization uses and more.
Thanks.

>>   	int			(*ndo_fdb_add)(struct ndmsg *ndm,
>>   					       struct net_device *dev,
>>   					       unsigned char *addr,

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH V2 08/12] net/eipoib: Add ethtool file support
  2012-08-02  0:22   ` Ben Hutchings
@ 2012-08-02  8:35     ` Erez Shitrit
  2012-08-02 15:42       ` Ben Hutchings
  0 siblings, 1 reply; 71+ messages in thread
From: Erez Shitrit @ 2012-08-02  8:35 UTC (permalink / raw)
  To: Ben Hutchings
  Cc: Or Gerlitz, davem, roland, netdev, ali, sean.hefty, Erez Shitrit

On 8/2/2012 3:22 AM, Ben Hutchings wrote:
> On Wed, 2012-08-01 at 20:09 +0300, Or Gerlitz wrote:
>> From: Erez Shitrit <erezsh@mellanox.co.il>
>>
>> Via ethtool the driver describes its version, ABI version, on what PIF
>> interface it runs and various statistics.
> [...]
>> --- /dev/null
>> +++ b/drivers/net/eipoib/eth_ipoib_ethtool.c
> [...]
>> +static void parent_ethtool_get_drvinfo(struct net_device *parent_dev,
>> +				       struct ethtool_drvinfo *drvinfo)
>> +{
>> +	struct parent *parent = netdev_priv(parent_dev);
>> +
>> +	strncpy(drvinfo->driver, DRV_NAME, 32);
>> +
>> +	strncpy(drvinfo->version, DRV_VERSION, 32);
>> +
>> +	strncpy(drvinfo->bus_info, parent->ipoib_main_interface,
>> +		ETHTOOL_BUSINFO_LEN);
> These must be null-terminated; therefore use strlcpy().

ok, will fix.

>> +	/* indicates ABI version */
>> +	snprintf(drvinfo->fw_version, 32, "%d", EIPOIB_ABI_VER);
> [...]
>
> This is an abuse of fw_version.
>
> Ben.
we took the idea from the bonding driver,
(snprintf(drvinfo->fw_version, 32, "%d", BOND_ABI_VERSION);)

Do you have any idea where can we keep the abi version?

Thanks, Erez

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH V2 06/12] net/core: Add rtnetlink support to vif parameters
  2012-08-02  0:20   ` Ben Hutchings
@ 2012-08-02 15:29     ` Erez Shitrit
  0 siblings, 0 replies; 71+ messages in thread
From: Erez Shitrit @ 2012-08-02 15:29 UTC (permalink / raw)
  To: Ben Hutchings
  Cc: Or Gerlitz, davem, roland, netdev, ali, sean.hefty, Erez Shitrit

On 8/2/2012 3:20 AM, Ben Hutchings wrote:
> On Wed, 2012-08-01 at 20:09 +0300, Or Gerlitz wrote:
>> From: Erez Shitrit <erezsh@mellanox.co.il>
>>
>> Add IFLA_VIF_INFO which allows user space to configure mac/vlan for a VIF
> [...]
>> --- a/net/core/rtnetlink.c
>> +++ b/net/core/rtnetlink.c
>> @@ -780,7 +780,8 @@ static noinline size_t if_nlmsg_size(const struct net_device *dev,
>>   	       + rtnl_vfinfo_size(dev, ext_filter_mask) /* IFLA_VFINFO_LIST */
>>   	       + rtnl_port_size(dev) /* IFLA_VF_PORTS + IFLA_PORT_SELF */
>>   	       + rtnl_link_get_size(dev) /* IFLA_LINKINFO */
>> -	       + rtnl_link_get_af_size(dev); /* IFLA_AF_SPEC */
>> +	       + rtnl_link_get_af_size(dev) /* IFLA_AF_SPEC */
>> +	       + nla_total_size(sizeof(struct ifla_vif_param)); /* IFLA_VIF_INFO */
>>   }
> [...]
>
> if_nlmsg_size() returns the size of a message describing the interface.
> But IFLA_VIF_INFO is write-only (why?) and therefore shouldn't be
> included.
>
> Ben.

thanks, we missed that.
currently, we don't expose the vifs on its parent interface, we need to 
think if we want to expose it, if yes we will add "get" option, 
otherwise we will take it from if_nlmsg_size() function.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH V2 08/12] net/eipoib: Add ethtool file support
  2012-08-02  8:35     ` Erez Shitrit
@ 2012-08-02 15:42       ` Ben Hutchings
  0 siblings, 0 replies; 71+ messages in thread
From: Ben Hutchings @ 2012-08-02 15:42 UTC (permalink / raw)
  To: Erez Shitrit
  Cc: Or Gerlitz, davem, roland, netdev, ali, sean.hefty, Erez Shitrit

On Thu, 2012-08-02 at 11:35 +0300, Erez Shitrit wrote:
> On 8/2/2012 3:22 AM, Ben Hutchings wrote:
> > On Wed, 2012-08-01 at 20:09 +0300, Or Gerlitz wrote:
> >> From: Erez Shitrit <erezsh@mellanox.co.il>
> >>
> >> Via ethtool the driver describes its version, ABI version, on what PIF
> >> interface it runs and various statistics.
> > [...]
> >> --- /dev/null
> >> +++ b/drivers/net/eipoib/eth_ipoib_ethtool.c
> > [...]
> >> +static void parent_ethtool_get_drvinfo(struct net_device *parent_dev,
> >> +				       struct ethtool_drvinfo *drvinfo)
> >> +{
> >> +	struct parent *parent = netdev_priv(parent_dev);
> >> +
> >> +	strncpy(drvinfo->driver, DRV_NAME, 32);
> >> +
> >> +	strncpy(drvinfo->version, DRV_VERSION, 32);
> >> +
> >> +	strncpy(drvinfo->bus_info, parent->ipoib_main_interface,
> >> +		ETHTOOL_BUSINFO_LEN);
> > These must be null-terminated; therefore use strlcpy().
> 
> ok, will fix.
> 
> >> +	/* indicates ABI version */
> >> +	snprintf(drvinfo->fw_version, 32, "%d", EIPOIB_ABI_VER);
> > [...]
> >
> > This is an abuse of fw_version.
> >
> > Ben.
> we took the idea from the bonding driver,
> (snprintf(drvinfo->fw_version, 32, "%d", BOND_ABI_VERSION);)

The bonding driver has lots of warts.

> Do you have any idea where can we keep the abi version?

You don't need to, because David will insist that you will only change
the ABI in a backward-compatible way. :-)

(The bonding ABI version hasn't changed since Linux 2.6.3, and even then
it provided backward compatibility.)

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH V2 09/12] net/eipoib: Add main driver functionality
  2012-08-01 17:09 ` [PATCH V2 09/12] net/eipoib: Add main driver functionality Or Gerlitz
@ 2012-08-02 17:15   ` Eric W. Biederman
  2012-08-03 20:31     ` Ali Ayoub
  2012-08-05 18:50     ` Michael S. Tsirkin
  0 siblings, 2 replies; 71+ messages in thread
From: Eric W. Biederman @ 2012-08-02 17:15 UTC (permalink / raw)
  To: Or Gerlitz; +Cc: davem, roland, netdev, ali, sean.hefty, Erez Shitrit

Or Gerlitz <ogerlitz@mellanox.com> writes:

> From: Erez Shitrit <erezsh@mellanox.co.il>
>
> The eipoib driver provides a standard Ethernet netdevice over
> the InfiniBand IPoIB interface .
>
> Some services can run only on top of Ethernet L2 interfaces, and cannot be
> bound to an IPoIB interface. With this new driver, these services can run
> seamlessly.

Do I read this code correctly that what you are doing is not tunneling
ethernet over IB but instead you are removing an ethernet header and
replacing it with an IB header?

Do I also read this code correctly if you can't find your destination
mac address in your ""neighbor table"" you do a normal IPoIB arp
for the infiniband GUID?

Do I read this right that if presented with a non-IPv4 or ARP packet
this code will do something undefined and unpredictable?

Maybe this makes some sense but just skimming it looks like you
are trying to force a square peg into a round hole resulting in
some weird code and some very weird maintainability issues.

I am honestly surprised at this approach.  I would think it would be
faster and simpler to run an IB queue pair directly to the hypervisor or
possibly even the guest operating system bypassing the kernel and doing
all of this translation in userspace.

Eric

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH V2 09/12] net/eipoib: Add main driver functionality
  2012-08-02 17:15   ` Eric W. Biederman
@ 2012-08-03 20:31     ` Ali Ayoub
  2012-08-03 21:33       ` David Miller
  2012-08-05 18:50     ` Michael S. Tsirkin
  1 sibling, 1 reply; 71+ messages in thread
From: Ali Ayoub @ 2012-08-03 20:31 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Or Gerlitz, davem, roland, netdev, sean.hefty, Erez Shitrit

On 8/2/2012 10:15 AM, Eric W. Biederman wrote:
> Or Gerlitz <ogerlitz@mellanox.com> writes:
> 
>> From: Erez Shitrit <erezsh@mellanox.co.il>
>>
>> The eipoib driver provides a standard Ethernet netdevice over
>> the InfiniBand IPoIB interface .
>>
>> Some services can run only on top of Ethernet L2 interfaces, and cannot be
>> bound to an IPoIB interface. With this new driver, these services can run
>> seamlessly.
> 
> Do I read this code correctly that what you are doing is not tunneling
> ethernet over IB but instead you are removing an ethernet header and
> replacing it with an IB header?
Correct.
eIPoIB runs standard IPoIB on the wire, thus it doesn't encapsulate the
Ethernet frame on top of IPoIB, but rather translates it to an IPoIB
packet, this allows us to expose an Ethernet L2 network device, and
still keep interoperability with existing IPoIB endpoints. Running full
encapsulation (i.e. EoIPoIB) will break interoperability.

> Do I also read this code correctly if you can't find your destination
> mac address in your ""neighbor table"" you do a normal IPoIB arp
> for the infiniband GUID?
Correct.
Wire protocol remains IPoIB.

> Do I read this right that if presented with a non-IPv4 or ARP packet
> this code will do something undefined and unpredictable?
The current code drops IPv6 packets (see IS_E_IPOIB_PROTO), IPv6 support
will be added later on.

> Maybe this makes some sense but just skimming it looks like you
> are trying to force a square peg into a round hole resulting in
> some weird code and some very weird maintainability issues.
> 
> I am honestly surprised at this approach.  I would think it would be
> faster and simpler to run an IB queue pair directly to the hypervisor or
> possibly even the guest operating system bypassing the kernel and doing
> all of this translation in userspace.
With eIPoIB architecture, the VM sees standard Ethernet emulator,
allowing the administrator to enslave eIPoIB PIF to the vSwitch/vBridge
as if it was standard Ethernet. Other approaches that exposes IB QP to
the VM (with w/o bypassing the kernel) won't be possible with the
current emulators and management tools.

--
Ali;

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH V2 09/12] net/eipoib: Add main driver functionality
  2012-08-03 20:31     ` Ali Ayoub
@ 2012-08-03 21:33       ` David Miller
  2012-08-03 22:39         ` Ali Ayoub
  0 siblings, 1 reply; 71+ messages in thread
From: David Miller @ 2012-08-03 21:33 UTC (permalink / raw)
  To: ali; +Cc: ebiederm, ogerlitz, roland, netdev, sean.hefty, erezsh

From: Ali Ayoub <ali@mellanox.com>
Date: Fri, 03 Aug 2012 13:31:35 -0700

> With eIPoIB architecture, the VM sees standard Ethernet emulator,
> allowing the administrator to enslave eIPoIB PIF to the vSwitch/vBridge
> as if it was standard Ethernet. Other approaches that exposes IB QP to
> the VM (with w/o bypassing the kernel) won't be possible with the
> current emulators and management tools.

So then fix the emulators and management tools to handle IB instead
of adding this bogus new protocol?

This new protocol seems to exist only because you don't want to have
to enhance the emulators and tools, and I'm sorry that isn't a valid
reason to do something like this.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH V2 09/12] net/eipoib: Add main driver functionality
  2012-08-03 21:33       ` David Miller
@ 2012-08-03 22:39         ` Ali Ayoub
  2012-08-03 23:36           ` David Miller
  2012-08-04  0:02           ` Ali Ayoub
  0 siblings, 2 replies; 71+ messages in thread
From: Ali Ayoub @ 2012-08-03 22:39 UTC (permalink / raw)
  To: David Miller; +Cc: ebiederm, ogerlitz, roland, netdev, sean.hefty, erezsh

On 8/3/2012 2:33 PM, David Miller wrote:
> From: Ali Ayoub <ali@mellanox.com>
> Date: Fri, 03 Aug 2012 13:31:35 -0700
> 
>> With eIPoIB architecture, the VM sees standard Ethernet emulator,
>> allowing the administrator to enslave eIPoIB PIF to the vSwitch/vBridge
>> as if it was standard Ethernet. Other approaches that exposes IB QP to
>> the VM (with w/o bypassing the kernel) won't be possible with the
>> current emulators and management tools.
> 
> So then fix the emulators and management tools to handle IB instead
> of adding this bogus new protocol?
> 
> This new protocol seems to exist only because you don't want to have
> to enhance the emulators and tools, and I'm sorry that isn't a valid
> reason to do something like this.

This driver exists to allow the user to have an Ethernet interface on
top of a high-speed InfiniBand (IB) interconnect.
Users would like to use sockets API from the VM without re-writing their
applications on top of IB verbs, this driver meant to allow such a user
to do so.

Exposing IB emulators and having IB support in the management tools for
the VM/Hypervisor won't address the usecases that this driver meant for.

With this driver, existing VMs, and their existing IP applications, can
run as-is on InfiniBand network.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH V2 09/12] net/eipoib: Add main driver functionality
  2012-08-03 22:39         ` Ali Ayoub
@ 2012-08-03 23:36           ` David Miller
  2012-08-04 21:23             ` Or Gerlitz
  2012-08-07  0:14             ` Ali Ayoub
  2012-08-04  0:02           ` Ali Ayoub
  1 sibling, 2 replies; 71+ messages in thread
From: David Miller @ 2012-08-03 23:36 UTC (permalink / raw)
  To: ali; +Cc: ebiederm, ogerlitz, roland, netdev, sean.hefty, erezsh

From: Ali Ayoub <ali@mellanox.com>
Date: Fri, 03 Aug 2012 15:39:36 -0700

> Users would like to use sockets API from the VM without re-writing their
> applications on top of IB verbs, this driver meant to allow such a user
> to do so.

That's what IPoIB was for, the application writers who don't want to have
to be knowledgable about IB verbs.

You're messing with the link layer here, and that's what is upsetting me.

It's a complete cop-out to changing the VM tools and emulators properly to
handle a new link layer.

The applications writers already have a way to use IB whilst using
something familiar, like IPv4, via IPoIB.  You're doing something
completely different here, and it stinks.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH V2 09/12] net/eipoib: Add main driver functionality
  2012-08-03 22:39         ` Ali Ayoub
  2012-08-03 23:36           ` David Miller
@ 2012-08-04  0:02           ` Ali Ayoub
  2012-08-04  0:05             ` David Miller
  2012-08-04  1:34             ` Eric W. Biederman
  1 sibling, 2 replies; 71+ messages in thread
From: Ali Ayoub @ 2012-08-04  0:02 UTC (permalink / raw)
  To: David Miller; +Cc: ebiederm, ogerlitz, roland, netdev, sean.hefty, erezsh

On 8/3/2012 3:39 PM, Ali Ayoub wrote:
> On 8/3/2012 2:33 PM, David Miller wrote:
>> From: Ali Ayoub <ali@mellanox.com>
>> Date: Fri, 03 Aug 2012 13:31:35 -0700
>>
>>> With eIPoIB architecture, the VM sees standard Ethernet emulator,
>>> allowing the administrator to enslave eIPoIB PIF to the vSwitch/vBridge
>>> as if it was standard Ethernet. Other approaches that exposes IB QP to
>>> the VM (with w/o bypassing the kernel) won't be possible with the
>>> current emulators and management tools.
>>
>> So then fix the emulators and management tools to handle IB instead
>> of adding this bogus new protocol?
>>
>> This new protocol seems to exist only because you don't want to have
>> to enhance the emulators and tools, and I'm sorry that isn't a valid
>> reason to do something like this.
> 
> This driver exists to allow the user to have an Ethernet interface on
> top of a high-speed InfiniBand (IB) interconnect.
> Users would like to use sockets API from the VM without re-writing their
> applications on top of IB verbs, this driver meant to allow such a user
> to do so.
> 
> Exposing IB emulators and having IB support in the management tools for
> the VM/Hypervisor won't address the usecases that this driver meant for.
> 
> With this driver, existing VMs, and their existing IP applications, can
> run as-is on InfiniBand network.

This driver exists to allow the user to have an Ethernet interface on
top of a high-speed InfiniBand (IB) interconnect.
Users would like to use sockets API from the VM without re-writing their
applications on top of IB verbs, this driver meant to allow such a user
to do so.

Exposing IB emulators and having IB support in the management tools for
the VM/Hypervisor won't address the usecases that this driver meant for.

With this driver, existing VMs, and their existing IP applications, can
run as-is on InfiniBand network.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH V2 09/12] net/eipoib: Add main driver functionality
  2012-08-04  0:02           ` Ali Ayoub
@ 2012-08-04  0:05             ` David Miller
  2012-08-04  1:34             ` Eric W. Biederman
  1 sibling, 0 replies; 71+ messages in thread
From: David Miller @ 2012-08-04  0:05 UTC (permalink / raw)
  To: ali; +Cc: ebiederm, ogerlitz, roland, netdev, sean.hefty, erezsh

From: Ali Ayoub <ali@mellanox.com>
Date: Fri, 03 Aug 2012 17:02:35 -0700

> On 8/3/2012 3:39 PM, Ali Ayoub wrote:
>> On 8/3/2012 2:33 PM, David Miller wrote:
>>> From: Ali Ayoub <ali@mellanox.com>
>>> Date: Fri, 03 Aug 2012 13:31:35 -0700
>>>
>>>> With eIPoIB architecture, the VM sees standard Ethernet emulator,
>>>> allowing the administrator to enslave eIPoIB PIF to the vSwitch/vBridge
>>>> as if it was standard Ethernet. Other approaches that exposes IB QP to
>>>> the VM (with w/o bypassing the kernel) won't be possible with the
>>>> current emulators and management tools.
>>>
>>> So then fix the emulators and management tools to handle IB instead
>>> of adding this bogus new protocol?
>>>
>>> This new protocol seems to exist only because you don't want to have
>>> to enhance the emulators and tools, and I'm sorry that isn't a valid
>>> reason to do something like this.
>> 
>> This driver exists to allow the user to have an Ethernet interface on
>> top of a high-speed InfiniBand (IB) interconnect.
>> Users would like to use sockets API from the VM without re-writing their
>> applications on top of IB verbs, this driver meant to allow such a user
>> to do so.
>> 
>> Exposing IB emulators and having IB support in the management tools for
>> the VM/Hypervisor won't address the usecases that this driver meant for.
>> 
>> With this driver, existing VMs, and their existing IP applications, can
>> run as-is on InfiniBand network.
> 
> This driver exists to allow the user to have an Ethernet interface on
> top of a high-speed InfiniBand (IB) interconnect.
> Users would like to use sockets API from the VM without re-writing their
> applications on top of IB verbs, this driver meant to allow such a user
> to do so.
> 
> Exposing IB emulators and having IB support in the management tools for
> the VM/Hypervisor won't address the usecases that this driver meant for.
> 
> With this driver, existing VMs, and their existing IP applications, can
> run as-is on InfiniBand network.

Just saying the same thing twice doesn't make your argument stronger.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH V2 09/12] net/eipoib: Add main driver functionality
  2012-08-04  0:02           ` Ali Ayoub
  2012-08-04  0:05             ` David Miller
@ 2012-08-04  1:34             ` Eric W. Biederman
  2012-08-04 21:33               ` Or Gerlitz
  1 sibling, 1 reply; 71+ messages in thread
From: Eric W. Biederman @ 2012-08-04  1:34 UTC (permalink / raw)
  To: Ali Ayoub, David Miller; +Cc: ogerlitz, roland, netdev, sean.hefty, erezsh

Ali Ayoub <ali@mellanox.com> wrote:

>
>With this driver, existing VMs, and their existing IP applications, can
>run as-is on InfiniBand network.

Actually it doesn't work like that.

If my application really needs ethernet it will not work with your driver.  I can not run decnet or appletalk or ATAoE or PPPoE or LACP or use VLANs or any of a thousand other things that require real live ethernet to function.

Most VMs will talk to linux with a tap interface, and the output of a tap interface can be routed just fine, and that works with existing tools, and existing already deployed  kernels.

Alternatively if you are silly you can implement a tap interface on top of IPoIB and have the same interface you do now, and it works with existing already deployed kernels.

So eIPoIB does not make sense from a time to market perspective.

Similarly eIPoIB does not make sense from a performance standpoint because the best performance requires the applications and hypervisors are infiniband and teaching the apps to cope.

eIPoIB also has considerable maintenance overhead as it is complex code doing some crazy things.

Now personally NAPT44 just about made the internet unusable for peer to peer appications.  NAT66 aka network prefix translation introduces much less breakage but it still requires someone to run STUN on IPv6 to keep applications working, ick.  NATEIB aka eIPoIB just looks complex and already breaks most of ethernet and history says that kind of breakage actually matters a lot.   I think everyone who suggests any kind of NAT is a good idea should have to implement RFC 5245 Interacive Connectivity Establishment.  It takes 2 100+ page rfcs to come up with a way that  allows applications to with the address munging over the open internet.   That way takes 3000+ lines of code in each application.   That is the kind of complexity you are asking people up stream of you to deal with your NATTing o
 f ethernet to infiniband.  So I totally think eIPoIB stinks, and that it does not at all let existing applications work without modification.

Eric

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH V2 09/12] net/eipoib: Add main driver functionality
  2012-08-03 23:36           ` David Miller
@ 2012-08-04 21:23             ` Or Gerlitz
  2012-08-04 21:44               ` Or Gerlitz
  2012-08-07  0:14             ` Ali Ayoub
  1 sibling, 1 reply; 71+ messages in thread
From: Or Gerlitz @ 2012-08-04 21:23 UTC (permalink / raw)
  To: David Miller; +Cc: ali, ebiederm, ogerlitz, roland, netdev, sean.hefty, erezsh

On Sat, Aug 4, 2012 at 2:36 AM, David Miller <davem@davemloft.net> wrote:
> From: Ali Ayoub <ali@mellanox.com>
> Date: Fri, 03 Aug 2012 15:39:36 -0700

>> Users would like to use sockets API from the VM without re-writing their
>> applications on top of IB verbs, this driver meant to allow such a user
>> to do so.

> That's what IPoIB was for, the application writers who don't want to have
> to be knowledgable about IB verbs.
>
> You're messing with the link layer here, and that's what is upsetting me.
>
> It's a complete cop-out to changing the VM tools and emulators properly to
> handle a new link layer.
>
> The applications writers already have a way to use IB whilst using
> something familiar, like IPv4, via IPoIB.  You're doing something
> completely different here, and it stinks.


Dave,

Just quick recap of things to make sure we're in sync on the facts
here and (me) trying to better follow your arguments:

Bringing IB to app usage is done through the IB verbs / RDMA CM, which
is supported
for quite a long while for native mode and BTW as for SRIOV, there's a
now running patch set @ linux-rdma submitted to Roland
http://marc.info/?l=linux-rdma&m=134398354428293&w=2

IPoIB is meant to allow IP apps to use IB, defined by IETF spec (RFC 4391/4392)

The idea in eIPoIB was to allow IP apps running on VMs under a
Para-Virtual set of mind, e.g when the Linux PV networking stack comes
into play, to use that stack w.o modifying it.

When looking on that, we thought so far so good, and went in the way
posted here. If reusing your last sentence... this driver provides a
way for apps to use the PV stack AND IB whilst using something
familiar, like IPv4.


I'd like to better understand your "messing with the link layer here,
and that's what is upsetting" claim --- since the service here is IP,
and this service has well defined mechanisms, eIPoIB is basically a
shim layer that allows to provide this service for VMs that interacts
with the emulators, drivers and tools used in Linux and the related
guest OS PV drivers (virtio and friends). For example, that messing
mostly goes down to translating ARP requests/replies from being
carried in Ethernet frame to be carried in IPoIB packet, is that
really too hacky? why and what?

Or.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH V2 09/12] net/eipoib: Add main driver functionality
  2012-08-04  1:34             ` Eric W. Biederman
@ 2012-08-04 21:33               ` Or Gerlitz
  0 siblings, 0 replies; 71+ messages in thread
From: Or Gerlitz @ 2012-08-04 21:33 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Ali Ayoub, David Miller, ogerlitz, roland, netdev, sean.hefty, erezsh

On Sat, Aug 4, 2012 at 4:34 AM, Eric W. Biederman <ebiederm@xmission.com> wrote:
> Ali Ayoub <ali@mellanox.com> wrote:

>>With this driver, existing VMs, and their existing IP applications, can
>>run as-is on InfiniBand network.

> Actually it doesn't work like that.
> If my application really needs ethernet it will not work with your driver.

Eric,

Starting to address your email

If your application really needs Ethernet, it will not work with
IPoIB, since the latter only supports IP apps. So this isn't the
problem we're trying to solve here, but rather allow IP app running in
PV mode to use IPoIB.

> I can not run decnet or appletalk or ATAoE or PPPoE or LACP

[...]

> or use VLANs

YES you can, vlan devices can be set over eipoib devices, and the
eipoib driver maps the vlan ID to IB Partition using a mapping defined
by the admin.

> or any of a thousand other things that require real live ethernet to function.
[...]
> Similarly eIPoIB does not make sense from a performance standpoint because
> the best performance requires the applications and hypervisors are infiniband
> and teaching the apps to cope.

agree 100% re the "best performance" part of this sentence, if your
app looks for best perf, yes you can use IB in either native or SRIOV
mode, as I mentioned in my response to Dave.

Or.

Or.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH V2 09/12] net/eipoib: Add main driver functionality
  2012-08-04 21:23             ` Or Gerlitz
@ 2012-08-04 21:44               ` Or Gerlitz
  2012-08-04 23:19                 ` Eric W. Biederman
  0 siblings, 1 reply; 71+ messages in thread
From: Or Gerlitz @ 2012-08-04 21:44 UTC (permalink / raw)
  To: David Miller; +Cc: ali, ebiederm, ogerlitz, roland, netdev, sean.hefty, erezsh

On Sun, Aug 5, 2012 at 12:23 AM, Or Gerlitz <or.gerlitz@gmail.com> wrote:

> The idea in eIPoIB was to allow IP apps running on VMs under a Para-Virtual set of
> mind, e.g when the Linux PV networking stack comes into play, to use that stack w.o > modifying it. When looking on that, we thought so far so good, and went in the way
> posted here. If reusing your last sentence... this driver provides a way for apps to use
> the PV stack AND IB whilst using something familiar, like IPv4.

OK, when I said the PV stack, I meant the portion of the PV stack
which assumes Ethernet link layer, ofcourse... If someone uses routing
they don't need this driver.

Again, since the app only uses IP,  which is well defined, etc. the
work done by the eipoib driver, didn't seem as hackish messing, so I'm
again with that WW (Why/What) question from 5m ago.

Or.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH V2 09/12] net/eipoib: Add main driver functionality
  2012-08-04 21:44               ` Or Gerlitz
@ 2012-08-04 23:19                 ` Eric W. Biederman
  0 siblings, 0 replies; 71+ messages in thread
From: Eric W. Biederman @ 2012-08-04 23:19 UTC (permalink / raw)
  To: Or Gerlitz, David Miller
  Cc: ali, ogerlitz, roland, netdev, sean.hefty, erezsh

Or Gerlitz <or.gerlitz@gmail.com> wrote:

>On Sun, Aug 5, 2012 at 12:23 AM, Or Gerlitz <or.gerlitz@gmail.com>
>wrote:
>
>> The idea in eIPoIB was to allow IP apps running on VMs under a
>Para-Virtual set of
>> mind, e.g when the Linux PV networking stack comes into play, to use
>that stack w.o > modifying it. When looking on that, we thought so far
>so good, and went in the way
>> posted here. If reusing your last sentence... this driver provides a
>way for apps to use
>> the PV stack AND IB whilst using something familiar, like IPv4.
>
>OK, when I said the PV stack, I meant the portion of the PV stack
>which assumes Ethernet link layer, ofcourse... If someone uses routing
>they don't need this driver.
>
>Again, since the app only uses IP,  which is well defined, etc. the
>work done by the eipoib driver, didn't seem as hackish messing, so I'm
>again with that WW (Why/What) question from 5m ago.


Something is missing from your sentence. 

What is the alternative that you view as worse?

Juast as a point of information.  In general when bridging is desired but not possible people deploy proxy arp.

I completely fail to see how having the VM output to a tun interface and then routing that, would not be supported by current solutions.  I believe that is where VM solutions all started networking wise.  Outputting to an interface is needed to support interfaces like 802.11 where bridging frequently does not work.

Eric

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH V2 09/12] net/eipoib: Add main driver functionality
  2012-08-02 17:15   ` Eric W. Biederman
  2012-08-03 20:31     ` Ali Ayoub
@ 2012-08-05 18:50     ` Michael S. Tsirkin
  2012-08-08  5:23       ` Or Gerlitz
  2012-09-03 20:53       ` Or Gerlitz
  1 sibling, 2 replies; 71+ messages in thread
From: Michael S. Tsirkin @ 2012-08-05 18:50 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Or Gerlitz, davem, roland, netdev, ali, sean.hefty, Erez Shitrit

On Thu, Aug 02, 2012 at 10:15:23AM -0700, Eric W. Biederman wrote:
> Or Gerlitz <ogerlitz@mellanox.com> writes:
> 
> > From: Erez Shitrit <erezsh@mellanox.co.il>
> >
> > The eipoib driver provides a standard Ethernet netdevice over
> > the InfiniBand IPoIB interface .
> >
> > Some services can run only on top of Ethernet L2 interfaces, and cannot be
> > bound to an IPoIB interface. With this new driver, these services can run
> > seamlessly.
> 
> Do I read this code correctly that what you are doing is not tunneling
> ethernet over IB but instead you are removing an ethernet header and
> replacing it with an IB header?
> 
> Do I also read this code correctly if you can't find your destination
> mac address in your ""neighbor table"" you do a normal IPoIB arp
> for the infiniband GUID?
> 
> Do I read this right that if presented with a non-IPv4 or ARP packet
> this code will do something undefined and unpredictable?
> 
> Maybe this makes some sense but just skimming it looks like you
> are trying to force a square peg into a round hole resulting in
> some weird code and some very weird maintainability issues.
> 
> I am honestly surprised at this approach.  I would think it would be
> faster and simpler to run an IB queue pair directly to the hypervisor or
> possibly even the guest operating system bypassing the kernel and doing
> all of this translation in userspace.
> 
> Eric

I'm on vacation and I have not looked at the patches, at Erez' request,
just reacting to the presentation and the discussion.

Bypassing the kernel has its own set of issues, not the
least of which is the need to lock all of guest memory which breaks
overcommit. Running an IB queue pair directly to the hypervisor
will also break live migration.

Another problem with exposing IB to guests has to do with the fact that
IB addresses such as combinations of LIDs, GIDs and QPNs to best of my
knowledge do not support soft hardware address setting, which interferes
with live migration.

So it seems that a sane solution would involve an extra level of
indirection, with guest addresses being translated to host IB addresses.

As long as you do this, maybe using an ethernet frame format makes
sense.

So far the things that make sense. Here are some that don't, to me:

- Is a pdf presentation all you have in terms of documentation?
  We are talking communication protocols here - I would expect a
  proper spec, and some effort to standardize, otherwise where's the
  guarantee it won't change in an incompatible way?
  Other things that I would expect to be addressed in such a spec is
  interaction with other IPoIB features, such as connected
  mode, checksum offloading etc, and IB features such as multipath etc.

- The way you encode LID/QPN in the MAC seems questionable. IIRC there's
  more to IB addressing than just the LID.  Since everyone on the subnet
  need access to this translation, I think it makes sense to store it in
  the SM. I think this would also obviate some IPv4 specific hacks
  in kernel.

- IGMP/MAC snooping in a driver is just too hairy.
  As you point out, bridge currently needs the uplink in promisc mode.
  I don't think a driver should work around that limitation.
  For some setups, it might be interesting to remove the
  promisc mode requirement, failing that,
  I think you could use macvtap passthrough.

- Currently migration works without host kernel help, would be
  preferable to keep it that way.


Hope this helps,
MST

> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH V2 09/12] net/eipoib: Add main driver functionality
  2012-08-03 23:36           ` David Miller
  2012-08-04 21:23             ` Or Gerlitz
@ 2012-08-07  0:14             ` Ali Ayoub
  2012-08-07  0:44               ` Eric W. Biederman
  1 sibling, 1 reply; 71+ messages in thread
From: Ali Ayoub @ 2012-08-07  0:14 UTC (permalink / raw)
  To: David Miller
  Cc: ebiederm, ogerlitz, roland, netdev, sean.hefty, erezsh, dledford

On 8/3/2012 4:36 PM, David Miller wrote:
> From: Ali Ayoub <ali@mellanox.com>
> Date: Fri, 03 Aug 2012 15:39:36 -0700
> 
>> Users would like to use sockets API from the VM without re-writing their
>> applications on top of IB verbs, this driver meant to allow such a user
>> to do so.
> 
> That's what IPoIB was for, the application writers who don't want to have
> to be knowledgable about IB verbs.
> 
> You're messing with the link layer here, and that's what is upsetting me.
> 
> It's a complete cop-out to changing the VM tools and emulators properly to
> handle a new link layer.
> 
> The applications writers already have a way to use IB whilst using
> something familiar, like IPv4, via IPoIB.  You're doing something
> completely different here, and it stinks.

Indeed, IPoIB driver meant to allow IP applications to run over
InfiniBand, but IPoIB cannot serve IPoE traffic.

The goal of eIPoIB driver to show to the user an Ethernet L2 netdev; by
translating IPoE packets to IPoIB. It keeps the same IPoIB wire
protocol, and exposes to the host an ethX interface, all IPoIB packets
are then translated to IPoIB.

Among other things, the main benefit we're targeting is to allow IPoE
traffic within the VM to go through the (Ethernet) vBridge down to the
eIPoIB PIF, and eventually to IPoIB and to the IB network.

In Para virtualized environment, the VM emulator sends/receives packets
with Ethernet header, and the vBridge also performs L2 switching based
on the Ethernet header, in addition to other tools that expect an
Ethernet link layer. We'd like to support them on top of IPoIB.

I see your point to change the the tools and VM emulations to handle
IPoIB link layer, but this involves not only changing many
components/layers such netfront/vbridge/vconfig/etc.. but also -unlike
Ethernet- having IPoIB-aware network device emulation in the VM domain
requires giving access from the VM to the IB Hardware, and this
association will break upon VM migration, as Tsirkin indicated in the
other Email.

Another issue with IPoIB-aware emulator, is that the IPoIB frame doesn't
include the destination link layer per data packet (RFC 4391), therefor
the neighbor address needs to be passed from the VM domain down to the
ipoib-aware vBridge.

I don't see in other alternatives a solution for the problem we're
trying to solve. If there are changes/suggestions to improve eIPoIB
netdev driver to avoid "messing with the link layer" and make it
acceptable, we can discuss and apply them.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH V2 09/12] net/eipoib: Add main driver functionality
  2012-08-07  0:14             ` Ali Ayoub
@ 2012-08-07  0:44               ` Eric W. Biederman
  2012-08-07  1:21                 ` Re[2]: " Naoto MATSUMOTO
                                   ` (3 more replies)
  0 siblings, 4 replies; 71+ messages in thread
From: Eric W. Biederman @ 2012-08-07  0:44 UTC (permalink / raw)
  To: Ali Ayoub
  Cc: David Miller, ogerlitz, roland, netdev, sean.hefty, erezsh, dledford

Ali Ayoub <ali@mellanox.com> writes:

> Among other things, the main benefit we're targeting is to allow IPoE
> traffic within the VM to go through the (Ethernet) vBridge down to the
> eIPoIB PIF, and eventually to IPoIB and to the IB network.

That works today without code changes.  It is called routing.

> In Para virtualized environment, the VM emulator sends/receives packets
> with Ethernet header, and the vBridge also performs L2 switching based
> on the Ethernet header, in addition to other tools that expect an
> Ethernet link layer. We'd like to support them on top of IPoIB.

See routing.  The code is already done.

> I don't see in other alternatives a solution for the problem we're
> trying to solve. If there are changes/suggestions to improve eIPoIB
> netdev driver to avoid "messing with the link layer" and make it
> acceptable, we can discuss and apply them.

Nothing needs to be applied the code is done.  Routing from
IPoE to IPoIB works.

There is nothing in what anyone has posted as requirements that needs
work to implement.

I totally fail to see how getting packets of of the VM as ethernet
frames, and then  IP layer routing those packets over IP is not an
option.  What requirement am I missing.

All VMs should suport that mode of operation, and certainly the kernel
does.

Implementations involving bridges like macvlan and macvtap are
performance optimizations, and the optimizations don't even apply in
areas like 802.11, where only one mac address is supported per adapter.

Bridging can ocassionally also be an administrative simplification as
well, but you should be able to achieve the a similar simplification
with a dhcprelay and proxy arp.

Eric

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re[2]: [PATCH V2 09/12] net/eipoib: Add main driver functionality
  2012-08-07  0:44               ` Eric W. Biederman
@ 2012-08-07  1:21                 ` Naoto MATSUMOTO
  2012-08-15  9:10                   ` Re[3]: " Naoto MATSUMOTO
  2012-08-07  3:33                 ` Eric W. Biederman
                                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 71+ messages in thread
From: Naoto MATSUMOTO @ 2012-08-07  1:21 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Ali Ayoub, David Miller, ogerlitz, roland, netdev, sean.hefty,
	erezsh, dledford


Hi Eric, Ali

On Mon, 06 Aug 2012 17:44:06 -0700
ebiederm@xmission.com (Eric W. Biederman) wrote:

> Implementations involving bridges like macvlan and macvtap are
> performance optimizations, and the optimizations don't even apply in
> areas like 802.11, where only one mac address is supported per adapter.

We had nearly equal situation perfomance result. share for you.

-----------------------------------------------------------------------
Ethernet over EoIB using gretap with "Raid STP" performance result.
http://twitpic.com/agd1pp

How to use EoIB with gretap. 
http://twitpic.com/agd4io


ifconfig ib0 10.1.1.1/24 up 
ip link add eoib0 type gretap remote 10.1.1.2 local 10.1.1.1
ip link add eoib1 type gretap remote 10.1.1.3 local 10.1.1.1
ip link add type veth
ifconfig eoib0 up up
ifconfig eoib1 up up
ifconfig veth0 up up
ifconfig veth1 up up

brctl addbr br0
brctl addif br0 eoib0
brctl addif br0 eoib1
bcrtl addif br0 veth1

rstpd
rstpctl rstp br0 on
rstpctl setbridgeprio br0 40960
rstpctl setportpathcost br0 vpn1 40000
rstpctl sethello br0 1
rstpctl setmaxage br0 6
rstpctl setfdelay br0 4
brctl stp br0 on
brctl sethello br0 1
brctl setfd br0 4
brctl setmaxage br0 6
-----------------------------------------------------------------------

regards,

-- 
SAKURA Internet Inc. / Senior Researcher
Naoto MATSUMOTO <n-matsumoto@sakura.ad.jp>
SAKURA Research Center <http://research.sakura.ad.jp/>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH V2 09/12] net/eipoib: Add main driver functionality
  2012-08-07  0:44               ` Eric W. Biederman
  2012-08-07  1:21                 ` Re[2]: " Naoto MATSUMOTO
@ 2012-08-07  3:33                 ` Eric W. Biederman
  2012-08-08  6:04                   ` Or Gerlitz
  2012-08-07  3:37                 ` Joseph Glanville
  2012-08-08  7:32                 ` Or Gerlitz
  3 siblings, 1 reply; 71+ messages in thread
From: Eric W. Biederman @ 2012-08-07  3:33 UTC (permalink / raw)
  To: Ali Ayoub
  Cc: David Miller, ogerlitz, roland, netdev, sean.hefty, erezsh, dledford

ebiederm@xmission.com (Eric W. Biederman) writes:

> Ali Ayoub <ali@mellanox.com> writes:
>
>> Among other things, the main benefit we're targeting is to allow IPoE
>> traffic within the VM to go through the (Ethernet) vBridge down to the
>> eIPoIB PIF, and eventually to IPoIB and to the IB network.

Oh yes.  It just occurred to me there is huge problem with eIPoIB as
currently presented in these patches.  It breaks DHCPv4 the same way
it breaks ARP, but DHCPv4 is not fixed up.

I am stunned to to realize there has been so much push for a solution
that doesn't even support dhcp.

How can anyone possibly argue that eIPoIB works or is useful?  I just
can't imagine.

Totally stunned,
Eric

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH V2 09/12] net/eipoib: Add main driver functionality
  2012-08-07  0:44               ` Eric W. Biederman
  2012-08-07  1:21                 ` Re[2]: " Naoto MATSUMOTO
  2012-08-07  3:33                 ` Eric W. Biederman
@ 2012-08-07  3:37                 ` Joseph Glanville
  2012-08-08  7:32                 ` Or Gerlitz
  3 siblings, 0 replies; 71+ messages in thread
From: Joseph Glanville @ 2012-08-07  3:37 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Ali Ayoub, David Miller, ogerlitz, roland, netdev, sean.hefty,
	erezsh, dledford

On 7 August 2012 10:44, Eric W. Biederman <ebiederm@xmission.com> wrote:
> Ali Ayoub <ali@mellanox.com> writes:
>
>> Among other things, the main benefit we're targeting is to allow IPoE
>> traffic within the VM to go through the (Ethernet) vBridge down to the
>> eIPoIB PIF, and eventually to IPoIB and to the IB network.
>
> That works today without code changes.  It is called routing.
>
>> In Para virtualized environment, the VM emulator sends/receives packets
>> with Ethernet header, and the vBridge also performs L2 switching based
>> on the Ethernet header, in addition to other tools that expect an
>> Ethernet link layer. We'd like to support them on top of IPoIB.
>
> See routing.  The code is already done.
>
>> I don't see in other alternatives a solution for the problem we're
>> trying to solve. If there are changes/suggestions to improve eIPoIB
>> netdev driver to avoid "messing with the link layer" and make it
>> acceptable, we can discuss and apply them.
>
> Nothing needs to be applied the code is done.  Routing from
> IPoE to IPoIB works.
>
> There is nothing in what anyone has posted as requirements that needs
> work to implement.
>
> I totally fail to see how getting packets of of the VM as ethernet
> frames, and then  IP layer routing those packets over IP is not an
> option.  What requirement am I missing.
>
> All VMs should suport that mode of operation, and certainly the kernel
> does.
>
> Implementations involving bridges like macvlan and macvtap are
> performance optimizations, and the optimizations don't even apply in
> areas like 802.11, where only one mac address is supported per adapter.
>
> Bridging can ocassionally also be an administrative simplification as
> well, but you should be able to achieve the a similar simplification
> with a dhcprelay and proxy arp.
>
> Eric
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

Hi,

While I agree this driver doesn't offer an adequate solution to the
problem I think the problem still very much exists.

Ultimately if you have an Infiniband fabric and you want to use it for
inter-guest communication across hosts you are left with 2 options,
tunneling and routing.
Routing can be ok sometimes but it's not a great solution for the vast
majority of people, it requires considerable edge intelligence in the
hypervisor and ISIS routing setups.
The biggest issue is that it's just not L2, guest applications on
guest operating systems expect to have access to fully fledged
Ethernet devices.
This is especially an issue for hosting providers etc where they have
little to no control over what applications their customers want to
use.

Tunnelling solves some of these problems, generally done over L3 and
using existing IP fabric to transport Ethernet L2 frames to the
destination hypervisors.
Better because now you have true L2 encapsulation but to be honest
it's slow and chews alot of cycles and generally has poor PPS
performance.

IMO this driver doesn't make sense compared to a true Ethernet over IB
encapsulation, which isn't actually all that much extra work. (If I am
right you already have a driver that does just this)
The point about retaining compatibility with existing IPoIB boggles my
mind a little.
It means little if no benefit is really added because only IP and ARP
will work.. no other L2 protocol will work correctly as others have
noted.

With full encapsulation you can make use of all the existing
infrastructure, linux bridge, OpenvSwitch, ebtables/netfilter etc.

Joseph.

-- 
CTO | Orion Virtualisation Solutions | www.orionvm.com.au
Phone: 1300 56 99 52 | Mobile: 0428 754 846

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH V2 09/12] net/eipoib: Add main driver functionality
  2012-08-05 18:50     ` Michael S. Tsirkin
@ 2012-08-08  5:23       ` Or Gerlitz
  2012-08-12 10:22         ` Michael S. Tsirkin
                           ` (2 more replies)
  2012-09-03 20:53       ` Or Gerlitz
  1 sibling, 3 replies; 71+ messages in thread
From: Or Gerlitz @ 2012-08-08  5:23 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Eric W. Biederman, Or Gerlitz, davem, roland, netdev, ali,
	sean.hefty, Erez Shitrit, Doug Ledford

On Sun, Aug 5, 2012 at 9:50 PM, Michael S. Tsirkin <mst@redhat.com> wrote:

[...]
> So it seems that a sane solution would involve an extra level of
> indirection, with guest addresses being translated to host IB addresses.
> As long as you do this, maybe using an ethernet frame format makes sense.
[...]

Yep, that's among the points we're trying to make, the way you've put
it makes it clearer.

> So far the things that make sense. Here are some that don't, to me:

> - Is a pdf presentation all you have in terms of documentation?
>   We are talking communication protocols here - I would expect a
>   proper spec, and some effort to standardize, otherwise where's the
>   guarantee it won't change in an incompatible way?

To be precise, the solution uses 100% IPoIB wire-protocol, so we don't
see a need
for any spec change / standardization effort. This might go to the 1st
point you've
brought... improve the documentation, will do that. The pdf you looked
at was presented
in a conference.

>   Other things that I would expect to be addressed in such a spec is
>   interaction with other IPoIB features, such as connected
>   mode, checksum offloading etc, and IB features such as multipath etc.

For the eipoib interface, it doesn't really matters if the underlyind
ipoib clones used by it (we call them VIFs) use connected or datagram
mode, what does matter is the MTU and offload features supported by
these VIFs, for which the eipoib interface will have the min among all
these VIFs. Since for a given eipoib nic, all its VIFs must originated
from the same IPoIB PIF (e.g ib0) its easy admin job to make sure they
all have the same mtu / features which are needed for that eipoib nic,
e.g by using the same mode (connected/datagram for all of them), hope
this is clear.


> - The way you encode LID/QPN in the MAC seems questionable. IIRC there's
>   more to IB addressing than just the LID.  Since everyone on the subnet
>   need access to this translation, I think it makes sense to store it in
>   the SM. I think this would also obviate some IPv4 specific hacks in kernel.

The idead beyond the encoding was uniqueness, LID/QPN is unique per IB
HCA end-node. I wasn't sure to understand the comment re the IPv4 hacks.

> - IGMP/MAC snooping in a driver is just too hairy.

mmm, any rough idea/direction how to do that otherwise?

>   As you point out, bridge currently needs the uplink in promisc mode.
>   I don't think a driver should work around that limitation.
>   For some setups, it might be interesting to remove the
>   promisc mode requirement, failing that,
>   I think you could use macvtap passthrough.

That's in the plans, the current code doesn't assume that the eipoib
has bridge on top, for VM networking it works with bridge + tap,
bridge + macvtap, but it would easily work with passthrough when we
allow to create multiple eipoib interfaces on the same ipoib PIF (e.g
today for the ib0 PIF we create eipoib eth0, and then two VIFs ib0.1
and ib0.2 that are enslaved by eth0, but next we will create eth1 and
eth2 which will use ib0.1 and ib0.2
respectively.

> - Currently migration works without host kernel help, would be
>   preferable to keep it that way.

OK

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH V2 09/12] net/eipoib: Add main driver functionality
  2012-08-07  3:33                 ` Eric W. Biederman
@ 2012-08-08  6:04                   ` Or Gerlitz
  2012-08-08  8:36                     ` Eric W. Biederman
  0 siblings, 1 reply; 71+ messages in thread
From: Or Gerlitz @ 2012-08-08  6:04 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Ali Ayoub, David Miller, ogerlitz, roland, netdev, sean.hefty,
	erezsh, dledford

Eric W. Biederman <ebiederm@xmission.com> wrote:
>> Ali Ayoub <ali@mellanox.com> writes:

>> Among other things, the main benefit we're targeting is to allow IPoE
>> traffic within the VM to go through the (Ethernet) vBridge down to the
>> eIPoIB PIF, and eventually to IPoIB and to the IB network.

> Oh yes.  It just occurred to me there is huge problem with eIPoIB as
> currently presented in these patches.  It breaks DHCPv4 the same way
> it breaks ARP, but DHCPv4 is not fixed up.

To put things in place, DHCPv4 is supported with eIPoIB, the DHCP
UDP/IP payload  isn't touched, only need to set the BOOTP broadcast
flag in the dhcp server config file.

Or.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH V2 09/12] net/eipoib: Add main driver functionality
  2012-08-07  0:44               ` Eric W. Biederman
                                   ` (2 preceding siblings ...)
  2012-08-07  3:37                 ` Joseph Glanville
@ 2012-08-08  7:32                 ` Or Gerlitz
  2012-08-08  9:17                   ` Eric W. Biederman
  3 siblings, 1 reply; 71+ messages in thread
From: Or Gerlitz @ 2012-08-08  7:32 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Ali Ayoub, David Miller, ogerlitz, roland, netdev, sean.hefty,
	erezsh, dledford, Michael S. Tsirkin

Eric W. Biederman <ebiederm@xmission.com> wrote:
> Ali Ayoub <ali@mellanox.com> writes:
[...]
>> I don't see in other alternatives a solution for the problem we're
>> trying to solve. If there are changes/suggestions to improve eIPoIB
>> netdev driver to avoid "messing with the link layer" and make it
>> acceptable, we can discuss and apply them.

> Nothing needs to be applied the code is done.  Routing from
> IPoE to IPoIB works. There is nothing in what anyone has posted as requirements
>  that needs work to implement.

> I totally fail to see how getting packets of of the VM as ethernet
> frames, and then  IP layer routing those packets over IP is not an
> option.  What requirement am I missing.


As you've indicated routing w/w.o using proxy-arp is an option, however,

> All VMs should suport that mode of operation, and certainly the kernel does.
> Implementations involving bridges like macvlan and macvtap are
> performance optimizations, and the optimizations don't even apply in
> areas like 802.11, where only one mac address is supported per adapter.
> Bridging can ocassionally also be an administrative simplification as
> well, but you should be able to achieve the a similar simplification
> with a dhcprelay and proxy arp.

as you wrote here, when performance and ease-of-use is under the spot,
VM deployments tend to not to use routing.

This is b/c it involves more over-head on the packet forwarding, and
more administration work, for example, for setting routing rules that
involve the VM IP address, something which AFAIK the hypervisor have
no clue on, also its unclear to me if/how live migration can work in
such setting.

>From this exact reason, there's a bunch of use-cases by tools and
cloud stacks (such as open stack, ovirt, more) which do use bridged
mode and the rest of the Ethernet envelope, such as using virtual L2
vlan domains, ebtables based rules, etc etc. Where they and are not
application to ipoib, but are working file ith eipoib.

You mentioned that bridging mode doesn't apply to environment such as
802.11, and hence routing mode is used, we are trying to make a point
here that bridging mode applies to ipoib with the approach suggested
by eipoib.

Also, if we extend the discussion a bit, there are two more aspects to throw in:

The first is the performance thing we have already started to mention
-- specifically, the approach for RX zero copy (into the VM buffer),
use designs such as vhost + macvtap NIC in passthrough mode which is
likey to be set over a per VM hypervisor NIC, e.g such as the ones
provided by VMDQ patches John Fastabend started to post (see
http://marc.info/?l=linux-netdev&m=134264998405581&w=2) -- the ib0.N
clone child are IPoIB VMDQ NICs if you like, and setting an eipoib NIC
on top of each they can be plugged to that design.

The 2nd aspect, is NON VM environments where a NIC with Ethernet look
and feel is required for IP traffic, but this have to live within an
echo-system that fully uses IPoIB.
In other words, a use case where IPoIB has to be below the cover for
set of some specific apps, or nodes but do IP interaction with other
apps/nodes and gateways who use IPoIB, the eIPoIB driver provides that
functionality.

So, to sum up, routing / proxy-arp seems to be off where we are targeting.

Or.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH V2 09/12] net/eipoib: Add main driver functionality
  2012-08-08  6:04                   ` Or Gerlitz
@ 2012-08-08  8:36                     ` Eric W. Biederman
  2012-08-09  4:06                       ` Or Gerlitz
  0 siblings, 1 reply; 71+ messages in thread
From: Eric W. Biederman @ 2012-08-08  8:36 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Ali Ayoub, David Miller, ogerlitz, roland, netdev, sean.hefty,
	erezsh, dledford

Or Gerlitz <or.gerlitz@gmail.com> writes:

> Eric W. Biederman <ebiederm@xmission.com> wrote:
>>> Ali Ayoub <ali@mellanox.com> writes:
>
>>> Among other things, the main benefit we're targeting is to allow IPoE
>>> traffic within the VM to go through the (Ethernet) vBridge down to the
>>> eIPoIB PIF, and eventually to IPoIB and to the IB network.
>
>> Oh yes.  It just occurred to me there is huge problem with eIPoIB as
>> currently presented in these patches.  It breaks DHCPv4 the same way
>> it breaks ARP, but DHCPv4 is not fixed up.
>
> To put things in place, DHCPv4 is supported with eIPoIB, the DHCP
> UDP/IP payload  isn't touched, only need to set the BOOTP broadcast
> flag in the dhcp server config file.

Wrong.  DHCPv4 is broken over eIPoIB. 

Coming from ethernet
htype == 1 not 32 as required by RFC4390
hlen == 6 not 0 as required by RFC4390
The chaddr field is has 6 bytes of the ethernet mac address not the
required 16 bytes of 0.

The client-identifier field is optional over ethernet.

An ethernet DHCPv4 client simply does not generate a dhcp packet that
conforms to RFC4390.

Therefore DHCPv4 over eIPoIB is broken, and a dhcp server or relay
may reasonably look at the DHCP packet and drop it because it is
garbage.

You might find a forgiving dhcp server that doesn't drop insane packets
on the floor and tries to make things work.

I am sorry.  eIPoIB is broken as designed.  eIPoIB most assuredly is not
compatible with ethernet.  eIPoIB most definitely does not work even for
the general case of transporing IP traffic.  Claiming that eIPoIB is any
else is a lie.

Eric

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH V2 09/12] net/eipoib: Add main driver functionality
  2012-08-08  7:32                 ` Or Gerlitz
@ 2012-08-08  9:17                   ` Eric W. Biederman
  2012-08-09  4:34                     ` Or Gerlitz
  0 siblings, 1 reply; 71+ messages in thread
From: Eric W. Biederman @ 2012-08-08  9:17 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Ali Ayoub, David Miller, ogerlitz, roland, netdev, sean.hefty,
	erezsh, dledford, Michael S. Tsirkin

Or Gerlitz <or.gerlitz@gmail.com> writes:

> Eric W. Biederman <ebiederm@xmission.com> wrote:
>> Ali Ayoub <ali@mellanox.com> writes:
> [...]
>>> I don't see in other alternatives a solution for the problem we're
>>> trying to solve. If there are changes/suggestions to improve eIPoIB
>>> netdev driver to avoid "messing with the link layer" and make it
>>> acceptable, we can discuss and apply them.
>
>> Nothing needs to be applied the code is done.  Routing from
>> IPoE to IPoIB works. There is nothing in what anyone has posted as requirements
>>  that needs work to implement.
>
>> I totally fail to see how getting packets of of the VM as ethernet
>> frames, and then  IP layer routing those packets over IP is not an
>> option.  What requirement am I missing.
>
>
> As you've indicated routing w/w.o using proxy-arp is an option, however,
>
>> All VMs should suport that mode of operation, and certainly the kernel does.
>> Implementations involving bridges like macvlan and macvtap are
>> performance optimizations, and the optimizations don't even apply in
>> areas like 802.11, where only one mac address is supported per adapter.
>> Bridging can ocassionally also be an administrative simplification as
>> well, but you should be able to achieve the a similar simplification
>> with a dhcprelay and proxy arp.
>
> as you wrote here, when performance and ease-of-use is under the spot,
> VM deployments tend to not to use routing.
>
> This is b/c it involves more over-head on the packet forwarding, and
> more administration work, for example, for setting routing rules that
> involve the VM IP address, something which AFAIK the hypervisor have
> no clue on, also its unclear to me if/how live migration can work in
> such setting.

All you need to make proxy-arp essentially pain free is a smart dhcp
relay, that sets up the routes.

> From this exact reason, there's a bunch of use-cases by tools and
> cloud stacks (such as open stack, ovirt, more) which do use bridged
> mode and the rest of the Ethernet envelope, such as using virtual L2
> vlan domains, ebtables based rules, etc etc. Where they and are not
> application to ipoib, but are working file ith eipoib.

Yes I am certain all of their IPv6 traffic works fine.

Regardless those are open source projects and can be modified to add
support to cleanly support inifinibnad.

> You mentioned that bridging mode doesn't apply to environment such as
> 802.11, and hence routing mode is used, we are trying to make a pointn
> here that bridging mode applies to ipoib with the approach suggested
> by eipoib.

You are completely failing.  Every time I look I see something about
eIPoIB that is even more broken.  Given that eIPoIB is a NAT
implementation that isn't really a surprise but still.

eIPoIB imposes enough overhead that I expect that routing is cheaper,
so your performance advantges go right out the window.

eIPoIB is seriously incompatible with ethernet breaking almost
everything and barely allowing IPv4 to work.

> Also, if we extend the discussion a bit, there are two more aspects to throw in:
>
> The first is the performance thing we have already started to mention
> -- specifically, the approach for RX zero copy (into the VM buffer),
> use designs such as vhost + macvtap NIC in passthrough mode which is
> likey to be set over a per VM hypervisor NIC, e.g such as the ones
> provided by VMDQ patches John Fastabend started to post (see
> http://marc.info/?l=linux-netdev&m=134264998405581&w=2) -- the ib0.N
> clone child are IPoIB VMDQ NICs if you like, and setting an eipoib NIC
> on top of each they can be plugged to that design.

If you care about performance link-layer NAT is not the way to go.
Teach the pieces you care about how to talk infiniband.

> The 2nd aspect, is NON VM environments where a NIC with Ethernet look
> and feel is required for IP traffic, but this have to live within an
> echo-system that fully uses IPoIB.
> In other words, a use case where IPoIB has to be below the cover for
> set of some specific apps, or nodes but do IP interaction with other
> apps/nodes and gateways who use IPoIB, the eIPoIB driver provides that
> functionality.

ip link add type dummy.

There now you have an interface with ethernet look and feel, and
routing can happily avoid it.

> So, to sum up, routing / proxy-arp seems to be off where we are
> targeting.

My condolences.

The existence of router / proxy-arp means that solutions do exist
(unlike your previous claim) you just don't like the idea of deploying
them.

Infiniband is standard enough you could quite easily implement virtual
infiniband bridging as an alternative to ethernet bridging.


At this stage of the game eIPoIB is interesting the same way a zombie is
interesting.  It is fascinating to see it still moving as chunks of
flesh fall to the floor removing any doubt that it is dead.

Eric

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH V2 09/12] net/eipoib: Add main driver functionality
  2012-08-08  8:36                     ` Eric W. Biederman
@ 2012-08-09  4:06                       ` Or Gerlitz
  2012-08-12 14:05                         ` Michael S. Tsirkin
  0 siblings, 1 reply; 71+ messages in thread
From: Or Gerlitz @ 2012-08-09  4:06 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Ali Ayoub, David Miller, ogerlitz, roland, netdev, sean.hefty,
	erezsh, dledford, Michael S. Tsirkin

Eric W. Biederman <ebiederm@xmission.com> wrote:
> Or Gerlitz <or.gerlitz@gmail.com> writes:

>> To put things in place, DHCPv4 is supported with eIPoIB, the DHCP
>> UDP/IP payload  isn't touched, only need to set the BOOTP broadcast
>> flag in the dhcp server config file.

> Wrong.  DHCPv4 is broken over eIPoIB. Coming from ethernet
> htype == 1 not 32 as required by RFC4390
> hlen == 6 not 0 as required by RFC4390
> The chaddr field is has 6 bytes of the ethernet mac address not the
> required 16 bytes of 0. The client-identifier field is optional over ethernet.
> An ethernet DHCPv4 client simply does not generate a dhcp packet that
> conforms to RFC4390.
>
> Therefore DHCPv4 over eIPoIB is broken, and a dhcp server or relay
> may reasonably look at the DHCP packet and drop it because it is garbage.
>
> You might find a forgiving dhcp server that doesn't drop insane packets
> on the floor and tries to make things work.

Under the eIPoIB design, the VM DHCP interaction follows
Ethernet DHCP, and not the IPoIB DHCP (RFC 4390).

The DHCP server has no reason to drop such packets.

DHCP is a L7 (L5 to be precise) construct, I don't see
why that the fact IPoIB DHCP RFC exists, means/mandates
the DHCP server to care on the link layer type.

Or.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH V2 09/12] net/eipoib: Add main driver functionality
  2012-08-08  9:17                   ` Eric W. Biederman
@ 2012-08-09  4:34                     ` Or Gerlitz
  2012-08-12 10:36                       ` Michael S. Tsirkin
  0 siblings, 1 reply; 71+ messages in thread
From: Or Gerlitz @ 2012-08-09  4:34 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Ali Ayoub, David Miller, ogerlitz, roland, netdev, sean.hefty,
	erezsh, dledford, Michael S. Tsirkin

Eric W. Biederman <ebiederm@xmission.com> wrote:
> Or Gerlitz <or.gerlitz@gmail.com> writes:

>> as you wrote here, when performance and ease-of-use is under the spot,
>> VM deployments tend to not to use routing.
>> This is b/c it involves more over-head on the packet forwarding, and
>> more administration work, for example, for setting routing rules that
>> involve the VM IP address, something which AFAIK the hypervisor have
>> no clue on, also its unclear to me if/how live migration can work in such setting.

> All you need to make proxy-arp essentially pain free is a smart dhcp
> relay, that sets up the routes.

So dhcp relay would help to maybe avoid some of the pain, however,
can you elaborate if/how live migration is supported under this scheme?

>> From this exact reason, there's a bunch of use-cases by tools and
>> cloud stacks (such as open stack, ovirt, more) which do use bridged
>> mode and the rest of the Ethernet envelope, such as using virtual L2
>> vlan domains, ebtables based rules, etc etc. Where they and are not
>> application to ipoib, but are working file ith eipoib.

> Yes I am certain all of their IPv6 traffic works fine.

I'm not sure to follow this comment.

> Regardless those are open source projects and can be modified to add
> support to cleanly support inifiniband.

open source/code can be modified indeed, but since exposing IPoIB link
layer to tools/emulators and VMs doesn't really make sense (see below),
we brought that this approach as a way to go for allowing people to use
bridging mode when the fabric is IB.

>> You mentioned that bridging mode doesn't apply to environment such as
>> 802.11, and hence routing mode is used, we are trying to make a pointn
>> here that bridging mode applies to ipoib with the approach suggested by eipoib.

> You are completely failing.  Every time I look I see something about
> eIPoIB that is even more broken.  Given that eIPoIB is a NAT
> implementation that isn't really a surprise but still.

> eIPoIB imposes enough overhead that I expect that routing is cheaper,
> so your performance advantges go right out the window.
>
> eIPoIB is seriously incompatible with ethernet breaking almost
> everything and barely allowing IPv4 to work.

I don't agree on this incompatiblity statement, you had a claim
on DHCP and I addressed it, beyond that, you don't like the eIPoIB
basic idea/design but this can't base an incompatiblity argument.


>> Also, if we extend the discussion a bit, there are two more aspects to throw in:
>> The first is the performance thing we have already started to mention
>> -- specifically, the approach for RX zero copy (into the VM buffer),
>> use designs such as vhost + macvtap NIC in passthrough mode which is
>> likey to be set over a per VM hypervisor NIC, e.g such as the ones
>> provided by VMDQ patches John Fastabend started to post (see
>> http://marc.info/?l=linux-netdev&m=134264998405581&w=2) -- the ib0.N
>> clone child are IPoIB VMDQ NICs if you like, and setting an eipoib NIC
>> on top of each they can be plugged to that design.

> If you care about performance link-layer NAT is not the way to go.
> Teach the pieces you care about how to talk infiniband.
>
>> The 2nd aspect, is NON VM environments where a NIC with Ethernet look
>> and feel is required for IP traffic, but this have to live within an
>> echo-system that fully uses IPoIB.
>> In other words, a use case where IPoIB has to be below the cover for
>> set of some specific apps, or nodes but do IP interaction with other
>> apps/nodes and gateways who use IPoIB, the eIPoIB driver provides that
>> functionality.


> ip link add type dummy.

> There now you have an interface with ethernet look and feel, and
> routing can happily avoid it.

again, not sure to follow, you mean "routing can happily use it", correct? that
is do routing between the dummy interface to IPoIB interface?


>> So, to sum up, routing / proxy-arp seems to be off where we are targeting.

> My condolences.
> The existence of router / proxy-arp means that solutions do exist
> (unlike your previous claim) you just don't like the idea of deploying them.

Don't like them from set of arguments, which we are covering here, re
manageability it still needs to clarified if/how live migration work and what
does it mean to always mandate dhcp relay.

> Infiniband is standard enough you could quite easily implement virtual
> infiniband bridging as an alternative to ethernet bridging.

Not really, as Michael indicated in his response over this thread
http://marc.info/?l=linux-netdev&m=134419288218373&w=2
IPoIB link layer addresses use IB HW constructs for which soft
hardware address setting isn't supported, and this interferes
with live migration.

Or.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH V2 09/12] net/eipoib: Add main driver functionality
  2012-08-08  5:23       ` Or Gerlitz
@ 2012-08-12 10:22         ` Michael S. Tsirkin
  2012-08-12 13:09           ` Or Gerlitz
  2012-08-12 13:15           ` Or Gerlitz
  2012-08-12 10:54         ` Michael S. Tsirkin
  2012-08-12 15:40         ` Eric W. Biederman
  2 siblings, 2 replies; 71+ messages in thread
From: Michael S. Tsirkin @ 2012-08-12 10:22 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Eric W. Biederman, Or Gerlitz, davem, roland, netdev, ali,
	sean.hefty, Erez Shitrit, Doug Ledford

On Wed, Aug 08, 2012 at 08:23:15AM +0300, Or Gerlitz wrote:
> On Sun, Aug 5, 2012 at 9:50 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> 
> [...]
> > So it seems that a sane solution would involve an extra level of
> > indirection, with guest addresses being translated to host IB addresses.
> > As long as you do this, maybe using an ethernet frame format makes sense.
> [...]
> 
> Yep, that's among the points we're trying to make, the way you've put
> it makes it clearer.
> 
> > So far the things that make sense. Here are some that don't, to me:
> 
> > - Is a pdf presentation all you have in terms of documentation?
> >   We are talking communication protocols here - I would expect a
> >   proper spec, and some effort to standardize, otherwise where's the
> >   guarantee it won't change in an incompatible way?
> 
> To be precise, the solution uses 100% IPoIB wire-protocol, so we don't
> see a need
> for any spec change / standardization effort.

Yes, I am guessing this is the real reason you pack LID/QPN
in the MAC - to make it all local. But it's a hack really,
and if you start storing it all in the SM you will need
to document the format so others can inter-operate.

> This might go to the 1st
> point you've
> brought... improve the documentation, will do that. The pdf you looked
> at was presented
> in a conference.
> 
> >   Other things that I would expect to be addressed in such a spec is
> >   interaction with other IPoIB features, such as connected
> >   mode, checksum offloading etc, and IB features such as multipath etc.
> 
> For the eipoib interface, it doesn't really matters if the underlyind
> ipoib clones used by it (we call them VIFs) use connected or datagram
> mode, what does matter is the MTU and offload features supported by
> these VIFs, for which the eipoib interface will have the min among all
> these VIFs. Since for a given eipoib nic, all its VIFs must originated
> from the same IPoIB PIF (e.g ib0) its easy admin job to make sure they
> all have the same mtu / features which are needed for that eipoib nic,
> e.g by using the same mode (connected/datagram for all of them), hope
> this is clear.
> 

Just pointing out all this needs to be documented.

> > - The way you encode LID/QPN in the MAC seems questionable. IIRC there's
> >   more to IB addressing than just the LID.  Since everyone on the subnet
> >   need access to this translation, I think it makes sense to store it in
> >   the SM. I think this would also obviate some IPv4 specific hacks in kernel.
> 
> The idead beyond the encoding was uniqueness, LID/QPN is unique per IB
> HCA end-node.

But then it breaks with VM migration, IB failover, softmac setting in
guest, probably more?

> I wasn't sure to understand the comment re the IPv4 hacks.

This refers to the ARP hack that you use to fix
VM migration.

> > - IGMP/MAC snooping in a driver is just too hairy.
> 
> mmm, any rough idea/direction how to do that otherwise?

Sure, even two ways, ideally you'd do both :)
A. fix macvtap
1. Use netdev_for_each_mc_addr etc to get multicast addresses
2. teach macvtap to fill that in (it currently floods multicasts
   for guest to guest communication so we ned to fix it anyway)

B. fix bridge
   teach bridge to work for VMs without using promisc mode


> >   As you point out, bridge currently needs the uplink in promisc mode.
> >   I don't think a driver should work around that limitation.
> >   For some setups, it might be interesting to remove the
> >   promisc mode requirement, failing that,
> >   I think you could use macvtap passthrough.
> 
> That's in the plans, the current code doesn't assume that the eipoib
> has bridge on top, for VM networking it works with bridge + tap,
> bridge + macvtap, but it would easily work with passthrough when we
> allow to create multiple eipoib interfaces on the same ipoib PIF (e.g
> today for the ib0 PIF we create eipoib eth0, and then two VIFs ib0.1
> and ib0.2 that are enslaved by eth0, but next we will create eth1 and
> eth2 which will use ib0.1 and ib0.2
> respectively.

The whole promisc mode emulation is there for the bridge, no?
Since you don't support promisc, ideally we'd check a hardware
capability and fail gracefully, though naturally this is not top
priority.

> > - Currently migration works without host kernel help, would be
> >   preferable to keep it that way.
> 
> OK

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH V2 09/12] net/eipoib: Add main driver functionality
  2012-08-09  4:34                     ` Or Gerlitz
@ 2012-08-12 10:36                       ` Michael S. Tsirkin
  0 siblings, 0 replies; 71+ messages in thread
From: Michael S. Tsirkin @ 2012-08-12 10:36 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Eric W. Biederman, Ali Ayoub, David Miller, ogerlitz, roland,
	netdev, sean.hefty, erezsh, dledford

On Thu, Aug 09, 2012 at 07:34:23AM +0300, Or Gerlitz wrote:
> > Infiniband is standard enough you could quite easily implement virtual
> > infiniband bridging as an alternative to ethernet bridging.
> 
> Not really, as Michael indicated in his response over this thread
> http://marc.info/?l=linux-netdev&m=134419288218373&w=2
> IPoIB link layer addresses use IB HW constructs for which soft
> hardware address setting isn't supported, and this interferes
> with live migration.
> 
> Or.

But I'd just like to point out your code does not support softmac
either.

-- 
MST

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH V2 09/12] net/eipoib: Add main driver functionality
  2012-08-08  5:23       ` Or Gerlitz
  2012-08-12 10:22         ` Michael S. Tsirkin
@ 2012-08-12 10:54         ` Michael S. Tsirkin
  2012-08-12 13:19           ` Or Gerlitz
  2012-08-12 15:40         ` Eric W. Biederman
  2 siblings, 1 reply; 71+ messages in thread
From: Michael S. Tsirkin @ 2012-08-12 10:54 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Eric W. Biederman, Or Gerlitz, davem, roland, netdev, ali,
	sean.hefty, Erez Shitrit, Doug Ledford

On Wed, Aug 08, 2012 at 08:23:15AM +0300, Or Gerlitz wrote:
> > So far the things that make sense. Here are some that don't, to me:
> 
> > - Is a pdf presentation all you have in terms of documentation?
> >   We are talking communication protocols here - I would expect a
> >   proper spec, and some effort to standardize, otherwise where's the
> >   guarantee it won't change in an incompatible way?
> 
> To be precise, the solution uses 100% IPoIB wire-protocol, so we don't
> see a need
> for any spec change / standardization effort.

Thinking about it some more, the solution is not really 100% local:
- ARP packets used at migration are visible on the network
- QPN/LID in MAC are guest visible, and could be exported on the network

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH V2 09/12] net/eipoib: Add main driver functionality
  2012-08-12 10:22         ` Michael S. Tsirkin
@ 2012-08-12 13:09           ` Or Gerlitz
  2012-08-12 13:41             ` Michael S. Tsirkin
  2012-08-12 13:15           ` Or Gerlitz
  1 sibling, 1 reply; 71+ messages in thread
From: Or Gerlitz @ 2012-08-12 13:09 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Eric W. Biederman, davem, roland, netdev, ali, sean.hefty,
	Erez Shitrit, Doug Ledford

On 12/08/2012 13:22, Michael S. Tsirkin wrote:
> On Wed, Aug 08, 2012 at 08:23:15AM +0300, Or Gerlitz wrote:
>> >On Sun, Aug 5, 2012 at 9:50 PM, Michael S. Tsirkin<mst@redhat.com>  wrote:
>> >
>> >[...]
>>> > >So it seems that a sane solution would involve an extra level of
>>> > >indirection, with guest addresses being translated to host IB addresses.
>>> > >As long as you do this, maybe using an ethernet frame format makes sense.
>> >[...]
>> >
>> >Yep, that's among the points we're trying to make, the way you've put
>> >it makes it clearer.
>> >
>>> > >So far the things that make sense. Here are some that don't, to me:
>> >
>>> > >- Is a pdf presentation all you have in terms of documentation?
>>> > >   We are talking communication protocols here - I would expect a
>>> > >   proper spec, and some effort to standardize, otherwise where's the
>>> > >   guarantee it won't change in an incompatible way?
>> >
>> >To be precise, the solution uses 100% IPoIB wire-protocol, so we don't
>> >see a need
>> >for any spec change / standardization effort.
> Yes, I am guessing this is the real reason you pack LID/QPN
> in the MAC - to make it all local. But it's a hack really,
> and if you start storing it all in the SM you will need
> to document the format so others can inter-operate.

I'd like to review the way we generate these MAC addresses, maybe it
can be done differently.


>
>
>> >This might go to the 1stpoint you've
>> >brought... improve the documentation, will do that. The pdf you looked
>> >at was presentedin a conference.
>>
>>> > >   Other things that I would expect to be addressed in such a spec is
>>> > >   interaction with other IPoIB features, such as connected
>>> > >   mode, checksum offloading etc, and IB features such as multipath etc.
>> >
>> >For the eipoib interface, it doesn't really matters if the underlyind
>> >ipoib clones used by it (we call them VIFs) use connected or datagram
>> >mode, what does matter is the MTU and offload features supported by
>> >these VIFs, for which the eipoib interface will have the min among all
>> >these VIFs. Since for a given eipoib nic, all its VIFs must originated
>> >from the same IPoIB PIF (e.g ib0) its easy admin job to make sure they
>> >all have the same mtu / features which are needed for that eipoib nic,
>> >e.g by using the same mode (connected/datagram for all of them), hope
>> >this is clear.
>> >
> Just pointing out all this needs to be documented.

OK, will do

>
>
>>> > >- The way you encode LID/QPN in the MAC seems questionable. IIRC there's
>>> > >   more to IB addressing than just the LID.  Since everyone on the subnet
>>> > >   need access to this translation, I think it makes sense to store it in
>>> > >   the SM. I think this would also obviate some IPv4 specific hacks in kernel.
>> >
>> >The idead beyond the encoding was uniqueness, LID/QPN is unique per IB
>> >HCA end-node.
> But then it breaks with VM migration, IB failover, softmac setting in guest, probably more?

With the current design/code the remote mac of a VM changes, when that 
VM migrates or IB
LIDs are changed. As for softmac setting in the guest, we  don't send 
the guest MAC on the wire
anyway, since the Ethernet header is removed.

Or.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH V2 09/12] net/eipoib: Add main driver functionality
  2012-08-12 10:22         ` Michael S. Tsirkin
  2012-08-12 13:09           ` Or Gerlitz
@ 2012-08-12 13:15           ` Or Gerlitz
  2012-08-12 13:55             ` Michael S. Tsirkin
  1 sibling, 1 reply; 71+ messages in thread
From: Or Gerlitz @ 2012-08-12 13:15 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Or Gerlitz, Eric W. Biederman, davem, roland, netdev, ali,
	sean.hefty, Erez Shitrit, Doug Ledford

On 12/08/2012 13:22, Michael S. Tsirkin wrote:
>>> >>- IGMP/MAC snooping in a driver is just too hairy.
>>
>> >mmm, any rough idea/direction how to do that otherwise?
> Sure, even two ways, ideally you'd do both:)
> A. fix macvtap
> 1. Use netdev_for_each_mc_addr etc to get multicast addresses
> 2. teach macvtap to fill that in (it currently floods multicasts
>     for guest to guest communication so we ned to fix it anyway)
>
> B. fix bridge
>     teach bridge to work for VMs without using promisc mode

I wasn't sure to fully follow... need some more bits of info, the 
macvtap fix
relates only to IGMP snooping, correct? as for the bridge fix, does this 
somehow
relates to ARP snooping we do in the driver? how?

Or.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH V2 09/12] net/eipoib: Add main driver functionality
  2012-08-12 10:54         ` Michael S. Tsirkin
@ 2012-08-12 13:19           ` Or Gerlitz
  0 siblings, 0 replies; 71+ messages in thread
From: Or Gerlitz @ 2012-08-12 13:19 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Eric W. Biederman, davem, roland, netdev, ali, sean.hefty,
	Erez Shitrit, Doug Ledford

On 12/08/2012 13:54, Michael S. Tsirkin wrote:
> On Wed, Aug 08, 2012 at 08:23:15AM +0300, Or Gerlitz wrote:
>>> So far the things that make sense. Here are some that don't, to me:
>>> - Is a pdf presentation all you have in terms of documentation?
>>>    We are talking communication protocols here - I would expect a
>>>    proper spec, and some effort to standardize, otherwise where's the
>>>    guarantee it won't change in an incompatible way?
>> To be precise, the solution uses 100% IPoIB wire-protocol, so we don't
>> see a need for any spec change / standardization effort.
> Thinking about it some more, the solution is not really 100% local:
> - ARP packets used at migration are visible on the network

The ARP packets on the wire are IPoIB ARP packets -- indeed, when the 
guest moves, the IPoIB
link layer address which was associated with the VIF that used to serve 
this guest on the host
we migrated from, changes to be of a VIF on the host that we migrated to.


> - QPN/LID in MAC are guest visible, and could be exported on the network
>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH V2 09/12] net/eipoib: Add main driver functionality
  2012-08-12 13:09           ` Or Gerlitz
@ 2012-08-12 13:41             ` Michael S. Tsirkin
  0 siblings, 0 replies; 71+ messages in thread
From: Michael S. Tsirkin @ 2012-08-12 13:41 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Eric W. Biederman, davem, roland, netdev, ali, sean.hefty,
	Erez Shitrit, Doug Ledford

On Sun, Aug 12, 2012 at 04:09:35PM +0300, Or Gerlitz wrote:
> On 12/08/2012 13:22, Michael S. Tsirkin wrote:
> >On Wed, Aug 08, 2012 at 08:23:15AM +0300, Or Gerlitz wrote:
> >>>On Sun, Aug 5, 2012 at 9:50 PM, Michael S. Tsirkin<mst@redhat.com>  wrote:
> >>>
> >>>[...]
> >>>> >So it seems that a sane solution would involve an extra level of
> >>>> >indirection, with guest addresses being translated to host IB addresses.
> >>>> >As long as you do this, maybe using an ethernet frame format makes sense.
> >>>[...]
> >>>
> >>>Yep, that's among the points we're trying to make, the way you've put
> >>>it makes it clearer.
> >>>
> >>>> >So far the things that make sense. Here are some that don't, to me:
> >>>
> >>>> >- Is a pdf presentation all you have in terms of documentation?
> >>>> >   We are talking communication protocols here - I would expect a
> >>>> >   proper spec, and some effort to standardize, otherwise where's the
> >>>> >   guarantee it won't change in an incompatible way?
> >>>
> >>>To be precise, the solution uses 100% IPoIB wire-protocol, so we don't
> >>>see a need
> >>>for any spec change / standardization effort.
> >Yes, I am guessing this is the real reason you pack LID/QPN
> >in the MAC - to make it all local. But it's a hack really,
> >and if you start storing it all in the SM you will need
> >to document the format so others can inter-operate.
> 
> I'd like to review the way we generate these MAC addresses, maybe it
> can be done differently.
> 
> 
> >
> >
> >>>This might go to the 1stpoint you've
> >>>brought... improve the documentation, will do that. The pdf you looked
> >>>at was presentedin a conference.
> >>
> >>>> >   Other things that I would expect to be addressed in such a spec is
> >>>> >   interaction with other IPoIB features, such as connected
> >>>> >   mode, checksum offloading etc, and IB features such as multipath etc.
> >>>
> >>>For the eipoib interface, it doesn't really matters if the underlyind
> >>>ipoib clones used by it (we call them VIFs) use connected or datagram
> >>>mode, what does matter is the MTU and offload features supported by
> >>>these VIFs, for which the eipoib interface will have the min among all
> >>>these VIFs. Since for a given eipoib nic, all its VIFs must originated
> >>>from the same IPoIB PIF (e.g ib0) its easy admin job to make sure they
> >>>all have the same mtu / features which are needed for that eipoib nic,
> >>>e.g by using the same mode (connected/datagram for all of them), hope
> >>>this is clear.
> >>>
> >Just pointing out all this needs to be documented.
> 
> OK, will do
> 
> >
> >
> >>>> >- The way you encode LID/QPN in the MAC seems questionable. IIRC there's
> >>>> >   more to IB addressing than just the LID.  Since everyone on the subnet
> >>>> >   need access to this translation, I think it makes sense to store it in
> >>>> >   the SM. I think this would also obviate some IPv4 specific hacks in kernel.
> >>>
> >>>The idead beyond the encoding was uniqueness, LID/QPN is unique per IB
> >>>HCA end-node.
> >But then it breaks with VM migration, IB failover, softmac setting in guest, probably more?
> 
> With the current design/code the remote mac of a VM changes, when
> that VM migrates or IB
> LIDs are changed.

Which is exactly the problem with IB and VM migration.

> As for softmac setting in the guest, we  don't
> send the guest MAC on the wire
> anyway, since the Ethernet header is removed.
> 
> Or.

Yes but you generate remote addresses automatically so admin can not
change the local address without risking conflicts.

-- 
MST

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH V2 09/12] net/eipoib: Add main driver functionality
  2012-08-12 13:15           ` Or Gerlitz
@ 2012-08-12 13:55             ` Michael S. Tsirkin
  2012-08-12 14:13               ` Or Gerlitz
  2012-08-14  7:41               ` Or Gerlitz
  0 siblings, 2 replies; 71+ messages in thread
From: Michael S. Tsirkin @ 2012-08-12 13:55 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Or Gerlitz, Eric W. Biederman, davem, roland, netdev, ali,
	sean.hefty, Erez Shitrit, Doug Ledford

On Sun, Aug 12, 2012 at 04:15:52PM +0300, Or Gerlitz wrote:
> On 12/08/2012 13:22, Michael S. Tsirkin wrote:
> >>>>>- IGMP/MAC snooping in a driver is just too hairy.
> >>
> >>>mmm, any rough idea/direction how to do that otherwise?
> >Sure, even two ways, ideally you'd do both:)
> >A. fix macvtap
> >1. Use netdev_for_each_mc_addr etc to get multicast addresses
> >2. teach macvtap to fill that in (it currently floods multicasts
> >    for guest to guest communication so we ned to fix it anyway)
> >
> >B. fix bridge
> >    teach bridge to work for VMs without using promisc mode
> 
> I wasn't sure to fully follow... need some more bits of info, the
> macvtap fix
> relates only to IGMP snooping, correct? as for the bridge fix, does
> this somehow
> relates to ARP snooping we do in the driver? how?
> 
> Or.

I didn't realize you do ARP snooping. Why?

I know you mangle outgoing ARP packets, this will go
away if you maintain a mapping in SM accessible to all guests.

-- 
MST

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH V2 09/12] net/eipoib: Add main driver functionality
  2012-08-09  4:06                       ` Or Gerlitz
@ 2012-08-12 14:05                         ` Michael S. Tsirkin
  0 siblings, 0 replies; 71+ messages in thread
From: Michael S. Tsirkin @ 2012-08-12 14:05 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Eric W. Biederman, Ali Ayoub, David Miller, ogerlitz, roland,
	netdev, sean.hefty, erezsh, dledford

On Thu, Aug 09, 2012 at 07:06:46AM +0300, Or Gerlitz wrote:
> Eric W. Biederman <ebiederm@xmission.com> wrote:
> > Or Gerlitz <or.gerlitz@gmail.com> writes:
> 
> >> To put things in place, DHCPv4 is supported with eIPoIB, the DHCP
> >> UDP/IP payload  isn't touched, only need to set the BOOTP broadcast
> >> flag in the dhcp server config file.
> 
> > Wrong.  DHCPv4 is broken over eIPoIB. Coming from ethernet
> > htype == 1 not 32 as required by RFC4390
> > hlen == 6 not 0 as required by RFC4390
> > The chaddr field is has 6 bytes of the ethernet mac address not the
> > required 16 bytes of 0. The client-identifier field is optional over ethernet.
> > An ethernet DHCPv4 client simply does not generate a dhcp packet that
> > conforms to RFC4390.
> >
> > Therefore DHCPv4 over eIPoIB is broken, and a dhcp server or relay
> > may reasonably look at the DHCP packet and drop it because it is garbage.
> >
> > You might find a forgiving dhcp server that doesn't drop insane packets
> > on the floor and tries to make things work.
> 
> Under the eIPoIB design, the VM DHCP interaction follows
> Ethernet DHCP, and not the IPoIB DHCP (RFC 4390).
> 
> The DHCP server has no reason to drop such packets.
> 
> DHCP is a L7 (L5 to be precise) construct, I don't see
> why that the fact IPoIB DHCP RFC exists, means/mandates
> the DHCP server to care on the link layer type.
> 
> Or.

For example DHCP server could be configured with
HW address/IP address table.

-- 
MST

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH V2 09/12] net/eipoib: Add main driver functionality
  2012-08-12 13:55             ` Michael S. Tsirkin
@ 2012-08-12 14:13               ` Or Gerlitz
  2012-08-12 20:54                 ` Michael S. Tsirkin
  2012-08-14  7:41               ` Or Gerlitz
  1 sibling, 1 reply; 71+ messages in thread
From: Or Gerlitz @ 2012-08-12 14:13 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Eric W. Biederman, davem, roland, netdev, ali, sean.hefty,
	Erez Shitrit, Doug Ledford

On 12/08/2012 16:55, Michael S. Tsirkin wrote:
> I didn't realize you do ARP snooping. Why? I know you mangle outgoing 
> ARP packets, 

Maybe I wasn't accurate/clear, we do mangle outgoing/incoming ARP 
packets, from/to Ethernet ARPs
to/from IPoIB ARPs.


> this will go away if you maintain a mapping in SM accessible to all 
> guests. 

guests don't interact with IB, I assume you referred to dom0 code, eIPoIB or
another driver in the host. But what mapping exactly? and remember that
this code (VM through eipoib) can talk to any IPoIB element on the 
fabric, native,
virtualized,  HW/SW gateways, etc etc.

Or.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH V2 09/12] net/eipoib: Add main driver functionality
  2012-08-08  5:23       ` Or Gerlitz
  2012-08-12 10:22         ` Michael S. Tsirkin
  2012-08-12 10:54         ` Michael S. Tsirkin
@ 2012-08-12 15:40         ` Eric W. Biederman
  2012-08-13  8:33           ` Or Gerlitz
  2 siblings, 1 reply; 71+ messages in thread
From: Eric W. Biederman @ 2012-08-12 15:40 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Michael S. Tsirkin, Or Gerlitz, davem, roland, netdev, ali,
	sean.hefty, Erez Shitrit, Doug Ledford

Or Gerlitz <or.gerlitz@gmail.com> writes:

> On Sun, Aug 5, 2012 at 9:50 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
>
> [...]
>> So it seems that a sane solution would involve an extra level of
>> indirection, with guest addresses being translated to host IB addresses.
>> As long as you do this, maybe using an ethernet frame format makes sense.
> [...]
>
> Yep, that's among the points we're trying to make, the way you've put
> it makes it clearer.
>
>> - IGMP/MAC snooping in a driver is just too hairy.
>
> mmm, any rough idea/direction how to do that otherwise?

Let me give you a non-hack recomendation.

- Give up on being wire compatible with IPoIB.

- Define and implement ethernet over inifiniband aka EoIB.

With EoIB:
- The SM would map ethernet address to inifiniband hardware addresses.
- You discover which multicast addresses are of interest from the
  IP layer above so no snooping is necessary.
- You could run queue pairs directly to hosts.

Shrug.  It is trivial and it will work.  It will probably run into the
same problems that have historically been a problem for using IPoIB
(lack of stateless offloads) but shrug that is mostly a NIC firmware
problem.  The switches will have no trouble and interoperability will
be assured.

If you want to map ethernet over infiniband please map ethernet over
infiniband.  Don't poorly NAT ethernet into infiniband.

Eric

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH V2 09/12] net/eipoib: Add main driver functionality
  2012-08-12 14:13               ` Or Gerlitz
@ 2012-08-12 20:54                 ` Michael S. Tsirkin
  2012-08-14  8:44                   ` Or Gerlitz
  2012-08-20 18:57                   ` Michael S. Tsirkin
  0 siblings, 2 replies; 71+ messages in thread
From: Michael S. Tsirkin @ 2012-08-12 20:54 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Eric W. Biederman, davem, roland, netdev, ali, sean.hefty,
	Erez Shitrit, Doug Ledford

On Sun, Aug 12, 2012 at 05:13:43PM +0300, Or Gerlitz wrote:
> On 12/08/2012 16:55, Michael S. Tsirkin wrote:
> >I didn't realize you do ARP snooping. Why? I know you mangle
> >outgoing ARP packets,
> 
> Maybe I wasn't accurate/clear, we do mangle outgoing/incoming ARP
> packets, from/to Ethernet ARPs
> to/from IPoIB ARPs.
> 
> 
> >this will go away if you maintain a mapping in SM accessible to
> >all guests.
> 
> guests don't interact with IB, I assume you referred to dom0 code, eIPoIB or
> another driver in the host. But what mapping exactly?

Well we are getting into protocol design here.  So here's a sketch
showing how you could build a protocol that does work.  But note it is
not *exactly* IPoIB. It is I think close enough that you can
use existing NIC hardware/firmware, which is why it
differs slightly from what Eric described, and is
more complex.  But it still shares the same property of no hacks, no
packet snooping in driver, etc.


And if you want to go that route, you really should talk to some IB
protocol people to figure out what works, write a spec and try to
standardize. lkml/netdev is not the right place.

But since you asked, if I had to, I would probably try to
do it like this:

- Each device registers with the SA specifying the
  mac address (+ vlan?), SA stores the translation from that
  to IPoIB address.
- alternatively, SA admin configures the translation statically
- you get a packet with 6 byte mac address,
  query the SA for a mapping to IPoIB address, strip
  ethernet frame and send
- multicast GID addresses can be similar, filled either when registering
  for multicast or by SA admin

I think it's possible that you could also convert a mac address to
EUI-64 and prepend a prefix to get a legal GID. But maybe I'm missing
something.  This could be handy for multicast.


In both cases:
- SA could return GID that you then resolve to
  LID using another query, or it could return LID so you save a roundtrip
- results can be cached locally
- SA can send updates when translation changes to flush this cache


Now above means protocols such as ARP and DHCP use 48 bit addresses so
you can not mix this new protocol with IPoIB.  Maybe IPoIB could simply
ignore irrelevant packets, but it's best not to try, get a
different all-broadcast group and CM ID instead to avoid confusion.

One other interesting thing you can do is forward multicast
registration data from the router, translate to mgid
by the SA and do appropriate IB mcast registrations.

My memory of how IB works is a bit rusty so I probably made some
mistakes above but should roughly work, I think.

> and remember that
> this code (VM through eipoib) can talk to any IPoIB element on the
> fabric, native,
> virtualized,  HW/SW gateways, etc etc.
> 
> Or.

If you want this, then you really want a limited form of IPoIB bridging.

Alternatively, decide that what you do is not IPoIB, have a proper
protocol of your own.

-- 
MST

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH V2 09/12] net/eipoib: Add main driver functionality
  2012-08-12 15:40         ` Eric W. Biederman
@ 2012-08-13  8:33           ` Or Gerlitz
  2012-08-13 16:08             ` Eric W. Biederman
  0 siblings, 1 reply; 71+ messages in thread
From: Or Gerlitz @ 2012-08-13  8:33 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Michael S. Tsirkin, davem, roland, netdev, ali, sean.hefty,
	Erez Shitrit, Doug Ledford

On 12/08/2012 18:40, Eric W. Biederman wrote:
> Let me give you a non-hack recomendation.
>
> - Give up on being wire compatible with IPoIB.
>
> - Define and implement ethernet over inifiniband aka EoIB.
>
> With EoIB:
> - The SM would map ethernet address to inifiniband hardware addresses.
> - You discover which multicast addresses are of interest from the
>    IP layer above so no snooping is necessary.
> - You could run queue pairs directly to hosts.
>
> Shrug.  It is trivial and it will work.  It will probably run into the
> same problems that have historically been a problem for using IPoIB
> (lack of stateless offloads) but shrug that is mostly a NIC firmware
> problem.  The switches will have no trouble and interoperability will
> be assured.
>
> If you want to map ethernet over infiniband please map ethernet over
> infiniband.  Don't poorly NAT ethernet into infiniband.
>
>


EoIB is a valid suggestion and we will look into it as well, BUT:

Providing EoIB is a separate discussion, obviously defining and 
standardizing a new protocol takes what is takes (a lot of time, longish 
term effort), and will also take time to develop/debug/mature e.g as you 
mentioned, some of the features/offloads might require new NIC HW, etc 
-- compared to IPoIB which is here for many years

In practice there is already a huge install base for IPoIB software and 
hardware products, in different operating environments/OS. We can't just 
through away everything and tell people to replace it all with a new 
protocol, e.g. bridging devices, storage systems/appliances, VMware, 
Windows, .. systems in production environments --- so
the interoperability concern you've mentioned gonna hit very hard.

The eIPoIB driver comes to provide a way to work with IPoIB in 
virtualized environments, where still, the suggestions/concerns raised 
in this thread should be addressed.

Or.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH V2 09/12] net/eipoib: Add main driver functionality
  2012-08-13  8:33           ` Or Gerlitz
@ 2012-08-13 16:08             ` Eric W. Biederman
  0 siblings, 0 replies; 71+ messages in thread
From: Eric W. Biederman @ 2012-08-13 16:08 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Michael S. Tsirkin, davem, roland, netdev, ali, sean.hefty,
	Erez Shitrit, Doug Ledford

Or Gerlitz <ogerlitz@mellanox.com> wrote:

>On 12/08/2012 18:40, Eric W. Biederman wrote:
>> Let me give you a non-hack recomendation.
>>
>> - Give up on being wire compatible with IPoIB.
>>
>> - Define and implement ethernet over inifiniband aka EoIB.
>>
>> With EoIB:
>> - The SM would map ethernet address to inifiniband hardware
>addresses.
>> - You discover which multicast addresses are of interest from the
>>    IP layer above so no snooping is necessary.
>> - You could run queue pairs directly to hosts.
>>
>> Shrug.  It is trivial and it will work.  It will probably run into
>the
>> same problems that have historically been a problem for using IPoIB
>> (lack of stateless offloads) but shrug that is mostly a NIC firmware
>> problem.  The switches will have no trouble and interoperability will
>> be assured.
>>
>> If you want to map ethernet over infiniband please map ethernet over
>> infiniband.  Don't poorly NAT ethernet into infiniband.
>>
>>
>
>
>EoIB is a valid suggestion and we will look into it as well, BUT:
>
>Providing EoIB is a separate discussion, obviously defining and 
>standardizing a new protocol takes what is takes (a lot of time,
>longish 
>term effort), and will also take time to develop/debug/mature e.g as

If you follow Michael Tirskins suggestion and use the same wire encoding as IPoIB and infer the mac address from the lids and queue pair numbers as you are already doing with for eIPoIB, except for defining exactly how to get the subnet manager to store the mac address to lid/qpn mapping you are done.

If you don't involve a comitte and simply define a defacto standard it will take less effort than this conversation and less effort than implementing your eIPoIB driver.


>you 
>mentioned, some of the features/offloads might require new NIC HW, etc 
>-- compared to IPoIB which is here for many years

So deploy routing and proxy arp and you are done.

>In practice there is already a huge install base for IPoIB software and
>
>hardware products, in different operating environments/OS. We can't
>just 
>through away everything and tell people to replace it all with a new 
>protocol, e.g. bridging devices, storage systems/appliances, VMware, 
>Windows, .. systems in production environments --- so
>the interoperability concern you've mentioned gonna hit very hard.

There is no need to throw anything away.  Just put them on different IP subnets.

Shrug.

>The eIPoIB driver comes to provide a way to work with IPoIB in 
>virtualized environments, where still, the suggestions/concerns raised 
>in this thread should be addressed.

eIPoIB does not work.

I can't get an IP address with out a specially configured dhcp server, and special dhcp clients.

eIPoIB does not work with IPv6.

As David Miller already said this code has no chance of being merged.

Shrug.  I have been polite and pointed out implementation choices that actually work.  Solutions that are less effort and less code, and provide more interoperability.

If after patient explanation you can not appreciate why people consider eIPoIB to be totally unacceptable that is your problem.

Good luck in your future endeavours,

Eric

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH V2 09/12] net/eipoib: Add main driver functionality
  2012-08-12 13:55             ` Michael S. Tsirkin
  2012-08-12 14:13               ` Or Gerlitz
@ 2012-08-14  7:41               ` Or Gerlitz
  1 sibling, 0 replies; 71+ messages in thread
From: Or Gerlitz @ 2012-08-14  7:41 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Eric W. Biederman, davem, roland, netdev, ali, sean.hefty,
	Erez Shitrit, Doug Ledford

On 12/08/2012 13:22, Michael S. Tsirkin wrote:
>>>>>>> - IGMP/MAC snooping in a driver is just too hairy.
>>>>
>>>>> mmm, any rough idea/direction how to do that otherwise?
>>> Sure, even two ways, ideally you'd do both:)
>>> A. fix macvtap
>>> 1. Use netdev_for_each_mc_addr etc to get multicast addresses
>>> 2. teach macvtap to fill that in (it currently floods multicasts
>>>     for guest to guest communication so we ned to fix it anyway)
>>>
>>> B. fix bridge
>>>     teach bridge to work for VMs without using promisc mode
>>

OK, I think I'm with you now... you suggest to avoid our direction of 
implementing promiscuous multicast mode which is applied by today's 
bridge, macvtap and friends by fixing these elements to support non 
promisc multicast mode, yep, sure, sounds as win/win, which will 
eliminate the need to do IGMP snooping in the driver.

Or.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH V2 09/12] net/eipoib: Add main driver functionality
  2012-08-12 20:54                 ` Michael S. Tsirkin
@ 2012-08-14  8:44                   ` Or Gerlitz
  2012-08-20 18:57                   ` Michael S. Tsirkin
  1 sibling, 0 replies; 71+ messages in thread
From: Or Gerlitz @ 2012-08-14  8:44 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Eric W. Biederman, davem, roland, netdev, ali, sean.hefty,
	Erez Shitrit, Doug Ledford

On 12/08/2012 23:54, Michael S. Tsirkin wrote:
> On Sun, Aug 12, 2012 at 05:13:43PM +0300, Or Gerlitz wrote:
>> On 12/08/2012 16:55, Michael S. Tsirkin wrote:
>>> I didn't realize you do ARP snooping. Why? I know you mangle
>>> outgoing ARP packets,
>>
>> Maybe I wasn't accurate/clear, we do mangle outgoing/incoming ARP
>> packets, from/to Ethernet ARPs to/from IPoIB ARPs.
>>
>>
>>> this will go away if you maintain a mapping in SM accessible to all guests.
>>
>> guests don't interact with IB, I assume you referred to dom0 code, eIPoIB or
>> another driver in the host. But what mapping exactly?
>
> Well we are getting into protocol design here.

wait... reading your responses again, I realized that we 1st and most 
have to (try and) agree on the
problem statement before going/jumping to solutions.

AFAIU your email/s you maybe think that we mandate the admin to set a 
specific MAC to the VM which is derived from the LID/QPN of the IPoIB 
VIF serving it, well this is wrong, we don't,  OTOH, indeed, the VM 
source mac isn't sent on the wire, since the Ethernet header is dropped, 
and on the receiving side is constructed from the LID/QPN
the IB packet arrived from, see next.

This reconstruction of what we call the REMAC (remote ethernet mac) is 
based in the current submission on
the LID/QPN, and as I said earlier on this thread, we are revisiting 
this approach -- where your idea below sounds
good: the eipoib driver can register with the SA an IB "service record" 
entry mapping from LID/QPN to the VM mac, when ever a VM is to be served 
by this eipoib instance, and remove the entry when the VM shouldn't be 
served any more. This will allow to preserve 1:1 the Ethernet MAC header 
sent by VMs on the receiving side.



> So here's a sketch showing how you could build a protocol that does work.  But note it is not *exactly* IPoIB.

HOWEVER, this doesn't touch the IPoIB wire protocol, and hence on the 
wire it IS exactly IPoIB. We only make use of your lovely suggestion to 
apply this SA assistance, so the change doesn't involve 
hardware/firmware nor the wire protocol.

Or.


> It is I think close enough that you can use existing NIC hardware/firmware, which is why it differs slightly from what Eric described, and is more complex.  But it still shares the same property of no hacks, no packet snooping in driver, etc.
>
>
> And if you want to go that route, you really should talk to some IB
> protocol people to figure out what works, write a spec and try to
> standardize. lkml/netdev is not the right place.
>
> But since you asked, if I had to, I would probably try to
> do it like this:
>
> - Each device registers with the SA specifying the
>    mac address (+ vlan?), SA stores the translation from that
>    to IPoIB address.
> - alternatively, SA admin configures the translation statically
> - you get a packet with 6 byte mac address,
>    query the SA for a mapping to IPoIB address, strip
>    ethernet frame and send
> - multicast GID addresses can be similar, filled either when registering
>    for multicast or by SA admin
>
> I think it's possible that you could also convert a mac address to
> EUI-64 and prepend a prefix to get a legal GID. But maybe I'm missing
> something.  This could be handy for multicast.
>
>
> In both cases:
> - SA could return GID that you then resolve to
>    LID using another query, or it could return LID so you save a roundtrip
> - results can be cached locally
> - SA can send updates when translation changes to flush this cache
>
>
> Now above means protocols such as ARP and DHCP use 48 bit addresses so
> you can not mix this new protocol with IPoIB.  Maybe IPoIB could simply
> ignore irrelevant packets, but it's best not to try, get a
> different all-broadcast group and CM ID instead to avoid confusion.
>
> One other interesting thing you can do is forward multicast
> registration data from the router, translate to mgid
> by the SA and do appropriate IB mcast registrations.
>
>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re[3]: [PATCH V2 09/12] net/eipoib: Add main driver functionality
  2012-08-07  1:21                 ` Re[2]: " Naoto MATSUMOTO
@ 2012-08-15  9:10                   ` Naoto MATSUMOTO
  0 siblings, 0 replies; 71+ messages in thread
From: Naoto MATSUMOTO @ 2012-08-15  9:10 UTC (permalink / raw)
  To: Eric W. Biederman, Ali Ayoub, David Miller, ogerlitz, roland,
	netdev, sean.hefty, erezsh, dledford
  Cc: Naoto MATSUMOTO


Hi All.

Our lab's additional infomation(result) for all.

Ethernet over L2TPv3 over IPoIB Pseudo-wire Multiplexing (RFC4719)
using "ip l2tp" command on linux.   http://twitpic.com/ajn2nx

On Tue, 07 Aug 2012 10:21:33 +0900
Naoto MATSUMOTO <n-matsumoto@sakura.ad.jp> wrote:

> -----------------------------------------------------------------------
> Ethernet over EoIB using gretap with "Raid STP" performance result.
> http://twitpic.com/agd1pp
> 
> How to use EoIB with gretap. 
> http://twitpic.com/agd4io

regards,

-- 
SAKURA Internet Inc. / Senior Researcher
Naoto MATSUMOTO <n-matsumoto@sakura.ad.jp>
SAKURA Research Center <http://research.sakura.ad.jp/>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH V2 09/12] net/eipoib: Add main driver functionality
  2012-08-12 20:54                 ` Michael S. Tsirkin
  2012-08-14  8:44                   ` Or Gerlitz
@ 2012-08-20 18:57                   ` Michael S. Tsirkin
  2012-08-23  6:45                     ` Or Gerlitz
  1 sibling, 1 reply; 71+ messages in thread
From: Michael S. Tsirkin @ 2012-08-20 18:57 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Eric W. Biederman, davem, roland, netdev, ali, sean.hefty,
	Erez Shitrit, Doug Ledford

On Sun, Aug 12, 2012 at 11:54:57PM +0300, Michael S. Tsirkin wrote:
> > and remember that
> > this code (VM through eipoib) can talk to any IPoIB element on the
> > fabric, native,
> > virtualized,  HW/SW gateways, etc etc.
> > 
> > Or.
> 
> If you want this, then you really want a limited form of IPoIB bridging.


And to clarify that statement, here is how I would make such
IPoIB "bridging" work:

Guest side:

- Implement virtio-ipoib. This would be a device like virtio-net,
  but instead of ethernet packets, it would pass packets
  that consist of:
	IPoIB destination address
	IP packet

- this is passed to/from host without modifications, possibly with addition
  of header such as virtio net header

- flags such as broadcast can also be added to header

- like virtio net get capabilities from host and expose
  as netdev capabilities

Host side:
- create macvtap -passthrough like device that can sit on top of an
  ipoib interface
- expose this device QPN and GID to guest as hardware address
- as we get packet forward it on UD QPN or CM as appropriate
  depending on size,checksum and admin preference
- expose capabilities such as TSO
- can expose capability such as max MTU to guest too

Above means hardware address changes with migration.
So we need to notify guest when this happens.

This can be addressed from host by notifying all
neighbours.

Alternatively guest can notify all neighbours.

Notification can be done by broadcast.
This second option seems preferable.

this ipoib-vtap can support two modes
	- bridge like mode:
	  guest to guest and guest to host packets
	  can be detected by macvtap and passed
	  to/from guest directly like macvlan bridge mode

	- vepa like mode
	  guest to guest and guest to host packets
	  are sent out and looped back by IB switch
	  like macvlan vepa mode

As compared to the custom protocol I sent, it has -
Advantages: interoperates cleanly with ipoib
Disadvantages: no support for legacy (ethernet-only) guest

-- 
MST

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH V2 09/12] net/eipoib: Add main driver functionality
  2012-08-20 18:57                   ` Michael S. Tsirkin
@ 2012-08-23  6:45                     ` Or Gerlitz
  0 siblings, 0 replies; 71+ messages in thread
From: Or Gerlitz @ 2012-08-23  6:45 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Eric W. Biederman, davem, roland, netdev, Ali Ayoub, sean.hefty,
	Erez Shitrit, Doug Ledford, Shlomo Pongratz

On 20/08/2012 21:57, Michael S. Tsirkin wrote:
> On Sun, Aug 12, 2012 at 11:54:57PM +0300, Michael S. Tsirkin wrote:
>
>> If you want this, then you really want a limited form of IPoIB bridging.
>
> And to clarify that statement, here is how I would make such IPoIB "bridging" work:
>
> Guest side:
>
> - Implement virtio-ipoib. This would be a device like virtio-net,
>    but instead of ethernet packets, it would pass packets
>    that consist of:
> 	IPoIB destination address
> 	IP packet
>
> - this is passed to/from host without modifications, possibly with addition
>    of header such as virtio net header
>
> - flags such as broadcast can also be added to header
>
> - like virtio net get capabilities from host and expose
>    as netdev capabilities
>
> Host side:
> - create macvtap -passthrough like device that can sit on top of an
>    ipoib interface
> - expose this device QPN and GID to guest as hardware address
> - as we get packet forward it on UD QPN or CM as appropriate
>    depending on size,checksum and admin preference
> - expose capabilities such as TSO
> - can expose capability such as max MTU to guest too
>
> Above means hardware address changes with migration.
> So we need to notify guest when this happens.
>
> This can be addressed from host by notifying all neighbours.
>
> Alternatively guest can notify all neighbours.
>
> Notification can be done by broadcast.
> This second option seems preferable.
>
> this ipoib-vtap can support two modes
> 	- bridge like mode:
> 	  guest to guest and guest to host packets
> 	  can be detected by macvtap and passed
> 	  to/from guest directly like macvlan bridge mode
>
> 	- vepa like mode
> 	  guest to guest and guest to host packets
> 	  are sent out and looped back by IB switch
> 	  like macvlan vepa mode
>
> As compared to the custom protocol I sent, it has -
> Advantages: interoperates cleanly with ipoib
> Disadvantages: no support for legacy (ethernet-only) guest
>

Hi Michael,

As you mentioned, the approach doesn't address legacy guests, who either 
don't have the virtio
driver, or don't have a virtio driver patched to support virtio-ipoib  
--  which doesn't go inline
with a strong requirement I got.

Other than this giant obstacle, I liked the suggestion and it seems 
valid and viable -- BTW IB HW has
loopback capability, so the VM/VM packets wouldn't actually go to the IB 
switch, but remain within the
HCA.

Or.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH V2 09/12] net/eipoib: Add main driver functionality
  2012-08-05 18:50     ` Michael S. Tsirkin
  2012-08-08  5:23       ` Or Gerlitz
@ 2012-09-03 20:53       ` Or Gerlitz
  2012-09-03 21:22         ` Michael S. Tsirkin
  1 sibling, 1 reply; 71+ messages in thread
From: Or Gerlitz @ 2012-09-03 20:53 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Eric W. Biederman, Or Gerlitz, davem, roland, netdev, sean.hefty,
	Erez Shitrit, Ali Ayoub, Doug Ledford

Michael S. Tsirkin <mst@redhat.com> wrote:

> [...] so it seems that a sane solution would involve an extra level of
> indirection, with guest addresses being translated to host IB addresses.
> As long as you do this, maybe using an ethernet frame format makes sense.

> So far the things that make sense. Here are some that don't, to me:

> - Is a pdf presentation all you have in terms of documentation?
>   We are talking communication protocols here - I would expect a
>   proper spec, and some effort to standardize, otherwise where's the
>   guarantee it won't change in an incompatible way?
>   Other things that I would expect to be addressed in such a spec is
>   interaction with other IPoIB features, such as connected
>   mode, checksum offloading etc, and IB features such as multipath etc.
>
> - The way you encode LID/QPN in the MAC seems questionable. IIRC there's
>   more to IB addressing than just the LID.  Since everyone on the subnet
>   need access to this translation, I think it makes sense to store it in
>   the SM. I think this would also obviate some IPv4 specific hacks in kernel.

> - IGMP/MAC snooping in a driver is just too hairy.
>   As you point out, bridge currently needs the uplink in promisc mode.
>   I don't think a driver should work around that limitation.
>   For some setups, it might be interesting to remove the promisc
>  mode requirement, failing that, I think you could use macvtap passthrough.
>
> - Currently migration works without host kernel help, would be
>   preferable to keep it that way.

Hi Michael,

If we rewind to this point, basically, you had few concerns

0. not enough documentation

1. the sender VM MAC isn't preserved when the packet is received

2. the IGMP snooping we planned to do within netdevice - isn't good practice

3. mangling of ARPs within netdevice - isn't good practice as well.

For 0,1,2 we have a way to address  (see below)

So we are remained with #3 - the ARPs -- thinking on this a little
further, FWIW there --are-- components in the kernel which
mangle/generate ARPs and are exposing netdevice, such as openvswitch,
anyway:

does it make sense to forward ARPs received into / sent over the
eIPoIB netdevice (e.g using some sort of rule) to some outer entity
such as user-space
daemon  for interception and later re-injection into eIPoIB?

Or.

Documentation we will fix,

Preserving remote VM mac at the receiver we have few directions for
solution, e.g either along your suggestion with SA records and/or with
using "alias GUIDs" (details TBD when the submission resumes).

Multicast we accept the direction you suggested - implement  support
for multicast non promiscuous in the elements "above" eIPoIB (bridge,
macvtap, etc).

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH V2 09/12] net/eipoib: Add main driver functionality
  2012-09-03 20:53       ` Or Gerlitz
@ 2012-09-03 21:22         ` Michael S. Tsirkin
  2012-09-04 18:50           ` Or Gerlitz
  2012-09-04 18:57           ` Or Gerlitz
  0 siblings, 2 replies; 71+ messages in thread
From: Michael S. Tsirkin @ 2012-09-03 21:22 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Eric W. Biederman, Or Gerlitz, davem, roland, netdev, sean.hefty,
	Erez Shitrit, Ali Ayoub, Doug Ledford

On Mon, Sep 03, 2012 at 11:53:56PM +0300, Or Gerlitz wrote:
> Michael S. Tsirkin <mst@redhat.com> wrote:
> 
> > [...] so it seems that a sane solution would involve an extra level of
> > indirection, with guest addresses being translated to host IB addresses.
> > As long as you do this, maybe using an ethernet frame format makes sense.
> 
> > So far the things that make sense. Here are some that don't, to me:
> 
> > - Is a pdf presentation all you have in terms of documentation?
> >   We are talking communication protocols here - I would expect a
> >   proper spec, and some effort to standardize, otherwise where's the
> >   guarantee it won't change in an incompatible way?
> >   Other things that I would expect to be addressed in such a spec is
> >   interaction with other IPoIB features, such as connected
> >   mode, checksum offloading etc, and IB features such as multipath etc.
> >
> > - The way you encode LID/QPN in the MAC seems questionable. IIRC there's
> >   more to IB addressing than just the LID.  Since everyone on the subnet
> >   need access to this translation, I think it makes sense to store it in
> >   the SM. I think this would also obviate some IPv4 specific hacks in kernel.
> 
> > - IGMP/MAC snooping in a driver is just too hairy.
> >   As you point out, bridge currently needs the uplink in promisc mode.
> >   I don't think a driver should work around that limitation.
> >   For some setups, it might be interesting to remove the promisc
> >  mode requirement, failing that, I think you could use macvtap passthrough.
> >
> > - Currently migration works without host kernel help, would be
> >   preferable to keep it that way.
> 
> Hi Michael,
> 
> If we rewind to this point, basically, you had few concerns

I think some other people gave feedback too, you need to address it in
the patch (as opposed to by mail - even if it's in documentation or
comments) don't just focus on what I wrote.

> 
> 0. not enough documentation
> 
> 1. the sender VM MAC isn't preserved when the packet is received
> 
> 2. the IGMP snooping we planned to do within netdevice - isn't good practice
> 
> 3. mangling of ARPs within netdevice - isn't good practice as well.
> 
> For 0,1,2 we have a way to address  (see below)
> 
> So we are remained with #3 - the ARPs -- thinking on this a little
> further, FWIW there --are-- components in the kernel which
> mangle/generate ARPs and are exposing netdevice, such as openvswitch,
> anyway:
> 
> does it make sense to forward ARPs received into / sent over the
> eIPoIB netdevice (e.g using some sort of rule) to some outer entity
> such as user-space
> daemon  for interception and later re-injection into eIPoIB?
> 
> Or.

Well if this is all you want to do, you can bind a packet socket to the
interface, and drop them at the nic.  It is harder to do for incoming
ARP requests though.

I would do something else: send ARPs out to some defined IB address.
This could be local host or queries from some SA property.  Said remote
side could send you the responses in ethernet format so you do not need
to mangle responses at all.  Similarly for incoming ARP requests.

The rule to do this can also just redirect non IP packets -
this is IPoIB after all.

> Documentation we will fix,

And just to stress the point, document the limitations as well.

> Preserving remote VM mac at the receiver we have few directions for
> solution, e.g either along your suggestion with SA records and/or with
> using "alias GUIDs" (details TBD when the submission resumes).
> 
> Multicast we accept the direction you suggested - implement  support
> for multicast non promiscuous in the elements "above" eIPoIB (bridge,
> macvtap, etc).

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH V2 09/12] net/eipoib: Add main driver functionality
  2012-09-03 21:22         ` Michael S. Tsirkin
@ 2012-09-04 18:50           ` Or Gerlitz
  2012-09-04 19:31             ` Eric W. Biederman
  2012-09-04 21:21             ` Michael S. Tsirkin
  2012-09-04 18:57           ` Or Gerlitz
  1 sibling, 2 replies; 71+ messages in thread
From: Or Gerlitz @ 2012-09-04 18:50 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Eric W. Biederman, Or Gerlitz, davem, roland, netdev, sean.hefty,
	Erez Shitrit, Ali Ayoub, Doug Ledford

On Tue, Sep 4, 2012 at 12:22 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Mon, Sep 03, 2012 at 11:53:56PM +0300, Or Gerlitz wrote:

>> So we are remained with #3 - the ARPs -- thinking on this a little
>> further, FWIW there --are-- components in the kernel which
>> mangle/generate ARPs and are exposing netdevice, such as openvswitch, anyway:

>> does it make sense to forward ARPs received into / sent over the
>> eIPoIB netdevice (e.g using some sort of rule) to some outer entity
>> such as user-space daemon  for interception and later re-injection into eIPoIB?

> Well if this is all you want to do, you can bind a packet socket to the
> interface, and drop them at the nic.  It is harder to do for incoming
> ARP requests though.

> I would do something else: send ARPs out to some defined IB address.
> This could be local host or queries from some SA property.  Said remote
> side could send you the responses in ethernet format so you do not need
> to mangle responses at all.  Similarly for incoming ARP requests.

> The rule to do this can also just redirect non IP packets - this is IPoIB after all.

Thanks for the heads up on the possible implementation route, will
look into that.

>> Documentation we will fix,

> And just to stress the point, document the limitations as well.

sure, not that I see concrete limitations for the **user** at this point, but
if there are such, will put them clearly written.

Or.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH V2 09/12] net/eipoib: Add main driver functionality
  2012-09-03 21:22         ` Michael S. Tsirkin
  2012-09-04 18:50           ` Or Gerlitz
@ 2012-09-04 18:57           ` Or Gerlitz
  1 sibling, 0 replies; 71+ messages in thread
From: Or Gerlitz @ 2012-09-04 18:57 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Eric W. Biederman, Or Gerlitz, davem, roland, netdev, sean.hefty,
	Erez Shitrit, Ali Ayoub, Doug Ledford

On Tue, Sep 4, 2012 at 12:22 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Mon, Sep 03, 2012 at 11:53:56PM +0300, Or Gerlitz wrote:
>> Michael S. Tsirkin <mst@redhat.com> wrote:
>> > [...] so it seems that a sane solution would involve an extra level of
>> > indirection, with guest addresses being translated to host IB addresses.
>> > As long as you do this, maybe using an ethernet frame format makes sense.
>>
>> > So far the things that make sense. Here are some that don't, to me:
>>
>> > - Is a pdf presentation all you have in terms of documentation?
>> >   We are talking communication protocols here - I would expect a
>> >   proper spec, and some effort to standardize, otherwise where's the
>> >   guarantee it won't change in an incompatible way?
>> >   Other things that I would expect to be addressed in such a spec is
>> >   interaction with other IPoIB features, such as connected
>> >   mode, checksum offloading etc, and IB features such as multipath etc.
>> >
>> > - The way you encode LID/QPN in the MAC seems questionable. IIRC there's
>> >   more to IB addressing than just the LID.  Since everyone on the subnet
>> >   need access to this translation, I think it makes sense to store it in
>> >   the SM. I think this would also obviate some IPv4 specific hacks in kernel.
>>
>> > - IGMP/MAC snooping in a driver is just too hairy.
>> >   As you point out, bridge currently needs the uplink in promisc mode.
>> >   I don't think a driver should work around that limitation.
>> >   For some setups, it might be interesting to remove the promisc
>> >  mode requirement, failing that, I think you could use macvtap passthrough.
>> >
>> > - Currently migration works without host kernel help, would be
>> >   preferable to keep it that way.

>> If we rewind to this point, basically, you had few concerns

> I think some other people gave feedback too, you need to address it in
> the patch (as opposed to by mail - even if it's in documentation or
> comments) don't just focus on what I wrote.

The other feedback was:

1. suggesting to introduce new link layer for the para-virtualized
network stack, a direction pointed by Dave and you, for which  I
responded that it doesn't address a hard requirement I got, which is
provide service  to an arbitrary VM which have any OS and a virtual
Ethernet NIC on, emulated or para-virtualized.

2. Suggestions to use solutions which involve routing and/or
proxt-arp, for which I responded that in most/practical cases this
will require for the host to know the VM IP, something which isn't
valid assumption in many cases AND that bunch of cloud stacks,
specifically the leading ones, don't even support that option, which
also violates a hard requirement we got, to support these stacks which
are in use by customers.

3. Suggestions to invent EoIB -- I said, OK but this is long termish
process that we're looking on, might be around at some future point,
but still we are required to support whole echo systems which use
IPoIB, and there's no point to "route" between EoIB segment to IPoIB
segments, its back to #2 which we didn't accept

4. Other feedback saying the eIPoIB driver is messy and "we don't like
it" -- hard to exactly address in code changes

----> all in all, we were suggested few directions which don't allow
us to address  the problem statement, so there's no way to change the
code so they are fullfilled, and some not concrete comments which are
also hard to address --> we took the route of design changes along
your **concrete** comments, doesn't it make sense?

Or.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH V2 09/12] net/eipoib: Add main driver functionality
  2012-09-04 18:50           ` Or Gerlitz
@ 2012-09-04 19:31             ` Eric W. Biederman
  2012-09-04 19:47               ` Or Gerlitz
  2012-09-04 21:21             ` Michael S. Tsirkin
  1 sibling, 1 reply; 71+ messages in thread
From: Eric W. Biederman @ 2012-09-04 19:31 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Michael S. Tsirkin, Or Gerlitz, davem, roland, netdev,
	sean.hefty, Erez Shitrit, Ali Ayoub, Doug Ledford

Or Gerlitz <or.gerlitz@gmail.com> writes:

> On Tue, Sep 4, 2012 at 12:22 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
>> On Mon, Sep 03, 2012 at 11:53:56PM +0300, Or Gerlitz wrote:
>
>>> So we are remained with #3 - the ARPs -- thinking on this a little
>>> further, FWIW there --are-- components in the kernel which
>>> mangle/generate ARPs and are exposing netdevice, such as openvswitch, anyway:
>
>>> does it make sense to forward ARPs received into / sent over the
>>> eIPoIB netdevice (e.g using some sort of rule) to some outer entity
>>> such as user-space daemon  for interception and later re-injection into eIPoIB?
>
>> Well if this is all you want to do, you can bind a packet socket to the
>> interface, and drop them at the nic.  It is harder to do for incoming
>> ARP requests though.
>
>> I would do something else: send ARPs out to some defined IB address.
>> This could be local host or queries from some SA property.  Said remote
>> side could send you the responses in ethernet format so you do not need
>> to mangle responses at all.  Similarly for incoming ARP requests.
>
>> The rule to do this can also just redirect non IP packets - this is IPoIB after all.
>
> Thanks for the heads up on the possible implementation route, will
> look into that.
>
>>> Documentation we will fix,
>
>> And just to stress the point, document the limitations as well.
>
> sure, not that I see concrete limitations for the **user** at this point, but
> if there are such, will put them clearly written.

So far you are still playing with a design that is strongly NOT
ethernet.  So calling it eIPoIB will continue to be a LIE.

You are still playing with an implementation that doesn't even dream
of supporting IPv6 which makes it so far from ethernet I can't imagine
anyone taking your code seriously.

All ethernet protocols not working except IPv4 is a huge concrete
limitation.

Any implementation that breaks a naive ARP implementation also breaks
IPv6.  Not to mention everything else that runs over ethernet.

If you are clever you can use the current IPoIB hardware accelleration
but you need to do something different so that you can either encode
or imply the MAC address so you won't have to munge ethernet protocols.

Just for fun you might want to consider what it takes to support 2 VMs
in the same VLAN that share the same IP address (but different MAC
addresses) for failover purposes.

Eric

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH V2 09/12] net/eipoib: Add main driver functionality
  2012-09-04 19:31             ` Eric W. Biederman
@ 2012-09-04 19:47               ` Or Gerlitz
  0 siblings, 0 replies; 71+ messages in thread
From: Or Gerlitz @ 2012-09-04 19:47 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Michael S. Tsirkin, Or Gerlitz, davem, roland, netdev,
	sean.hefty, Erez Shitrit, Ali Ayoub, Doug Ledford

On Tue, Sep 4, 2012 at 10:31 PM, Eric W. Biederman
<ebiederm@xmission.com> wrote:
> Or Gerlitz <or.gerlitz@gmail.com> writes:
>> On Tue, Sep 4, 2012 at 12:22 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
>>>> Documentation we will fix,
>>> And just to stress the point, document the limitations as well.
>> sure, not that I see concrete limitations for the **user** at this point, but
>> if there are such, will put them clearly written.

> All ethernet protocols not working except IPv4 is a huge concrete limitation.

Oh, sure, that WILL be documented, currently eIPoIB can deliver only what
IPoIB can which is IPv4, ARP/RARP, IPv6+ND, for IPv6 see next

> So far you are still playing with a design that is strongly NOT
> ethernet.  So calling it eIPoIB will continue to be a LIE.

> You are still playing with an implementation that doesn't even dream
> of supporting IPv6 which makes it so far from ethernet I can't imagine
> anyone taking your code seriously.

This design can and will support IPv6, the IPv6 ND handling will follow the path
we are talking now for the IPv4 ARPs, e.g not within the driver, etc.
Could you be
more specific?

> Any implementation that breaks a naive ARP implementation also breaks
> IPv6.  Not to mention everything else that runs over ethernet.

not sure to follow on the naive impl. comment,  eIPoIB solution will
include kernel driver along with supporting user-space portion, this
is to follow a comment made by the community on mangling ARPs in
network driver.

> If you are clever you can use the current IPoIB hardware accelleration
> but you need to do something different so that you can either encode
> or imply the MAC address so you won't have to munge ethernet protocols.

also here not sure to follow, we have a new design under which the
original VM MAC is preserved on the RX side, we will not generate MAC
for Ethernet frames sent by VMs which we reconstruct on the RX side
any more.

> Just for fun you might want to consider what it takes to support 2 VMs
> in the same VLAN that share the same IP address (but different MAC
> addresses) for failover purposes.

will look on that

Or.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH V2 09/12] net/eipoib: Add main driver functionality
  2012-09-04 18:50           ` Or Gerlitz
  2012-09-04 19:31             ` Eric W. Biederman
@ 2012-09-04 21:21             ` Michael S. Tsirkin
  1 sibling, 0 replies; 71+ messages in thread
From: Michael S. Tsirkin @ 2012-09-04 21:21 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Eric W. Biederman, Or Gerlitz, davem, roland, netdev, sean.hefty,
	Erez Shitrit, Ali Ayoub, Doug Ledford

On Tue, Sep 04, 2012 at 09:50:09PM +0300, Or Gerlitz wrote:
> > And just to stress the point, document the limitations as well.
> 
> sure, not that I see concrete limitations for the **user** at this point, but
> if there are such, will put them clearly written.

Hmm, I'm afraid you mistook a short list of some major
bugs that jumped out at me for an exhaustive list.
This was not intended as such.

Here's how to find some of the limitations in your design
1. look through list archives. Some where pointed out to you
2. list everything ethernet does that you dont, or do differently
3. list everything ipoib does that you don't or do differently
4. list any extra setup work required on behalf of the user
5. check various overheads, compare with native ipoib and alternatives
   such as routing
6. if you still have an empty list of limitations and disadvantages,
   look again :)

The point is to have documentation that is useful both
for reviewers - so they can know which bugs are known
and which need to be reported; and for users -
who should not be expected to be familiar with
internals of your implementation.

-- 
MST

^ permalink raw reply	[flat|nested] 71+ messages in thread

end of thread, other threads:[~2012-09-04 21:20 UTC | newest]

Thread overview: 71+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-08-01 17:09 [PATCH V2 00/12] Add Ethernet IPoIB driver Or Gerlitz
2012-08-01 17:09 ` [PATCH V2 01/12] IB/ipoib: Add rtnl_link_ops support Or Gerlitz
2012-08-01 17:09 ` [PATCH V2 02/12] IB/ipoib: Add support for clones / multiple childs on the same partition Or Gerlitz
2012-08-01 17:09 ` [PATCH V2 03/12] include/linux: Add private flags for IPoIB interfaces Or Gerlitz
2012-08-01 17:09 ` [PATCH V2 04/12] IB/ipoib: Add support for acting as VIF Or Gerlitz
2012-08-01 17:09 ` [PATCH V2 05/12] net: Add ndo_set_vif_param operation to serve eIPoIB VIFs Or Gerlitz
2012-08-02  0:17   ` Ben Hutchings
2012-08-02  8:25     ` Erez Shitrit
2012-08-01 17:09 ` [PATCH V2 06/12] net/core: Add rtnetlink support to vif parameters Or Gerlitz
2012-08-02  0:20   ` Ben Hutchings
2012-08-02 15:29     ` Erez Shitrit
2012-08-01 17:09 ` [PATCH V2 07/12] net/eipoib: Add private header file Or Gerlitz
2012-08-01 17:09 ` [PATCH V2 08/12] net/eipoib: Add ethtool file support Or Gerlitz
2012-08-02  0:22   ` Ben Hutchings
2012-08-02  8:35     ` Erez Shitrit
2012-08-02 15:42       ` Ben Hutchings
2012-08-01 17:09 ` [PATCH V2 09/12] net/eipoib: Add main driver functionality Or Gerlitz
2012-08-02 17:15   ` Eric W. Biederman
2012-08-03 20:31     ` Ali Ayoub
2012-08-03 21:33       ` David Miller
2012-08-03 22:39         ` Ali Ayoub
2012-08-03 23:36           ` David Miller
2012-08-04 21:23             ` Or Gerlitz
2012-08-04 21:44               ` Or Gerlitz
2012-08-04 23:19                 ` Eric W. Biederman
2012-08-07  0:14             ` Ali Ayoub
2012-08-07  0:44               ` Eric W. Biederman
2012-08-07  1:21                 ` Re[2]: " Naoto MATSUMOTO
2012-08-15  9:10                   ` Re[3]: " Naoto MATSUMOTO
2012-08-07  3:33                 ` Eric W. Biederman
2012-08-08  6:04                   ` Or Gerlitz
2012-08-08  8:36                     ` Eric W. Biederman
2012-08-09  4:06                       ` Or Gerlitz
2012-08-12 14:05                         ` Michael S. Tsirkin
2012-08-07  3:37                 ` Joseph Glanville
2012-08-08  7:32                 ` Or Gerlitz
2012-08-08  9:17                   ` Eric W. Biederman
2012-08-09  4:34                     ` Or Gerlitz
2012-08-12 10:36                       ` Michael S. Tsirkin
2012-08-04  0:02           ` Ali Ayoub
2012-08-04  0:05             ` David Miller
2012-08-04  1:34             ` Eric W. Biederman
2012-08-04 21:33               ` Or Gerlitz
2012-08-05 18:50     ` Michael S. Tsirkin
2012-08-08  5:23       ` Or Gerlitz
2012-08-12 10:22         ` Michael S. Tsirkin
2012-08-12 13:09           ` Or Gerlitz
2012-08-12 13:41             ` Michael S. Tsirkin
2012-08-12 13:15           ` Or Gerlitz
2012-08-12 13:55             ` Michael S. Tsirkin
2012-08-12 14:13               ` Or Gerlitz
2012-08-12 20:54                 ` Michael S. Tsirkin
2012-08-14  8:44                   ` Or Gerlitz
2012-08-20 18:57                   ` Michael S. Tsirkin
2012-08-23  6:45                     ` Or Gerlitz
2012-08-14  7:41               ` Or Gerlitz
2012-08-12 10:54         ` Michael S. Tsirkin
2012-08-12 13:19           ` Or Gerlitz
2012-08-12 15:40         ` Eric W. Biederman
2012-08-13  8:33           ` Or Gerlitz
2012-08-13 16:08             ` Eric W. Biederman
2012-09-03 20:53       ` Or Gerlitz
2012-09-03 21:22         ` Michael S. Tsirkin
2012-09-04 18:50           ` Or Gerlitz
2012-09-04 19:31             ` Eric W. Biederman
2012-09-04 19:47               ` Or Gerlitz
2012-09-04 21:21             ` Michael S. Tsirkin
2012-09-04 18:57           ` Or Gerlitz
2012-08-01 17:09 ` [PATCH V2 10/12] net/eipoib: Add sysfs support Or Gerlitz
2012-08-01 17:09 ` [PATCH V2 11/12] net/eipoib: Add Makefile, Kconfig and MAINTAINERS entries Or Gerlitz
2012-08-01 17:09 ` [PATCH V2 12/12] IB/ipoib: Add support for transmission of skbs w.o dst/neighbour Or Gerlitz

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.