[PATCH v4 0/2] Enable virtio_net to act as a backup for a passthru device

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v4 0/2] Enable virtio_net to act as a backup for a passthru device
@ 2018-03-01 20:08 ` Sridhar Samudrala
  0 siblings, 0 replies; 102+ messages in thread
From: Sridhar Samudrala @ 2018-03-01 20:08 UTC (permalink / raw)
  To: mst, stephen, davem, netdev, jiri, virtio-dev, jesse.brandeburg,
	alexander.h.duyck, kubakici, sridhar.samudrala

Based on the review comments on the v3 version of the RFC patch and
Jakub's suggestion for the naming issue with 3 netdev solution, we are now 
proposing 3 netdev in-driver bonding solution for virtio-net.

The main motivation for this patch is to enable cloud service providers
to provide an accelerated datapath to virtio-net enabled VMs in a 
transparent manner with no/minimal guest userspace changes. This also
enables hypervisor controlled live migration to be supported with VMs that
have direct attached SR-IOV VF devices.

Patch 1 introduces a new feature bit VIRTIO_NET_F_BACKUP that can be
used by hypervisor to indicate that virtio_net interface should act as
a backup for another device with the same MAC address.

Patch 2 extends virtio_net to use alternate datapath when available and
registered. When BACKUP feature is enabled, virtio_net driver creates
an additional 'bypass' netdev that acts as a master device and controls
2 slave devices.  The original virtio_net netdev is registered as
'backup' netdev and a passthru/vf device with the same MAC gets
registered as 'active' netdev. Both 'bypass' and 'backup' netdevs are
associated with the same 'pci' device.  The user accesses the network
interface via 'bypass' netdev. The 'bypass' netdev chooses 'active' netdev
as default for transmits when it is available with link up and running.

As this patch series is initially focusing on usecases where hypervisor 
fully controls the VM networking and the guest is not expected to directly 
configure any hardware settings, it doesn't expose all the ndo/ethtool ops
that are supported by virtio_net at this time. To support additional usecases,
it should be possible to enable additional ops later by caching the state
in virtio netdev and replaying when the 'active' netdev gets registered. 

The hypervisor needs to enable only one datapath at any time so that packets
don't get looped back to the VM over the other datapath. When a VF is
plugged, the virtio datapath link state can be marked as down.
At the time of live migration, the hypervisor needs to unplug the VF device
from the guest on the source host and reset the MAC filter of the VF to
initiate failover of datapath to virtio before starting the migration. After
the migration is completed, the destination hypervisor sets the MAC filter
on the VF and plugs it back to the guest to switch over to VF datapath.

Most of the code added is implemented as new functions that are isolated
from the existing virtio_net functionality and it can be moved to a new
driver to implement bypass functionality if we decide to make it common
with other para virtual drivers like netvsc.

This patch is based on the discussion initiated by Jesse on this thread.
https://marc.info/?l=linux-virtualization&m=151189725224231&w=2

Sridhar Samudrala (2):
  virtio_net: Introduce VIRTIO_NET_F_BACKUP feature bit
  virtio_net: Extend virtio to use VF datapath when available

 drivers/net/virtio_net.c        | 682 +++++++++++++++++++++++++++++++++++++++-
 include/uapi/linux/virtio_net.h |   3 +
 2 files changed, 683 insertions(+), 2 deletions(-)

-- 
2.14.3

^ permalink raw reply	[flat|nested] 102+ messages in thread

* [virtio-dev] [PATCH v4 0/2] Enable virtio_net to act as a backup for a passthru device
@ 2018-03-01 20:08 ` Sridhar Samudrala
  0 siblings, 0 replies; 102+ messages in thread
From: Sridhar Samudrala @ 2018-03-01 20:08 UTC (permalink / raw)
  To: mst, stephen, davem, netdev, jiri, virtio-dev, jesse.brandeburg,
	alexander.h.duyck, kubakici, sridhar.samudrala

Based on the review comments on the v3 version of the RFC patch and
Jakub's suggestion for the naming issue with 3 netdev solution, we are now 
proposing 3 netdev in-driver bonding solution for virtio-net.

The main motivation for this patch is to enable cloud service providers
to provide an accelerated datapath to virtio-net enabled VMs in a 
transparent manner with no/minimal guest userspace changes. This also
enables hypervisor controlled live migration to be supported with VMs that
have direct attached SR-IOV VF devices.

Patch 1 introduces a new feature bit VIRTIO_NET_F_BACKUP that can be
used by hypervisor to indicate that virtio_net interface should act as
a backup for another device with the same MAC address.

Patch 2 extends virtio_net to use alternate datapath when available and
registered. When BACKUP feature is enabled, virtio_net driver creates
an additional 'bypass' netdev that acts as a master device and controls
2 slave devices.  The original virtio_net netdev is registered as
'backup' netdev and a passthru/vf device with the same MAC gets
registered as 'active' netdev. Both 'bypass' and 'backup' netdevs are
associated with the same 'pci' device.  The user accesses the network
interface via 'bypass' netdev. The 'bypass' netdev chooses 'active' netdev
as default for transmits when it is available with link up and running.

As this patch series is initially focusing on usecases where hypervisor 
fully controls the VM networking and the guest is not expected to directly 
configure any hardware settings, it doesn't expose all the ndo/ethtool ops
that are supported by virtio_net at this time. To support additional usecases,
it should be possible to enable additional ops later by caching the state
in virtio netdev and replaying when the 'active' netdev gets registered. 

The hypervisor needs to enable only one datapath at any time so that packets
don't get looped back to the VM over the other datapath. When a VF is
plugged, the virtio datapath link state can be marked as down.
At the time of live migration, the hypervisor needs to unplug the VF device
from the guest on the source host and reset the MAC filter of the VF to
initiate failover of datapath to virtio before starting the migration. After
the migration is completed, the destination hypervisor sets the MAC filter
on the VF and plugs it back to the guest to switch over to VF datapath.

Most of the code added is implemented as new functions that are isolated
from the existing virtio_net functionality and it can be moved to a new
driver to implement bypass functionality if we decide to make it common
with other para virtual drivers like netvsc.

This patch is based on the discussion initiated by Jesse on this thread.
https://marc.info/?l=linux-virtualization&m=151189725224231&w=2

Sridhar Samudrala (2):
  virtio_net: Introduce VIRTIO_NET_F_BACKUP feature bit
  virtio_net: Extend virtio to use VF datapath when available

 drivers/net/virtio_net.c        | 682 +++++++++++++++++++++++++++++++++++++++-
 include/uapi/linux/virtio_net.h |   3 +
 2 files changed, 683 insertions(+), 2 deletions(-)

-- 
2.14.3

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org

^ permalink raw reply	[flat|nested] 102+ messages in thread

* [PATCH v4 1/2] virtio_net: Introduce VIRTIO_NET_F_BACKUP feature bit
  2018-03-01 20:08 ` [virtio-dev] " Sridhar Samudrala
@ 2018-03-01 20:08   ` Sridhar Samudrala
  -1 siblings, 0 replies; 102+ messages in thread
From: Sridhar Samudrala @ 2018-03-01 20:08 UTC (permalink / raw)
  To: mst, stephen, davem, netdev, jiri, virtio-dev, jesse.brandeburg,
	alexander.h.duyck, kubakici, sridhar.samudrala

This feature bit can be used by hypervisor to indicate virtio_net device to
act as a backup for another device with the same MAC address.

VIRTIO_NET_F_BACKUP is defined as bit 62 as it is a device feature bit.

Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
---
 drivers/net/virtio_net.c        | 2 +-
 include/uapi/linux/virtio_net.h | 3 +++
 2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 626c27352ae2..bcd13fe906ca 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -2920,7 +2920,7 @@ static struct virtio_device_id id_table[] = {
 	VIRTIO_NET_F_GUEST_ANNOUNCE, VIRTIO_NET_F_MQ, \
 	VIRTIO_NET_F_CTRL_MAC_ADDR, \
 	VIRTIO_NET_F_MTU, VIRTIO_NET_F_CTRL_GUEST_OFFLOADS, \
-	VIRTIO_NET_F_SPEED_DUPLEX
+	VIRTIO_NET_F_SPEED_DUPLEX, VIRTIO_NET_F_BACKUP
 
 static unsigned int features[] = {
 	VIRTNET_FEATURES,
diff --git a/include/uapi/linux/virtio_net.h b/include/uapi/linux/virtio_net.h
index 5de6ed37695b..c7c35fd1a5ed 100644
--- a/include/uapi/linux/virtio_net.h
+++ b/include/uapi/linux/virtio_net.h
@@ -57,6 +57,9 @@
 					 * Steering */
 #define VIRTIO_NET_F_CTRL_MAC_ADDR 23	/* Set MAC address */
 
+#define VIRTIO_NET_F_BACKUP	  62	/* Act as backup for another device
+					 * with the same MAC.
+					 */
 #define VIRTIO_NET_F_SPEED_DUPLEX 63	/* Device set linkspeed and duplex */
 
 #ifndef VIRTIO_NET_NO_LEGACY
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 102+ messages in thread

* [virtio-dev] [PATCH v4 1/2] virtio_net: Introduce VIRTIO_NET_F_BACKUP feature bit
@ 2018-03-01 20:08   ` Sridhar Samudrala
  0 siblings, 0 replies; 102+ messages in thread
From: Sridhar Samudrala @ 2018-03-01 20:08 UTC (permalink / raw)
  To: mst, stephen, davem, netdev, jiri, virtio-dev, jesse.brandeburg,
	alexander.h.duyck, kubakici, sridhar.samudrala

This feature bit can be used by hypervisor to indicate virtio_net device to
act as a backup for another device with the same MAC address.

VIRTIO_NET_F_BACKUP is defined as bit 62 as it is a device feature bit.

Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
---
 drivers/net/virtio_net.c        | 2 +-
 include/uapi/linux/virtio_net.h | 3 +++
 2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 626c27352ae2..bcd13fe906ca 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -2920,7 +2920,7 @@ static struct virtio_device_id id_table[] = {
 	VIRTIO_NET_F_GUEST_ANNOUNCE, VIRTIO_NET_F_MQ, \
 	VIRTIO_NET_F_CTRL_MAC_ADDR, \
 	VIRTIO_NET_F_MTU, VIRTIO_NET_F_CTRL_GUEST_OFFLOADS, \
-	VIRTIO_NET_F_SPEED_DUPLEX
+	VIRTIO_NET_F_SPEED_DUPLEX, VIRTIO_NET_F_BACKUP
 
 static unsigned int features[] = {
 	VIRTNET_FEATURES,
diff --git a/include/uapi/linux/virtio_net.h b/include/uapi/linux/virtio_net.h
index 5de6ed37695b..c7c35fd1a5ed 100644
--- a/include/uapi/linux/virtio_net.h
+++ b/include/uapi/linux/virtio_net.h
@@ -57,6 +57,9 @@
 					 * Steering */
 #define VIRTIO_NET_F_CTRL_MAC_ADDR 23	/* Set MAC address */
 
+#define VIRTIO_NET_F_BACKUP	  62	/* Act as backup for another device
+					 * with the same MAC.
+					 */
 #define VIRTIO_NET_F_SPEED_DUPLEX 63	/* Device set linkspeed and duplex */
 
 #ifndef VIRTIO_NET_NO_LEGACY
-- 
2.14.3


---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply related	[flat|nested] 102+ messages in thread

* [PATCH v4 2/2] virtio_net: Extend virtio to use VF datapath when available
  2018-03-01 20:08 ` [virtio-dev] " Sridhar Samudrala
@ 2018-03-01 20:08   ` Sridhar Samudrala
  -1 siblings, 0 replies; 102+ messages in thread
From: Sridhar Samudrala @ 2018-03-01 20:08 UTC (permalink / raw)
  To: mst, stephen, davem, netdev, jiri, virtio-dev, jesse.brandeburg,
	alexander.h.duyck, kubakici, sridhar.samudrala

This patch enables virtio_net to switch over to a VF datapath when a VF
netdev is present with the same MAC address. It allows live migration
of a VM with a direct attached VF without the need to setup a bond/team
between a VF and virtio net device in the guest.

The hypervisor needs to enable only one datapath at any time so that
packets don't get looped back to the VM over the other datapath. When a VF
is plugged, the virtio datapath link state can be marked as down. The
hypervisor needs to unplug the VF device from the guest on the source host
and reset the MAC filter of the VF to initiate failover of datapath to
virtio before starting the migration. After the migration is completed,
the destination hypervisor sets the MAC filter on the VF and plugs it back
to the guest to switch over to VF datapath.

When BACKUP feature is enabled, an additional netdev(bypass netdev) is
created that acts as a master device and tracks the state of the 2 lower
netdevs. The original virtio_net netdev is marked as 'backup' netdev and a
passthru device with the same MAC is registered as 'active' netdev.

This patch is based on the discussion initiated by Jesse on this thread.
https://marc.info/?l=linux-virtualization&m=151189725224231&w=2

Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
---
 drivers/net/virtio_net.c | 683 ++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 682 insertions(+), 1 deletion(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index bcd13fe906ca..f2860d86c952 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -30,6 +30,8 @@
 #include <linux/cpu.h>
 #include <linux/average.h>
 #include <linux/filter.h>
+#include <linux/netdevice.h>
+#include <linux/pci.h>
 #include <net/route.h>
 #include <net/xdp.h>
 
@@ -206,6 +208,9 @@ struct virtnet_info {
 	u32 speed;
 
 	unsigned long guest_offloads;
+
+	/* upper netdev created when BACKUP feature enabled */
+	struct net_device *bypass_netdev;
 };
 
 struct padded_vnet_hdr {
@@ -2236,6 +2241,22 @@ static int virtnet_xdp(struct net_device *dev, struct netdev_bpf *xdp)
 	}
 }
 
+static int virtnet_get_phys_port_name(struct net_device *dev, char *buf,
+				      size_t len)
+{
+	struct virtnet_info *vi = netdev_priv(dev);
+	int ret;
+
+	if (!virtio_has_feature(vi->vdev, VIRTIO_NET_F_BACKUP))
+		return -EOPNOTSUPP;
+
+	ret = snprintf(buf, len, "_bkup");
+	if (ret >= len)
+		return -EOPNOTSUPP;
+
+	return 0;
+}
+
 static const struct net_device_ops virtnet_netdev = {
 	.ndo_open            = virtnet_open,
 	.ndo_stop   	     = virtnet_close,
@@ -2253,6 +2274,7 @@ static const struct net_device_ops virtnet_netdev = {
 	.ndo_xdp_xmit		= virtnet_xdp_xmit,
 	.ndo_xdp_flush		= virtnet_xdp_flush,
 	.ndo_features_check	= passthru_features_check,
+	.ndo_get_phys_port_name	= virtnet_get_phys_port_name,
 };
 
 static void virtnet_config_changed_work(struct work_struct *work)
@@ -2647,6 +2669,653 @@ static int virtnet_validate(struct virtio_device *vdev)
 	return 0;
 }
 
+/* START of functions supporting VIRTIO_NET_F_BACKUP feature.
+ * When BACKUP feature is enabled, an additional netdev(bypass netdev)
+ * is created that acts as a master device and tracks the state of the
+ * 2 lower netdevs. The original virtio_net netdev is registered as
+ * 'backup' netdev and a passthru device with the same MAC is registered
+ * as 'active' netdev.
+ */
+
+/* bypass state maintained when BACKUP feature is enabled */
+struct virtnet_bypass_info {
+	/* passthru netdev with same MAC */
+	struct net_device __rcu *active_netdev;
+
+	/* virtio_net netdev */
+	struct net_device __rcu *backup_netdev;
+
+	/* active netdev stats */
+	struct rtnl_link_stats64 active_stats;
+
+	/* backup netdev stats */
+	struct rtnl_link_stats64 backup_stats;
+
+	/* aggregated stats */
+	struct rtnl_link_stats64 bypass_stats;
+
+	/* spinlock while updating stats */
+	spinlock_t stats_lock;
+};
+
+static void virtnet_bypass_child_open(struct net_device *dev,
+				      struct net_device *child_netdev)
+{
+	int err = dev_open(child_netdev);
+
+	if (err)
+		netdev_warn(dev, "unable to open slave: %s: %d\n",
+			    child_netdev->name, err);
+}
+
+static int virtnet_bypass_open(struct net_device *dev)
+{
+	struct virtnet_bypass_info *vbi = netdev_priv(dev);
+	struct net_device *child_netdev;
+
+	netif_carrier_off(dev);
+	netif_tx_wake_all_queues(dev);
+
+	child_netdev = rtnl_dereference(vbi->active_netdev);
+	if (child_netdev)
+		virtnet_bypass_child_open(dev, child_netdev);
+
+	child_netdev = rtnl_dereference(vbi->backup_netdev);
+	if (child_netdev)
+		virtnet_bypass_child_open(dev, child_netdev);
+
+	return 0;
+}
+
+static int virtnet_bypass_close(struct net_device *dev)
+{
+	struct virtnet_bypass_info *vi = netdev_priv(dev);
+	struct net_device *child_netdev;
+
+	netif_tx_disable(dev);
+
+	child_netdev = rtnl_dereference(vi->active_netdev);
+	if (child_netdev)
+		dev_close(child_netdev);
+
+	child_netdev = rtnl_dereference(vi->backup_netdev);
+	if (child_netdev)
+		dev_close(child_netdev);
+
+	return 0;
+}
+
+static netdev_tx_t virtnet_bypass_drop_xmit(struct sk_buff *skb,
+					    struct net_device *dev)
+{
+	atomic_long_inc(&dev->tx_dropped);
+	dev_kfree_skb_any(skb);
+	return NETDEV_TX_OK;
+}
+
+static bool virtnet_bypass_xmit_ready(struct net_device *dev)
+{
+	return netif_running(dev) && netif_carrier_ok(dev);
+}
+
+static netdev_tx_t virtnet_bypass_start_xmit(struct sk_buff *skb,
+					     struct net_device *dev)
+{
+	struct virtnet_bypass_info *vbi = netdev_priv(dev);
+	struct net_device *xmit_dev;
+
+	/* Try xmit via active netdev followed by backup netdev */
+	xmit_dev = rcu_dereference_bh(vbi->active_netdev);
+	if (!xmit_dev || !virtnet_bypass_xmit_ready(xmit_dev)) {
+		xmit_dev = rcu_dereference_bh(vbi->backup_netdev);
+		if (!xmit_dev || !virtnet_bypass_xmit_ready(xmit_dev))
+			return virtnet_bypass_drop_xmit(skb, dev);
+	}
+
+	skb->dev = xmit_dev;
+	skb->queue_mapping = qdisc_skb_cb(skb)->slave_dev_queue_mapping;
+
+	return dev_queue_xmit(skb);
+}
+
+static u16 virtnet_bypass_select_queue(struct net_device *dev,
+				       struct sk_buff *skb, void *accel_priv,
+				       select_queue_fallback_t fallback)
+{
+	/* This helper function exists to help dev_pick_tx get the correct
+	 * destination queue.  Using a helper function skips a call to
+	 * skb_tx_hash and will put the skbs in the queue we expect on their
+	 * way down to the bonding driver.
+	 */
+	u16 txq = skb_rx_queue_recorded(skb) ? skb_get_rx_queue(skb) : 0;
+
+	/* Save the original txq to restore before passing to the driver */
+	qdisc_skb_cb(skb)->slave_dev_queue_mapping = skb->queue_mapping;
+
+	if (unlikely(txq >= dev->real_num_tx_queues)) {
+		do {
+			txq -= dev->real_num_tx_queues;
+		} while (txq >= dev->real_num_tx_queues);
+	}
+
+	return txq;
+}
+
+/* fold stats, assuming all rtnl_link_stats64 fields are u64, but
+ * that some drivers can provide 32bit values only.
+ */
+static void virtnet_bypass_fold_stats(struct rtnl_link_stats64 *_res,
+				      const struct rtnl_link_stats64 *_new,
+				      const struct rtnl_link_stats64 *_old)
+{
+	const u64 *new = (const u64 *)_new;
+	const u64 *old = (const u64 *)_old;
+	u64 *res = (u64 *)_res;
+	int i;
+
+	for (i = 0; i < sizeof(*_res) / sizeof(u64); i++) {
+		u64 nv = new[i];
+		u64 ov = old[i];
+		s64 delta = nv - ov;
+
+		/* detects if this particular field is 32bit only */
+		if (((nv | ov) >> 32) == 0)
+			delta = (s64)(s32)((u32)nv - (u32)ov);
+
+		/* filter anomalies, some drivers reset their stats
+		 * at down/up events.
+		 */
+		if (delta > 0)
+			res[i] += delta;
+	}
+}
+
+static void virtnet_bypass_get_stats(struct net_device *dev,
+				     struct rtnl_link_stats64 *stats)
+{
+	struct virtnet_bypass_info *vbi = netdev_priv(dev);
+	const struct rtnl_link_stats64 *new;
+	struct rtnl_link_stats64 temp;
+	struct net_device *child_netdev;
+
+	spin_lock(&vbi->stats_lock);
+	memcpy(stats, &vbi->bypass_stats, sizeof(*stats));
+
+	rcu_read_lock();
+
+	child_netdev = rcu_dereference(vbi->active_netdev);
+	if (child_netdev) {
+		new = dev_get_stats(child_netdev, &temp);
+		virtnet_bypass_fold_stats(stats, new, &vbi->active_stats);
+		memcpy(&vbi->active_stats, new, sizeof(*new));
+	}
+
+	child_netdev = rcu_dereference(vbi->backup_netdev);
+	if (child_netdev) {
+		new = dev_get_stats(child_netdev, &temp);
+		virtnet_bypass_fold_stats(stats, new, &vbi->backup_stats);
+		memcpy(&vbi->backup_stats, new, sizeof(*new));
+	}
+
+	rcu_read_unlock();
+
+	memcpy(&vbi->bypass_stats, stats, sizeof(*stats));
+	spin_unlock(&vbi->stats_lock);
+}
+
+static int virtnet_bypass_change_mtu(struct net_device *dev, int new_mtu)
+{
+	struct virtnet_bypass_info *vbi = netdev_priv(dev);
+	struct net_device *child_netdev;
+	int ret = 0;
+
+	child_netdev = rcu_dereference(vbi->active_netdev);
+	if (child_netdev) {
+		ret = dev_set_mtu(child_netdev, new_mtu);
+		if (ret)
+			return ret;
+	}
+
+	child_netdev = rcu_dereference(vbi->backup_netdev);
+	if (child_netdev) {
+		ret = dev_set_mtu(child_netdev, new_mtu);
+		if (ret)
+			netdev_err(child_netdev,
+				   "Unexpected failure to set mtu to %d\n",
+				   new_mtu);
+	}
+
+	dev->mtu = new_mtu;
+	return 0;
+}
+
+static void virtnet_bypass_set_rx_mode(struct net_device *dev)
+{
+	struct virtnet_bypass_info *vbi = netdev_priv(dev);
+	struct net_device *child_netdev;
+
+	rcu_read_lock();
+
+	child_netdev = rcu_dereference(vbi->active_netdev);
+	if (child_netdev) {
+		dev_uc_sync_multiple(child_netdev, dev);
+		dev_mc_sync_multiple(child_netdev, dev);
+	}
+
+	child_netdev = rcu_dereference(vbi->backup_netdev);
+	if (child_netdev) {
+		dev_uc_sync_multiple(child_netdev, dev);
+		dev_mc_sync_multiple(child_netdev, dev);
+	}
+
+	rcu_read_unlock();
+}
+
+static const struct net_device_ops virtnet_bypass_netdev_ops = {
+	.ndo_open		= virtnet_bypass_open,
+	.ndo_stop		= virtnet_bypass_close,
+	.ndo_start_xmit		= virtnet_bypass_start_xmit,
+	.ndo_select_queue	= virtnet_bypass_select_queue,
+	.ndo_get_stats64	= virtnet_bypass_get_stats,
+	.ndo_change_mtu		= virtnet_bypass_change_mtu,
+	.ndo_set_rx_mode	= virtnet_bypass_set_rx_mode,
+	.ndo_validate_addr	= eth_validate_addr,
+	.ndo_features_check	= passthru_features_check,
+};
+
+static int
+virtnet_bypass_ethtool_get_link_ksettings(struct net_device *dev,
+					  struct ethtool_link_ksettings *cmd)
+{
+	struct virtnet_bypass_info *vbi = netdev_priv(dev);
+	struct net_device *child_netdev;
+
+	child_netdev = rtnl_dereference(vbi->active_netdev);
+	if (!child_netdev || !virtnet_bypass_xmit_ready(child_netdev)) {
+		child_netdev = rtnl_dereference(vbi->backup_netdev);
+		if (!child_netdev || !virtnet_bypass_xmit_ready(child_netdev)) {
+			cmd->base.duplex = DUPLEX_UNKNOWN;
+			cmd->base.port = PORT_OTHER;
+			cmd->base.speed = SPEED_UNKNOWN;
+
+			return 0;
+		}
+	}
+
+	return __ethtool_get_link_ksettings(child_netdev, cmd);
+}
+
+#define BYPASS_DRV_NAME "virtnet_bypass"
+#define BYPASS_DRV_VERSION "0.1"
+
+static void virtnet_bypass_ethtool_get_drvinfo(struct net_device *dev,
+					       struct ethtool_drvinfo *drvinfo)
+{
+	strlcpy(drvinfo->driver, BYPASS_DRV_NAME, sizeof(drvinfo->driver));
+	strlcpy(drvinfo->version, BYPASS_DRV_VERSION, sizeof(drvinfo->version));
+}
+
+static const struct ethtool_ops virtnet_bypass_ethtool_ops = {
+	.get_drvinfo            = virtnet_bypass_ethtool_get_drvinfo,
+	.get_link               = ethtool_op_get_link,
+	.get_link_ksettings     = virtnet_bypass_ethtool_get_link_ksettings,
+};
+
+static struct net_device *get_virtnet_bypass_bymac(struct net *net,
+						   const u8 *mac)
+{
+	struct net_device *dev;
+
+	ASSERT_RTNL();
+
+	for_each_netdev(net, dev) {
+		if (dev->netdev_ops != &virtnet_bypass_netdev_ops)
+			continue;       /* not a virtnet_bypass device */
+
+		if (ether_addr_equal(mac, dev->perm_addr))
+			return dev;
+	}
+
+	return NULL;
+}
+
+static struct net_device *
+get_virtnet_bypass_byref(struct net_device *child_netdev)
+{
+	struct net *net = dev_net(child_netdev);
+	struct net_device *dev;
+
+	ASSERT_RTNL();
+
+	for_each_netdev(net, dev) {
+		struct virtnet_bypass_info *vbi;
+
+		if (dev->netdev_ops != &virtnet_bypass_netdev_ops)
+			continue;       /* not a virtnet_bypass device */
+
+		vbi = netdev_priv(dev);
+
+		if ((rtnl_dereference(vbi->active_netdev) == child_netdev) ||
+		    (rtnl_dereference(vbi->backup_netdev) == child_netdev))
+			return dev;	/* a match */
+	}
+
+	return NULL;
+}
+
+/* Called when child dev is injecting data into network stack.
+ * Change the associated network device from lower dev to virtio.
+ * note: already called with rcu_read_lock
+ */
+static rx_handler_result_t virtnet_bypass_handle_frame(struct sk_buff **pskb)
+{
+	struct sk_buff *skb = *pskb;
+	struct net_device *ndev = rcu_dereference(skb->dev->rx_handler_data);
+
+	skb->dev = ndev;
+
+	return RX_HANDLER_ANOTHER;
+}
+
+static int virtnet_bypass_register_child(struct net_device *child_netdev)
+{
+	struct virtnet_bypass_info *vbi;
+	struct net_device *dev;
+	bool backup;
+	int ret;
+
+	if (child_netdev->addr_len != ETH_ALEN)
+		return NOTIFY_DONE;
+
+	/* We will use the MAC address to locate the virtnet_bypass netdev
+	 * to associate with the child netdev. If we don't find a matching
+	 * bypass netdev, move on.
+	 */
+	dev = get_virtnet_bypass_bymac(dev_net(child_netdev),
+				       child_netdev->perm_addr);
+	if (!dev)
+		return NOTIFY_DONE;
+
+	vbi = netdev_priv(dev);
+	backup = (child_netdev->dev.parent == dev->dev.parent);
+	if (backup ? rtnl_dereference(vbi->backup_netdev) :
+			rtnl_dereference(vbi->active_netdev)) {
+		netdev_info(dev,
+			    "%s attempting to join bypass dev when %s already present\n",
+			    child_netdev->name, backup ? "backup" : "active");
+		return NOTIFY_DONE;
+	}
+
+	/* Avoid non pci devices as active netdev */
+	if (!backup && (!child_netdev->dev.parent ||
+			!dev_is_pci(child_netdev->dev.parent)))
+		return NOTIFY_DONE;
+
+	ret = netdev_rx_handler_register(child_netdev,
+					 virtnet_bypass_handle_frame, dev);
+	if (ret != 0) {
+		netdev_err(child_netdev,
+			   "can not register bypass receive handler (err = %d)\n",
+			   ret);
+		goto rx_handler_failed;
+	}
+
+	ret = netdev_upper_dev_link(child_netdev, dev, NULL);
+	if (ret != 0) {
+		netdev_err(child_netdev,
+			   "can not set master device %s (err = %d)\n",
+			   dev->name, ret);
+		goto upper_link_failed;
+	}
+
+	child_netdev->flags |= IFF_SLAVE;
+
+	if (netif_running(dev)) {
+		ret = dev_open(child_netdev);
+		if (ret && (ret != -EBUSY)) {
+			netdev_err(dev, "Opening child %s failed ret:%d\n",
+				   child_netdev->name, ret);
+			goto err_interface_up;
+		}
+	}
+
+	/* Align MTU of child with master */
+	ret = dev_set_mtu(child_netdev, dev->mtu);
+	if (ret) {
+		netdev_err(dev,
+			   "unable to change mtu of %s to %u register failed\n",
+			   child_netdev->name, dev->mtu);
+		goto err_set_mtu;
+	}
+
+	call_netdevice_notifiers(NETDEV_JOIN, child_netdev);
+
+	netdev_info(dev, "registering %s\n", child_netdev->name);
+
+	dev_hold(child_netdev);
+	if (backup) {
+		rcu_assign_pointer(vbi->backup_netdev, child_netdev);
+		dev_get_stats(vbi->backup_netdev, &vbi->backup_stats);
+	} else {
+		rcu_assign_pointer(vbi->active_netdev, child_netdev);
+		dev_get_stats(vbi->active_netdev, &vbi->active_stats);
+		dev->min_mtu = child_netdev->min_mtu;
+		dev->max_mtu = child_netdev->max_mtu;
+	}
+
+	return NOTIFY_OK;
+
+err_set_mtu:
+	dev_close(child_netdev);
+err_interface_up:
+	netdev_upper_dev_unlink(child_netdev, dev);
+	child_netdev->flags &= ~IFF_SLAVE;
+upper_link_failed:
+	netdev_rx_handler_unregister(child_netdev);
+rx_handler_failed:
+	return NOTIFY_DONE;
+}
+
+static int virtnet_bypass_unregister_child(struct net_device *child_netdev)
+{
+	struct virtnet_bypass_info *vbi;
+	struct net_device *dev, *backup;
+
+	dev = get_virtnet_bypass_byref(child_netdev);
+	if (!dev)
+		return NOTIFY_DONE;
+
+	vbi = netdev_priv(dev);
+
+	netdev_info(dev, "unregistering %s\n", child_netdev->name);
+
+	netdev_rx_handler_unregister(child_netdev);
+	netdev_upper_dev_unlink(child_netdev, dev);
+	child_netdev->flags &= ~IFF_SLAVE;
+
+	if (child_netdev->dev.parent == dev->dev.parent) {
+		RCU_INIT_POINTER(vbi->backup_netdev, NULL);
+	} else {
+		RCU_INIT_POINTER(vbi->active_netdev, NULL);
+		backup = rtnl_dereference(vbi->backup_netdev);
+		if (backup) {
+			dev->min_mtu = backup->min_mtu;
+			dev->max_mtu = backup->max_mtu;
+		}
+	}
+
+	dev_put(child_netdev);
+
+	return NOTIFY_OK;
+}
+
+static int virtnet_bypass_update_link(struct net_device *child_netdev)
+{
+	struct net_device *dev, *active, *backup;
+	struct virtnet_bypass_info *vbi;
+
+	dev = get_virtnet_bypass_byref(child_netdev);
+	if (!dev || !netif_running(dev))
+		return NOTIFY_DONE;
+
+	vbi = netdev_priv(dev);
+
+	active = rtnl_dereference(vbi->active_netdev);
+	backup = rtnl_dereference(vbi->backup_netdev);
+
+	if ((active && virtnet_bypass_xmit_ready(active)) ||
+	    (backup && virtnet_bypass_xmit_ready(backup))) {
+		netif_carrier_on(dev);
+		netif_tx_wake_all_queues(dev);
+	} else {
+		netif_carrier_off(dev);
+		netif_tx_stop_all_queues(dev);
+	}
+
+	return NOTIFY_OK;
+}
+
+static int virtnet_bypass_event(struct notifier_block *this,
+				unsigned long event, void *ptr)
+{
+	struct net_device *event_dev = netdev_notifier_info_to_dev(ptr);
+
+	/* Skip our own events */
+	if (event_dev->netdev_ops == &virtnet_bypass_netdev_ops)
+		return NOTIFY_DONE;
+
+	/* Avoid non-Ethernet type devices */
+	if (event_dev->type != ARPHRD_ETHER)
+		return NOTIFY_DONE;
+
+	/* Avoid Vlan dev with same MAC registering as child dev */
+	if (is_vlan_dev(event_dev))
+		return NOTIFY_DONE;
+
+	/* Avoid Bonding master dev with same MAC registering as child dev */
+	if ((event_dev->priv_flags & IFF_BONDING) &&
+	    (event_dev->flags & IFF_MASTER))
+		return NOTIFY_DONE;
+
+	switch (event) {
+	case NETDEV_REGISTER:
+		return virtnet_bypass_register_child(event_dev);
+	case NETDEV_UNREGISTER:
+		return virtnet_bypass_unregister_child(event_dev);
+	case NETDEV_UP:
+	case NETDEV_DOWN:
+	case NETDEV_CHANGE:
+		return virtnet_bypass_update_link(event_dev);
+	default:
+		return NOTIFY_DONE;
+	}
+}
+
+static struct notifier_block virtnet_bypass_notifier = {
+	.notifier_call = virtnet_bypass_event,
+};
+
+static int virtnet_bypass_create(struct virtnet_info *vi)
+{
+	struct net_device *backup_netdev = vi->dev;
+	struct device *dev = &vi->vdev->dev;
+	struct net_device *bypass_netdev;
+	int res;
+
+	/* Alloc at least 2 queues, for now we are going with 16 assuming
+	 * that most devices being bonded won't have too many queues.
+	 */
+	bypass_netdev = alloc_etherdev_mq(sizeof(struct virtnet_bypass_info),
+					  16);
+	if (!bypass_netdev) {
+		dev_err(dev, "Unable to allocate bypass_netdev!\n");
+		return -ENOMEM;
+	}
+
+	dev_net_set(bypass_netdev, dev_net(backup_netdev));
+	SET_NETDEV_DEV(bypass_netdev, dev);
+
+	bypass_netdev->netdev_ops = &virtnet_bypass_netdev_ops;
+	bypass_netdev->ethtool_ops = &virtnet_bypass_ethtool_ops;
+
+	/* Initialize the device options */
+	bypass_netdev->flags |= IFF_MASTER;
+	bypass_netdev->priv_flags |= IFF_BONDING | IFF_UNICAST_FLT |
+				     IFF_NO_QUEUE;
+	bypass_netdev->priv_flags &= ~(IFF_XMIT_DST_RELEASE |
+				       IFF_TX_SKB_SHARING);
+
+	/* don't acquire bypass netdev's netif_tx_lock when transmitting */
+	bypass_netdev->features |= NETIF_F_LLTX;
+
+	/* Don't allow bypass devices to change network namespaces. */
+	bypass_netdev->features |= NETIF_F_NETNS_LOCAL;
+
+	bypass_netdev->hw_features = NETIF_F_HW_CSUM | NETIF_F_SG |
+				     NETIF_F_FRAGLIST | NETIF_F_ALL_TSO |
+				     NETIF_F_HIGHDMA | NETIF_F_LRO;
+
+	bypass_netdev->hw_features |= NETIF_F_GSO_ENCAP_ALL;
+	bypass_netdev->features |= bypass_netdev->hw_features;
+
+	/* For now treat bypass netdev as VLAN challenged since we
+	 * cannot assume VLAN functionality with a VF
+	 */
+	bypass_netdev->features |= NETIF_F_VLAN_CHALLENGED;
+
+	memcpy(bypass_netdev->dev_addr, backup_netdev->dev_addr,
+	       bypass_netdev->addr_len);
+
+	bypass_netdev->min_mtu = backup_netdev->min_mtu;
+	bypass_netdev->max_mtu = backup_netdev->max_mtu;
+
+	res = register_netdev(bypass_netdev);
+	if (res < 0) {
+		dev_err(dev, "Unable to register bypass_netdev!\n");
+		free_netdev(bypass_netdev);
+		return res;
+	}
+
+	netif_carrier_off(bypass_netdev);
+
+	vi->bypass_netdev = bypass_netdev;
+
+	return 0;
+}
+
+static void virtnet_bypass_destroy(struct virtnet_info *vi)
+{
+	struct net_device *bypass_netdev = vi->bypass_netdev;
+	struct virtnet_bypass_info *vbi;
+	struct net_device *child_netdev;
+
+	/* no device found, nothing to free */
+	if (!bypass_netdev)
+		return;
+
+	vbi = netdev_priv(bypass_netdev);
+
+	netif_device_detach(bypass_netdev);
+
+	rtnl_lock();
+
+	child_netdev = rtnl_dereference(vbi->active_netdev);
+	if (child_netdev)
+		virtnet_bypass_unregister_child(child_netdev);
+
+	child_netdev = rtnl_dereference(vbi->backup_netdev);
+	if (child_netdev)
+		virtnet_bypass_unregister_child(child_netdev);
+
+	unregister_netdevice(bypass_netdev);
+
+	rtnl_unlock();
+
+	free_netdev(bypass_netdev);
+}
+
+/* END of functions supporting VIRTIO_NET_F_BACKUP feature. */
+
 static int virtnet_probe(struct virtio_device *vdev)
 {
 	int i, err = -ENOMEM;
@@ -2797,10 +3466,15 @@ static int virtnet_probe(struct virtio_device *vdev)
 
 	virtnet_init_settings(dev);
 
+	if (virtio_has_feature(vdev, VIRTIO_NET_F_BACKUP)) {
+		if (virtnet_bypass_create(vi) != 0)
+			goto free_vqs;
+	}
+
 	err = register_netdev(dev);
 	if (err) {
 		pr_debug("virtio_net: registering device failed\n");
-		goto free_vqs;
+		goto free_bypass;
 	}
 
 	virtio_device_ready(vdev);
@@ -2837,6 +3511,8 @@ static int virtnet_probe(struct virtio_device *vdev)
 	vi->vdev->config->reset(vdev);
 
 	unregister_netdev(dev);
+free_bypass:
+	virtnet_bypass_destroy(vi);
 free_vqs:
 	cancel_delayed_work_sync(&vi->refill);
 	free_receive_page_frags(vi);
@@ -2871,6 +3547,8 @@ static void virtnet_remove(struct virtio_device *vdev)
 
 	unregister_netdev(vi->dev);
 
+	virtnet_bypass_destroy(vi);
+
 	remove_vq_common(vi);
 
 	free_netdev(vi->dev);
@@ -2968,6 +3646,8 @@ static __init int virtio_net_driver_init(void)
         ret = register_virtio_driver(&virtio_net_driver);
 	if (ret)
 		goto err_virtio;
+
+	register_netdevice_notifier(&virtnet_bypass_notifier);
 	return 0;
 err_virtio:
 	cpuhp_remove_multi_state(CPUHP_VIRT_NET_DEAD);
@@ -2980,6 +3660,7 @@ module_init(virtio_net_driver_init);
 
 static __exit void virtio_net_driver_exit(void)
 {
+	unregister_netdevice_notifier(&virtnet_bypass_notifier);
 	unregister_virtio_driver(&virtio_net_driver);
 	cpuhp_remove_multi_state(CPUHP_VIRT_NET_DEAD);
 	cpuhp_remove_multi_state(virtionet_online);
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 102+ messages in thread

* [virtio-dev] [PATCH v4 2/2] virtio_net: Extend virtio to use VF datapath when available
@ 2018-03-01 20:08   ` Sridhar Samudrala
  0 siblings, 0 replies; 102+ messages in thread
From: Sridhar Samudrala @ 2018-03-01 20:08 UTC (permalink / raw)
  To: mst, stephen, davem, netdev, jiri, virtio-dev, jesse.brandeburg,
	alexander.h.duyck, kubakici, sridhar.samudrala

This patch enables virtio_net to switch over to a VF datapath when a VF
netdev is present with the same MAC address. It allows live migration
of a VM with a direct attached VF without the need to setup a bond/team
between a VF and virtio net device in the guest.

The hypervisor needs to enable only one datapath at any time so that
packets don't get looped back to the VM over the other datapath. When a VF
is plugged, the virtio datapath link state can be marked as down. The
hypervisor needs to unplug the VF device from the guest on the source host
and reset the MAC filter of the VF to initiate failover of datapath to
virtio before starting the migration. After the migration is completed,
the destination hypervisor sets the MAC filter on the VF and plugs it back
to the guest to switch over to VF datapath.

When BACKUP feature is enabled, an additional netdev(bypass netdev) is
created that acts as a master device and tracks the state of the 2 lower
netdevs. The original virtio_net netdev is marked as 'backup' netdev and a
passthru device with the same MAC is registered as 'active' netdev.

This patch is based on the discussion initiated by Jesse on this thread.
https://marc.info/?l=linux-virtualization&m=151189725224231&w=2

Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
---
 drivers/net/virtio_net.c | 683 ++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 682 insertions(+), 1 deletion(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index bcd13fe906ca..f2860d86c952 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -30,6 +30,8 @@
 #include <linux/cpu.h>
 #include <linux/average.h>
 #include <linux/filter.h>
+#include <linux/netdevice.h>
+#include <linux/pci.h>
 #include <net/route.h>
 #include <net/xdp.h>
 
@@ -206,6 +208,9 @@ struct virtnet_info {
 	u32 speed;
 
 	unsigned long guest_offloads;
+
+	/* upper netdev created when BACKUP feature enabled */
+	struct net_device *bypass_netdev;
 };
 
 struct padded_vnet_hdr {
@@ -2236,6 +2241,22 @@ static int virtnet_xdp(struct net_device *dev, struct netdev_bpf *xdp)
 	}
 }
 
+static int virtnet_get_phys_port_name(struct net_device *dev, char *buf,
+				      size_t len)
+{
+	struct virtnet_info *vi = netdev_priv(dev);
+	int ret;
+
+	if (!virtio_has_feature(vi->vdev, VIRTIO_NET_F_BACKUP))
+		return -EOPNOTSUPP;
+
+	ret = snprintf(buf, len, "_bkup");
+	if (ret >= len)
+		return -EOPNOTSUPP;
+
+	return 0;
+}
+
 static const struct net_device_ops virtnet_netdev = {
 	.ndo_open            = virtnet_open,
 	.ndo_stop   	     = virtnet_close,
@@ -2253,6 +2274,7 @@ static const struct net_device_ops virtnet_netdev = {
 	.ndo_xdp_xmit		= virtnet_xdp_xmit,
 	.ndo_xdp_flush		= virtnet_xdp_flush,
 	.ndo_features_check	= passthru_features_check,
+	.ndo_get_phys_port_name	= virtnet_get_phys_port_name,
 };
 
 static void virtnet_config_changed_work(struct work_struct *work)
@@ -2647,6 +2669,653 @@ static int virtnet_validate(struct virtio_device *vdev)
 	return 0;
 }
 
+/* START of functions supporting VIRTIO_NET_F_BACKUP feature.
+ * When BACKUP feature is enabled, an additional netdev(bypass netdev)
+ * is created that acts as a master device and tracks the state of the
+ * 2 lower netdevs. The original virtio_net netdev is registered as
+ * 'backup' netdev and a passthru device with the same MAC is registered
+ * as 'active' netdev.
+ */
+
+/* bypass state maintained when BACKUP feature is enabled */
+struct virtnet_bypass_info {
+	/* passthru netdev with same MAC */
+	struct net_device __rcu *active_netdev;
+
+	/* virtio_net netdev */
+	struct net_device __rcu *backup_netdev;
+
+	/* active netdev stats */
+	struct rtnl_link_stats64 active_stats;
+
+	/* backup netdev stats */
+	struct rtnl_link_stats64 backup_stats;
+
+	/* aggregated stats */
+	struct rtnl_link_stats64 bypass_stats;
+
+	/* spinlock while updating stats */
+	spinlock_t stats_lock;
+};
+
+static void virtnet_bypass_child_open(struct net_device *dev,
+				      struct net_device *child_netdev)
+{
+	int err = dev_open(child_netdev);
+
+	if (err)
+		netdev_warn(dev, "unable to open slave: %s: %d\n",
+			    child_netdev->name, err);
+}
+
+static int virtnet_bypass_open(struct net_device *dev)
+{
+	struct virtnet_bypass_info *vbi = netdev_priv(dev);
+	struct net_device *child_netdev;
+
+	netif_carrier_off(dev);
+	netif_tx_wake_all_queues(dev);
+
+	child_netdev = rtnl_dereference(vbi->active_netdev);
+	if (child_netdev)
+		virtnet_bypass_child_open(dev, child_netdev);
+
+	child_netdev = rtnl_dereference(vbi->backup_netdev);
+	if (child_netdev)
+		virtnet_bypass_child_open(dev, child_netdev);
+
+	return 0;
+}
+
+static int virtnet_bypass_close(struct net_device *dev)
+{
+	struct virtnet_bypass_info *vi = netdev_priv(dev);
+	struct net_device *child_netdev;
+
+	netif_tx_disable(dev);
+
+	child_netdev = rtnl_dereference(vi->active_netdev);
+	if (child_netdev)
+		dev_close(child_netdev);
+
+	child_netdev = rtnl_dereference(vi->backup_netdev);
+	if (child_netdev)
+		dev_close(child_netdev);
+
+	return 0;
+}
+
+static netdev_tx_t virtnet_bypass_drop_xmit(struct sk_buff *skb,
+					    struct net_device *dev)
+{
+	atomic_long_inc(&dev->tx_dropped);
+	dev_kfree_skb_any(skb);
+	return NETDEV_TX_OK;
+}
+
+static bool virtnet_bypass_xmit_ready(struct net_device *dev)
+{
+	return netif_running(dev) && netif_carrier_ok(dev);
+}
+
+static netdev_tx_t virtnet_bypass_start_xmit(struct sk_buff *skb,
+					     struct net_device *dev)
+{
+	struct virtnet_bypass_info *vbi = netdev_priv(dev);
+	struct net_device *xmit_dev;
+
+	/* Try xmit via active netdev followed by backup netdev */
+	xmit_dev = rcu_dereference_bh(vbi->active_netdev);
+	if (!xmit_dev || !virtnet_bypass_xmit_ready(xmit_dev)) {
+		xmit_dev = rcu_dereference_bh(vbi->backup_netdev);
+		if (!xmit_dev || !virtnet_bypass_xmit_ready(xmit_dev))
+			return virtnet_bypass_drop_xmit(skb, dev);
+	}
+
+	skb->dev = xmit_dev;
+	skb->queue_mapping = qdisc_skb_cb(skb)->slave_dev_queue_mapping;
+
+	return dev_queue_xmit(skb);
+}
+
+static u16 virtnet_bypass_select_queue(struct net_device *dev,
+				       struct sk_buff *skb, void *accel_priv,
+				       select_queue_fallback_t fallback)
+{
+	/* This helper function exists to help dev_pick_tx get the correct
+	 * destination queue.  Using a helper function skips a call to
+	 * skb_tx_hash and will put the skbs in the queue we expect on their
+	 * way down to the bonding driver.
+	 */
+	u16 txq = skb_rx_queue_recorded(skb) ? skb_get_rx_queue(skb) : 0;
+
+	/* Save the original txq to restore before passing to the driver */
+	qdisc_skb_cb(skb)->slave_dev_queue_mapping = skb->queue_mapping;
+
+	if (unlikely(txq >= dev->real_num_tx_queues)) {
+		do {
+			txq -= dev->real_num_tx_queues;
+		} while (txq >= dev->real_num_tx_queues);
+	}
+
+	return txq;
+}
+
+/* fold stats, assuming all rtnl_link_stats64 fields are u64, but
+ * that some drivers can provide 32bit values only.
+ */
+static void virtnet_bypass_fold_stats(struct rtnl_link_stats64 *_res,
+				      const struct rtnl_link_stats64 *_new,
+				      const struct rtnl_link_stats64 *_old)
+{
+	const u64 *new = (const u64 *)_new;
+	const u64 *old = (const u64 *)_old;
+	u64 *res = (u64 *)_res;
+	int i;
+
+	for (i = 0; i < sizeof(*_res) / sizeof(u64); i++) {
+		u64 nv = new[i];
+		u64 ov = old[i];
+		s64 delta = nv - ov;
+
+		/* detects if this particular field is 32bit only */
+		if (((nv | ov) >> 32) == 0)
+			delta = (s64)(s32)((u32)nv - (u32)ov);
+
+		/* filter anomalies, some drivers reset their stats
+		 * at down/up events.
+		 */
+		if (delta > 0)
+			res[i] += delta;
+	}
+}
+
+static void virtnet_bypass_get_stats(struct net_device *dev,
+				     struct rtnl_link_stats64 *stats)
+{
+	struct virtnet_bypass_info *vbi = netdev_priv(dev);
+	const struct rtnl_link_stats64 *new;
+	struct rtnl_link_stats64 temp;
+	struct net_device *child_netdev;
+
+	spin_lock(&vbi->stats_lock);
+	memcpy(stats, &vbi->bypass_stats, sizeof(*stats));
+
+	rcu_read_lock();
+
+	child_netdev = rcu_dereference(vbi->active_netdev);
+	if (child_netdev) {
+		new = dev_get_stats(child_netdev, &temp);
+		virtnet_bypass_fold_stats(stats, new, &vbi->active_stats);
+		memcpy(&vbi->active_stats, new, sizeof(*new));
+	}
+
+	child_netdev = rcu_dereference(vbi->backup_netdev);
+	if (child_netdev) {
+		new = dev_get_stats(child_netdev, &temp);
+		virtnet_bypass_fold_stats(stats, new, &vbi->backup_stats);
+		memcpy(&vbi->backup_stats, new, sizeof(*new));
+	}
+
+	rcu_read_unlock();
+
+	memcpy(&vbi->bypass_stats, stats, sizeof(*stats));
+	spin_unlock(&vbi->stats_lock);
+}
+
+static int virtnet_bypass_change_mtu(struct net_device *dev, int new_mtu)
+{
+	struct virtnet_bypass_info *vbi = netdev_priv(dev);
+	struct net_device *child_netdev;
+	int ret = 0;
+
+	child_netdev = rcu_dereference(vbi->active_netdev);
+	if (child_netdev) {
+		ret = dev_set_mtu(child_netdev, new_mtu);
+		if (ret)
+			return ret;
+	}
+
+	child_netdev = rcu_dereference(vbi->backup_netdev);
+	if (child_netdev) {
+		ret = dev_set_mtu(child_netdev, new_mtu);
+		if (ret)
+			netdev_err(child_netdev,
+				   "Unexpected failure to set mtu to %d\n",
+				   new_mtu);
+	}
+
+	dev->mtu = new_mtu;
+	return 0;
+}
+
+static void virtnet_bypass_set_rx_mode(struct net_device *dev)
+{
+	struct virtnet_bypass_info *vbi = netdev_priv(dev);
+	struct net_device *child_netdev;
+
+	rcu_read_lock();
+
+	child_netdev = rcu_dereference(vbi->active_netdev);
+	if (child_netdev) {
+		dev_uc_sync_multiple(child_netdev, dev);
+		dev_mc_sync_multiple(child_netdev, dev);
+	}
+
+	child_netdev = rcu_dereference(vbi->backup_netdev);
+	if (child_netdev) {
+		dev_uc_sync_multiple(child_netdev, dev);
+		dev_mc_sync_multiple(child_netdev, dev);
+	}
+
+	rcu_read_unlock();
+}
+
+static const struct net_device_ops virtnet_bypass_netdev_ops = {
+	.ndo_open		= virtnet_bypass_open,
+	.ndo_stop		= virtnet_bypass_close,
+	.ndo_start_xmit		= virtnet_bypass_start_xmit,
+	.ndo_select_queue	= virtnet_bypass_select_queue,
+	.ndo_get_stats64	= virtnet_bypass_get_stats,
+	.ndo_change_mtu		= virtnet_bypass_change_mtu,
+	.ndo_set_rx_mode	= virtnet_bypass_set_rx_mode,
+	.ndo_validate_addr	= eth_validate_addr,
+	.ndo_features_check	= passthru_features_check,
+};
+
+static int
+virtnet_bypass_ethtool_get_link_ksettings(struct net_device *dev,
+					  struct ethtool_link_ksettings *cmd)
+{
+	struct virtnet_bypass_info *vbi = netdev_priv(dev);
+	struct net_device *child_netdev;
+
+	child_netdev = rtnl_dereference(vbi->active_netdev);
+	if (!child_netdev || !virtnet_bypass_xmit_ready(child_netdev)) {
+		child_netdev = rtnl_dereference(vbi->backup_netdev);
+		if (!child_netdev || !virtnet_bypass_xmit_ready(child_netdev)) {
+			cmd->base.duplex = DUPLEX_UNKNOWN;
+			cmd->base.port = PORT_OTHER;
+			cmd->base.speed = SPEED_UNKNOWN;
+
+			return 0;
+		}
+	}
+
+	return __ethtool_get_link_ksettings(child_netdev, cmd);
+}
+
+#define BYPASS_DRV_NAME "virtnet_bypass"
+#define BYPASS_DRV_VERSION "0.1"
+
+static void virtnet_bypass_ethtool_get_drvinfo(struct net_device *dev,
+					       struct ethtool_drvinfo *drvinfo)
+{
+	strlcpy(drvinfo->driver, BYPASS_DRV_NAME, sizeof(drvinfo->driver));
+	strlcpy(drvinfo->version, BYPASS_DRV_VERSION, sizeof(drvinfo->version));
+}
+
+static const struct ethtool_ops virtnet_bypass_ethtool_ops = {
+	.get_drvinfo            = virtnet_bypass_ethtool_get_drvinfo,
+	.get_link               = ethtool_op_get_link,
+	.get_link_ksettings     = virtnet_bypass_ethtool_get_link_ksettings,
+};
+
+static struct net_device *get_virtnet_bypass_bymac(struct net *net,
+						   const u8 *mac)
+{
+	struct net_device *dev;
+
+	ASSERT_RTNL();
+
+	for_each_netdev(net, dev) {
+		if (dev->netdev_ops != &virtnet_bypass_netdev_ops)
+			continue;       /* not a virtnet_bypass device */
+
+		if (ether_addr_equal(mac, dev->perm_addr))
+			return dev;
+	}
+
+	return NULL;
+}
+
+static struct net_device *
+get_virtnet_bypass_byref(struct net_device *child_netdev)
+{
+	struct net *net = dev_net(child_netdev);
+	struct net_device *dev;
+
+	ASSERT_RTNL();
+
+	for_each_netdev(net, dev) {
+		struct virtnet_bypass_info *vbi;
+
+		if (dev->netdev_ops != &virtnet_bypass_netdev_ops)
+			continue;       /* not a virtnet_bypass device */
+
+		vbi = netdev_priv(dev);
+
+		if ((rtnl_dereference(vbi->active_netdev) == child_netdev) ||
+		    (rtnl_dereference(vbi->backup_netdev) == child_netdev))
+			return dev;	/* a match */
+	}
+
+	return NULL;
+}
+
+/* Called when child dev is injecting data into network stack.
+ * Change the associated network device from lower dev to virtio.
+ * note: already called with rcu_read_lock
+ */
+static rx_handler_result_t virtnet_bypass_handle_frame(struct sk_buff **pskb)
+{
+	struct sk_buff *skb = *pskb;
+	struct net_device *ndev = rcu_dereference(skb->dev->rx_handler_data);
+
+	skb->dev = ndev;
+
+	return RX_HANDLER_ANOTHER;
+}
+
+static int virtnet_bypass_register_child(struct net_device *child_netdev)
+{
+	struct virtnet_bypass_info *vbi;
+	struct net_device *dev;
+	bool backup;
+	int ret;
+
+	if (child_netdev->addr_len != ETH_ALEN)
+		return NOTIFY_DONE;
+
+	/* We will use the MAC address to locate the virtnet_bypass netdev
+	 * to associate with the child netdev. If we don't find a matching
+	 * bypass netdev, move on.
+	 */
+	dev = get_virtnet_bypass_bymac(dev_net(child_netdev),
+				       child_netdev->perm_addr);
+	if (!dev)
+		return NOTIFY_DONE;
+
+	vbi = netdev_priv(dev);
+	backup = (child_netdev->dev.parent == dev->dev.parent);
+	if (backup ? rtnl_dereference(vbi->backup_netdev) :
+			rtnl_dereference(vbi->active_netdev)) {
+		netdev_info(dev,
+			    "%s attempting to join bypass dev when %s already present\n",
+			    child_netdev->name, backup ? "backup" : "active");
+		return NOTIFY_DONE;
+	}
+
+	/* Avoid non pci devices as active netdev */
+	if (!backup && (!child_netdev->dev.parent ||
+			!dev_is_pci(child_netdev->dev.parent)))
+		return NOTIFY_DONE;
+
+	ret = netdev_rx_handler_register(child_netdev,
+					 virtnet_bypass_handle_frame, dev);
+	if (ret != 0) {
+		netdev_err(child_netdev,
+			   "can not register bypass receive handler (err = %d)\n",
+			   ret);
+		goto rx_handler_failed;
+	}
+
+	ret = netdev_upper_dev_link(child_netdev, dev, NULL);
+	if (ret != 0) {
+		netdev_err(child_netdev,
+			   "can not set master device %s (err = %d)\n",
+			   dev->name, ret);
+		goto upper_link_failed;
+	}
+
+	child_netdev->flags |= IFF_SLAVE;
+
+	if (netif_running(dev)) {
+		ret = dev_open(child_netdev);
+		if (ret && (ret != -EBUSY)) {
+			netdev_err(dev, "Opening child %s failed ret:%d\n",
+				   child_netdev->name, ret);
+			goto err_interface_up;
+		}
+	}
+
+	/* Align MTU of child with master */
+	ret = dev_set_mtu(child_netdev, dev->mtu);
+	if (ret) {
+		netdev_err(dev,
+			   "unable to change mtu of %s to %u register failed\n",
+			   child_netdev->name, dev->mtu);
+		goto err_set_mtu;
+	}
+
+	call_netdevice_notifiers(NETDEV_JOIN, child_netdev);
+
+	netdev_info(dev, "registering %s\n", child_netdev->name);
+
+	dev_hold(child_netdev);
+	if (backup) {
+		rcu_assign_pointer(vbi->backup_netdev, child_netdev);
+		dev_get_stats(vbi->backup_netdev, &vbi->backup_stats);
+	} else {
+		rcu_assign_pointer(vbi->active_netdev, child_netdev);
+		dev_get_stats(vbi->active_netdev, &vbi->active_stats);
+		dev->min_mtu = child_netdev->min_mtu;
+		dev->max_mtu = child_netdev->max_mtu;
+	}
+
+	return NOTIFY_OK;
+
+err_set_mtu:
+	dev_close(child_netdev);
+err_interface_up:
+	netdev_upper_dev_unlink(child_netdev, dev);
+	child_netdev->flags &= ~IFF_SLAVE;
+upper_link_failed:
+	netdev_rx_handler_unregister(child_netdev);
+rx_handler_failed:
+	return NOTIFY_DONE;
+}
+
+static int virtnet_bypass_unregister_child(struct net_device *child_netdev)
+{
+	struct virtnet_bypass_info *vbi;
+	struct net_device *dev, *backup;
+
+	dev = get_virtnet_bypass_byref(child_netdev);
+	if (!dev)
+		return NOTIFY_DONE;
+
+	vbi = netdev_priv(dev);
+
+	netdev_info(dev, "unregistering %s\n", child_netdev->name);
+
+	netdev_rx_handler_unregister(child_netdev);
+	netdev_upper_dev_unlink(child_netdev, dev);
+	child_netdev->flags &= ~IFF_SLAVE;
+
+	if (child_netdev->dev.parent == dev->dev.parent) {
+		RCU_INIT_POINTER(vbi->backup_netdev, NULL);
+	} else {
+		RCU_INIT_POINTER(vbi->active_netdev, NULL);
+		backup = rtnl_dereference(vbi->backup_netdev);
+		if (backup) {
+			dev->min_mtu = backup->min_mtu;
+			dev->max_mtu = backup->max_mtu;
+		}
+	}
+
+	dev_put(child_netdev);
+
+	return NOTIFY_OK;
+}
+
+static int virtnet_bypass_update_link(struct net_device *child_netdev)
+{
+	struct net_device *dev, *active, *backup;
+	struct virtnet_bypass_info *vbi;
+
+	dev = get_virtnet_bypass_byref(child_netdev);
+	if (!dev || !netif_running(dev))
+		return NOTIFY_DONE;
+
+	vbi = netdev_priv(dev);
+
+	active = rtnl_dereference(vbi->active_netdev);
+	backup = rtnl_dereference(vbi->backup_netdev);
+
+	if ((active && virtnet_bypass_xmit_ready(active)) ||
+	    (backup && virtnet_bypass_xmit_ready(backup))) {
+		netif_carrier_on(dev);
+		netif_tx_wake_all_queues(dev);
+	} else {
+		netif_carrier_off(dev);
+		netif_tx_stop_all_queues(dev);
+	}
+
+	return NOTIFY_OK;
+}
+
+static int virtnet_bypass_event(struct notifier_block *this,
+				unsigned long event, void *ptr)
+{
+	struct net_device *event_dev = netdev_notifier_info_to_dev(ptr);
+
+	/* Skip our own events */
+	if (event_dev->netdev_ops == &virtnet_bypass_netdev_ops)
+		return NOTIFY_DONE;
+
+	/* Avoid non-Ethernet type devices */
+	if (event_dev->type != ARPHRD_ETHER)
+		return NOTIFY_DONE;
+
+	/* Avoid Vlan dev with same MAC registering as child dev */
+	if (is_vlan_dev(event_dev))
+		return NOTIFY_DONE;
+
+	/* Avoid Bonding master dev with same MAC registering as child dev */
+	if ((event_dev->priv_flags & IFF_BONDING) &&
+	    (event_dev->flags & IFF_MASTER))
+		return NOTIFY_DONE;
+
+	switch (event) {
+	case NETDEV_REGISTER:
+		return virtnet_bypass_register_child(event_dev);
+	case NETDEV_UNREGISTER:
+		return virtnet_bypass_unregister_child(event_dev);
+	case NETDEV_UP:
+	case NETDEV_DOWN:
+	case NETDEV_CHANGE:
+		return virtnet_bypass_update_link(event_dev);
+	default:
+		return NOTIFY_DONE;
+	}
+}
+
+static struct notifier_block virtnet_bypass_notifier = {
+	.notifier_call = virtnet_bypass_event,
+};
+
+static int virtnet_bypass_create(struct virtnet_info *vi)
+{
+	struct net_device *backup_netdev = vi->dev;
+	struct device *dev = &vi->vdev->dev;
+	struct net_device *bypass_netdev;
+	int res;
+
+	/* Alloc at least 2 queues, for now we are going with 16 assuming
+	 * that most devices being bonded won't have too many queues.
+	 */
+	bypass_netdev = alloc_etherdev_mq(sizeof(struct virtnet_bypass_info),
+					  16);
+	if (!bypass_netdev) {
+		dev_err(dev, "Unable to allocate bypass_netdev!\n");
+		return -ENOMEM;
+	}
+
+	dev_net_set(bypass_netdev, dev_net(backup_netdev));
+	SET_NETDEV_DEV(bypass_netdev, dev);
+
+	bypass_netdev->netdev_ops = &virtnet_bypass_netdev_ops;
+	bypass_netdev->ethtool_ops = &virtnet_bypass_ethtool_ops;
+
+	/* Initialize the device options */
+	bypass_netdev->flags |= IFF_MASTER;
+	bypass_netdev->priv_flags |= IFF_BONDING | IFF_UNICAST_FLT |
+				     IFF_NO_QUEUE;
+	bypass_netdev->priv_flags &= ~(IFF_XMIT_DST_RELEASE |
+				       IFF_TX_SKB_SHARING);
+
+	/* don't acquire bypass netdev's netif_tx_lock when transmitting */
+	bypass_netdev->features |= NETIF_F_LLTX;
+
+	/* Don't allow bypass devices to change network namespaces. */
+	bypass_netdev->features |= NETIF_F_NETNS_LOCAL;
+
+	bypass_netdev->hw_features = NETIF_F_HW_CSUM | NETIF_F_SG |
+				     NETIF_F_FRAGLIST | NETIF_F_ALL_TSO |
+				     NETIF_F_HIGHDMA | NETIF_F_LRO;
+
+	bypass_netdev->hw_features |= NETIF_F_GSO_ENCAP_ALL;
+	bypass_netdev->features |= bypass_netdev->hw_features;
+
+	/* For now treat bypass netdev as VLAN challenged since we
+	 * cannot assume VLAN functionality with a VF
+	 */
+	bypass_netdev->features |= NETIF_F_VLAN_CHALLENGED;
+
+	memcpy(bypass_netdev->dev_addr, backup_netdev->dev_addr,
+	       bypass_netdev->addr_len);
+
+	bypass_netdev->min_mtu = backup_netdev->min_mtu;
+	bypass_netdev->max_mtu = backup_netdev->max_mtu;
+
+	res = register_netdev(bypass_netdev);
+	if (res < 0) {
+		dev_err(dev, "Unable to register bypass_netdev!\n");
+		free_netdev(bypass_netdev);
+		return res;
+	}
+
+	netif_carrier_off(bypass_netdev);
+
+	vi->bypass_netdev = bypass_netdev;
+
+	return 0;
+}
+
+static void virtnet_bypass_destroy(struct virtnet_info *vi)
+{
+	struct net_device *bypass_netdev = vi->bypass_netdev;
+	struct virtnet_bypass_info *vbi;
+	struct net_device *child_netdev;
+
+	/* no device found, nothing to free */
+	if (!bypass_netdev)
+		return;
+
+	vbi = netdev_priv(bypass_netdev);
+
+	netif_device_detach(bypass_netdev);
+
+	rtnl_lock();
+
+	child_netdev = rtnl_dereference(vbi->active_netdev);
+	if (child_netdev)
+		virtnet_bypass_unregister_child(child_netdev);
+
+	child_netdev = rtnl_dereference(vbi->backup_netdev);
+	if (child_netdev)
+		virtnet_bypass_unregister_child(child_netdev);
+
+	unregister_netdevice(bypass_netdev);
+
+	rtnl_unlock();
+
+	free_netdev(bypass_netdev);
+}
+
+/* END of functions supporting VIRTIO_NET_F_BACKUP feature. */
+
 static int virtnet_probe(struct virtio_device *vdev)
 {
 	int i, err = -ENOMEM;
@@ -2797,10 +3466,15 @@ static int virtnet_probe(struct virtio_device *vdev)
 
 	virtnet_init_settings(dev);
 
+	if (virtio_has_feature(vdev, VIRTIO_NET_F_BACKUP)) {
+		if (virtnet_bypass_create(vi) != 0)
+			goto free_vqs;
+	}
+
 	err = register_netdev(dev);
 	if (err) {
 		pr_debug("virtio_net: registering device failed\n");
-		goto free_vqs;
+		goto free_bypass;
 	}
 
 	virtio_device_ready(vdev);
@@ -2837,6 +3511,8 @@ static int virtnet_probe(struct virtio_device *vdev)
 	vi->vdev->config->reset(vdev);
 
 	unregister_netdev(dev);
+free_bypass:
+	virtnet_bypass_destroy(vi);
 free_vqs:
 	cancel_delayed_work_sync(&vi->refill);
 	free_receive_page_frags(vi);
@@ -2871,6 +3547,8 @@ static void virtnet_remove(struct virtio_device *vdev)
 
 	unregister_netdev(vi->dev);
 
+	virtnet_bypass_destroy(vi);
+
 	remove_vq_common(vi);
 
 	free_netdev(vi->dev);
@@ -2968,6 +3646,8 @@ static __init int virtio_net_driver_init(void)
         ret = register_virtio_driver(&virtio_net_driver);
 	if (ret)
 		goto err_virtio;
+
+	register_netdevice_notifier(&virtnet_bypass_notifier);
 	return 0;
 err_virtio:
 	cpuhp_remove_multi_state(CPUHP_VIRT_NET_DEAD);
@@ -2980,6 +3660,7 @@ module_init(virtio_net_driver_init);
 
 static __exit void virtio_net_driver_exit(void)
 {
+	unregister_netdevice_notifier(&virtnet_bypass_notifier);
 	unregister_virtio_driver(&virtio_net_driver);
 	cpuhp_remove_multi_state(CPUHP_VIRT_NET_DEAD);
 	cpuhp_remove_multi_state(virtionet_online);
-- 
2.14.3


---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply related	[flat|nested] 102+ messages in thread

* Re: [PATCH v4 2/2] virtio_net: Extend virtio to use VF datapath when available
  2018-03-01 20:08   ` [virtio-dev] " Sridhar Samudrala
  (?)
@ 2018-03-02  8:36   ` Jiri Pirko
  2018-03-02 15:26       ` [virtio-dev] " Alexander Duyck
  -1 siblings, 1 reply; 102+ messages in thread
From: Jiri Pirko @ 2018-03-02  8:36 UTC (permalink / raw)
  To: Sridhar Samudrala
  Cc: mst, stephen, davem, netdev, virtio-dev, jesse.brandeburg,
	alexander.h.duyck, kubakici

Thu, Mar 01, 2018 at 09:08:43PM CET, sridhar.samudrala@intel.com wrote:
>This patch enables virtio_net to switch over to a VF datapath when a VF
>netdev is present with the same MAC address. It allows live migration
>of a VM with a direct attached VF without the need to setup a bond/team
>between a VF and virtio net device in the guest.
>
>The hypervisor needs to enable only one datapath at any time so that
>packets don't get looped back to the VM over the other datapath. When a VF
>is plugged, the virtio datapath link state can be marked as down. The
>hypervisor needs to unplug the VF device from the guest on the source host
>and reset the MAC filter of the VF to initiate failover of datapath to
>virtio before starting the migration. After the migration is completed,
>the destination hypervisor sets the MAC filter on the VF and plugs it back
>to the guest to switch over to VF datapath.
>
>When BACKUP feature is enabled, an additional netdev(bypass netdev) is
>created that acts as a master device and tracks the state of the 2 lower
>netdevs. The original virtio_net netdev is marked as 'backup' netdev and a
>passthru device with the same MAC is registered as 'active' netdev.
>
>This patch is based on the discussion initiated by Jesse on this thread.
>https://marc.info/?l=linux-virtualization&m=151189725224231&w=2
>
>Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
>Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
>Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
>---
> drivers/net/virtio_net.c | 683 ++++++++++++++++++++++++++++++++++++++++++++++-

As I wrote to the discussion of the other version of this patchset,
I strongly believe you need to have a common part in net/core/ that will
be shared by both virtio_net and netvsc. There there should be explicit
limits for this in-driver bonding solution, like max 1 vf slave, etc.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v4 2/2] virtio_net: Extend virtio to use VF datapath when available
  2018-03-02  8:36   ` Jiri Pirko
@ 2018-03-02 15:26       ` Alexander Duyck
  0 siblings, 0 replies; 102+ messages in thread
From: Alexander Duyck @ 2018-03-02 15:26 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Sridhar Samudrala, Michael S. Tsirkin, Stephen Hemminger,
	David Miller, Netdev, virtio-dev, Brandeburg, Jesse, Duyck,
	Alexander H, Jakub Kicinski

On Fri, Mar 2, 2018 at 12:36 AM, Jiri Pirko <jiri@resnulli.us> wrote:
> Thu, Mar 01, 2018 at 09:08:43PM CET, sridhar.samudrala@intel.com wrote:
>>This patch enables virtio_net to switch over to a VF datapath when a VF
>>netdev is present with the same MAC address. It allows live migration
>>of a VM with a direct attached VF without the need to setup a bond/team
>>between a VF and virtio net device in the guest.
>>
>>The hypervisor needs to enable only one datapath at any time so that
>>packets don't get looped back to the VM over the other datapath. When a VF
>>is plugged, the virtio datapath link state can be marked as down. The
>>hypervisor needs to unplug the VF device from the guest on the source host
>>and reset the MAC filter of the VF to initiate failover of datapath to
>>virtio before starting the migration. After the migration is completed,
>>the destination hypervisor sets the MAC filter on the VF and plugs it back
>>to the guest to switch over to VF datapath.
>>
>>When BACKUP feature is enabled, an additional netdev(bypass netdev) is
>>created that acts as a master device and tracks the state of the 2 lower
>>netdevs. The original virtio_net netdev is marked as 'backup' netdev and a
>>passthru device with the same MAC is registered as 'active' netdev.
>>
>>This patch is based on the discussion initiated by Jesse on this thread.
>>https://marc.info/?l=linux-virtualization&m=151189725224231&w=2
>>
>>Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
>>Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
>>Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
>>---
>> drivers/net/virtio_net.c | 683 ++++++++++++++++++++++++++++++++++++++++++++++-
>
> As I wrote to the discussion of the other version of this patchset,
> I strongly believe you need to have a common part in net/core/ that will
> be shared by both virtio_net and netvsc. There there should be explicit
> limits for this in-driver bonding solution, like max 1 vf slave, etc.

Yeah, this code essentially calls out the "shareable" code with a
comment at the start and end of the section what defines the
virtio_bypass functionality. It would just be a matter of mostly
cutting and pasting to put it into a separate driver module.

The design limits things to a 1:1 relationship since we just have the
child and backup pointers, but I don't think I am seeing exception
handling to prevent us from overwriting the child pointers so there
may be a leak there.

Thanks.

- Alex

^ permalink raw reply	[flat|nested] 102+ messages in thread

* [virtio-dev] Re: [PATCH v4 2/2] virtio_net: Extend virtio to use VF datapath when available
@ 2018-03-02 15:26       ` Alexander Duyck
  0 siblings, 0 replies; 102+ messages in thread
From: Alexander Duyck @ 2018-03-02 15:26 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Sridhar Samudrala, Michael S. Tsirkin, Stephen Hemminger,
	David Miller, Netdev, virtio-dev, Brandeburg, Jesse, Duyck,
	Alexander H, Jakub Kicinski

On Fri, Mar 2, 2018 at 12:36 AM, Jiri Pirko <jiri@resnulli.us> wrote:
> Thu, Mar 01, 2018 at 09:08:43PM CET, sridhar.samudrala@intel.com wrote:
>>This patch enables virtio_net to switch over to a VF datapath when a VF
>>netdev is present with the same MAC address. It allows live migration
>>of a VM with a direct attached VF without the need to setup a bond/team
>>between a VF and virtio net device in the guest.
>>
>>The hypervisor needs to enable only one datapath at any time so that
>>packets don't get looped back to the VM over the other datapath. When a VF
>>is plugged, the virtio datapath link state can be marked as down. The
>>hypervisor needs to unplug the VF device from the guest on the source host
>>and reset the MAC filter of the VF to initiate failover of datapath to
>>virtio before starting the migration. After the migration is completed,
>>the destination hypervisor sets the MAC filter on the VF and plugs it back
>>to the guest to switch over to VF datapath.
>>
>>When BACKUP feature is enabled, an additional netdev(bypass netdev) is
>>created that acts as a master device and tracks the state of the 2 lower
>>netdevs. The original virtio_net netdev is marked as 'backup' netdev and a
>>passthru device with the same MAC is registered as 'active' netdev.
>>
>>This patch is based on the discussion initiated by Jesse on this thread.
>>https://marc.info/?l=linux-virtualization&m=151189725224231&w=2
>>
>>Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
>>Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
>>Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
>>---
>> drivers/net/virtio_net.c | 683 ++++++++++++++++++++++++++++++++++++++++++++++-
>
> As I wrote to the discussion of the other version of this patchset,
> I strongly believe you need to have a common part in net/core/ that will
> be shared by both virtio_net and netvsc. There there should be explicit
> limits for this in-driver bonding solution, like max 1 vf slave, etc.

Yeah, this code essentially calls out the "shareable" code with a
comment at the start and end of the section what defines the
virtio_bypass functionality. It would just be a matter of mostly
cutting and pasting to put it into a separate driver module.

The design limits things to a 1:1 relationship since we just have the
child and backup pointers, but I don't think I am seeing exception
handling to prevent us from overwriting the child pointers so there
may be a leak there.

Thanks.

- Alex

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v4 2/2] virtio_net: Extend virtio to use VF datapath when available
  2018-03-02 15:26       ` [virtio-dev] " Alexander Duyck
  (?)
@ 2018-03-02 16:20       ` Jiri Pirko
  2018-03-02 16:37           ` [virtio-dev] " Samudrala, Sridhar
  2018-03-02 19:42           ` [virtio-dev] " Michael S. Tsirkin
  -1 siblings, 2 replies; 102+ messages in thread
From: Jiri Pirko @ 2018-03-02 16:20 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Sridhar Samudrala, Michael S. Tsirkin, Stephen Hemminger,
	David Miller, Netdev, virtio-dev, Brandeburg, Jesse, Duyck,
	Alexander H, Jakub Kicinski

Fri, Mar 02, 2018 at 04:26:25PM CET, alexander.duyck@gmail.com wrote:
>On Fri, Mar 2, 2018 at 12:36 AM, Jiri Pirko <jiri@resnulli.us> wrote:
>> Thu, Mar 01, 2018 at 09:08:43PM CET, sridhar.samudrala@intel.com wrote:
>>>This patch enables virtio_net to switch over to a VF datapath when a VF
>>>netdev is present with the same MAC address. It allows live migration
>>>of a VM with a direct attached VF without the need to setup a bond/team
>>>between a VF and virtio net device in the guest.
>>>
>>>The hypervisor needs to enable only one datapath at any time so that
>>>packets don't get looped back to the VM over the other datapath. When a VF
>>>is plugged, the virtio datapath link state can be marked as down. The
>>>hypervisor needs to unplug the VF device from the guest on the source host
>>>and reset the MAC filter of the VF to initiate failover of datapath to
>>>virtio before starting the migration. After the migration is completed,
>>>the destination hypervisor sets the MAC filter on the VF and plugs it back
>>>to the guest to switch over to VF datapath.
>>>
>>>When BACKUP feature is enabled, an additional netdev(bypass netdev) is
>>>created that acts as a master device and tracks the state of the 2 lower
>>>netdevs. The original virtio_net netdev is marked as 'backup' netdev and a
>>>passthru device with the same MAC is registered as 'active' netdev.
>>>
>>>This patch is based on the discussion initiated by Jesse on this thread.
>>>https://marc.info/?l=linux-virtualization&m=151189725224231&w=2
>>>
>>>Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
>>>Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
>>>Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
>>>---
>>> drivers/net/virtio_net.c | 683 ++++++++++++++++++++++++++++++++++++++++++++++-
>>
>> As I wrote to the discussion of the other version of this patchset,
>> I strongly believe you need to have a common part in net/core/ that will
>> be shared by both virtio_net and netvsc. There there should be explicit
>> limits for this in-driver bonding solution, like max 1 vf slave, etc.
>
>Yeah, this code essentially calls out the "shareable" code with a
>comment at the start and end of the section what defines the
>virtio_bypass functionality. It would just be a matter of mostly
>cutting and pasting to put it into a separate driver module.

Please put it there and unite the use of it with netvsc.

>
>The design limits things to a 1:1 relationship since we just have the
>child and backup pointers, but I don't think I am seeing exception
>handling to prevent us from overwriting the child pointers so there
>may be a leak there.

Ok.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v4 2/2] virtio_net: Extend virtio to use VF datapath when available
  2018-03-02 16:20       ` Jiri Pirko
@ 2018-03-02 16:37           ` Samudrala, Sridhar
  2018-03-02 19:42           ` [virtio-dev] " Michael S. Tsirkin
  1 sibling, 0 replies; 102+ messages in thread
From: Samudrala, Sridhar @ 2018-03-02 16:37 UTC (permalink / raw)
  To: Jiri Pirko, Alexander Duyck
  Cc: Michael S. Tsirkin, Stephen Hemminger, David Miller, Netdev,
	virtio-dev, Brandeburg, Jesse, Duyck, Alexander H,
	Jakub Kicinski

[-- Attachment #1: Type: text/plain, Size: 3880 bytes --]



On 3/2/2018 8:20 AM, Jiri Pirko wrote:
> Fri, Mar 02, 2018 at 04:26:25PM CET, alexander.duyck@gmail.com wrote:
>> On Fri, Mar 2, 2018 at 12:36 AM, Jiri Pirko <jiri@resnulli.us> wrote:
>>> Thu, Mar 01, 2018 at 09:08:43PM CET, sridhar.samudrala@intel.com wrote:
>>>> This patch enables virtio_net to switch over to a VF datapath when a VF
>>>> netdev is present with the same MAC address. It allows live migration
>>>> of a VM with a direct attached VF without the need to setup a bond/team
>>>> between a VF and virtio net device in the guest.
>>>>
>>>> The hypervisor needs to enable only one datapath at any time so that
>>>> packets don't get looped back to the VM over the other datapath. When a VF
>>>> is plugged, the virtio datapath link state can be marked as down. The
>>>> hypervisor needs to unplug the VF device from the guest on the source host
>>>> and reset the MAC filter of the VF to initiate failover of datapath to
>>>> virtio before starting the migration. After the migration is completed,
>>>> the destination hypervisor sets the MAC filter on the VF and plugs it back
>>>> to the guest to switch over to VF datapath.
>>>>
>>>> When BACKUP feature is enabled, an additional netdev(bypass netdev) is
>>>> created that acts as a master device and tracks the state of the 2 lower
>>>> netdevs. The original virtio_net netdev is marked as 'backup' netdev and a
>>>> passthru device with the same MAC is registered as 'active' netdev.
>>>>
>>>> This patch is based on the discussion initiated by Jesse on this thread.
>>>> https://marc.info/?l=linux-virtualization&m=151189725224231&w=2
>>>>
>>>> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
>>>> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
>>>> Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
>>>> ---
>>>> drivers/net/virtio_net.c | 683 ++++++++++++++++++++++++++++++++++++++++++++++-
>>> As I wrote to the discussion of the other version of this patchset,
>>> I strongly believe you need to have a common part in net/core/ that will
>>> be shared by both virtio_net and netvsc. There there should be explicit
>>> limits for this in-driver bonding solution, like max 1 vf slave, etc.
>> Yeah, this code essentially calls out the "shareable" code with a
>> comment at the start and end of the section what defines the
>> virtio_bypass functionality. It would just be a matter of mostly
>> cutting and pasting to put it into a separate driver module.
> Please put it there and unite the use of it with netvsc.

Based on Stephen's responses, it is not clear if netvsc would be OK to move to
the 3 netdev solution at this time. When another paravirtual driver would like
to use this solution, we can do refactoring at that time.

>
>> The design limits things to a 1:1 relationship since we just have the
>> child and backup pointers, but I don't think I am seeing exception
>> handling to prevent us from overwriting the child pointers so there
>> may be a leak there.
>
We only allow one active netdev registered as a child. There is a check to
make sure that if an active netdev is registered, another one with same MAC
is not allowed. I don't think there is a leak.

	vbi = netdev_priv(dev);
         backup = (child_netdev->dev.parent == dev->dev.parent);
         if (backup ? rtnl_dereference(vbi->backup_netdev) :
                         rtnl_dereference(vbi->active_netdev)) {
                 netdev_info(dev,
                             "%s attempting to join bypass dev when %s already present\n",
                             child_netdev->name, backup ? "backup" : "active");
                 return NOTIFY_DONE;
         }

         /* Avoid non pci devices as active netdev */
         if (!backup && (!child_netdev->dev.parent ||
                         !dev_is_pci(child_netdev->dev.parent)))
                 return NOTIFY_DONE;

Thanks
Sridhar


[-- Attachment #2: Type: text/html, Size: 40257 bytes --]

^ permalink raw reply	[flat|nested] 102+ messages in thread

* [virtio-dev] Re: [PATCH v4 2/2] virtio_net: Extend virtio to use VF datapath when available
@ 2018-03-02 16:37           ` Samudrala, Sridhar
  0 siblings, 0 replies; 102+ messages in thread
From: Samudrala, Sridhar @ 2018-03-02 16:37 UTC (permalink / raw)
  To: Jiri Pirko, Alexander Duyck
  Cc: Michael S. Tsirkin, Stephen Hemminger, David Miller, Netdev,
	virtio-dev, Brandeburg, Jesse, Duyck, Alexander H,
	Jakub Kicinski

[-- Attachment #1: Type: text/plain, Size: 3880 bytes --]



On 3/2/2018 8:20 AM, Jiri Pirko wrote:
> Fri, Mar 02, 2018 at 04:26:25PM CET, alexander.duyck@gmail.com wrote:
>> On Fri, Mar 2, 2018 at 12:36 AM, Jiri Pirko <jiri@resnulli.us> wrote:
>>> Thu, Mar 01, 2018 at 09:08:43PM CET, sridhar.samudrala@intel.com wrote:
>>>> This patch enables virtio_net to switch over to a VF datapath when a VF
>>>> netdev is present with the same MAC address. It allows live migration
>>>> of a VM with a direct attached VF without the need to setup a bond/team
>>>> between a VF and virtio net device in the guest.
>>>>
>>>> The hypervisor needs to enable only one datapath at any time so that
>>>> packets don't get looped back to the VM over the other datapath. When a VF
>>>> is plugged, the virtio datapath link state can be marked as down. The
>>>> hypervisor needs to unplug the VF device from the guest on the source host
>>>> and reset the MAC filter of the VF to initiate failover of datapath to
>>>> virtio before starting the migration. After the migration is completed,
>>>> the destination hypervisor sets the MAC filter on the VF and plugs it back
>>>> to the guest to switch over to VF datapath.
>>>>
>>>> When BACKUP feature is enabled, an additional netdev(bypass netdev) is
>>>> created that acts as a master device and tracks the state of the 2 lower
>>>> netdevs. The original virtio_net netdev is marked as 'backup' netdev and a
>>>> passthru device with the same MAC is registered as 'active' netdev.
>>>>
>>>> This patch is based on the discussion initiated by Jesse on this thread.
>>>> https://marc.info/?l=linux-virtualization&m=151189725224231&w=2
>>>>
>>>> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
>>>> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
>>>> Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
>>>> ---
>>>> drivers/net/virtio_net.c | 683 ++++++++++++++++++++++++++++++++++++++++++++++-
>>> As I wrote to the discussion of the other version of this patchset,
>>> I strongly believe you need to have a common part in net/core/ that will
>>> be shared by both virtio_net and netvsc. There there should be explicit
>>> limits for this in-driver bonding solution, like max 1 vf slave, etc.
>> Yeah, this code essentially calls out the "shareable" code with a
>> comment at the start and end of the section what defines the
>> virtio_bypass functionality. It would just be a matter of mostly
>> cutting and pasting to put it into a separate driver module.
> Please put it there and unite the use of it with netvsc.

Based on Stephen's responses, it is not clear if netvsc would be OK to move to
the 3 netdev solution at this time. When another paravirtual driver would like
to use this solution, we can do refactoring at that time.

>
>> The design limits things to a 1:1 relationship since we just have the
>> child and backup pointers, but I don't think I am seeing exception
>> handling to prevent us from overwriting the child pointers so there
>> may be a leak there.
>
We only allow one active netdev registered as a child. There is a check to
make sure that if an active netdev is registered, another one with same MAC
is not allowed. I don't think there is a leak.

	vbi = netdev_priv(dev);
         backup = (child_netdev->dev.parent == dev->dev.parent);
         if (backup ? rtnl_dereference(vbi->backup_netdev) :
                         rtnl_dereference(vbi->active_netdev)) {
                 netdev_info(dev,
                             "%s attempting to join bypass dev when %s already present\n",
                             child_netdev->name, backup ? "backup" : "active");
                 return NOTIFY_DONE;
         }

         /* Avoid non pci devices as active netdev */
         if (!backup && (!child_netdev->dev.parent ||
                         !dev_is_pci(child_netdev->dev.parent)))
                 return NOTIFY_DONE;

Thanks
Sridhar


[-- Attachment #2: Type: text/html, Size: 40257 bytes --]

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v4 2/2] virtio_net: Extend virtio to use VF datapath when available
  2018-03-02 16:37           ` [virtio-dev] " Samudrala, Sridhar
@ 2018-03-02 17:06             ` Alexander Duyck
  -1 siblings, 0 replies; 102+ messages in thread
From: Alexander Duyck @ 2018-03-02 17:06 UTC (permalink / raw)
  To: Samudrala, Sridhar
  Cc: Jiri Pirko, Michael S. Tsirkin, Stephen Hemminger, David Miller,
	Netdev, virtio-dev, Brandeburg, Jesse, Duyck, Alexander H,
	Jakub Kicinski

On Fri, Mar 2, 2018 at 8:37 AM, Samudrala, Sridhar
<sridhar.samudrala@intel.com> wrote:
>
>
> On 3/2/2018 8:20 AM, Jiri Pirko wrote:
>
> Fri, Mar 02, 2018 at 04:26:25PM CET, alexander.duyck@gmail.com wrote:
>
> On Fri, Mar 2, 2018 at 12:36 AM, Jiri Pirko <jiri@resnulli.us> wrote:
>
> Thu, Mar 01, 2018 at 09:08:43PM CET, sridhar.samudrala@intel.com wrote:
>
> This patch enables virtio_net to switch over to a VF datapath when a VF
> netdev is present with the same MAC address. It allows live migration
> of a VM with a direct attached VF without the need to setup a bond/team
> between a VF and virtio net device in the guest.
>
> The hypervisor needs to enable only one datapath at any time so that
> packets don't get looped back to the VM over the other datapath. When a VF
> is plugged, the virtio datapath link state can be marked as down. The
> hypervisor needs to unplug the VF device from the guest on the source host
> and reset the MAC filter of the VF to initiate failover of datapath to
> virtio before starting the migration. After the migration is completed,
> the destination hypervisor sets the MAC filter on the VF and plugs it back
> to the guest to switch over to VF datapath.
>
> When BACKUP feature is enabled, an additional netdev(bypass netdev) is
> created that acts as a master device and tracks the state of the 2 lower
> netdevs. The original virtio_net netdev is marked as 'backup' netdev and a
> passthru device with the same MAC is registered as 'active' netdev.
>
> This patch is based on the discussion initiated by Jesse on this thread.
> https://marc.info/?l=linux-virtualization&m=151189725224231&w=2
>
> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
> Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
> ---
> drivers/net/virtio_net.c | 683
> ++++++++++++++++++++++++++++++++++++++++++++++-
>
> As I wrote to the discussion of the other version of this patchset,
> I strongly believe you need to have a common part in net/core/ that will
> be shared by both virtio_net and netvsc. There there should be explicit
> limits for this in-driver bonding solution, like max 1 vf slave, etc.
>
> Yeah, this code essentially calls out the "shareable" code with a
> comment at the start and end of the section what defines the
> virtio_bypass functionality. It would just be a matter of mostly
> cutting and pasting to put it into a separate driver module.
>
> Please put it there and unite the use of it with netvsc.
>
> Based on Stephen's responses, it is not clear if netvsc would be OK to move
> to
> the 3 netdev solution at this time. When another paravirtual driver would
> like
> to use this solution, we can do refactoring at that time.
>
> The design limits things to a 1:1 relationship since we just have the
> child and backup pointers, but I don't think I am seeing exception
> handling to prevent us from overwriting the child pointers so there
> may be a leak there.
>
>
> We only allow one active netdev registered as a child. There is a check to
> make sure that if an active netdev is registered, another one with same MAC
> is not allowed. I don't think there is a leak.
>
> vbi = netdev_priv(dev);
>         backup = (child_netdev->dev.parent == dev->dev.parent);
>         if (backup ? rtnl_dereference(vbi->backup_netdev) :
>                         rtnl_dereference(vbi->active_netdev)) {
>                 netdev_info(dev,
>                             "%s attempting to join bypass dev when %s
> already present\n",
>                             child_netdev->name, backup ? "backup" :
> "active");
>                 return NOTIFY_DONE;
>         }
>
>         /* Avoid non pci devices as active netdev */
>         if (!backup && (!child_netdev->dev.parent ||
>                         !dev_is_pci(child_netdev->dev.parent)))
>                 return NOTIFY_DONE;
>
> Thanks
> Sridhar

I overlooked that part.

Thanks.

- Alex

^ permalink raw reply	[flat|nested] 102+ messages in thread

* [virtio-dev] Re: [PATCH v4 2/2] virtio_net: Extend virtio to use VF datapath when available
@ 2018-03-02 17:06             ` Alexander Duyck
  0 siblings, 0 replies; 102+ messages in thread
From: Alexander Duyck @ 2018-03-02 17:06 UTC (permalink / raw)
  To: Samudrala, Sridhar
  Cc: Jiri Pirko, Michael S. Tsirkin, Stephen Hemminger, David Miller,
	Netdev, virtio-dev, Brandeburg, Jesse, Duyck, Alexander H,
	Jakub Kicinski

On Fri, Mar 2, 2018 at 8:37 AM, Samudrala, Sridhar
<sridhar.samudrala@intel.com> wrote:
>
>
> On 3/2/2018 8:20 AM, Jiri Pirko wrote:
>
> Fri, Mar 02, 2018 at 04:26:25PM CET, alexander.duyck@gmail.com wrote:
>
> On Fri, Mar 2, 2018 at 12:36 AM, Jiri Pirko <jiri@resnulli.us> wrote:
>
> Thu, Mar 01, 2018 at 09:08:43PM CET, sridhar.samudrala@intel.com wrote:
>
> This patch enables virtio_net to switch over to a VF datapath when a VF
> netdev is present with the same MAC address. It allows live migration
> of a VM with a direct attached VF without the need to setup a bond/team
> between a VF and virtio net device in the guest.
>
> The hypervisor needs to enable only one datapath at any time so that
> packets don't get looped back to the VM over the other datapath. When a VF
> is plugged, the virtio datapath link state can be marked as down. The
> hypervisor needs to unplug the VF device from the guest on the source host
> and reset the MAC filter of the VF to initiate failover of datapath to
> virtio before starting the migration. After the migration is completed,
> the destination hypervisor sets the MAC filter on the VF and plugs it back
> to the guest to switch over to VF datapath.
>
> When BACKUP feature is enabled, an additional netdev(bypass netdev) is
> created that acts as a master device and tracks the state of the 2 lower
> netdevs. The original virtio_net netdev is marked as 'backup' netdev and a
> passthru device with the same MAC is registered as 'active' netdev.
>
> This patch is based on the discussion initiated by Jesse on this thread.
> https://marc.info/?l=linux-virtualization&m=151189725224231&w=2
>
> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
> Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
> ---
> drivers/net/virtio_net.c | 683
> ++++++++++++++++++++++++++++++++++++++++++++++-
>
> As I wrote to the discussion of the other version of this patchset,
> I strongly believe you need to have a common part in net/core/ that will
> be shared by both virtio_net and netvsc. There there should be explicit
> limits for this in-driver bonding solution, like max 1 vf slave, etc.
>
> Yeah, this code essentially calls out the "shareable" code with a
> comment at the start and end of the section what defines the
> virtio_bypass functionality. It would just be a matter of mostly
> cutting and pasting to put it into a separate driver module.
>
> Please put it there and unite the use of it with netvsc.
>
> Based on Stephen's responses, it is not clear if netvsc would be OK to move
> to
> the 3 netdev solution at this time. When another paravirtual driver would
> like
> to use this solution, we can do refactoring at that time.
>
> The design limits things to a 1:1 relationship since we just have the
> child and backup pointers, but I don't think I am seeing exception
> handling to prevent us from overwriting the child pointers so there
> may be a leak there.
>
>
> We only allow one active netdev registered as a child. There is a check to
> make sure that if an active netdev is registered, another one with same MAC
> is not allowed. I don't think there is a leak.
>
> vbi = netdev_priv(dev);
>         backup = (child_netdev->dev.parent == dev->dev.parent);
>         if (backup ? rtnl_dereference(vbi->backup_netdev) :
>                         rtnl_dereference(vbi->active_netdev)) {
>                 netdev_info(dev,
>                             "%s attempting to join bypass dev when %s
> already present\n",
>                             child_netdev->name, backup ? "backup" :
> "active");
>                 return NOTIFY_DONE;
>         }
>
>         /* Avoid non pci devices as active netdev */
>         if (!backup && (!child_netdev->dev.parent ||
>                         !dev_is_pci(child_netdev->dev.parent)))
>                 return NOTIFY_DONE;
>
> Thanks
> Sridhar

I overlooked that part.

Thanks.

- Alex

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v4 2/2] virtio_net: Extend virtio to use VF datapath when available
  2018-03-02 15:26       ` [virtio-dev] " Alexander Duyck
@ 2018-03-02 19:41         ` Michael S. Tsirkin
  -1 siblings, 0 replies; 102+ messages in thread
From: Michael S. Tsirkin @ 2018-03-02 19:41 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Jiri Pirko, Sridhar Samudrala, Stephen Hemminger, David Miller,
	Netdev, virtio-dev, Brandeburg, Jesse, Duyck, Alexander H,
	Jakub Kicinski

On Fri, Mar 02, 2018 at 07:26:25AM -0800, Alexander Duyck wrote:
> The design limits things to a 1:1 relationship since we just have the
> child and backup pointers, but I don't think I am seeing exception
> handling to prevent us from overwriting the child pointers so there
> may be a leak there.
> 
> Thanks.
> 
> - Alex

In fact maintaining a list in that case would be nicer, and
just use an arbitrary one.
E.g. one can see how a user wanting to swap device 1 for device 2
might first add device 2 with same MAC then drop device 1.

-- 
MST

^ permalink raw reply	[flat|nested] 102+ messages in thread

* [virtio-dev] Re: [PATCH v4 2/2] virtio_net: Extend virtio to use VF datapath when available
@ 2018-03-02 19:41         ` Michael S. Tsirkin
  0 siblings, 0 replies; 102+ messages in thread
From: Michael S. Tsirkin @ 2018-03-02 19:41 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Jiri Pirko, Sridhar Samudrala, Stephen Hemminger, David Miller,
	Netdev, virtio-dev, Brandeburg, Jesse, Duyck, Alexander H,
	Jakub Kicinski

On Fri, Mar 02, 2018 at 07:26:25AM -0800, Alexander Duyck wrote:
> The design limits things to a 1:1 relationship since we just have the
> child and backup pointers, but I don't think I am seeing exception
> handling to prevent us from overwriting the child pointers so there
> may be a leak there.
> 
> Thanks.
> 
> - Alex

In fact maintaining a list in that case would be nicer, and
just use an arbitrary one.
E.g. one can see how a user wanting to swap device 1 for device 2
might first add device 2 with same MAC then drop device 1.

-- 
MST

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v4 2/2] virtio_net: Extend virtio to use VF datapath when available
  2018-03-02 16:20       ` Jiri Pirko
@ 2018-03-02 19:42           ` Michael S. Tsirkin
  2018-03-02 19:42           ` [virtio-dev] " Michael S. Tsirkin
  1 sibling, 0 replies; 102+ messages in thread
From: Michael S. Tsirkin @ 2018-03-02 19:42 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Alexander Duyck, Sridhar Samudrala, Stephen Hemminger,
	David Miller, Netdev, virtio-dev, Brandeburg, Jesse, Duyck,
	Alexander H, Jakub Kicinski

On Fri, Mar 02, 2018 at 05:20:17PM +0100, Jiri Pirko wrote:
> >Yeah, this code essentially calls out the "shareable" code with a
> >comment at the start and end of the section what defines the
> >virtio_bypass functionality. It would just be a matter of mostly
> >cutting and pasting to put it into a separate driver module.
> 
> Please put it there and unite the use of it with netvsc.

Surely, adding this to other drivers (e.g. might this be handy for xen
too?) can be left for a separate patchset. Let's get one device merged
first.

-- 
MST

^ permalink raw reply	[flat|nested] 102+ messages in thread

* [virtio-dev] Re: [PATCH v4 2/2] virtio_net: Extend virtio to use VF datapath when available
@ 2018-03-02 19:42           ` Michael S. Tsirkin
  0 siblings, 0 replies; 102+ messages in thread
From: Michael S. Tsirkin @ 2018-03-02 19:42 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Alexander Duyck, Sridhar Samudrala, Stephen Hemminger,
	David Miller, Netdev, virtio-dev, Brandeburg, Jesse, Duyck,
	Alexander H, Jakub Kicinski

On Fri, Mar 02, 2018 at 05:20:17PM +0100, Jiri Pirko wrote:
> >Yeah, this code essentially calls out the "shareable" code with a
> >comment at the start and end of the section what defines the
> >virtio_bypass functionality. It would just be a matter of mostly
> >cutting and pasting to put it into a separate driver module.
> 
> Please put it there and unite the use of it with netvsc.

Surely, adding this to other drivers (e.g. might this be handy for xen
too?) can be left for a separate patchset. Let's get one device merged
first.

-- 
MST

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v4 2/2] virtio_net: Extend virtio to use VF datapath when available
  2018-03-02 19:41         ` [virtio-dev] " Michael S. Tsirkin
@ 2018-03-02 19:52           ` Samudrala, Sridhar
  -1 siblings, 0 replies; 102+ messages in thread
From: Samudrala, Sridhar @ 2018-03-02 19:52 UTC (permalink / raw)
  To: Michael S. Tsirkin, Alexander Duyck
  Cc: Jiri Pirko, Stephen Hemminger, David Miller, Netdev, virtio-dev,
	Brandeburg, Jesse, Duyck, Alexander H, Jakub Kicinski



On 3/2/2018 11:41 AM, Michael S. Tsirkin wrote:
> On Fri, Mar 02, 2018 at 07:26:25AM -0800, Alexander Duyck wrote:
>> The design limits things to a 1:1 relationship since we just have the
>> child and backup pointers, but I don't think I am seeing exception
>> handling to prevent us from overwriting the child pointers so there
>> may be a leak there.
>>
>> Thanks.
>>
>> - Alex
> In fact maintaining a list in that case would be nicer, and
> just use an arbitrary one.
> E.g. one can see how a user wanting to swap device 1 for device 2
> might first add device 2 with same MAC then drop device 1.

It should be possible to swap VF1 with VF2 by
- enabling virtio link
- unplugging VF1
- plugging VF2
- disabling virtio link

^ permalink raw reply	[flat|nested] 102+ messages in thread

* [virtio-dev] Re: [PATCH v4 2/2] virtio_net: Extend virtio to use VF datapath when available
@ 2018-03-02 19:52           ` Samudrala, Sridhar
  0 siblings, 0 replies; 102+ messages in thread
From: Samudrala, Sridhar @ 2018-03-02 19:52 UTC (permalink / raw)
  To: Michael S. Tsirkin, Alexander Duyck
  Cc: Jiri Pirko, Stephen Hemminger, David Miller, Netdev, virtio-dev,
	Brandeburg, Jesse, Duyck, Alexander H, Jakub Kicinski



On 3/2/2018 11:41 AM, Michael S. Tsirkin wrote:
> On Fri, Mar 02, 2018 at 07:26:25AM -0800, Alexander Duyck wrote:
>> The design limits things to a 1:1 relationship since we just have the
>> child and backup pointers, but I don't think I am seeing exception
>> handling to prevent us from overwriting the child pointers so there
>> may be a leak there.
>>
>> Thanks.
>>
>> - Alex
> In fact maintaining a list in that case would be nicer, and
> just use an arbitrary one.
> E.g. one can see how a user wanting to swap device 1 for device 2
> might first add device 2 with same MAC then drop device 1.

It should be possible to swap VF1 with VF2 by
- enabling virtio link
- unplugging VF1
- plugging VF2
- disabling virtio link



---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v4 2/2] virtio_net: Extend virtio to use VF datapath when available
  2018-03-02 19:52           ` [virtio-dev] " Samudrala, Sridhar
@ 2018-03-02 20:10             ` Michael S. Tsirkin
  -1 siblings, 0 replies; 102+ messages in thread
From: Michael S. Tsirkin @ 2018-03-02 20:10 UTC (permalink / raw)
  To: Samudrala, Sridhar
  Cc: Alexander Duyck, Jiri Pirko, Stephen Hemminger, David Miller,
	Netdev, virtio-dev, Brandeburg, Jesse, Duyck, Alexander H,
	Jakub Kicinski

On Fri, Mar 02, 2018 at 11:52:27AM -0800, Samudrala, Sridhar wrote:
> 
> 
> On 3/2/2018 11:41 AM, Michael S. Tsirkin wrote:
> > On Fri, Mar 02, 2018 at 07:26:25AM -0800, Alexander Duyck wrote:
> > > The design limits things to a 1:1 relationship since we just have the
> > > child and backup pointers, but I don't think I am seeing exception
> > > handling to prevent us from overwriting the child pointers so there
> > > may be a leak there.
> > > 
> > > Thanks.
> > > 
> > > - Alex
> > In fact maintaining a list in that case would be nicer, and
> > just use an arbitrary one.
> > E.g. one can see how a user wanting to swap device 1 for device 2
> > might first add device 2 with same MAC then drop device 1.
> 
> It should be possible to swap VF1 with VF2 by
> 1.- enabling virtio link
> 2.- unplugging VF1
> 3.- plugging VF2
> 4.- disabling virtio link
> 

True, but it isn't hard to avoid breakage if user
swapped steps 2 and 3. No need to make it more
fragile that it has to be.

-- 
MST

^ permalink raw reply	[flat|nested] 102+ messages in thread

* [virtio-dev] Re: [PATCH v4 2/2] virtio_net: Extend virtio to use VF datapath when available
@ 2018-03-02 20:10             ` Michael S. Tsirkin
  0 siblings, 0 replies; 102+ messages in thread
From: Michael S. Tsirkin @ 2018-03-02 20:10 UTC (permalink / raw)
  To: Samudrala, Sridhar
  Cc: Alexander Duyck, Jiri Pirko, Stephen Hemminger, David Miller,
	Netdev, virtio-dev, Brandeburg, Jesse, Duyck, Alexander H,
	Jakub Kicinski

On Fri, Mar 02, 2018 at 11:52:27AM -0800, Samudrala, Sridhar wrote:
> 
> 
> On 3/2/2018 11:41 AM, Michael S. Tsirkin wrote:
> > On Fri, Mar 02, 2018 at 07:26:25AM -0800, Alexander Duyck wrote:
> > > The design limits things to a 1:1 relationship since we just have the
> > > child and backup pointers, but I don't think I am seeing exception
> > > handling to prevent us from overwriting the child pointers so there
> > > may be a leak there.
> > > 
> > > Thanks.
> > > 
> > > - Alex
> > In fact maintaining a list in that case would be nicer, and
> > just use an arbitrary one.
> > E.g. one can see how a user wanting to swap device 1 for device 2
> > might first add device 2 with same MAC then drop device 1.
> 
> It should be possible to swap VF1 with VF2 by
> 1.- enabling virtio link
> 2.- unplugging VF1
> 3.- plugging VF2
> 4.- disabling virtio link
> 

True, but it isn't hard to avoid breakage if user
swapped steps 2 and 3. No need to make it more
fragile that it has to be.

-- 
MST

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Re: [PATCH v4 2/2] virtio_net: Extend virtio to use VF datapath when available
  2018-03-02 20:10             ` [virtio-dev] " Michael S. Tsirkin
@ 2018-03-02 20:44               ` Siwei Liu
  -1 siblings, 0 replies; 102+ messages in thread
From: Siwei Liu @ 2018-03-02 20:44 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Samudrala, Sridhar, Alexander Duyck, Jiri Pirko,
	Stephen Hemminger, David Miller, Netdev, virtio-dev, Brandeburg,
	Jesse, Duyck, Alexander H, Jakub Kicinski

On Fri, Mar 2, 2018 at 12:10 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Fri, Mar 02, 2018 at 11:52:27AM -0800, Samudrala, Sridhar wrote:
>>
>>
>> On 3/2/2018 11:41 AM, Michael S. Tsirkin wrote:
>> > On Fri, Mar 02, 2018 at 07:26:25AM -0800, Alexander Duyck wrote:
>> > > The design limits things to a 1:1 relationship since we just have the
>> > > child and backup pointers, but I don't think I am seeing exception
>> > > handling to prevent us from overwriting the child pointers so there
>> > > may be a leak there.
>> > >
>> > > Thanks.
>> > >
>> > > - Alex
>> > In fact maintaining a list in that case would be nicer, and
>> > just use an arbitrary one.
>> > E.g. one can see how a user wanting to swap device 1 for device 2
>> > might first add device 2 with same MAC then drop device 1.
>>
>> It should be possible to swap VF1 with VF2 by
>> 1.- enabling virtio link
>> 2.- unplugging VF1
>> 3.- plugging VF2
>> 4.- disabling virtio link
>>
>
> True, but it isn't hard to avoid breakage if user
> swapped steps 2 and 3. No need to make it more
> fragile that it has to be.

The migration case, VF2 is associated with another PF on another
machine (destination), I wonder how it is possible.

Even with local plugging of VF2 on the same PF, the MAC address
requirement (VF1's == VF2's) would fail the MAC address assignment on
VF2.

-Siwei

>
> --
> MST
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
> For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org
>

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [virtio-dev] Re: [PATCH v4 2/2] virtio_net: Extend virtio to use VF datapath when available
@ 2018-03-02 20:44               ` Siwei Liu
  0 siblings, 0 replies; 102+ messages in thread
From: Siwei Liu @ 2018-03-02 20:44 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Samudrala, Sridhar, Alexander Duyck, Jiri Pirko,
	Stephen Hemminger, David Miller, Netdev, virtio-dev, Brandeburg,
	Jesse, Duyck, Alexander H, Jakub Kicinski

On Fri, Mar 2, 2018 at 12:10 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Fri, Mar 02, 2018 at 11:52:27AM -0800, Samudrala, Sridhar wrote:
>>
>>
>> On 3/2/2018 11:41 AM, Michael S. Tsirkin wrote:
>> > On Fri, Mar 02, 2018 at 07:26:25AM -0800, Alexander Duyck wrote:
>> > > The design limits things to a 1:1 relationship since we just have the
>> > > child and backup pointers, but I don't think I am seeing exception
>> > > handling to prevent us from overwriting the child pointers so there
>> > > may be a leak there.
>> > >
>> > > Thanks.
>> > >
>> > > - Alex
>> > In fact maintaining a list in that case would be nicer, and
>> > just use an arbitrary one.
>> > E.g. one can see how a user wanting to swap device 1 for device 2
>> > might first add device 2 with same MAC then drop device 1.
>>
>> It should be possible to swap VF1 with VF2 by
>> 1.- enabling virtio link
>> 2.- unplugging VF1
>> 3.- plugging VF2
>> 4.- disabling virtio link
>>
>
> True, but it isn't hard to avoid breakage if user
> swapped steps 2 and 3. No need to make it more
> fragile that it has to be.

The migration case, VF2 is associated with another PF on another
machine (destination), I wonder how it is possible.

Even with local plugging of VF2 on the same PF, the MAC address
requirement (VF1's == VF2's) would fail the MAC address assignment on
VF2.

-Siwei

>
> --
> MST
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
> For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Re: [PATCH v4 2/2] virtio_net: Extend virtio to use VF datapath when available
  2018-03-02 19:42           ` [virtio-dev] " Michael S. Tsirkin
@ 2018-03-02 20:49             ` Siwei Liu
  -1 siblings, 0 replies; 102+ messages in thread
From: Siwei Liu @ 2018-03-02 20:49 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Jiri Pirko, Alexander Duyck, Sridhar Samudrala,
	Stephen Hemminger, David Miller, Netdev, virtio-dev, Brandeburg,
	Jesse, Duyck, Alexander H, Jakub Kicinski

On Fri, Mar 2, 2018 at 11:42 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Fri, Mar 02, 2018 at 05:20:17PM +0100, Jiri Pirko wrote:
>> >Yeah, this code essentially calls out the "shareable" code with a
>> >comment at the start and end of the section what defines the
>> >virtio_bypass functionality. It would just be a matter of mostly
>> >cutting and pasting to put it into a separate driver module.
>>
>> Please put it there and unite the use of it with netvsc.
>
> Surely, adding this to other drivers (e.g. might this be handy for xen
> too?) can be left for a separate patchset. Let's get one device merged
> first.
Agreed.

-Siwei

>
> --
> MST
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
> For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org
>

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [virtio-dev] Re: [PATCH v4 2/2] virtio_net: Extend virtio to use VF datapath when available
@ 2018-03-02 20:49             ` Siwei Liu
  0 siblings, 0 replies; 102+ messages in thread
From: Siwei Liu @ 2018-03-02 20:49 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Jiri Pirko, Alexander Duyck, Sridhar Samudrala,
	Stephen Hemminger, David Miller, Netdev, virtio-dev, Brandeburg,
	Jesse, Duyck, Alexander H, Jakub Kicinski

On Fri, Mar 2, 2018 at 11:42 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Fri, Mar 02, 2018 at 05:20:17PM +0100, Jiri Pirko wrote:
>> >Yeah, this code essentially calls out the "shareable" code with a
>> >comment at the start and end of the section what defines the
>> >virtio_bypass functionality. It would just be a matter of mostly
>> >cutting and pasting to put it into a separate driver module.
>>
>> Please put it there and unite the use of it with netvsc.
>
> Surely, adding this to other drivers (e.g. might this be handy for xen
> too?) can be left for a separate patchset. Let's get one device merged
> first.
Agreed.

-Siwei

>
> --
> MST
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
> For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Re: [PATCH v4 2/2] virtio_net: Extend virtio to use VF datapath when available
  2018-03-02 20:44               ` [virtio-dev] " Siwei Liu
@ 2018-03-02 20:56                 ` Samudrala, Sridhar
  -1 siblings, 0 replies; 102+ messages in thread
From: Samudrala, Sridhar @ 2018-03-02 20:56 UTC (permalink / raw)
  To: Siwei Liu, Michael S. Tsirkin
  Cc: Alexander Duyck, Jiri Pirko, Stephen Hemminger, David Miller,
	Netdev, virtio-dev, Brandeburg, Jesse, Duyck, Alexander H,
	Jakub Kicinski



On 3/2/2018 12:44 PM, Siwei Liu wrote:
> On Fri, Mar 2, 2018 at 12:10 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
>> On Fri, Mar 02, 2018 at 11:52:27AM -0800, Samudrala, Sridhar wrote:
>>>
>>> On 3/2/2018 11:41 AM, Michael S. Tsirkin wrote:
>>>> On Fri, Mar 02, 2018 at 07:26:25AM -0800, Alexander Duyck wrote:
>>>>> The design limits things to a 1:1 relationship since we just have the
>>>>> child and backup pointers, but I don't think I am seeing exception
>>>>> handling to prevent us from overwriting the child pointers so there
>>>>> may be a leak there.
>>>>>
>>>>> Thanks.
>>>>>
>>>>> - Alex
>>>> In fact maintaining a list in that case would be nicer, and
>>>> just use an arbitrary one.
>>>> E.g. one can see how a user wanting to swap device 1 for device 2
>>>> might first add device 2 with same MAC then drop device 1.
>>> It should be possible to swap VF1 with VF2 by
>>> 1.- enabling virtio link
>>> 2.- unplugging VF1
>>> 3.- plugging VF2
>>> 4.- disabling virtio link
>>>
>> True, but it isn't hard to avoid breakage if user
>> swapped steps 2 and 3. No need to make it more
>> fragile that it has to be.
> The migration case, VF2 is associated with another PF on another
> machine (destination), I wonder how it is possible.
>
> Even with local plugging of VF2 on the same PF, the MAC address
> requirement (VF1's == VF2's) would fail the MAC address assignment on
> VF2.
>
>
I didn't include updating the MAC filter step in the above sequence.
So definitely plugging 2 VFs with the same MAC address will be an issue.

Here is the more complete sequence of steps that are required to
enable live migration.

Source Hypervisor
- Bring up the virtio link
- Hot Unplug VF from the VM
- Delete FDB entry for the MAC on the Bridge associated with virtio/tap
- Remove the MAC filter associated with the VF
- Migrate VM to destination

Destination Hypervisor (after migration is completed)
- Set MAC filter for the VF
- Hot Plug VF to the VM
- Bring down the virtio link

Thanks
Sridhar

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [virtio-dev] Re: [PATCH v4 2/2] virtio_net: Extend virtio to use VF datapath when available
@ 2018-03-02 20:56                 ` Samudrala, Sridhar
  0 siblings, 0 replies; 102+ messages in thread
From: Samudrala, Sridhar @ 2018-03-02 20:56 UTC (permalink / raw)
  To: Siwei Liu, Michael S. Tsirkin
  Cc: Alexander Duyck, Jiri Pirko, Stephen Hemminger, David Miller,
	Netdev, virtio-dev, Brandeburg, Jesse, Duyck, Alexander H,
	Jakub Kicinski



On 3/2/2018 12:44 PM, Siwei Liu wrote:
> On Fri, Mar 2, 2018 at 12:10 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
>> On Fri, Mar 02, 2018 at 11:52:27AM -0800, Samudrala, Sridhar wrote:
>>>
>>> On 3/2/2018 11:41 AM, Michael S. Tsirkin wrote:
>>>> On Fri, Mar 02, 2018 at 07:26:25AM -0800, Alexander Duyck wrote:
>>>>> The design limits things to a 1:1 relationship since we just have the
>>>>> child and backup pointers, but I don't think I am seeing exception
>>>>> handling to prevent us from overwriting the child pointers so there
>>>>> may be a leak there.
>>>>>
>>>>> Thanks.
>>>>>
>>>>> - Alex
>>>> In fact maintaining a list in that case would be nicer, and
>>>> just use an arbitrary one.
>>>> E.g. one can see how a user wanting to swap device 1 for device 2
>>>> might first add device 2 with same MAC then drop device 1.
>>> It should be possible to swap VF1 with VF2 by
>>> 1.- enabling virtio link
>>> 2.- unplugging VF1
>>> 3.- plugging VF2
>>> 4.- disabling virtio link
>>>
>> True, but it isn't hard to avoid breakage if user
>> swapped steps 2 and 3. No need to make it more
>> fragile that it has to be.
> The migration case, VF2 is associated with another PF on another
> machine (destination), I wonder how it is possible.
>
> Even with local plugging of VF2 on the same PF, the MAC address
> requirement (VF1's == VF2's) would fail the MAC address assignment on
> VF2.
>
>
I didn't include updating the MAC filter step in the above sequence.
So definitely plugging 2 VFs with the same MAC address will be an issue.

Here is the more complete sequence of steps that are required to
enable live migration.

Source Hypervisor
- Bring up the virtio link
- Hot Unplug VF from the VM
- Delete FDB entry for the MAC on the Bridge associated with virtio/tap
- Remove the MAC filter associated with the VF
- Migrate VM to destination

Destination Hypervisor (after migration is completed)
- Set MAC filter for the VF
- Hot Plug VF to the VM
- Bring down the virtio link

Thanks
Sridhar




---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v4 2/2] virtio_net: Extend virtio to use VF datapath when available
  2018-03-01 20:08   ` [virtio-dev] " Sridhar Samudrala
@ 2018-03-02 21:11     ` Siwei Liu
  -1 siblings, 0 replies; 102+ messages in thread
From: Siwei Liu @ 2018-03-02 21:11 UTC (permalink / raw)
  To: Sridhar Samudrala
  Cc: Michael S. Tsirkin, Stephen Hemminger, David Miller, Netdev,
	Jiri Pirko, virtio-dev, Brandeburg, Jesse, Alexander Duyck,
	Jakub Kicinski

On Thu, Mar 1, 2018 at 12:08 PM, Sridhar Samudrala
<sridhar.samudrala@intel.com> wrote:
> This patch enables virtio_net to switch over to a VF datapath when a VF
> netdev is present with the same MAC address. It allows live migration
> of a VM with a direct attached VF without the need to setup a bond/team
> between a VF and virtio net device in the guest.
>
> The hypervisor needs to enable only one datapath at any time so that
> packets don't get looped back to the VM over the other datapath. When a VF
> is plugged, the virtio datapath link state can be marked as down. The
> hypervisor needs to unplug the VF device from the guest on the source host
> and reset the MAC filter of the VF to initiate failover of datapath to
> virtio before starting the migration. After the migration is completed,
> the destination hypervisor sets the MAC filter on the VF and plugs it back
> to the guest to switch over to VF datapath.
>
> When BACKUP feature is enabled, an additional netdev(bypass netdev) is
> created that acts as a master device and tracks the state of the 2 lower
> netdevs. The original virtio_net netdev is marked as 'backup' netdev and a
> passthru device with the same MAC is registered as 'active' netdev.
>
> This patch is based on the discussion initiated by Jesse on this thread.
> https://marc.info/?l=linux-virtualization&m=151189725224231&w=2
>
> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
> Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
> ---
>  drivers/net/virtio_net.c | 683 ++++++++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 682 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index bcd13fe906ca..f2860d86c952 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -30,6 +30,8 @@
>  #include <linux/cpu.h>
>  #include <linux/average.h>
>  #include <linux/filter.h>
> +#include <linux/netdevice.h>
> +#include <linux/pci.h>
>  #include <net/route.h>
>  #include <net/xdp.h>
>
> @@ -206,6 +208,9 @@ struct virtnet_info {
>         u32 speed;
>
>         unsigned long guest_offloads;
> +
> +       /* upper netdev created when BACKUP feature enabled */
> +       struct net_device *bypass_netdev;
>  };
>
>  struct padded_vnet_hdr {
> @@ -2236,6 +2241,22 @@ static int virtnet_xdp(struct net_device *dev, struct netdev_bpf *xdp)
>         }
>  }
>
> +static int virtnet_get_phys_port_name(struct net_device *dev, char *buf,
> +                                     size_t len)
> +{
> +       struct virtnet_info *vi = netdev_priv(dev);
> +       int ret;
> +
> +       if (!virtio_has_feature(vi->vdev, VIRTIO_NET_F_BACKUP))
> +               return -EOPNOTSUPP;
> +
> +       ret = snprintf(buf, len, "_bkup");
> +       if (ret >= len)
> +               return -EOPNOTSUPP;
> +
> +       return 0;
> +}
> +

What if the systemd/udevd is not new enough to enforce the
n<phys_port_name> naming? Would virtio_bypass get a different name
than the original virtio_net? Should we detect this earlier and fall
back to legacy mode without creating the bypass netdev and ensalving
the VF?

>  static const struct net_device_ops virtnet_netdev = {
>         .ndo_open            = virtnet_open,
>         .ndo_stop            = virtnet_close,
> @@ -2253,6 +2274,7 @@ static const struct net_device_ops virtnet_netdev = {
>         .ndo_xdp_xmit           = virtnet_xdp_xmit,
>         .ndo_xdp_flush          = virtnet_xdp_flush,
>         .ndo_features_check     = passthru_features_check,
> +       .ndo_get_phys_port_name = virtnet_get_phys_port_name,
>  };
>
>  static void virtnet_config_changed_work(struct work_struct *work)
> @@ -2647,6 +2669,653 @@ static int virtnet_validate(struct virtio_device *vdev)
>         return 0;
>  }
>
> +/* START of functions supporting VIRTIO_NET_F_BACKUP feature.
> + * When BACKUP feature is enabled, an additional netdev(bypass netdev)
> + * is created that acts as a master device and tracks the state of the
> + * 2 lower netdevs. The original virtio_net netdev is registered as
> + * 'backup' netdev and a passthru device with the same MAC is registered
> + * as 'active' netdev.
> + */
> +
> +/* bypass state maintained when BACKUP feature is enabled */
> +struct virtnet_bypass_info {
> +       /* passthru netdev with same MAC */
> +       struct net_device __rcu *active_netdev;
> +
> +       /* virtio_net netdev */
> +       struct net_device __rcu *backup_netdev;
> +
> +       /* active netdev stats */
> +       struct rtnl_link_stats64 active_stats;
> +
> +       /* backup netdev stats */
> +       struct rtnl_link_stats64 backup_stats;
> +
> +       /* aggregated stats */
> +       struct rtnl_link_stats64 bypass_stats;
> +
> +       /* spinlock while updating stats */
> +       spinlock_t stats_lock;
> +};
> +
> +static void virtnet_bypass_child_open(struct net_device *dev,
> +                                     struct net_device *child_netdev)
> +{
> +       int err = dev_open(child_netdev);
> +
> +       if (err)
> +               netdev_warn(dev, "unable to open slave: %s: %d\n",
> +                           child_netdev->name, err);
> +}
> +
> +static int virtnet_bypass_open(struct net_device *dev)
> +{
> +       struct virtnet_bypass_info *vbi = netdev_priv(dev);
> +       struct net_device *child_netdev;
> +
> +       netif_carrier_off(dev);
> +       netif_tx_wake_all_queues(dev);
> +
> +       child_netdev = rtnl_dereference(vbi->active_netdev);
> +       if (child_netdev)
> +               virtnet_bypass_child_open(dev, child_netdev);
> +
> +       child_netdev = rtnl_dereference(vbi->backup_netdev);
> +       if (child_netdev)
> +               virtnet_bypass_child_open(dev, child_netdev);
> +
> +       return 0;
> +}
> +
> +static int virtnet_bypass_close(struct net_device *dev)
> +{
> +       struct virtnet_bypass_info *vi = netdev_priv(dev);
> +       struct net_device *child_netdev;
> +
> +       netif_tx_disable(dev);
> +
> +       child_netdev = rtnl_dereference(vi->active_netdev);
> +       if (child_netdev)
> +               dev_close(child_netdev);
> +
> +       child_netdev = rtnl_dereference(vi->backup_netdev);
> +       if (child_netdev)
> +               dev_close(child_netdev);
> +
> +       return 0;
> +}
> +
> +static netdev_tx_t virtnet_bypass_drop_xmit(struct sk_buff *skb,
> +                                           struct net_device *dev)
> +{
> +       atomic_long_inc(&dev->tx_dropped);
> +       dev_kfree_skb_any(skb);
> +       return NETDEV_TX_OK;
> +}
> +
> +static bool virtnet_bypass_xmit_ready(struct net_device *dev)
> +{
> +       return netif_running(dev) && netif_carrier_ok(dev);
> +}
> +
> +static netdev_tx_t virtnet_bypass_start_xmit(struct sk_buff *skb,
> +                                            struct net_device *dev)
> +{
> +       struct virtnet_bypass_info *vbi = netdev_priv(dev);
> +       struct net_device *xmit_dev;
> +
> +       /* Try xmit via active netdev followed by backup netdev */
> +       xmit_dev = rcu_dereference_bh(vbi->active_netdev);
> +       if (!xmit_dev || !virtnet_bypass_xmit_ready(xmit_dev)) {
> +               xmit_dev = rcu_dereference_bh(vbi->backup_netdev);
> +               if (!xmit_dev || !virtnet_bypass_xmit_ready(xmit_dev))
> +                       return virtnet_bypass_drop_xmit(skb, dev);
> +       }
> +
> +       skb->dev = xmit_dev;
> +       skb->queue_mapping = qdisc_skb_cb(skb)->slave_dev_queue_mapping;
> +
> +       return dev_queue_xmit(skb);
> +}
> +
> +static u16 virtnet_bypass_select_queue(struct net_device *dev,
> +                                      struct sk_buff *skb, void *accel_priv,
> +                                      select_queue_fallback_t fallback)
> +{
> +       /* This helper function exists to help dev_pick_tx get the correct
> +        * destination queue.  Using a helper function skips a call to
> +        * skb_tx_hash and will put the skbs in the queue we expect on their
> +        * way down to the bonding driver.
> +        */
> +       u16 txq = skb_rx_queue_recorded(skb) ? skb_get_rx_queue(skb) : 0;
> +
> +       /* Save the original txq to restore before passing to the driver */
> +       qdisc_skb_cb(skb)->slave_dev_queue_mapping = skb->queue_mapping;
> +
> +       if (unlikely(txq >= dev->real_num_tx_queues)) {
> +               do {
> +                       txq -= dev->real_num_tx_queues;
> +               } while (txq >= dev->real_num_tx_queues);
> +       }
> +
> +       return txq;
> +}
> +
> +/* fold stats, assuming all rtnl_link_stats64 fields are u64, but
> + * that some drivers can provide 32bit values only.
> + */
> +static void virtnet_bypass_fold_stats(struct rtnl_link_stats64 *_res,
> +                                     const struct rtnl_link_stats64 *_new,
> +                                     const struct rtnl_link_stats64 *_old)
> +{
> +       const u64 *new = (const u64 *)_new;
> +       const u64 *old = (const u64 *)_old;
> +       u64 *res = (u64 *)_res;
> +       int i;
> +
> +       for (i = 0; i < sizeof(*_res) / sizeof(u64); i++) {
> +               u64 nv = new[i];
> +               u64 ov = old[i];
> +               s64 delta = nv - ov;
> +
> +               /* detects if this particular field is 32bit only */
> +               if (((nv | ov) >> 32) == 0)
> +                       delta = (s64)(s32)((u32)nv - (u32)ov);
> +
> +               /* filter anomalies, some drivers reset their stats
> +                * at down/up events.
> +                */
> +               if (delta > 0)
> +                       res[i] += delta;
> +       }
> +}
> +
> +static void virtnet_bypass_get_stats(struct net_device *dev,
> +                                    struct rtnl_link_stats64 *stats)
> +{
> +       struct virtnet_bypass_info *vbi = netdev_priv(dev);
> +       const struct rtnl_link_stats64 *new;
> +       struct rtnl_link_stats64 temp;
> +       struct net_device *child_netdev;
> +
> +       spin_lock(&vbi->stats_lock);
> +       memcpy(stats, &vbi->bypass_stats, sizeof(*stats));
> +
> +       rcu_read_lock();
> +
> +       child_netdev = rcu_dereference(vbi->active_netdev);
> +       if (child_netdev) {
> +               new = dev_get_stats(child_netdev, &temp);
> +               virtnet_bypass_fold_stats(stats, new, &vbi->active_stats);
> +               memcpy(&vbi->active_stats, new, sizeof(*new));
> +       }
> +
> +       child_netdev = rcu_dereference(vbi->backup_netdev);
> +       if (child_netdev) {
> +               new = dev_get_stats(child_netdev, &temp);
> +               virtnet_bypass_fold_stats(stats, new, &vbi->backup_stats);
> +               memcpy(&vbi->backup_stats, new, sizeof(*new));
> +       }
> +
> +       rcu_read_unlock();
> +
> +       memcpy(&vbi->bypass_stats, stats, sizeof(*stats));
> +       spin_unlock(&vbi->stats_lock);
> +}
> +
> +static int virtnet_bypass_change_mtu(struct net_device *dev, int new_mtu)
> +{
> +       struct virtnet_bypass_info *vbi = netdev_priv(dev);
> +       struct net_device *child_netdev;
> +       int ret = 0;
> +
> +       child_netdev = rcu_dereference(vbi->active_netdev);
> +       if (child_netdev) {
> +               ret = dev_set_mtu(child_netdev, new_mtu);
> +               if (ret)
> +                       return ret;
> +       }
> +
> +       child_netdev = rcu_dereference(vbi->backup_netdev);
> +       if (child_netdev) {
> +               ret = dev_set_mtu(child_netdev, new_mtu);
> +               if (ret)
> +                       netdev_err(child_netdev,
> +                                  "Unexpected failure to set mtu to %d\n",
> +                                  new_mtu);

Shouldn't we unwind the MTU config on active_netdev if failing to set
it on backup_netdev?

> +       }
> +
> +       dev->mtu = new_mtu;
> +       return 0;
> +}
> +
> +static void virtnet_bypass_set_rx_mode(struct net_device *dev)
> +{
> +       struct virtnet_bypass_info *vbi = netdev_priv(dev);
> +       struct net_device *child_netdev;
> +
> +       rcu_read_lock();
> +
> +       child_netdev = rcu_dereference(vbi->active_netdev);
> +       if (child_netdev) {
> +               dev_uc_sync_multiple(child_netdev, dev);
> +               dev_mc_sync_multiple(child_netdev, dev);
> +       }
> +
> +       child_netdev = rcu_dereference(vbi->backup_netdev);
> +       if (child_netdev) {
> +               dev_uc_sync_multiple(child_netdev, dev);
> +               dev_mc_sync_multiple(child_netdev, dev);
> +       }
> +

If VF comes up later than when set_rx_mode is called where do you sync
up the unicast and multicast address?

The rest looks good.

Thanks,
-Siwei


> +       rcu_read_unlock();
> +}
> +
> +static const struct net_device_ops virtnet_bypass_netdev_ops = {
> +       .ndo_open               = virtnet_bypass_open,
> +       .ndo_stop               = virtnet_bypass_close,
> +       .ndo_start_xmit         = virtnet_bypass_start_xmit,
> +       .ndo_select_queue       = virtnet_bypass_select_queue,
> +       .ndo_get_stats64        = virtnet_bypass_get_stats,
> +       .ndo_change_mtu         = virtnet_bypass_change_mtu,
> +       .ndo_set_rx_mode        = virtnet_bypass_set_rx_mode,
> +       .ndo_validate_addr      = eth_validate_addr,
> +       .ndo_features_check     = passthru_features_check,
> +};
> +
> +static int
> +virtnet_bypass_ethtool_get_link_ksettings(struct net_device *dev,
> +                                         struct ethtool_link_ksettings *cmd)
> +{
> +       struct virtnet_bypass_info *vbi = netdev_priv(dev);
> +       struct net_device *child_netdev;
> +
> +       child_netdev = rtnl_dereference(vbi->active_netdev);
> +       if (!child_netdev || !virtnet_bypass_xmit_ready(child_netdev)) {
> +               child_netdev = rtnl_dereference(vbi->backup_netdev);
> +               if (!child_netdev || !virtnet_bypass_xmit_ready(child_netdev)) {
> +                       cmd->base.duplex = DUPLEX_UNKNOWN;
> +                       cmd->base.port = PORT_OTHER;
> +                       cmd->base.speed = SPEED_UNKNOWN;
> +
> +                       return 0;
> +               }
> +       }
> +
> +       return __ethtool_get_link_ksettings(child_netdev, cmd);
> +}
> +
> +#define BYPASS_DRV_NAME "virtnet_bypass"
> +#define BYPASS_DRV_VERSION "0.1"
> +
> +static void virtnet_bypass_ethtool_get_drvinfo(struct net_device *dev,
> +                                              struct ethtool_drvinfo *drvinfo)
> +{
> +       strlcpy(drvinfo->driver, BYPASS_DRV_NAME, sizeof(drvinfo->driver));
> +       strlcpy(drvinfo->version, BYPASS_DRV_VERSION, sizeof(drvinfo->version));
> +}
> +
> +static const struct ethtool_ops virtnet_bypass_ethtool_ops = {
> +       .get_drvinfo            = virtnet_bypass_ethtool_get_drvinfo,
> +       .get_link               = ethtool_op_get_link,
> +       .get_link_ksettings     = virtnet_bypass_ethtool_get_link_ksettings,
> +};
> +
> +static struct net_device *get_virtnet_bypass_bymac(struct net *net,
> +                                                  const u8 *mac)
> +{
> +       struct net_device *dev;
> +
> +       ASSERT_RTNL();
> +
> +       for_each_netdev(net, dev) {
> +               if (dev->netdev_ops != &virtnet_bypass_netdev_ops)
> +                       continue;       /* not a virtnet_bypass device */
> +
> +               if (ether_addr_equal(mac, dev->perm_addr))
> +                       return dev;
> +       }
> +
> +       return NULL;
> +}
> +
> +static struct net_device *
> +get_virtnet_bypass_byref(struct net_device *child_netdev)
> +{
> +       struct net *net = dev_net(child_netdev);
> +       struct net_device *dev;
> +
> +       ASSERT_RTNL();
> +
> +       for_each_netdev(net, dev) {
> +               struct virtnet_bypass_info *vbi;
> +
> +               if (dev->netdev_ops != &virtnet_bypass_netdev_ops)
> +                       continue;       /* not a virtnet_bypass device */
> +
> +               vbi = netdev_priv(dev);
> +
> +               if ((rtnl_dereference(vbi->active_netdev) == child_netdev) ||
> +                   (rtnl_dereference(vbi->backup_netdev) == child_netdev))
> +                       return dev;     /* a match */
> +       }
> +
> +       return NULL;
> +}
> +
> +/* Called when child dev is injecting data into network stack.
> + * Change the associated network device from lower dev to virtio.
> + * note: already called with rcu_read_lock
> + */
> +static rx_handler_result_t virtnet_bypass_handle_frame(struct sk_buff **pskb)
> +{
> +       struct sk_buff *skb = *pskb;
> +       struct net_device *ndev = rcu_dereference(skb->dev->rx_handler_data);
> +
> +       skb->dev = ndev;
> +
> +       return RX_HANDLER_ANOTHER;
> +}
> +
> +static int virtnet_bypass_register_child(struct net_device *child_netdev)
> +{
> +       struct virtnet_bypass_info *vbi;
> +       struct net_device *dev;
> +       bool backup;
> +       int ret;
> +
> +       if (child_netdev->addr_len != ETH_ALEN)
> +               return NOTIFY_DONE;
> +
> +       /* We will use the MAC address to locate the virtnet_bypass netdev
> +        * to associate with the child netdev. If we don't find a matching
> +        * bypass netdev, move on.
> +        */
> +       dev = get_virtnet_bypass_bymac(dev_net(child_netdev),
> +                                      child_netdev->perm_addr);
> +       if (!dev)
> +               return NOTIFY_DONE;
> +
> +       vbi = netdev_priv(dev);
> +       backup = (child_netdev->dev.parent == dev->dev.parent);
> +       if (backup ? rtnl_dereference(vbi->backup_netdev) :
> +                       rtnl_dereference(vbi->active_netdev)) {
> +               netdev_info(dev,
> +                           "%s attempting to join bypass dev when %s already present\n",
> +                           child_netdev->name, backup ? "backup" : "active");
> +               return NOTIFY_DONE;
> +       }
> +
> +       /* Avoid non pci devices as active netdev */
> +       if (!backup && (!child_netdev->dev.parent ||
> +                       !dev_is_pci(child_netdev->dev.parent)))
> +               return NOTIFY_DONE;
> +
> +       ret = netdev_rx_handler_register(child_netdev,
> +                                        virtnet_bypass_handle_frame, dev);
> +       if (ret != 0) {
> +               netdev_err(child_netdev,
> +                          "can not register bypass receive handler (err = %d)\n",
> +                          ret);
> +               goto rx_handler_failed;
> +       }
> +
> +       ret = netdev_upper_dev_link(child_netdev, dev, NULL);
> +       if (ret != 0) {
> +               netdev_err(child_netdev,
> +                          "can not set master device %s (err = %d)\n",
> +                          dev->name, ret);
> +               goto upper_link_failed;
> +       }
> +
> +       child_netdev->flags |= IFF_SLAVE;
> +
> +       if (netif_running(dev)) {
> +               ret = dev_open(child_netdev);
> +               if (ret && (ret != -EBUSY)) {
> +                       netdev_err(dev, "Opening child %s failed ret:%d\n",
> +                                  child_netdev->name, ret);
> +                       goto err_interface_up;
> +               }
> +       }
> +
> +       /* Align MTU of child with master */
> +       ret = dev_set_mtu(child_netdev, dev->mtu);
> +       if (ret) {
> +               netdev_err(dev,
> +                          "unable to change mtu of %s to %u register failed\n",
> +                          child_netdev->name, dev->mtu);
> +               goto err_set_mtu;
> +       }
> +
> +       call_netdevice_notifiers(NETDEV_JOIN, child_netdev);
> +
> +       netdev_info(dev, "registering %s\n", child_netdev->name);
> +
> +       dev_hold(child_netdev);
> +       if (backup) {
> +               rcu_assign_pointer(vbi->backup_netdev, child_netdev);
> +               dev_get_stats(vbi->backup_netdev, &vbi->backup_stats);
> +       } else {
> +               rcu_assign_pointer(vbi->active_netdev, child_netdev);
> +               dev_get_stats(vbi->active_netdev, &vbi->active_stats);
> +               dev->min_mtu = child_netdev->min_mtu;
> +               dev->max_mtu = child_netdev->max_mtu;
> +       }
> +
> +       return NOTIFY_OK;
> +
> +err_set_mtu:
> +       dev_close(child_netdev);
> +err_interface_up:
> +       netdev_upper_dev_unlink(child_netdev, dev);
> +       child_netdev->flags &= ~IFF_SLAVE;
> +upper_link_failed:
> +       netdev_rx_handler_unregister(child_netdev);
> +rx_handler_failed:
> +       return NOTIFY_DONE;
> +}
> +
> +static int virtnet_bypass_unregister_child(struct net_device *child_netdev)
> +{
> +       struct virtnet_bypass_info *vbi;
> +       struct net_device *dev, *backup;
> +
> +       dev = get_virtnet_bypass_byref(child_netdev);
> +       if (!dev)
> +               return NOTIFY_DONE;
> +
> +       vbi = netdev_priv(dev);
> +
> +       netdev_info(dev, "unregistering %s\n", child_netdev->name);
> +
> +       netdev_rx_handler_unregister(child_netdev);
> +       netdev_upper_dev_unlink(child_netdev, dev);
> +       child_netdev->flags &= ~IFF_SLAVE;
> +
> +       if (child_netdev->dev.parent == dev->dev.parent) {
> +               RCU_INIT_POINTER(vbi->backup_netdev, NULL);
> +       } else {
> +               RCU_INIT_POINTER(vbi->active_netdev, NULL);
> +               backup = rtnl_dereference(vbi->backup_netdev);
> +               if (backup) {
> +                       dev->min_mtu = backup->min_mtu;
> +                       dev->max_mtu = backup->max_mtu;
> +               }
> +       }
> +
> +       dev_put(child_netdev);
> +
> +       return NOTIFY_OK;
> +}
> +
> +static int virtnet_bypass_update_link(struct net_device *child_netdev)
> +{
> +       struct net_device *dev, *active, *backup;
> +       struct virtnet_bypass_info *vbi;
> +
> +       dev = get_virtnet_bypass_byref(child_netdev);
> +       if (!dev || !netif_running(dev))
> +               return NOTIFY_DONE;
> +
> +       vbi = netdev_priv(dev);
> +
> +       active = rtnl_dereference(vbi->active_netdev);
> +       backup = rtnl_dereference(vbi->backup_netdev);
> +
> +       if ((active && virtnet_bypass_xmit_ready(active)) ||
> +           (backup && virtnet_bypass_xmit_ready(backup))) {
> +               netif_carrier_on(dev);
> +               netif_tx_wake_all_queues(dev);
> +       } else {
> +               netif_carrier_off(dev);
> +               netif_tx_stop_all_queues(dev);
> +       }
> +
> +       return NOTIFY_OK;
> +}
> +
> +static int virtnet_bypass_event(struct notifier_block *this,
> +                               unsigned long event, void *ptr)
> +{
> +       struct net_device *event_dev = netdev_notifier_info_to_dev(ptr);
> +
> +       /* Skip our own events */
> +       if (event_dev->netdev_ops == &virtnet_bypass_netdev_ops)
> +               return NOTIFY_DONE;
> +
> +       /* Avoid non-Ethernet type devices */
> +       if (event_dev->type != ARPHRD_ETHER)
> +               return NOTIFY_DONE;
> +
> +       /* Avoid Vlan dev with same MAC registering as child dev */
> +       if (is_vlan_dev(event_dev))
> +               return NOTIFY_DONE;
> +
> +       /* Avoid Bonding master dev with same MAC registering as child dev */
> +       if ((event_dev->priv_flags & IFF_BONDING) &&
> +           (event_dev->flags & IFF_MASTER))
> +               return NOTIFY_DONE;
> +
> +       switch (event) {
> +       case NETDEV_REGISTER:
> +               return virtnet_bypass_register_child(event_dev);
> +       case NETDEV_UNREGISTER:
> +               return virtnet_bypass_unregister_child(event_dev);
> +       case NETDEV_UP:
> +       case NETDEV_DOWN:
> +       case NETDEV_CHANGE:
> +               return virtnet_bypass_update_link(event_dev);
> +       default:
> +               return NOTIFY_DONE;
> +       }
> +}
> +
> +static struct notifier_block virtnet_bypass_notifier = {
> +       .notifier_call = virtnet_bypass_event,
> +};
> +
> +static int virtnet_bypass_create(struct virtnet_info *vi)
> +{
> +       struct net_device *backup_netdev = vi->dev;
> +       struct device *dev = &vi->vdev->dev;
> +       struct net_device *bypass_netdev;
> +       int res;
> +
> +       /* Alloc at least 2 queues, for now we are going with 16 assuming
> +        * that most devices being bonded won't have too many queues.
> +        */
> +       bypass_netdev = alloc_etherdev_mq(sizeof(struct virtnet_bypass_info),
> +                                         16);
> +       if (!bypass_netdev) {
> +               dev_err(dev, "Unable to allocate bypass_netdev!\n");
> +               return -ENOMEM;
> +       }
> +
> +       dev_net_set(bypass_netdev, dev_net(backup_netdev));
> +       SET_NETDEV_DEV(bypass_netdev, dev);
> +
> +       bypass_netdev->netdev_ops = &virtnet_bypass_netdev_ops;
> +       bypass_netdev->ethtool_ops = &virtnet_bypass_ethtool_ops;
> +
> +       /* Initialize the device options */
> +       bypass_netdev->flags |= IFF_MASTER;
> +       bypass_netdev->priv_flags |= IFF_BONDING | IFF_UNICAST_FLT |
> +                                    IFF_NO_QUEUE;
> +       bypass_netdev->priv_flags &= ~(IFF_XMIT_DST_RELEASE |
> +                                      IFF_TX_SKB_SHARING);
> +
> +       /* don't acquire bypass netdev's netif_tx_lock when transmitting */
> +       bypass_netdev->features |= NETIF_F_LLTX;
> +
> +       /* Don't allow bypass devices to change network namespaces. */
> +       bypass_netdev->features |= NETIF_F_NETNS_LOCAL;
> +
> +       bypass_netdev->hw_features = NETIF_F_HW_CSUM | NETIF_F_SG |
> +                                    NETIF_F_FRAGLIST | NETIF_F_ALL_TSO |
> +                                    NETIF_F_HIGHDMA | NETIF_F_LRO;
> +
> +       bypass_netdev->hw_features |= NETIF_F_GSO_ENCAP_ALL;
> +       bypass_netdev->features |= bypass_netdev->hw_features;
> +
> +       /* For now treat bypass netdev as VLAN challenged since we
> +        * cannot assume VLAN functionality with a VF
> +        */
> +       bypass_netdev->features |= NETIF_F_VLAN_CHALLENGED;
> +
> +       memcpy(bypass_netdev->dev_addr, backup_netdev->dev_addr,
> +              bypass_netdev->addr_len);
> +
> +       bypass_netdev->min_mtu = backup_netdev->min_mtu;
> +       bypass_netdev->max_mtu = backup_netdev->max_mtu;
> +
> +       res = register_netdev(bypass_netdev);
> +       if (res < 0) {
> +               dev_err(dev, "Unable to register bypass_netdev!\n");
> +               free_netdev(bypass_netdev);
> +               return res;
> +       }
> +
> +       netif_carrier_off(bypass_netdev);
> +
> +       vi->bypass_netdev = bypass_netdev;
> +
> +       return 0;
> +}
> +
> +static void virtnet_bypass_destroy(struct virtnet_info *vi)
> +{
> +       struct net_device *bypass_netdev = vi->bypass_netdev;
> +       struct virtnet_bypass_info *vbi;
> +       struct net_device *child_netdev;
> +
> +       /* no device found, nothing to free */
> +       if (!bypass_netdev)
> +               return;
> +
> +       vbi = netdev_priv(bypass_netdev);
> +
> +       netif_device_detach(bypass_netdev);
> +
> +       rtnl_lock();
> +
> +       child_netdev = rtnl_dereference(vbi->active_netdev);
> +       if (child_netdev)
> +               virtnet_bypass_unregister_child(child_netdev);
> +
> +       child_netdev = rtnl_dereference(vbi->backup_netdev);
> +       if (child_netdev)
> +               virtnet_bypass_unregister_child(child_netdev);
> +
> +       unregister_netdevice(bypass_netdev);
> +
> +       rtnl_unlock();
> +
> +       free_netdev(bypass_netdev);
> +}
> +
> +/* END of functions supporting VIRTIO_NET_F_BACKUP feature. */
> +
>  static int virtnet_probe(struct virtio_device *vdev)
>  {
>         int i, err = -ENOMEM;
> @@ -2797,10 +3466,15 @@ static int virtnet_probe(struct virtio_device *vdev)
>
>         virtnet_init_settings(dev);
>
> +       if (virtio_has_feature(vdev, VIRTIO_NET_F_BACKUP)) {
> +               if (virtnet_bypass_create(vi) != 0)
> +                       goto free_vqs;
> +       }
> +
>         err = register_netdev(dev);
>         if (err) {
>                 pr_debug("virtio_net: registering device failed\n");
> -               goto free_vqs;
> +               goto free_bypass;
>         }
>
>         virtio_device_ready(vdev);
> @@ -2837,6 +3511,8 @@ static int virtnet_probe(struct virtio_device *vdev)
>         vi->vdev->config->reset(vdev);
>
>         unregister_netdev(dev);
> +free_bypass:
> +       virtnet_bypass_destroy(vi);
>  free_vqs:
>         cancel_delayed_work_sync(&vi->refill);
>         free_receive_page_frags(vi);
> @@ -2871,6 +3547,8 @@ static void virtnet_remove(struct virtio_device *vdev)
>
>         unregister_netdev(vi->dev);
>
> +       virtnet_bypass_destroy(vi);
> +
>         remove_vq_common(vi);
>
>         free_netdev(vi->dev);
> @@ -2968,6 +3646,8 @@ static __init int virtio_net_driver_init(void)
>          ret = register_virtio_driver(&virtio_net_driver);
>         if (ret)
>                 goto err_virtio;
> +
> +       register_netdevice_notifier(&virtnet_bypass_notifier);
>         return 0;
>  err_virtio:
>         cpuhp_remove_multi_state(CPUHP_VIRT_NET_DEAD);
> @@ -2980,6 +3660,7 @@ module_init(virtio_net_driver_init);
>
>  static __exit void virtio_net_driver_exit(void)
>  {
> +       unregister_netdevice_notifier(&virtnet_bypass_notifier);
>         unregister_virtio_driver(&virtio_net_driver);
>         cpuhp_remove_multi_state(CPUHP_VIRT_NET_DEAD);
>         cpuhp_remove_multi_state(virtionet_online);
> --
> 2.14.3
>

^ permalink raw reply	[flat|nested] 102+ messages in thread

* [virtio-dev] Re: [PATCH v4 2/2] virtio_net: Extend virtio to use VF datapath when available
@ 2018-03-02 21:11     ` Siwei Liu
  0 siblings, 0 replies; 102+ messages in thread
From: Siwei Liu @ 2018-03-02 21:11 UTC (permalink / raw)
  To: Sridhar Samudrala
  Cc: Michael S. Tsirkin, Stephen Hemminger, David Miller, Netdev,
	Jiri Pirko, virtio-dev, Brandeburg, Jesse, Alexander Duyck,
	Jakub Kicinski

On Thu, Mar 1, 2018 at 12:08 PM, Sridhar Samudrala
<sridhar.samudrala@intel.com> wrote:
> This patch enables virtio_net to switch over to a VF datapath when a VF
> netdev is present with the same MAC address. It allows live migration
> of a VM with a direct attached VF without the need to setup a bond/team
> between a VF and virtio net device in the guest.
>
> The hypervisor needs to enable only one datapath at any time so that
> packets don't get looped back to the VM over the other datapath. When a VF
> is plugged, the virtio datapath link state can be marked as down. The
> hypervisor needs to unplug the VF device from the guest on the source host
> and reset the MAC filter of the VF to initiate failover of datapath to
> virtio before starting the migration. After the migration is completed,
> the destination hypervisor sets the MAC filter on the VF and plugs it back
> to the guest to switch over to VF datapath.
>
> When BACKUP feature is enabled, an additional netdev(bypass netdev) is
> created that acts as a master device and tracks the state of the 2 lower
> netdevs. The original virtio_net netdev is marked as 'backup' netdev and a
> passthru device with the same MAC is registered as 'active' netdev.
>
> This patch is based on the discussion initiated by Jesse on this thread.
> https://marc.info/?l=linux-virtualization&m=151189725224231&w=2
>
> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
> Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
> ---
>  drivers/net/virtio_net.c | 683 ++++++++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 682 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index bcd13fe906ca..f2860d86c952 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -30,6 +30,8 @@
>  #include <linux/cpu.h>
>  #include <linux/average.h>
>  #include <linux/filter.h>
> +#include <linux/netdevice.h>
> +#include <linux/pci.h>
>  #include <net/route.h>
>  #include <net/xdp.h>
>
> @@ -206,6 +208,9 @@ struct virtnet_info {
>         u32 speed;
>
>         unsigned long guest_offloads;
> +
> +       /* upper netdev created when BACKUP feature enabled */
> +       struct net_device *bypass_netdev;
>  };
>
>  struct padded_vnet_hdr {
> @@ -2236,6 +2241,22 @@ static int virtnet_xdp(struct net_device *dev, struct netdev_bpf *xdp)
>         }
>  }
>
> +static int virtnet_get_phys_port_name(struct net_device *dev, char *buf,
> +                                     size_t len)
> +{
> +       struct virtnet_info *vi = netdev_priv(dev);
> +       int ret;
> +
> +       if (!virtio_has_feature(vi->vdev, VIRTIO_NET_F_BACKUP))
> +               return -EOPNOTSUPP;
> +
> +       ret = snprintf(buf, len, "_bkup");
> +       if (ret >= len)
> +               return -EOPNOTSUPP;
> +
> +       return 0;
> +}
> +

What if the systemd/udevd is not new enough to enforce the
n<phys_port_name> naming? Would virtio_bypass get a different name
than the original virtio_net? Should we detect this earlier and fall
back to legacy mode without creating the bypass netdev and ensalving
the VF?

>  static const struct net_device_ops virtnet_netdev = {
>         .ndo_open            = virtnet_open,
>         .ndo_stop            = virtnet_close,
> @@ -2253,6 +2274,7 @@ static const struct net_device_ops virtnet_netdev = {
>         .ndo_xdp_xmit           = virtnet_xdp_xmit,
>         .ndo_xdp_flush          = virtnet_xdp_flush,
>         .ndo_features_check     = passthru_features_check,
> +       .ndo_get_phys_port_name = virtnet_get_phys_port_name,
>  };
>
>  static void virtnet_config_changed_work(struct work_struct *work)
> @@ -2647,6 +2669,653 @@ static int virtnet_validate(struct virtio_device *vdev)
>         return 0;
>  }
>
> +/* START of functions supporting VIRTIO_NET_F_BACKUP feature.
> + * When BACKUP feature is enabled, an additional netdev(bypass netdev)
> + * is created that acts as a master device and tracks the state of the
> + * 2 lower netdevs. The original virtio_net netdev is registered as
> + * 'backup' netdev and a passthru device with the same MAC is registered
> + * as 'active' netdev.
> + */
> +
> +/* bypass state maintained when BACKUP feature is enabled */
> +struct virtnet_bypass_info {
> +       /* passthru netdev with same MAC */
> +       struct net_device __rcu *active_netdev;
> +
> +       /* virtio_net netdev */
> +       struct net_device __rcu *backup_netdev;
> +
> +       /* active netdev stats */
> +       struct rtnl_link_stats64 active_stats;
> +
> +       /* backup netdev stats */
> +       struct rtnl_link_stats64 backup_stats;
> +
> +       /* aggregated stats */
> +       struct rtnl_link_stats64 bypass_stats;
> +
> +       /* spinlock while updating stats */
> +       spinlock_t stats_lock;
> +};
> +
> +static void virtnet_bypass_child_open(struct net_device *dev,
> +                                     struct net_device *child_netdev)
> +{
> +       int err = dev_open(child_netdev);
> +
> +       if (err)
> +               netdev_warn(dev, "unable to open slave: %s: %d\n",
> +                           child_netdev->name, err);
> +}
> +
> +static int virtnet_bypass_open(struct net_device *dev)
> +{
> +       struct virtnet_bypass_info *vbi = netdev_priv(dev);
> +       struct net_device *child_netdev;
> +
> +       netif_carrier_off(dev);
> +       netif_tx_wake_all_queues(dev);
> +
> +       child_netdev = rtnl_dereference(vbi->active_netdev);
> +       if (child_netdev)
> +               virtnet_bypass_child_open(dev, child_netdev);
> +
> +       child_netdev = rtnl_dereference(vbi->backup_netdev);
> +       if (child_netdev)
> +               virtnet_bypass_child_open(dev, child_netdev);
> +
> +       return 0;
> +}
> +
> +static int virtnet_bypass_close(struct net_device *dev)
> +{
> +       struct virtnet_bypass_info *vi = netdev_priv(dev);
> +       struct net_device *child_netdev;
> +
> +       netif_tx_disable(dev);
> +
> +       child_netdev = rtnl_dereference(vi->active_netdev);
> +       if (child_netdev)
> +               dev_close(child_netdev);
> +
> +       child_netdev = rtnl_dereference(vi->backup_netdev);
> +       if (child_netdev)
> +               dev_close(child_netdev);
> +
> +       return 0;
> +}
> +
> +static netdev_tx_t virtnet_bypass_drop_xmit(struct sk_buff *skb,
> +                                           struct net_device *dev)
> +{
> +       atomic_long_inc(&dev->tx_dropped);
> +       dev_kfree_skb_any(skb);
> +       return NETDEV_TX_OK;
> +}
> +
> +static bool virtnet_bypass_xmit_ready(struct net_device *dev)
> +{
> +       return netif_running(dev) && netif_carrier_ok(dev);
> +}
> +
> +static netdev_tx_t virtnet_bypass_start_xmit(struct sk_buff *skb,
> +                                            struct net_device *dev)
> +{
> +       struct virtnet_bypass_info *vbi = netdev_priv(dev);
> +       struct net_device *xmit_dev;
> +
> +       /* Try xmit via active netdev followed by backup netdev */
> +       xmit_dev = rcu_dereference_bh(vbi->active_netdev);
> +       if (!xmit_dev || !virtnet_bypass_xmit_ready(xmit_dev)) {
> +               xmit_dev = rcu_dereference_bh(vbi->backup_netdev);
> +               if (!xmit_dev || !virtnet_bypass_xmit_ready(xmit_dev))
> +                       return virtnet_bypass_drop_xmit(skb, dev);
> +       }
> +
> +       skb->dev = xmit_dev;
> +       skb->queue_mapping = qdisc_skb_cb(skb)->slave_dev_queue_mapping;
> +
> +       return dev_queue_xmit(skb);
> +}
> +
> +static u16 virtnet_bypass_select_queue(struct net_device *dev,
> +                                      struct sk_buff *skb, void *accel_priv,
> +                                      select_queue_fallback_t fallback)
> +{
> +       /* This helper function exists to help dev_pick_tx get the correct
> +        * destination queue.  Using a helper function skips a call to
> +        * skb_tx_hash and will put the skbs in the queue we expect on their
> +        * way down to the bonding driver.
> +        */
> +       u16 txq = skb_rx_queue_recorded(skb) ? skb_get_rx_queue(skb) : 0;
> +
> +       /* Save the original txq to restore before passing to the driver */
> +       qdisc_skb_cb(skb)->slave_dev_queue_mapping = skb->queue_mapping;
> +
> +       if (unlikely(txq >= dev->real_num_tx_queues)) {
> +               do {
> +                       txq -= dev->real_num_tx_queues;
> +               } while (txq >= dev->real_num_tx_queues);
> +       }
> +
> +       return txq;
> +}
> +
> +/* fold stats, assuming all rtnl_link_stats64 fields are u64, but
> + * that some drivers can provide 32bit values only.
> + */
> +static void virtnet_bypass_fold_stats(struct rtnl_link_stats64 *_res,
> +                                     const struct rtnl_link_stats64 *_new,
> +                                     const struct rtnl_link_stats64 *_old)
> +{
> +       const u64 *new = (const u64 *)_new;
> +       const u64 *old = (const u64 *)_old;
> +       u64 *res = (u64 *)_res;
> +       int i;
> +
> +       for (i = 0; i < sizeof(*_res) / sizeof(u64); i++) {
> +               u64 nv = new[i];
> +               u64 ov = old[i];
> +               s64 delta = nv - ov;
> +
> +               /* detects if this particular field is 32bit only */
> +               if (((nv | ov) >> 32) == 0)
> +                       delta = (s64)(s32)((u32)nv - (u32)ov);
> +
> +               /* filter anomalies, some drivers reset their stats
> +                * at down/up events.
> +                */
> +               if (delta > 0)
> +                       res[i] += delta;
> +       }
> +}
> +
> +static void virtnet_bypass_get_stats(struct net_device *dev,
> +                                    struct rtnl_link_stats64 *stats)
> +{
> +       struct virtnet_bypass_info *vbi = netdev_priv(dev);
> +       const struct rtnl_link_stats64 *new;
> +       struct rtnl_link_stats64 temp;
> +       struct net_device *child_netdev;
> +
> +       spin_lock(&vbi->stats_lock);
> +       memcpy(stats, &vbi->bypass_stats, sizeof(*stats));
> +
> +       rcu_read_lock();
> +
> +       child_netdev = rcu_dereference(vbi->active_netdev);
> +       if (child_netdev) {
> +               new = dev_get_stats(child_netdev, &temp);
> +               virtnet_bypass_fold_stats(stats, new, &vbi->active_stats);
> +               memcpy(&vbi->active_stats, new, sizeof(*new));
> +       }
> +
> +       child_netdev = rcu_dereference(vbi->backup_netdev);
> +       if (child_netdev) {
> +               new = dev_get_stats(child_netdev, &temp);
> +               virtnet_bypass_fold_stats(stats, new, &vbi->backup_stats);
> +               memcpy(&vbi->backup_stats, new, sizeof(*new));
> +       }
> +
> +       rcu_read_unlock();
> +
> +       memcpy(&vbi->bypass_stats, stats, sizeof(*stats));
> +       spin_unlock(&vbi->stats_lock);
> +}
> +
> +static int virtnet_bypass_change_mtu(struct net_device *dev, int new_mtu)
> +{
> +       struct virtnet_bypass_info *vbi = netdev_priv(dev);
> +       struct net_device *child_netdev;
> +       int ret = 0;
> +
> +       child_netdev = rcu_dereference(vbi->active_netdev);
> +       if (child_netdev) {
> +               ret = dev_set_mtu(child_netdev, new_mtu);
> +               if (ret)
> +                       return ret;
> +       }
> +
> +       child_netdev = rcu_dereference(vbi->backup_netdev);
> +       if (child_netdev) {
> +               ret = dev_set_mtu(child_netdev, new_mtu);
> +               if (ret)
> +                       netdev_err(child_netdev,
> +                                  "Unexpected failure to set mtu to %d\n",
> +                                  new_mtu);

Shouldn't we unwind the MTU config on active_netdev if failing to set
it on backup_netdev?

> +       }
> +
> +       dev->mtu = new_mtu;
> +       return 0;
> +}
> +
> +static void virtnet_bypass_set_rx_mode(struct net_device *dev)
> +{
> +       struct virtnet_bypass_info *vbi = netdev_priv(dev);
> +       struct net_device *child_netdev;
> +
> +       rcu_read_lock();
> +
> +       child_netdev = rcu_dereference(vbi->active_netdev);
> +       if (child_netdev) {
> +               dev_uc_sync_multiple(child_netdev, dev);
> +               dev_mc_sync_multiple(child_netdev, dev);
> +       }
> +
> +       child_netdev = rcu_dereference(vbi->backup_netdev);
> +       if (child_netdev) {
> +               dev_uc_sync_multiple(child_netdev, dev);
> +               dev_mc_sync_multiple(child_netdev, dev);
> +       }
> +

If VF comes up later than when set_rx_mode is called where do you sync
up the unicast and multicast address?

The rest looks good.

Thanks,
-Siwei


> +       rcu_read_unlock();
> +}
> +
> +static const struct net_device_ops virtnet_bypass_netdev_ops = {
> +       .ndo_open               = virtnet_bypass_open,
> +       .ndo_stop               = virtnet_bypass_close,
> +       .ndo_start_xmit         = virtnet_bypass_start_xmit,
> +       .ndo_select_queue       = virtnet_bypass_select_queue,
> +       .ndo_get_stats64        = virtnet_bypass_get_stats,
> +       .ndo_change_mtu         = virtnet_bypass_change_mtu,
> +       .ndo_set_rx_mode        = virtnet_bypass_set_rx_mode,
> +       .ndo_validate_addr      = eth_validate_addr,
> +       .ndo_features_check     = passthru_features_check,
> +};
> +
> +static int
> +virtnet_bypass_ethtool_get_link_ksettings(struct net_device *dev,
> +                                         struct ethtool_link_ksettings *cmd)
> +{
> +       struct virtnet_bypass_info *vbi = netdev_priv(dev);
> +       struct net_device *child_netdev;
> +
> +       child_netdev = rtnl_dereference(vbi->active_netdev);
> +       if (!child_netdev || !virtnet_bypass_xmit_ready(child_netdev)) {
> +               child_netdev = rtnl_dereference(vbi->backup_netdev);
> +               if (!child_netdev || !virtnet_bypass_xmit_ready(child_netdev)) {
> +                       cmd->base.duplex = DUPLEX_UNKNOWN;
> +                       cmd->base.port = PORT_OTHER;
> +                       cmd->base.speed = SPEED_UNKNOWN;
> +
> +                       return 0;
> +               }
> +       }
> +
> +       return __ethtool_get_link_ksettings(child_netdev, cmd);
> +}
> +
> +#define BYPASS_DRV_NAME "virtnet_bypass"
> +#define BYPASS_DRV_VERSION "0.1"
> +
> +static void virtnet_bypass_ethtool_get_drvinfo(struct net_device *dev,
> +                                              struct ethtool_drvinfo *drvinfo)
> +{
> +       strlcpy(drvinfo->driver, BYPASS_DRV_NAME, sizeof(drvinfo->driver));
> +       strlcpy(drvinfo->version, BYPASS_DRV_VERSION, sizeof(drvinfo->version));
> +}
> +
> +static const struct ethtool_ops virtnet_bypass_ethtool_ops = {
> +       .get_drvinfo            = virtnet_bypass_ethtool_get_drvinfo,
> +       .get_link               = ethtool_op_get_link,
> +       .get_link_ksettings     = virtnet_bypass_ethtool_get_link_ksettings,
> +};
> +
> +static struct net_device *get_virtnet_bypass_bymac(struct net *net,
> +                                                  const u8 *mac)
> +{
> +       struct net_device *dev;
> +
> +       ASSERT_RTNL();
> +
> +       for_each_netdev(net, dev) {
> +               if (dev->netdev_ops != &virtnet_bypass_netdev_ops)
> +                       continue;       /* not a virtnet_bypass device */
> +
> +               if (ether_addr_equal(mac, dev->perm_addr))
> +                       return dev;
> +       }
> +
> +       return NULL;
> +}
> +
> +static struct net_device *
> +get_virtnet_bypass_byref(struct net_device *child_netdev)
> +{
> +       struct net *net = dev_net(child_netdev);
> +       struct net_device *dev;
> +
> +       ASSERT_RTNL();
> +
> +       for_each_netdev(net, dev) {
> +               struct virtnet_bypass_info *vbi;
> +
> +               if (dev->netdev_ops != &virtnet_bypass_netdev_ops)
> +                       continue;       /* not a virtnet_bypass device */
> +
> +               vbi = netdev_priv(dev);
> +
> +               if ((rtnl_dereference(vbi->active_netdev) == child_netdev) ||
> +                   (rtnl_dereference(vbi->backup_netdev) == child_netdev))
> +                       return dev;     /* a match */
> +       }
> +
> +       return NULL;
> +}
> +
> +/* Called when child dev is injecting data into network stack.
> + * Change the associated network device from lower dev to virtio.
> + * note: already called with rcu_read_lock
> + */
> +static rx_handler_result_t virtnet_bypass_handle_frame(struct sk_buff **pskb)
> +{
> +       struct sk_buff *skb = *pskb;
> +       struct net_device *ndev = rcu_dereference(skb->dev->rx_handler_data);
> +
> +       skb->dev = ndev;
> +
> +       return RX_HANDLER_ANOTHER;
> +}
> +
> +static int virtnet_bypass_register_child(struct net_device *child_netdev)
> +{
> +       struct virtnet_bypass_info *vbi;
> +       struct net_device *dev;
> +       bool backup;
> +       int ret;
> +
> +       if (child_netdev->addr_len != ETH_ALEN)
> +               return NOTIFY_DONE;
> +
> +       /* We will use the MAC address to locate the virtnet_bypass netdev
> +        * to associate with the child netdev. If we don't find a matching
> +        * bypass netdev, move on.
> +        */
> +       dev = get_virtnet_bypass_bymac(dev_net(child_netdev),
> +                                      child_netdev->perm_addr);
> +       if (!dev)
> +               return NOTIFY_DONE;
> +
> +       vbi = netdev_priv(dev);
> +       backup = (child_netdev->dev.parent == dev->dev.parent);
> +       if (backup ? rtnl_dereference(vbi->backup_netdev) :
> +                       rtnl_dereference(vbi->active_netdev)) {
> +               netdev_info(dev,
> +                           "%s attempting to join bypass dev when %s already present\n",
> +                           child_netdev->name, backup ? "backup" : "active");
> +               return NOTIFY_DONE;
> +       }
> +
> +       /* Avoid non pci devices as active netdev */
> +       if (!backup && (!child_netdev->dev.parent ||
> +                       !dev_is_pci(child_netdev->dev.parent)))
> +               return NOTIFY_DONE;
> +
> +       ret = netdev_rx_handler_register(child_netdev,
> +                                        virtnet_bypass_handle_frame, dev);
> +       if (ret != 0) {
> +               netdev_err(child_netdev,
> +                          "can not register bypass receive handler (err = %d)\n",
> +                          ret);
> +               goto rx_handler_failed;
> +       }
> +
> +       ret = netdev_upper_dev_link(child_netdev, dev, NULL);
> +       if (ret != 0) {
> +               netdev_err(child_netdev,
> +                          "can not set master device %s (err = %d)\n",
> +                          dev->name, ret);
> +               goto upper_link_failed;
> +       }
> +
> +       child_netdev->flags |= IFF_SLAVE;
> +
> +       if (netif_running(dev)) {
> +               ret = dev_open(child_netdev);
> +               if (ret && (ret != -EBUSY)) {
> +                       netdev_err(dev, "Opening child %s failed ret:%d\n",
> +                                  child_netdev->name, ret);
> +                       goto err_interface_up;
> +               }
> +       }
> +
> +       /* Align MTU of child with master */
> +       ret = dev_set_mtu(child_netdev, dev->mtu);
> +       if (ret) {
> +               netdev_err(dev,
> +                          "unable to change mtu of %s to %u register failed\n",
> +                          child_netdev->name, dev->mtu);
> +               goto err_set_mtu;
> +       }
> +
> +       call_netdevice_notifiers(NETDEV_JOIN, child_netdev);
> +
> +       netdev_info(dev, "registering %s\n", child_netdev->name);
> +
> +       dev_hold(child_netdev);
> +       if (backup) {
> +               rcu_assign_pointer(vbi->backup_netdev, child_netdev);
> +               dev_get_stats(vbi->backup_netdev, &vbi->backup_stats);
> +       } else {
> +               rcu_assign_pointer(vbi->active_netdev, child_netdev);
> +               dev_get_stats(vbi->active_netdev, &vbi->active_stats);
> +               dev->min_mtu = child_netdev->min_mtu;
> +               dev->max_mtu = child_netdev->max_mtu;
> +       }
> +
> +       return NOTIFY_OK;
> +
> +err_set_mtu:
> +       dev_close(child_netdev);
> +err_interface_up:
> +       netdev_upper_dev_unlink(child_netdev, dev);
> +       child_netdev->flags &= ~IFF_SLAVE;
> +upper_link_failed:
> +       netdev_rx_handler_unregister(child_netdev);
> +rx_handler_failed:
> +       return NOTIFY_DONE;
> +}
> +
> +static int virtnet_bypass_unregister_child(struct net_device *child_netdev)
> +{
> +       struct virtnet_bypass_info *vbi;
> +       struct net_device *dev, *backup;
> +
> +       dev = get_virtnet_bypass_byref(child_netdev);
> +       if (!dev)
> +               return NOTIFY_DONE;
> +
> +       vbi = netdev_priv(dev);
> +
> +       netdev_info(dev, "unregistering %s\n", child_netdev->name);
> +
> +       netdev_rx_handler_unregister(child_netdev);
> +       netdev_upper_dev_unlink(child_netdev, dev);
> +       child_netdev->flags &= ~IFF_SLAVE;
> +
> +       if (child_netdev->dev.parent == dev->dev.parent) {
> +               RCU_INIT_POINTER(vbi->backup_netdev, NULL);
> +       } else {
> +               RCU_INIT_POINTER(vbi->active_netdev, NULL);
> +               backup = rtnl_dereference(vbi->backup_netdev);
> +               if (backup) {
> +                       dev->min_mtu = backup->min_mtu;
> +                       dev->max_mtu = backup->max_mtu;
> +               }
> +       }
> +
> +       dev_put(child_netdev);
> +
> +       return NOTIFY_OK;
> +}
> +
> +static int virtnet_bypass_update_link(struct net_device *child_netdev)
> +{
> +       struct net_device *dev, *active, *backup;
> +       struct virtnet_bypass_info *vbi;
> +
> +       dev = get_virtnet_bypass_byref(child_netdev);
> +       if (!dev || !netif_running(dev))
> +               return NOTIFY_DONE;
> +
> +       vbi = netdev_priv(dev);
> +
> +       active = rtnl_dereference(vbi->active_netdev);
> +       backup = rtnl_dereference(vbi->backup_netdev);
> +
> +       if ((active && virtnet_bypass_xmit_ready(active)) ||
> +           (backup && virtnet_bypass_xmit_ready(backup))) {
> +               netif_carrier_on(dev);
> +               netif_tx_wake_all_queues(dev);
> +       } else {
> +               netif_carrier_off(dev);
> +               netif_tx_stop_all_queues(dev);
> +       }
> +
> +       return NOTIFY_OK;
> +}
> +
> +static int virtnet_bypass_event(struct notifier_block *this,
> +                               unsigned long event, void *ptr)
> +{
> +       struct net_device *event_dev = netdev_notifier_info_to_dev(ptr);
> +
> +       /* Skip our own events */
> +       if (event_dev->netdev_ops == &virtnet_bypass_netdev_ops)
> +               return NOTIFY_DONE;
> +
> +       /* Avoid non-Ethernet type devices */
> +       if (event_dev->type != ARPHRD_ETHER)
> +               return NOTIFY_DONE;
> +
> +       /* Avoid Vlan dev with same MAC registering as child dev */
> +       if (is_vlan_dev(event_dev))
> +               return NOTIFY_DONE;
> +
> +       /* Avoid Bonding master dev with same MAC registering as child dev */
> +       if ((event_dev->priv_flags & IFF_BONDING) &&
> +           (event_dev->flags & IFF_MASTER))
> +               return NOTIFY_DONE;
> +
> +       switch (event) {
> +       case NETDEV_REGISTER:
> +               return virtnet_bypass_register_child(event_dev);
> +       case NETDEV_UNREGISTER:
> +               return virtnet_bypass_unregister_child(event_dev);
> +       case NETDEV_UP:
> +       case NETDEV_DOWN:
> +       case NETDEV_CHANGE:
> +               return virtnet_bypass_update_link(event_dev);
> +       default:
> +               return NOTIFY_DONE;
> +       }
> +}
> +
> +static struct notifier_block virtnet_bypass_notifier = {
> +       .notifier_call = virtnet_bypass_event,
> +};
> +
> +static int virtnet_bypass_create(struct virtnet_info *vi)
> +{
> +       struct net_device *backup_netdev = vi->dev;
> +       struct device *dev = &vi->vdev->dev;
> +       struct net_device *bypass_netdev;
> +       int res;
> +
> +       /* Alloc at least 2 queues, for now we are going with 16 assuming
> +        * that most devices being bonded won't have too many queues.
> +        */
> +       bypass_netdev = alloc_etherdev_mq(sizeof(struct virtnet_bypass_info),
> +                                         16);
> +       if (!bypass_netdev) {
> +               dev_err(dev, "Unable to allocate bypass_netdev!\n");
> +               return -ENOMEM;
> +       }
> +
> +       dev_net_set(bypass_netdev, dev_net(backup_netdev));
> +       SET_NETDEV_DEV(bypass_netdev, dev);
> +
> +       bypass_netdev->netdev_ops = &virtnet_bypass_netdev_ops;
> +       bypass_netdev->ethtool_ops = &virtnet_bypass_ethtool_ops;
> +
> +       /* Initialize the device options */
> +       bypass_netdev->flags |= IFF_MASTER;
> +       bypass_netdev->priv_flags |= IFF_BONDING | IFF_UNICAST_FLT |
> +                                    IFF_NO_QUEUE;
> +       bypass_netdev->priv_flags &= ~(IFF_XMIT_DST_RELEASE |
> +                                      IFF_TX_SKB_SHARING);
> +
> +       /* don't acquire bypass netdev's netif_tx_lock when transmitting */
> +       bypass_netdev->features |= NETIF_F_LLTX;
> +
> +       /* Don't allow bypass devices to change network namespaces. */
> +       bypass_netdev->features |= NETIF_F_NETNS_LOCAL;
> +
> +       bypass_netdev->hw_features = NETIF_F_HW_CSUM | NETIF_F_SG |
> +                                    NETIF_F_FRAGLIST | NETIF_F_ALL_TSO |
> +                                    NETIF_F_HIGHDMA | NETIF_F_LRO;
> +
> +       bypass_netdev->hw_features |= NETIF_F_GSO_ENCAP_ALL;
> +       bypass_netdev->features |= bypass_netdev->hw_features;
> +
> +       /* For now treat bypass netdev as VLAN challenged since we
> +        * cannot assume VLAN functionality with a VF
> +        */
> +       bypass_netdev->features |= NETIF_F_VLAN_CHALLENGED;
> +
> +       memcpy(bypass_netdev->dev_addr, backup_netdev->dev_addr,
> +              bypass_netdev->addr_len);
> +
> +       bypass_netdev->min_mtu = backup_netdev->min_mtu;
> +       bypass_netdev->max_mtu = backup_netdev->max_mtu;
> +
> +       res = register_netdev(bypass_netdev);
> +       if (res < 0) {
> +               dev_err(dev, "Unable to register bypass_netdev!\n");
> +               free_netdev(bypass_netdev);
> +               return res;
> +       }
> +
> +       netif_carrier_off(bypass_netdev);
> +
> +       vi->bypass_netdev = bypass_netdev;
> +
> +       return 0;
> +}
> +
> +static void virtnet_bypass_destroy(struct virtnet_info *vi)
> +{
> +       struct net_device *bypass_netdev = vi->bypass_netdev;
> +       struct virtnet_bypass_info *vbi;
> +       struct net_device *child_netdev;
> +
> +       /* no device found, nothing to free */
> +       if (!bypass_netdev)
> +               return;
> +
> +       vbi = netdev_priv(bypass_netdev);
> +
> +       netif_device_detach(bypass_netdev);
> +
> +       rtnl_lock();
> +
> +       child_netdev = rtnl_dereference(vbi->active_netdev);
> +       if (child_netdev)
> +               virtnet_bypass_unregister_child(child_netdev);
> +
> +       child_netdev = rtnl_dereference(vbi->backup_netdev);
> +       if (child_netdev)
> +               virtnet_bypass_unregister_child(child_netdev);
> +
> +       unregister_netdevice(bypass_netdev);
> +
> +       rtnl_unlock();
> +
> +       free_netdev(bypass_netdev);
> +}
> +
> +/* END of functions supporting VIRTIO_NET_F_BACKUP feature. */
> +
>  static int virtnet_probe(struct virtio_device *vdev)
>  {
>         int i, err = -ENOMEM;
> @@ -2797,10 +3466,15 @@ static int virtnet_probe(struct virtio_device *vdev)
>
>         virtnet_init_settings(dev);
>
> +       if (virtio_has_feature(vdev, VIRTIO_NET_F_BACKUP)) {
> +               if (virtnet_bypass_create(vi) != 0)
> +                       goto free_vqs;
> +       }
> +
>         err = register_netdev(dev);
>         if (err) {
>                 pr_debug("virtio_net: registering device failed\n");
> -               goto free_vqs;
> +               goto free_bypass;
>         }
>
>         virtio_device_ready(vdev);
> @@ -2837,6 +3511,8 @@ static int virtnet_probe(struct virtio_device *vdev)
>         vi->vdev->config->reset(vdev);
>
>         unregister_netdev(dev);
> +free_bypass:
> +       virtnet_bypass_destroy(vi);
>  free_vqs:
>         cancel_delayed_work_sync(&vi->refill);
>         free_receive_page_frags(vi);
> @@ -2871,6 +3547,8 @@ static void virtnet_remove(struct virtio_device *vdev)
>
>         unregister_netdev(vi->dev);
>
> +       virtnet_bypass_destroy(vi);
> +
>         remove_vq_common(vi);
>
>         free_netdev(vi->dev);
> @@ -2968,6 +3646,8 @@ static __init int virtio_net_driver_init(void)
>          ret = register_virtio_driver(&virtio_net_driver);
>         if (ret)
>                 goto err_virtio;
> +
> +       register_netdevice_notifier(&virtnet_bypass_notifier);
>         return 0;
>  err_virtio:
>         cpuhp_remove_multi_state(CPUHP_VIRT_NET_DEAD);
> @@ -2980,6 +3660,7 @@ module_init(virtio_net_driver_init);
>
>  static __exit void virtio_net_driver_exit(void)
>  {
> +       unregister_netdevice_notifier(&virtnet_bypass_notifier);
>         unregister_virtio_driver(&virtio_net_driver);
>         cpuhp_remove_multi_state(CPUHP_VIRT_NET_DEAD);
>         cpuhp_remove_multi_state(virtionet_online);
> --
> 2.14.3
>

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Re: [PATCH v4 2/2] virtio_net: Extend virtio to use VF datapath when available
  2018-03-02 20:44               ` [virtio-dev] " Siwei Liu
@ 2018-03-02 21:31                 ` Michael S. Tsirkin
  -1 siblings, 0 replies; 102+ messages in thread
From: Michael S. Tsirkin @ 2018-03-02 21:31 UTC (permalink / raw)
  To: Siwei Liu
  Cc: Samudrala, Sridhar, Alexander Duyck, Jiri Pirko,
	Stephen Hemminger, David Miller, Netdev, virtio-dev, Brandeburg,
	Jesse, Duyck, Alexander H, Jakub Kicinski

On Fri, Mar 02, 2018 at 12:44:56PM -0800, Siwei Liu wrote:
> On Fri, Mar 2, 2018 at 12:10 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> > On Fri, Mar 02, 2018 at 11:52:27AM -0800, Samudrala, Sridhar wrote:
> >>
> >>
> >> On 3/2/2018 11:41 AM, Michael S. Tsirkin wrote:
> >> > On Fri, Mar 02, 2018 at 07:26:25AM -0800, Alexander Duyck wrote:
> >> > > The design limits things to a 1:1 relationship since we just have the
> >> > > child and backup pointers, but I don't think I am seeing exception
> >> > > handling to prevent us from overwriting the child pointers so there
> >> > > may be a leak there.
> >> > >
> >> > > Thanks.
> >> > >
> >> > > - Alex
> >> > In fact maintaining a list in that case would be nicer, and
> >> > just use an arbitrary one.
> >> > E.g. one can see how a user wanting to swap device 1 for device 2
> >> > might first add device 2 with same MAC then drop device 1.
> >>
> >> It should be possible to swap VF1 with VF2 by
> >> 1.- enabling virtio link
> >> 2.- unplugging VF1
> >> 3.- plugging VF2
> >> 4.- disabling virtio link
> >>
> >
> > True, but it isn't hard to avoid breakage if user
> > swapped steps 2 and 3. No need to make it more
> > fragile that it has to be.
> 
> The migration case, VF2 is associated with another PF on another
> machine (destination), I wonder how it is possible.

E.g. you want to remove the PF so you unplug the VF
then add another VF of the same PF.

> Even with local plugging of VF2 on the same PF, the MAC address
> requirement (VF1's == VF2's) would fail the MAC address assignment on
> VF2.
> 
> -Siwei

Why would it fail? These are separate cards.

> >
> > --
> > MST
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
> > For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org
> >

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [virtio-dev] Re: [PATCH v4 2/2] virtio_net: Extend virtio to use VF datapath when available
@ 2018-03-02 21:31                 ` Michael S. Tsirkin
  0 siblings, 0 replies; 102+ messages in thread
From: Michael S. Tsirkin @ 2018-03-02 21:31 UTC (permalink / raw)
  To: Siwei Liu
  Cc: Samudrala, Sridhar, Alexander Duyck, Jiri Pirko,
	Stephen Hemminger, David Miller, Netdev, virtio-dev, Brandeburg,
	Jesse, Duyck, Alexander H, Jakub Kicinski

On Fri, Mar 02, 2018 at 12:44:56PM -0800, Siwei Liu wrote:
> On Fri, Mar 2, 2018 at 12:10 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> > On Fri, Mar 02, 2018 at 11:52:27AM -0800, Samudrala, Sridhar wrote:
> >>
> >>
> >> On 3/2/2018 11:41 AM, Michael S. Tsirkin wrote:
> >> > On Fri, Mar 02, 2018 at 07:26:25AM -0800, Alexander Duyck wrote:
> >> > > The design limits things to a 1:1 relationship since we just have the
> >> > > child and backup pointers, but I don't think I am seeing exception
> >> > > handling to prevent us from overwriting the child pointers so there
> >> > > may be a leak there.
> >> > >
> >> > > Thanks.
> >> > >
> >> > > - Alex
> >> > In fact maintaining a list in that case would be nicer, and
> >> > just use an arbitrary one.
> >> > E.g. one can see how a user wanting to swap device 1 for device 2
> >> > might first add device 2 with same MAC then drop device 1.
> >>
> >> It should be possible to swap VF1 with VF2 by
> >> 1.- enabling virtio link
> >> 2.- unplugging VF1
> >> 3.- plugging VF2
> >> 4.- disabling virtio link
> >>
> >
> > True, but it isn't hard to avoid breakage if user
> > swapped steps 2 and 3. No need to make it more
> > fragile that it has to be.
> 
> The migration case, VF2 is associated with another PF on another
> machine (destination), I wonder how it is possible.

E.g. you want to remove the PF so you unplug the VF
then add another VF of the same PF.

> Even with local plugging of VF2 on the same PF, the MAC address
> requirement (VF1's == VF2's) would fail the MAC address assignment on
> VF2.
> 
> -Siwei

Why would it fail? These are separate cards.

> >
> > --
> > MST
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
> > For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org
> >

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Re: [PATCH v4 2/2] virtio_net: Extend virtio to use VF datapath when available
  2018-03-02 20:56                 ` [virtio-dev] " Samudrala, Sridhar
@ 2018-03-02 21:33                   ` Michael S. Tsirkin
  -1 siblings, 0 replies; 102+ messages in thread
From: Michael S. Tsirkin @ 2018-03-02 21:33 UTC (permalink / raw)
  To: Samudrala, Sridhar
  Cc: Siwei Liu, Alexander Duyck, Jiri Pirko, Stephen Hemminger,
	David Miller, Netdev, virtio-dev, Brandeburg, Jesse, Duyck,
	Alexander H, Jakub Kicinski

On Fri, Mar 02, 2018 at 12:56:21PM -0800, Samudrala, Sridhar wrote:
> 
> 
> On 3/2/2018 12:44 PM, Siwei Liu wrote:
> > On Fri, Mar 2, 2018 at 12:10 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> > > On Fri, Mar 02, 2018 at 11:52:27AM -0800, Samudrala, Sridhar wrote:
> > > > 
> > > > On 3/2/2018 11:41 AM, Michael S. Tsirkin wrote:
> > > > > On Fri, Mar 02, 2018 at 07:26:25AM -0800, Alexander Duyck wrote:
> > > > > > The design limits things to a 1:1 relationship since we just have the
> > > > > > child and backup pointers, but I don't think I am seeing exception
> > > > > > handling to prevent us from overwriting the child pointers so there
> > > > > > may be a leak there.
> > > > > > 
> > > > > > Thanks.
> > > > > > 
> > > > > > - Alex
> > > > > In fact maintaining a list in that case would be nicer, and
> > > > > just use an arbitrary one.
> > > > > E.g. one can see how a user wanting to swap device 1 for device 2
> > > > > might first add device 2 with same MAC then drop device 1.
> > > > It should be possible to swap VF1 with VF2 by
> > > > 1.- enabling virtio link
> > > > 2.- unplugging VF1
> > > > 3.- plugging VF2
> > > > 4.- disabling virtio link
> > > > 
> > > True, but it isn't hard to avoid breakage if user
> > > swapped steps 2 and 3. No need to make it more
> > > fragile that it has to be.
> > The migration case, VF2 is associated with another PF on another
> > machine (destination), I wonder how it is possible.
> > 
> > Even with local plugging of VF2 on the same PF, the MAC address
> > requirement (VF1's == VF2's) would fail the MAC address assignment on
> > VF2.
> > 
> > 
> I didn't include updating the MAC filter step in the above sequence.
> So definitely plugging 2 VFs with the same MAC address will be an issue.

If these are two separate PFs then I don't see why -
each has its own MAC filter.

> Here is the more complete sequence of steps that are required to
> enable live migration.

Replacing VF1 by VF2 is not about migration. It's to remove PF
from host e.g. for service.

-- 
MST

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [virtio-dev] Re: [PATCH v4 2/2] virtio_net: Extend virtio to use VF datapath when available
@ 2018-03-02 21:33                   ` Michael S. Tsirkin
  0 siblings, 0 replies; 102+ messages in thread
From: Michael S. Tsirkin @ 2018-03-02 21:33 UTC (permalink / raw)
  To: Samudrala, Sridhar
  Cc: Siwei Liu, Alexander Duyck, Jiri Pirko, Stephen Hemminger,
	David Miller, Netdev, virtio-dev, Brandeburg, Jesse, Duyck,
	Alexander H, Jakub Kicinski

On Fri, Mar 02, 2018 at 12:56:21PM -0800, Samudrala, Sridhar wrote:
> 
> 
> On 3/2/2018 12:44 PM, Siwei Liu wrote:
> > On Fri, Mar 2, 2018 at 12:10 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> > > On Fri, Mar 02, 2018 at 11:52:27AM -0800, Samudrala, Sridhar wrote:
> > > > 
> > > > On 3/2/2018 11:41 AM, Michael S. Tsirkin wrote:
> > > > > On Fri, Mar 02, 2018 at 07:26:25AM -0800, Alexander Duyck wrote:
> > > > > > The design limits things to a 1:1 relationship since we just have the
> > > > > > child and backup pointers, but I don't think I am seeing exception
> > > > > > handling to prevent us from overwriting the child pointers so there
> > > > > > may be a leak there.
> > > > > > 
> > > > > > Thanks.
> > > > > > 
> > > > > > - Alex
> > > > > In fact maintaining a list in that case would be nicer, and
> > > > > just use an arbitrary one.
> > > > > E.g. one can see how a user wanting to swap device 1 for device 2
> > > > > might first add device 2 with same MAC then drop device 1.
> > > > It should be possible to swap VF1 with VF2 by
> > > > 1.- enabling virtio link
> > > > 2.- unplugging VF1
> > > > 3.- plugging VF2
> > > > 4.- disabling virtio link
> > > > 
> > > True, but it isn't hard to avoid breakage if user
> > > swapped steps 2 and 3. No need to make it more
> > > fragile that it has to be.
> > The migration case, VF2 is associated with another PF on another
> > machine (destination), I wonder how it is possible.
> > 
> > Even with local plugging of VF2 on the same PF, the MAC address
> > requirement (VF1's == VF2's) would fail the MAC address assignment on
> > VF2.
> > 
> > 
> I didn't include updating the MAC filter step in the above sequence.
> So definitely plugging 2 VFs with the same MAC address will be an issue.

If these are two separate PFs then I don't see why -
each has its own MAC filter.

> Here is the more complete sequence of steps that are required to
> enable live migration.

Replacing VF1 by VF2 is not about migration. It's to remove PF
from host e.g. for service.

-- 
MST

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v4 2/2] virtio_net: Extend virtio to use VF datapath when available
  2018-03-02 21:11     ` [virtio-dev] " Siwei Liu
@ 2018-03-02 21:36       ` Michael S. Tsirkin
  -1 siblings, 0 replies; 102+ messages in thread
From: Michael S. Tsirkin @ 2018-03-02 21:36 UTC (permalink / raw)
  To: Siwei Liu
  Cc: Sridhar Samudrala, Stephen Hemminger, David Miller, Netdev,
	Jiri Pirko, virtio-dev, Brandeburg, Jesse, Alexander Duyck,
	Jakub Kicinski

On Fri, Mar 02, 2018 at 01:11:56PM -0800, Siwei Liu wrote:
> On Thu, Mar 1, 2018 at 12:08 PM, Sridhar Samudrala
> <sridhar.samudrala@intel.com> wrote:
> > This patch enables virtio_net to switch over to a VF datapath when a VF
> > netdev is present with the same MAC address. It allows live migration
> > of a VM with a direct attached VF without the need to setup a bond/team
> > between a VF and virtio net device in the guest.
> >
> > The hypervisor needs to enable only one datapath at any time so that
> > packets don't get looped back to the VM over the other datapath. When a VF
> > is plugged, the virtio datapath link state can be marked as down. The
> > hypervisor needs to unplug the VF device from the guest on the source host
> > and reset the MAC filter of the VF to initiate failover of datapath to
> > virtio before starting the migration. After the migration is completed,
> > the destination hypervisor sets the MAC filter on the VF and plugs it back
> > to the guest to switch over to VF datapath.
> >
> > When BACKUP feature is enabled, an additional netdev(bypass netdev) is
> > created that acts as a master device and tracks the state of the 2 lower
> > netdevs. The original virtio_net netdev is marked as 'backup' netdev and a
> > passthru device with the same MAC is registered as 'active' netdev.
> >
> > This patch is based on the discussion initiated by Jesse on this thread.
> > https://marc.info/?l=linux-virtualization&m=151189725224231&w=2
> >
> > Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
> > Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
> > Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
> > ---
> >  drivers/net/virtio_net.c | 683 ++++++++++++++++++++++++++++++++++++++++++++++-
> >  1 file changed, 682 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> > index bcd13fe906ca..f2860d86c952 100644
> > --- a/drivers/net/virtio_net.c
> > +++ b/drivers/net/virtio_net.c
> > @@ -30,6 +30,8 @@
> >  #include <linux/cpu.h>
> >  #include <linux/average.h>
> >  #include <linux/filter.h>
> > +#include <linux/netdevice.h>
> > +#include <linux/pci.h>
> >  #include <net/route.h>
> >  #include <net/xdp.h>
> >
> > @@ -206,6 +208,9 @@ struct virtnet_info {
> >         u32 speed;
> >
> >         unsigned long guest_offloads;
> > +
> > +       /* upper netdev created when BACKUP feature enabled */
> > +       struct net_device *bypass_netdev;
> >  };
> >
> >  struct padded_vnet_hdr {
> > @@ -2236,6 +2241,22 @@ static int virtnet_xdp(struct net_device *dev, struct netdev_bpf *xdp)
> >         }
> >  }
> >
> > +static int virtnet_get_phys_port_name(struct net_device *dev, char *buf,
> > +                                     size_t len)
> > +{
> > +       struct virtnet_info *vi = netdev_priv(dev);
> > +       int ret;
> > +
> > +       if (!virtio_has_feature(vi->vdev, VIRTIO_NET_F_BACKUP))
> > +               return -EOPNOTSUPP;
> > +
> > +       ret = snprintf(buf, len, "_bkup");
> > +       if (ret >= len)
> > +               return -EOPNOTSUPP;
> > +
> > +       return 0;
> > +}
> > +
> 
> What if the systemd/udevd is not new enough to enforce the
> n<phys_port_name> naming? Would virtio_bypass get a different name
> than the original virtio_net?

You mean people using ethX names? Any hardware config change breaks
these, I don't think that can be helped.

> Should we detect this earlier and fall
> back to legacy mode without creating the bypass netdev and ensalving
> the VF?

I don't think we can do this with existing kernel/userspace APIs.

-- 
MST

^ permalink raw reply	[flat|nested] 102+ messages in thread

* [virtio-dev] Re: [PATCH v4 2/2] virtio_net: Extend virtio to use VF datapath when available
@ 2018-03-02 21:36       ` Michael S. Tsirkin
  0 siblings, 0 replies; 102+ messages in thread
From: Michael S. Tsirkin @ 2018-03-02 21:36 UTC (permalink / raw)
  To: Siwei Liu
  Cc: Sridhar Samudrala, Stephen Hemminger, David Miller, Netdev,
	Jiri Pirko, virtio-dev, Brandeburg, Jesse, Alexander Duyck,
	Jakub Kicinski

On Fri, Mar 02, 2018 at 01:11:56PM -0800, Siwei Liu wrote:
> On Thu, Mar 1, 2018 at 12:08 PM, Sridhar Samudrala
> <sridhar.samudrala@intel.com> wrote:
> > This patch enables virtio_net to switch over to a VF datapath when a VF
> > netdev is present with the same MAC address. It allows live migration
> > of a VM with a direct attached VF without the need to setup a bond/team
> > between a VF and virtio net device in the guest.
> >
> > The hypervisor needs to enable only one datapath at any time so that
> > packets don't get looped back to the VM over the other datapath. When a VF
> > is plugged, the virtio datapath link state can be marked as down. The
> > hypervisor needs to unplug the VF device from the guest on the source host
> > and reset the MAC filter of the VF to initiate failover of datapath to
> > virtio before starting the migration. After the migration is completed,
> > the destination hypervisor sets the MAC filter on the VF and plugs it back
> > to the guest to switch over to VF datapath.
> >
> > When BACKUP feature is enabled, an additional netdev(bypass netdev) is
> > created that acts as a master device and tracks the state of the 2 lower
> > netdevs. The original virtio_net netdev is marked as 'backup' netdev and a
> > passthru device with the same MAC is registered as 'active' netdev.
> >
> > This patch is based on the discussion initiated by Jesse on this thread.
> > https://marc.info/?l=linux-virtualization&m=151189725224231&w=2
> >
> > Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
> > Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
> > Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
> > ---
> >  drivers/net/virtio_net.c | 683 ++++++++++++++++++++++++++++++++++++++++++++++-
> >  1 file changed, 682 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> > index bcd13fe906ca..f2860d86c952 100644
> > --- a/drivers/net/virtio_net.c
> > +++ b/drivers/net/virtio_net.c
> > @@ -30,6 +30,8 @@
> >  #include <linux/cpu.h>
> >  #include <linux/average.h>
> >  #include <linux/filter.h>
> > +#include <linux/netdevice.h>
> > +#include <linux/pci.h>
> >  #include <net/route.h>
> >  #include <net/xdp.h>
> >
> > @@ -206,6 +208,9 @@ struct virtnet_info {
> >         u32 speed;
> >
> >         unsigned long guest_offloads;
> > +
> > +       /* upper netdev created when BACKUP feature enabled */
> > +       struct net_device *bypass_netdev;
> >  };
> >
> >  struct padded_vnet_hdr {
> > @@ -2236,6 +2241,22 @@ static int virtnet_xdp(struct net_device *dev, struct netdev_bpf *xdp)
> >         }
> >  }
> >
> > +static int virtnet_get_phys_port_name(struct net_device *dev, char *buf,
> > +                                     size_t len)
> > +{
> > +       struct virtnet_info *vi = netdev_priv(dev);
> > +       int ret;
> > +
> > +       if (!virtio_has_feature(vi->vdev, VIRTIO_NET_F_BACKUP))
> > +               return -EOPNOTSUPP;
> > +
> > +       ret = snprintf(buf, len, "_bkup");
> > +       if (ret >= len)
> > +               return -EOPNOTSUPP;
> > +
> > +       return 0;
> > +}
> > +
> 
> What if the systemd/udevd is not new enough to enforce the
> n<phys_port_name> naming? Would virtio_bypass get a different name
> than the original virtio_net?

You mean people using ethX names? Any hardware config change breaks
these, I don't think that can be helped.

> Should we detect this earlier and fall
> back to legacy mode without creating the bypass netdev and ensalving
> the VF?

I don't think we can do this with existing kernel/userspace APIs.

-- 
MST

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Re: [PATCH v4 2/2] virtio_net: Extend virtio to use VF datapath when available
  2018-03-02 21:31                 ` [virtio-dev] " Michael S. Tsirkin
@ 2018-03-02 22:26                   ` Siwei Liu
  -1 siblings, 0 replies; 102+ messages in thread
From: Siwei Liu @ 2018-03-02 22:26 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Samudrala, Sridhar, Alexander Duyck, Jiri Pirko,
	Stephen Hemminger, David Miller, Netdev, virtio-dev, Brandeburg,
	Jesse, Duyck, Alexander H, Jakub Kicinski

On Fri, Mar 2, 2018 at 1:31 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Fri, Mar 02, 2018 at 12:44:56PM -0800, Siwei Liu wrote:
>> On Fri, Mar 2, 2018 at 12:10 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
>> > On Fri, Mar 02, 2018 at 11:52:27AM -0800, Samudrala, Sridhar wrote:
>> >>
>> >>
>> >> On 3/2/2018 11:41 AM, Michael S. Tsirkin wrote:
>> >> > On Fri, Mar 02, 2018 at 07:26:25AM -0800, Alexander Duyck wrote:
>> >> > > The design limits things to a 1:1 relationship since we just have the
>> >> > > child and backup pointers, but I don't think I am seeing exception
>> >> > > handling to prevent us from overwriting the child pointers so there
>> >> > > may be a leak there.
>> >> > >
>> >> > > Thanks.
>> >> > >
>> >> > > - Alex
>> >> > In fact maintaining a list in that case would be nicer, and
>> >> > just use an arbitrary one.
>> >> > E.g. one can see how a user wanting to swap device 1 for device 2
>> >> > might first add device 2 with same MAC then drop device 1.
>> >>
>> >> It should be possible to swap VF1 with VF2 by
>> >> 1.- enabling virtio link
>> >> 2.- unplugging VF1
>> >> 3.- plugging VF2
>> >> 4.- disabling virtio link
>> >>
>> >
>> > True, but it isn't hard to avoid breakage if user
>> > swapped steps 2 and 3. No need to make it more
>> > fragile that it has to be.
>>
>> The migration case, VF2 is associated with another PF on another
>> machine (destination), I wonder how it is possible.
>
> E.g. you want to remove the PF so you unplug the VF
> then add another VF of the same PF.
>
>> Even with local plugging of VF2 on the same PF, the MAC address
>> requirement (VF1's == VF2's) would fail the MAC address assignment on
>> VF2.
>>
>> -Siwei
>
> Why would it fail? These are separate cards.

OK. I realized that you may talk about assigning a VF on a diffferent
PF (VF1 on PF1 while VF2 on PF2). And we might assign a pass-through
device rather than a VF. Yes, it's indeed possible that may happen but
I take it as a further step down (another patch maybe) as it would
involve changes to notify the network with gratuituious ARP and/or
unsolicited ND advertisement of the MAC address association with the
new port.

-Siwei

>
>> >
>> > --
>> > MST
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
>> > For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org
>> >

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [virtio-dev] Re: [PATCH v4 2/2] virtio_net: Extend virtio to use VF datapath when available
@ 2018-03-02 22:26                   ` Siwei Liu
  0 siblings, 0 replies; 102+ messages in thread
From: Siwei Liu @ 2018-03-02 22:26 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Samudrala, Sridhar, Alexander Duyck, Jiri Pirko,
	Stephen Hemminger, David Miller, Netdev, virtio-dev, Brandeburg,
	Jesse, Duyck, Alexander H, Jakub Kicinski

On Fri, Mar 2, 2018 at 1:31 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Fri, Mar 02, 2018 at 12:44:56PM -0800, Siwei Liu wrote:
>> On Fri, Mar 2, 2018 at 12:10 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
>> > On Fri, Mar 02, 2018 at 11:52:27AM -0800, Samudrala, Sridhar wrote:
>> >>
>> >>
>> >> On 3/2/2018 11:41 AM, Michael S. Tsirkin wrote:
>> >> > On Fri, Mar 02, 2018 at 07:26:25AM -0800, Alexander Duyck wrote:
>> >> > > The design limits things to a 1:1 relationship since we just have the
>> >> > > child and backup pointers, but I don't think I am seeing exception
>> >> > > handling to prevent us from overwriting the child pointers so there
>> >> > > may be a leak there.
>> >> > >
>> >> > > Thanks.
>> >> > >
>> >> > > - Alex
>> >> > In fact maintaining a list in that case would be nicer, and
>> >> > just use an arbitrary one.
>> >> > E.g. one can see how a user wanting to swap device 1 for device 2
>> >> > might first add device 2 with same MAC then drop device 1.
>> >>
>> >> It should be possible to swap VF1 with VF2 by
>> >> 1.- enabling virtio link
>> >> 2.- unplugging VF1
>> >> 3.- plugging VF2
>> >> 4.- disabling virtio link
>> >>
>> >
>> > True, but it isn't hard to avoid breakage if user
>> > swapped steps 2 and 3. No need to make it more
>> > fragile that it has to be.
>>
>> The migration case, VF2 is associated with another PF on another
>> machine (destination), I wonder how it is possible.
>
> E.g. you want to remove the PF so you unplug the VF
> then add another VF of the same PF.
>
>> Even with local plugging of VF2 on the same PF, the MAC address
>> requirement (VF1's == VF2's) would fail the MAC address assignment on
>> VF2.
>>
>> -Siwei
>
> Why would it fail? These are separate cards.

OK. I realized that you may talk about assigning a VF on a diffferent
PF (VF1 on PF1 while VF2 on PF2). And we might assign a pass-through
device rather than a VF. Yes, it's indeed possible that may happen but
I take it as a further step down (another patch maybe) as it would
involve changes to notify the network with gratuituious ARP and/or
unsolicited ND advertisement of the MAC address association with the
new port.

-Siwei

>
>> >
>> > --
>> > MST
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
>> > For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org
>> >

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Re: [PATCH v4 2/2] virtio_net: Extend virtio to use VF datapath when available
  2018-03-02 21:11     ` [virtio-dev] " Siwei Liu
@ 2018-03-02 23:12       ` Samudrala, Sridhar
  -1 siblings, 0 replies; 102+ messages in thread
From: Samudrala, Sridhar @ 2018-03-02 23:12 UTC (permalink / raw)
  To: Siwei Liu
  Cc: Michael S. Tsirkin, Stephen Hemminger, David Miller, Netdev,
	Jiri Pirko, virtio-dev, Brandeburg, Jesse, Alexander Duyck,
	Jakub Kicinski

On 3/2/2018 1:11 PM, Siwei Liu wrote:
> On Thu, Mar 1, 2018 at 12:08 PM, Sridhar Samudrala
> <sridhar.samudrala@intel.com> wrote:
>> This patch enables virtio_net to switch over to a VF datapath when a VF
>> netdev is present with the same MAC address. It allows live migration
>> of a VM with a direct attached VF without the need to setup a bond/team
>> between a VF and virtio net device in the guest.
>>
>> The hypervisor needs to enable only one datapath at any time so that
>> packets don't get looped back to the VM over the other datapath. When a VF
>> is plugged, the virtio datapath link state can be marked as down. The
>> hypervisor needs to unplug the VF device from the guest on the source host
>> and reset the MAC filter of the VF to initiate failover of datapath to
>> virtio before starting the migration. After the migration is completed,
>> the destination hypervisor sets the MAC filter on the VF and plugs it back
>> to the guest to switch over to VF datapath.
>>
>> When BACKUP feature is enabled, an additional netdev(bypass netdev) is
>> created that acts as a master device and tracks the state of the 2 lower
>> netdevs. The original virtio_net netdev is marked as 'backup' netdev and a
>> passthru device with the same MAC is registered as 'active' netdev.
>>
>> This patch is based on the discussion initiated by Jesse on this thread.
>> https://marc.info/?l=linux-virtualization&m=151189725224231&w=2
>>
>> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
>> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
>> Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
>> ---
>>   drivers/net/virtio_net.c | 683 ++++++++++++++++++++++++++++++++++++++++++++++-
>>   1 file changed, 682 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>> index bcd13fe906ca..f2860d86c952 100644
>> --- a/drivers/net/virtio_net.c
>> +++ b/drivers/net/virtio_net.c
>> @@ -30,6 +30,8 @@
>>   #include <linux/cpu.h>
>>   #include <linux/average.h>
>>   #include <linux/filter.h>
>> +#include <linux/netdevice.h>
>> +#include <linux/pci.h>
>>   #include <net/route.h>
>>   #include <net/xdp.h>
>>
>> @@ -206,6 +208,9 @@ struct virtnet_info {
>>          u32 speed;
>>
>>          unsigned long guest_offloads;
>> +
>> +       /* upper netdev created when BACKUP feature enabled */
>> +       struct net_device *bypass_netdev;
>>   };
>>
>>   struct padded_vnet_hdr {
>> @@ -2236,6 +2241,22 @@ static int virtnet_xdp(struct net_device *dev, struct netdev_bpf *xdp)
>>          }
>>   }
>>
>> +static int virtnet_get_phys_port_name(struct net_device *dev, char *buf,
>> +                                     size_t len)
>> +{
>> +       struct virtnet_info *vi = netdev_priv(dev);
>> +       int ret;
>> +
>> +       if (!virtio_has_feature(vi->vdev, VIRTIO_NET_F_BACKUP))
>> +               return -EOPNOTSUPP;
>> +
>> +       ret = snprintf(buf, len, "_bkup");
>> +       if (ret >= len)
>> +               return -EOPNOTSUPP;
>> +
>> +       return 0;
>> +}
>> +
> What if the systemd/udevd is not new enough to enforce the
> n<phys_port_name> naming? Would virtio_bypass get a different name
> than the original virtio_net? Should we detect this earlier and fall
> back to legacy mode without creating the bypass netdev and ensalving
> the VF?

If udev doesn't support renaming of the devices,  then the upper bypass device
should get the original name and the lower virtio netdev will get the next name.
Hopefully the distros updating the kernel will also move to the new systemd/udev.


>
>>   static const struct net_device_ops virtnet_netdev = {
>>          .ndo_open            = virtnet_open,
>>          .ndo_stop            = virtnet_close,
>> @@ -2253,6 +2274,7 @@ static const struct net_device_ops virtnet_netdev = {
>>          .ndo_xdp_xmit           = virtnet_xdp_xmit,
>>          .ndo_xdp_flush          = virtnet_xdp_flush,
>>          .ndo_features_check     = passthru_features_check,
>> +       .ndo_get_phys_port_name = virtnet_get_phys_port_name,
>>   };
>>
>>   static void virtnet_config_changed_work(struct work_struct *work)
>> @@ -2647,6 +2669,653 @@ static int virtnet_validate(struct virtio_device *vdev)
>>          return 0;
>>   }
>>
>> +/* START of functions supporting VIRTIO_NET_F_BACKUP feature.
>> + * When BACKUP feature is enabled, an additional netdev(bypass netdev)
>> + * is created that acts as a master device and tracks the state of the
>> + * 2 lower netdevs. The original virtio_net netdev is registered as
>> + * 'backup' netdev and a passthru device with the same MAC is registered
>> + * as 'active' netdev.
>> + */
>> +
>> +/* bypass state maintained when BACKUP feature is enabled */
>> +struct virtnet_bypass_info {
>> +       /* passthru netdev with same MAC */
>> +       struct net_device __rcu *active_netdev;
>> +
>> +       /* virtio_net netdev */
>> +       struct net_device __rcu *backup_netdev;
>> +
>> +       /* active netdev stats */
>> +       struct rtnl_link_stats64 active_stats;
>> +
>> +       /* backup netdev stats */
>> +       struct rtnl_link_stats64 backup_stats;
>> +
>> +       /* aggregated stats */
>> +       struct rtnl_link_stats64 bypass_stats;
>> +
>> +       /* spinlock while updating stats */
>> +       spinlock_t stats_lock;
>> +};
>> +
>> +static void virtnet_bypass_child_open(struct net_device *dev,
>> +                                     struct net_device *child_netdev)
>> +{
>> +       int err = dev_open(child_netdev);
>> +
>> +       if (err)
>> +               netdev_warn(dev, "unable to open slave: %s: %d\n",
>> +                           child_netdev->name, err);
>> +}
>> +
>> +static int virtnet_bypass_open(struct net_device *dev)
>> +{
>> +       struct virtnet_bypass_info *vbi = netdev_priv(dev);
>> +       struct net_device *child_netdev;
>> +
>> +       netif_carrier_off(dev);
>> +       netif_tx_wake_all_queues(dev);
>> +
>> +       child_netdev = rtnl_dereference(vbi->active_netdev);
>> +       if (child_netdev)
>> +               virtnet_bypass_child_open(dev, child_netdev);
>> +
>> +       child_netdev = rtnl_dereference(vbi->backup_netdev);
>> +       if (child_netdev)
>> +               virtnet_bypass_child_open(dev, child_netdev);
>> +
>> +       return 0;
>> +}
>> +
>> +static int virtnet_bypass_close(struct net_device *dev)
>> +{
>> +       struct virtnet_bypass_info *vi = netdev_priv(dev);
>> +       struct net_device *child_netdev;
>> +
>> +       netif_tx_disable(dev);
>> +
>> +       child_netdev = rtnl_dereference(vi->active_netdev);
>> +       if (child_netdev)
>> +               dev_close(child_netdev);
>> +
>> +       child_netdev = rtnl_dereference(vi->backup_netdev);
>> +       if (child_netdev)
>> +               dev_close(child_netdev);
>> +
>> +       return 0;
>> +}
>> +
>> +static netdev_tx_t virtnet_bypass_drop_xmit(struct sk_buff *skb,
>> +                                           struct net_device *dev)
>> +{
>> +       atomic_long_inc(&dev->tx_dropped);
>> +       dev_kfree_skb_any(skb);
>> +       return NETDEV_TX_OK;
>> +}
>> +
>> +static bool virtnet_bypass_xmit_ready(struct net_device *dev)
>> +{
>> +       return netif_running(dev) && netif_carrier_ok(dev);
>> +}
>> +
>> +static netdev_tx_t virtnet_bypass_start_xmit(struct sk_buff *skb,
>> +                                            struct net_device *dev)
>> +{
>> +       struct virtnet_bypass_info *vbi = netdev_priv(dev);
>> +       struct net_device *xmit_dev;
>> +
>> +       /* Try xmit via active netdev followed by backup netdev */
>> +       xmit_dev = rcu_dereference_bh(vbi->active_netdev);
>> +       if (!xmit_dev || !virtnet_bypass_xmit_ready(xmit_dev)) {
>> +               xmit_dev = rcu_dereference_bh(vbi->backup_netdev);
>> +               if (!xmit_dev || !virtnet_bypass_xmit_ready(xmit_dev))
>> +                       return virtnet_bypass_drop_xmit(skb, dev);
>> +       }
>> +
>> +       skb->dev = xmit_dev;
>> +       skb->queue_mapping = qdisc_skb_cb(skb)->slave_dev_queue_mapping;
>> +
>> +       return dev_queue_xmit(skb);
>> +}
>> +
>> +static u16 virtnet_bypass_select_queue(struct net_device *dev,
>> +                                      struct sk_buff *skb, void *accel_priv,
>> +                                      select_queue_fallback_t fallback)
>> +{
>> +       /* This helper function exists to help dev_pick_tx get the correct
>> +        * destination queue.  Using a helper function skips a call to
>> +        * skb_tx_hash and will put the skbs in the queue we expect on their
>> +        * way down to the bonding driver.
>> +        */
>> +       u16 txq = skb_rx_queue_recorded(skb) ? skb_get_rx_queue(skb) : 0;
>> +
>> +       /* Save the original txq to restore before passing to the driver */
>> +       qdisc_skb_cb(skb)->slave_dev_queue_mapping = skb->queue_mapping;
>> +
>> +       if (unlikely(txq >= dev->real_num_tx_queues)) {
>> +               do {
>> +                       txq -= dev->real_num_tx_queues;
>> +               } while (txq >= dev->real_num_tx_queues);
>> +       }
>> +
>> +       return txq;
>> +}
>> +
>> +/* fold stats, assuming all rtnl_link_stats64 fields are u64, but
>> + * that some drivers can provide 32bit values only.
>> + */
>> +static void virtnet_bypass_fold_stats(struct rtnl_link_stats64 *_res,
>> +                                     const struct rtnl_link_stats64 *_new,
>> +                                     const struct rtnl_link_stats64 *_old)
>> +{
>> +       const u64 *new = (const u64 *)_new;
>> +       const u64 *old = (const u64 *)_old;
>> +       u64 *res = (u64 *)_res;
>> +       int i;
>> +
>> +       for (i = 0; i < sizeof(*_res) / sizeof(u64); i++) {
>> +               u64 nv = new[i];
>> +               u64 ov = old[i];
>> +               s64 delta = nv - ov;
>> +
>> +               /* detects if this particular field is 32bit only */
>> +               if (((nv | ov) >> 32) == 0)
>> +                       delta = (s64)(s32)((u32)nv - (u32)ov);
>> +
>> +               /* filter anomalies, some drivers reset their stats
>> +                * at down/up events.
>> +                */
>> +               if (delta > 0)
>> +                       res[i] += delta;
>> +       }
>> +}
>> +
>> +static void virtnet_bypass_get_stats(struct net_device *dev,
>> +                                    struct rtnl_link_stats64 *stats)
>> +{
>> +       struct virtnet_bypass_info *vbi = netdev_priv(dev);
>> +       const struct rtnl_link_stats64 *new;
>> +       struct rtnl_link_stats64 temp;
>> +       struct net_device *child_netdev;
>> +
>> +       spin_lock(&vbi->stats_lock);
>> +       memcpy(stats, &vbi->bypass_stats, sizeof(*stats));
>> +
>> +       rcu_read_lock();
>> +
>> +       child_netdev = rcu_dereference(vbi->active_netdev);
>> +       if (child_netdev) {
>> +               new = dev_get_stats(child_netdev, &temp);
>> +               virtnet_bypass_fold_stats(stats, new, &vbi->active_stats);
>> +               memcpy(&vbi->active_stats, new, sizeof(*new));
>> +       }
>> +
>> +       child_netdev = rcu_dereference(vbi->backup_netdev);
>> +       if (child_netdev) {
>> +               new = dev_get_stats(child_netdev, &temp);
>> +               virtnet_bypass_fold_stats(stats, new, &vbi->backup_stats);
>> +               memcpy(&vbi->backup_stats, new, sizeof(*new));
>> +       }
>> +
>> +       rcu_read_unlock();
>> +
>> +       memcpy(&vbi->bypass_stats, stats, sizeof(*stats));
>> +       spin_unlock(&vbi->stats_lock);
>> +}
>> +
>> +static int virtnet_bypass_change_mtu(struct net_device *dev, int new_mtu)
>> +{
>> +       struct virtnet_bypass_info *vbi = netdev_priv(dev);
>> +       struct net_device *child_netdev;
>> +       int ret = 0;
>> +
>> +       child_netdev = rcu_dereference(vbi->active_netdev);
>> +       if (child_netdev) {
>> +               ret = dev_set_mtu(child_netdev, new_mtu);
>> +               if (ret)
>> +                       return ret;
>> +       }
>> +
>> +       child_netdev = rcu_dereference(vbi->backup_netdev);
>> +       if (child_netdev) {
>> +               ret = dev_set_mtu(child_netdev, new_mtu);
>> +               if (ret)
>> +                       netdev_err(child_netdev,
>> +                                  "Unexpected failure to set mtu to %d\n",
>> +                                  new_mtu);
> Shouldn't we unwind the MTU config on active_netdev if failing to set
> it on backup_netdev?

backup_netdev is virtio_net and  it is not expected to fail as it supports
MTU all the way upto 64K.


>
>> +       }
>> +
>> +       dev->mtu = new_mtu;
>> +       return 0;
>> +}
>> +
>> +static void virtnet_bypass_set_rx_mode(struct net_device *dev)
>> +{
>> +       struct virtnet_bypass_info *vbi = netdev_priv(dev);
>> +       struct net_device *child_netdev;
>> +
>> +       rcu_read_lock();
>> +
>> +       child_netdev = rcu_dereference(vbi->active_netdev);
>> +       if (child_netdev) {
>> +               dev_uc_sync_multiple(child_netdev, dev);
>> +               dev_mc_sync_multiple(child_netdev, dev);
>> +       }
>> +
>> +       child_netdev = rcu_dereference(vbi->backup_netdev);
>> +       if (child_netdev) {
>> +               dev_uc_sync_multiple(child_netdev, dev);
>> +               dev_mc_sync_multiple(child_netdev, dev);
>> +       }
>> +
> If VF comes up later than when set_rx_mode is called where do you sync
> up the unicast and multicast address?

I could include sync up of unicast and multicast list along with
MTU when the child is registered.


>
> The rest looks good.
>
> Thanks,
> -Siwei
>
>

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [virtio-dev] Re: [PATCH v4 2/2] virtio_net: Extend virtio to use VF datapath when available
@ 2018-03-02 23:12       ` Samudrala, Sridhar
  0 siblings, 0 replies; 102+ messages in thread
From: Samudrala, Sridhar @ 2018-03-02 23:12 UTC (permalink / raw)
  To: Siwei Liu
  Cc: Michael S. Tsirkin, Stephen Hemminger, David Miller, Netdev,
	Jiri Pirko, virtio-dev, Brandeburg, Jesse, Alexander Duyck,
	Jakub Kicinski

On 3/2/2018 1:11 PM, Siwei Liu wrote:
> On Thu, Mar 1, 2018 at 12:08 PM, Sridhar Samudrala
> <sridhar.samudrala@intel.com> wrote:
>> This patch enables virtio_net to switch over to a VF datapath when a VF
>> netdev is present with the same MAC address. It allows live migration
>> of a VM with a direct attached VF without the need to setup a bond/team
>> between a VF and virtio net device in the guest.
>>
>> The hypervisor needs to enable only one datapath at any time so that
>> packets don't get looped back to the VM over the other datapath. When a VF
>> is plugged, the virtio datapath link state can be marked as down. The
>> hypervisor needs to unplug the VF device from the guest on the source host
>> and reset the MAC filter of the VF to initiate failover of datapath to
>> virtio before starting the migration. After the migration is completed,
>> the destination hypervisor sets the MAC filter on the VF and plugs it back
>> to the guest to switch over to VF datapath.
>>
>> When BACKUP feature is enabled, an additional netdev(bypass netdev) is
>> created that acts as a master device and tracks the state of the 2 lower
>> netdevs. The original virtio_net netdev is marked as 'backup' netdev and a
>> passthru device with the same MAC is registered as 'active' netdev.
>>
>> This patch is based on the discussion initiated by Jesse on this thread.
>> https://marc.info/?l=linux-virtualization&m=151189725224231&w=2
>>
>> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
>> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
>> Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
>> ---
>>   drivers/net/virtio_net.c | 683 ++++++++++++++++++++++++++++++++++++++++++++++-
>>   1 file changed, 682 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>> index bcd13fe906ca..f2860d86c952 100644
>> --- a/drivers/net/virtio_net.c
>> +++ b/drivers/net/virtio_net.c
>> @@ -30,6 +30,8 @@
>>   #include <linux/cpu.h>
>>   #include <linux/average.h>
>>   #include <linux/filter.h>
>> +#include <linux/netdevice.h>
>> +#include <linux/pci.h>
>>   #include <net/route.h>
>>   #include <net/xdp.h>
>>
>> @@ -206,6 +208,9 @@ struct virtnet_info {
>>          u32 speed;
>>
>>          unsigned long guest_offloads;
>> +
>> +       /* upper netdev created when BACKUP feature enabled */
>> +       struct net_device *bypass_netdev;
>>   };
>>
>>   struct padded_vnet_hdr {
>> @@ -2236,6 +2241,22 @@ static int virtnet_xdp(struct net_device *dev, struct netdev_bpf *xdp)
>>          }
>>   }
>>
>> +static int virtnet_get_phys_port_name(struct net_device *dev, char *buf,
>> +                                     size_t len)
>> +{
>> +       struct virtnet_info *vi = netdev_priv(dev);
>> +       int ret;
>> +
>> +       if (!virtio_has_feature(vi->vdev, VIRTIO_NET_F_BACKUP))
>> +               return -EOPNOTSUPP;
>> +
>> +       ret = snprintf(buf, len, "_bkup");
>> +       if (ret >= len)
>> +               return -EOPNOTSUPP;
>> +
>> +       return 0;
>> +}
>> +
> What if the systemd/udevd is not new enough to enforce the
> n<phys_port_name> naming? Would virtio_bypass get a different name
> than the original virtio_net? Should we detect this earlier and fall
> back to legacy mode without creating the bypass netdev and ensalving
> the VF?

If udev doesn't support renaming of the devices,  then the upper bypass device
should get the original name and the lower virtio netdev will get the next name.
Hopefully the distros updating the kernel will also move to the new systemd/udev.


>
>>   static const struct net_device_ops virtnet_netdev = {
>>          .ndo_open            = virtnet_open,
>>          .ndo_stop            = virtnet_close,
>> @@ -2253,6 +2274,7 @@ static const struct net_device_ops virtnet_netdev = {
>>          .ndo_xdp_xmit           = virtnet_xdp_xmit,
>>          .ndo_xdp_flush          = virtnet_xdp_flush,
>>          .ndo_features_check     = passthru_features_check,
>> +       .ndo_get_phys_port_name = virtnet_get_phys_port_name,
>>   };
>>
>>   static void virtnet_config_changed_work(struct work_struct *work)
>> @@ -2647,6 +2669,653 @@ static int virtnet_validate(struct virtio_device *vdev)
>>          return 0;
>>   }
>>
>> +/* START of functions supporting VIRTIO_NET_F_BACKUP feature.
>> + * When BACKUP feature is enabled, an additional netdev(bypass netdev)
>> + * is created that acts as a master device and tracks the state of the
>> + * 2 lower netdevs. The original virtio_net netdev is registered as
>> + * 'backup' netdev and a passthru device with the same MAC is registered
>> + * as 'active' netdev.
>> + */
>> +
>> +/* bypass state maintained when BACKUP feature is enabled */
>> +struct virtnet_bypass_info {
>> +       /* passthru netdev with same MAC */
>> +       struct net_device __rcu *active_netdev;
>> +
>> +       /* virtio_net netdev */
>> +       struct net_device __rcu *backup_netdev;
>> +
>> +       /* active netdev stats */
>> +       struct rtnl_link_stats64 active_stats;
>> +
>> +       /* backup netdev stats */
>> +       struct rtnl_link_stats64 backup_stats;
>> +
>> +       /* aggregated stats */
>> +       struct rtnl_link_stats64 bypass_stats;
>> +
>> +       /* spinlock while updating stats */
>> +       spinlock_t stats_lock;
>> +};
>> +
>> +static void virtnet_bypass_child_open(struct net_device *dev,
>> +                                     struct net_device *child_netdev)
>> +{
>> +       int err = dev_open(child_netdev);
>> +
>> +       if (err)
>> +               netdev_warn(dev, "unable to open slave: %s: %d\n",
>> +                           child_netdev->name, err);
>> +}
>> +
>> +static int virtnet_bypass_open(struct net_device *dev)
>> +{
>> +       struct virtnet_bypass_info *vbi = netdev_priv(dev);
>> +       struct net_device *child_netdev;
>> +
>> +       netif_carrier_off(dev);
>> +       netif_tx_wake_all_queues(dev);
>> +
>> +       child_netdev = rtnl_dereference(vbi->active_netdev);
>> +       if (child_netdev)
>> +               virtnet_bypass_child_open(dev, child_netdev);
>> +
>> +       child_netdev = rtnl_dereference(vbi->backup_netdev);
>> +       if (child_netdev)
>> +               virtnet_bypass_child_open(dev, child_netdev);
>> +
>> +       return 0;
>> +}
>> +
>> +static int virtnet_bypass_close(struct net_device *dev)
>> +{
>> +       struct virtnet_bypass_info *vi = netdev_priv(dev);
>> +       struct net_device *child_netdev;
>> +
>> +       netif_tx_disable(dev);
>> +
>> +       child_netdev = rtnl_dereference(vi->active_netdev);
>> +       if (child_netdev)
>> +               dev_close(child_netdev);
>> +
>> +       child_netdev = rtnl_dereference(vi->backup_netdev);
>> +       if (child_netdev)
>> +               dev_close(child_netdev);
>> +
>> +       return 0;
>> +}
>> +
>> +static netdev_tx_t virtnet_bypass_drop_xmit(struct sk_buff *skb,
>> +                                           struct net_device *dev)
>> +{
>> +       atomic_long_inc(&dev->tx_dropped);
>> +       dev_kfree_skb_any(skb);
>> +       return NETDEV_TX_OK;
>> +}
>> +
>> +static bool virtnet_bypass_xmit_ready(struct net_device *dev)
>> +{
>> +       return netif_running(dev) && netif_carrier_ok(dev);
>> +}
>> +
>> +static netdev_tx_t virtnet_bypass_start_xmit(struct sk_buff *skb,
>> +                                            struct net_device *dev)
>> +{
>> +       struct virtnet_bypass_info *vbi = netdev_priv(dev);
>> +       struct net_device *xmit_dev;
>> +
>> +       /* Try xmit via active netdev followed by backup netdev */
>> +       xmit_dev = rcu_dereference_bh(vbi->active_netdev);
>> +       if (!xmit_dev || !virtnet_bypass_xmit_ready(xmit_dev)) {
>> +               xmit_dev = rcu_dereference_bh(vbi->backup_netdev);
>> +               if (!xmit_dev || !virtnet_bypass_xmit_ready(xmit_dev))
>> +                       return virtnet_bypass_drop_xmit(skb, dev);
>> +       }
>> +
>> +       skb->dev = xmit_dev;
>> +       skb->queue_mapping = qdisc_skb_cb(skb)->slave_dev_queue_mapping;
>> +
>> +       return dev_queue_xmit(skb);
>> +}
>> +
>> +static u16 virtnet_bypass_select_queue(struct net_device *dev,
>> +                                      struct sk_buff *skb, void *accel_priv,
>> +                                      select_queue_fallback_t fallback)
>> +{
>> +       /* This helper function exists to help dev_pick_tx get the correct
>> +        * destination queue.  Using a helper function skips a call to
>> +        * skb_tx_hash and will put the skbs in the queue we expect on their
>> +        * way down to the bonding driver.
>> +        */
>> +       u16 txq = skb_rx_queue_recorded(skb) ? skb_get_rx_queue(skb) : 0;
>> +
>> +       /* Save the original txq to restore before passing to the driver */
>> +       qdisc_skb_cb(skb)->slave_dev_queue_mapping = skb->queue_mapping;
>> +
>> +       if (unlikely(txq >= dev->real_num_tx_queues)) {
>> +               do {
>> +                       txq -= dev->real_num_tx_queues;
>> +               } while (txq >= dev->real_num_tx_queues);
>> +       }
>> +
>> +       return txq;
>> +}
>> +
>> +/* fold stats, assuming all rtnl_link_stats64 fields are u64, but
>> + * that some drivers can provide 32bit values only.
>> + */
>> +static void virtnet_bypass_fold_stats(struct rtnl_link_stats64 *_res,
>> +                                     const struct rtnl_link_stats64 *_new,
>> +                                     const struct rtnl_link_stats64 *_old)
>> +{
>> +       const u64 *new = (const u64 *)_new;
>> +       const u64 *old = (const u64 *)_old;
>> +       u64 *res = (u64 *)_res;
>> +       int i;
>> +
>> +       for (i = 0; i < sizeof(*_res) / sizeof(u64); i++) {
>> +               u64 nv = new[i];
>> +               u64 ov = old[i];
>> +               s64 delta = nv - ov;
>> +
>> +               /* detects if this particular field is 32bit only */
>> +               if (((nv | ov) >> 32) == 0)
>> +                       delta = (s64)(s32)((u32)nv - (u32)ov);
>> +
>> +               /* filter anomalies, some drivers reset their stats
>> +                * at down/up events.
>> +                */
>> +               if (delta > 0)
>> +                       res[i] += delta;
>> +       }
>> +}
>> +
>> +static void virtnet_bypass_get_stats(struct net_device *dev,
>> +                                    struct rtnl_link_stats64 *stats)
>> +{
>> +       struct virtnet_bypass_info *vbi = netdev_priv(dev);
>> +       const struct rtnl_link_stats64 *new;
>> +       struct rtnl_link_stats64 temp;
>> +       struct net_device *child_netdev;
>> +
>> +       spin_lock(&vbi->stats_lock);
>> +       memcpy(stats, &vbi->bypass_stats, sizeof(*stats));
>> +
>> +       rcu_read_lock();
>> +
>> +       child_netdev = rcu_dereference(vbi->active_netdev);
>> +       if (child_netdev) {
>> +               new = dev_get_stats(child_netdev, &temp);
>> +               virtnet_bypass_fold_stats(stats, new, &vbi->active_stats);
>> +               memcpy(&vbi->active_stats, new, sizeof(*new));
>> +       }
>> +
>> +       child_netdev = rcu_dereference(vbi->backup_netdev);
>> +       if (child_netdev) {
>> +               new = dev_get_stats(child_netdev, &temp);
>> +               virtnet_bypass_fold_stats(stats, new, &vbi->backup_stats);
>> +               memcpy(&vbi->backup_stats, new, sizeof(*new));
>> +       }
>> +
>> +       rcu_read_unlock();
>> +
>> +       memcpy(&vbi->bypass_stats, stats, sizeof(*stats));
>> +       spin_unlock(&vbi->stats_lock);
>> +}
>> +
>> +static int virtnet_bypass_change_mtu(struct net_device *dev, int new_mtu)
>> +{
>> +       struct virtnet_bypass_info *vbi = netdev_priv(dev);
>> +       struct net_device *child_netdev;
>> +       int ret = 0;
>> +
>> +       child_netdev = rcu_dereference(vbi->active_netdev);
>> +       if (child_netdev) {
>> +               ret = dev_set_mtu(child_netdev, new_mtu);
>> +               if (ret)
>> +                       return ret;
>> +       }
>> +
>> +       child_netdev = rcu_dereference(vbi->backup_netdev);
>> +       if (child_netdev) {
>> +               ret = dev_set_mtu(child_netdev, new_mtu);
>> +               if (ret)
>> +                       netdev_err(child_netdev,
>> +                                  "Unexpected failure to set mtu to %d\n",
>> +                                  new_mtu);
> Shouldn't we unwind the MTU config on active_netdev if failing to set
> it on backup_netdev?

backup_netdev is virtio_net and  it is not expected to fail as it supports
MTU all the way upto 64K.


>
>> +       }
>> +
>> +       dev->mtu = new_mtu;
>> +       return 0;
>> +}
>> +
>> +static void virtnet_bypass_set_rx_mode(struct net_device *dev)
>> +{
>> +       struct virtnet_bypass_info *vbi = netdev_priv(dev);
>> +       struct net_device *child_netdev;
>> +
>> +       rcu_read_lock();
>> +
>> +       child_netdev = rcu_dereference(vbi->active_netdev);
>> +       if (child_netdev) {
>> +               dev_uc_sync_multiple(child_netdev, dev);
>> +               dev_mc_sync_multiple(child_netdev, dev);
>> +       }
>> +
>> +       child_netdev = rcu_dereference(vbi->backup_netdev);
>> +       if (child_netdev) {
>> +               dev_uc_sync_multiple(child_netdev, dev);
>> +               dev_mc_sync_multiple(child_netdev, dev);
>> +       }
>> +
> If VF comes up later than when set_rx_mode is called where do you sync
> up the unicast and multicast address?

I could include sync up of unicast and multicast list along with
MTU when the child is registered.


>
> The rest looks good.
>
> Thanks,
> -Siwei
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v4 2/2] virtio_net: Extend virtio to use VF datapath when available
  2018-03-02 21:36       ` [virtio-dev] " Michael S. Tsirkin
@ 2018-03-02 23:56         ` Siwei Liu
  -1 siblings, 0 replies; 102+ messages in thread
From: Siwei Liu @ 2018-03-02 23:56 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Sridhar Samudrala, Stephen Hemminger, David Miller, Netdev,
	Jiri Pirko, virtio-dev, Brandeburg, Jesse, Alexander Duyck,
	Jakub Kicinski

On Fri, Mar 2, 2018 at 1:36 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Fri, Mar 02, 2018 at 01:11:56PM -0800, Siwei Liu wrote:
>> On Thu, Mar 1, 2018 at 12:08 PM, Sridhar Samudrala
>> <sridhar.samudrala@intel.com> wrote:
>> > This patch enables virtio_net to switch over to a VF datapath when a VF
>> > netdev is present with the same MAC address. It allows live migration
>> > of a VM with a direct attached VF without the need to setup a bond/team
>> > between a VF and virtio net device in the guest.
>> >
>> > The hypervisor needs to enable only one datapath at any time so that
>> > packets don't get looped back to the VM over the other datapath. When a VF
>> > is plugged, the virtio datapath link state can be marked as down. The
>> > hypervisor needs to unplug the VF device from the guest on the source host
>> > and reset the MAC filter of the VF to initiate failover of datapath to
>> > virtio before starting the migration. After the migration is completed,
>> > the destination hypervisor sets the MAC filter on the VF and plugs it back
>> > to the guest to switch over to VF datapath.
>> >
>> > When BACKUP feature is enabled, an additional netdev(bypass netdev) is
>> > created that acts as a master device and tracks the state of the 2 lower
>> > netdevs. The original virtio_net netdev is marked as 'backup' netdev and a
>> > passthru device with the same MAC is registered as 'active' netdev.
>> >
>> > This patch is based on the discussion initiated by Jesse on this thread.
>> > https://marc.info/?l=linux-virtualization&m=151189725224231&w=2
>> >
>> > Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
>> > Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
>> > Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
>> > ---
>> >  drivers/net/virtio_net.c | 683 ++++++++++++++++++++++++++++++++++++++++++++++-
>> >  1 file changed, 682 insertions(+), 1 deletion(-)
>> >
>> > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>> > index bcd13fe906ca..f2860d86c952 100644
>> > --- a/drivers/net/virtio_net.c
>> > +++ b/drivers/net/virtio_net.c
>> > @@ -30,6 +30,8 @@
>> >  #include <linux/cpu.h>
>> >  #include <linux/average.h>
>> >  #include <linux/filter.h>
>> > +#include <linux/netdevice.h>
>> > +#include <linux/pci.h>
>> >  #include <net/route.h>
>> >  #include <net/xdp.h>
>> >
>> > @@ -206,6 +208,9 @@ struct virtnet_info {
>> >         u32 speed;
>> >
>> >         unsigned long guest_offloads;
>> > +
>> > +       /* upper netdev created when BACKUP feature enabled */
>> > +       struct net_device *bypass_netdev;
>> >  };
>> >
>> >  struct padded_vnet_hdr {
>> > @@ -2236,6 +2241,22 @@ static int virtnet_xdp(struct net_device *dev, struct netdev_bpf *xdp)
>> >         }
>> >  }
>> >
>> > +static int virtnet_get_phys_port_name(struct net_device *dev, char *buf,
>> > +                                     size_t len)
>> > +{
>> > +       struct virtnet_info *vi = netdev_priv(dev);
>> > +       int ret;
>> > +
>> > +       if (!virtio_has_feature(vi->vdev, VIRTIO_NET_F_BACKUP))
>> > +               return -EOPNOTSUPP;
>> > +
>> > +       ret = snprintf(buf, len, "_bkup");
>> > +       if (ret >= len)
>> > +               return -EOPNOTSUPP;
>> > +
>> > +       return 0;
>> > +}
>> > +
>>
>> What if the systemd/udevd is not new enough to enforce the
>> n<phys_port_name> naming? Would virtio_bypass get a different name
>> than the original virtio_net?
>
> You mean people using ethX names? Any hardware config change breaks
> these, I don't think that can be helped.

I don't like the way to rely on .ndo_get_phys_port_name - it's fragile
and it does not completely solve the problem it tries to address.
Imagine what can end up with if getting an old udevd, or users already
have exsiting explicit udev rules around phys_port_name. It does not
give you the an ack in saying "yes, I know you're the bypass and
you're the backup, please continue and I will give you both correct
names", or an unacknowlegment saying "no, I don't know what these
extra interfaces are, please go back and leave the VF device alone".
We need new udev API for both feature negotiation and naming, or may
even completely hide the lower interfaces.

>
>> Should we detect this earlier and fall
>> back to legacy mode without creating the bypass netdev and ensalving
>> the VF?
>
> I don't think we can do this with existing kernel/userspace APIs.

That's why I ever said to make udev aware of this new type of combined
device instead of doing hacks here and there around.

Regards,
-Siwei

>
> --
> MST

^ permalink raw reply	[flat|nested] 102+ messages in thread

* [virtio-dev] Re: [PATCH v4 2/2] virtio_net: Extend virtio to use VF datapath when available
@ 2018-03-02 23:56         ` Siwei Liu
  0 siblings, 0 replies; 102+ messages in thread
From: Siwei Liu @ 2018-03-02 23:56 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Sridhar Samudrala, Stephen Hemminger, David Miller, Netdev,
	Jiri Pirko, virtio-dev, Brandeburg, Jesse, Alexander Duyck,
	Jakub Kicinski

On Fri, Mar 2, 2018 at 1:36 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Fri, Mar 02, 2018 at 01:11:56PM -0800, Siwei Liu wrote:
>> On Thu, Mar 1, 2018 at 12:08 PM, Sridhar Samudrala
>> <sridhar.samudrala@intel.com> wrote:
>> > This patch enables virtio_net to switch over to a VF datapath when a VF
>> > netdev is present with the same MAC address. It allows live migration
>> > of a VM with a direct attached VF without the need to setup a bond/team
>> > between a VF and virtio net device in the guest.
>> >
>> > The hypervisor needs to enable only one datapath at any time so that
>> > packets don't get looped back to the VM over the other datapath. When a VF
>> > is plugged, the virtio datapath link state can be marked as down. The
>> > hypervisor needs to unplug the VF device from the guest on the source host
>> > and reset the MAC filter of the VF to initiate failover of datapath to
>> > virtio before starting the migration. After the migration is completed,
>> > the destination hypervisor sets the MAC filter on the VF and plugs it back
>> > to the guest to switch over to VF datapath.
>> >
>> > When BACKUP feature is enabled, an additional netdev(bypass netdev) is
>> > created that acts as a master device and tracks the state of the 2 lower
>> > netdevs. The original virtio_net netdev is marked as 'backup' netdev and a
>> > passthru device with the same MAC is registered as 'active' netdev.
>> >
>> > This patch is based on the discussion initiated by Jesse on this thread.
>> > https://marc.info/?l=linux-virtualization&m=151189725224231&w=2
>> >
>> > Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
>> > Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
>> > Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
>> > ---
>> >  drivers/net/virtio_net.c | 683 ++++++++++++++++++++++++++++++++++++++++++++++-
>> >  1 file changed, 682 insertions(+), 1 deletion(-)
>> >
>> > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>> > index bcd13fe906ca..f2860d86c952 100644
>> > --- a/drivers/net/virtio_net.c
>> > +++ b/drivers/net/virtio_net.c
>> > @@ -30,6 +30,8 @@
>> >  #include <linux/cpu.h>
>> >  #include <linux/average.h>
>> >  #include <linux/filter.h>
>> > +#include <linux/netdevice.h>
>> > +#include <linux/pci.h>
>> >  #include <net/route.h>
>> >  #include <net/xdp.h>
>> >
>> > @@ -206,6 +208,9 @@ struct virtnet_info {
>> >         u32 speed;
>> >
>> >         unsigned long guest_offloads;
>> > +
>> > +       /* upper netdev created when BACKUP feature enabled */
>> > +       struct net_device *bypass_netdev;
>> >  };
>> >
>> >  struct padded_vnet_hdr {
>> > @@ -2236,6 +2241,22 @@ static int virtnet_xdp(struct net_device *dev, struct netdev_bpf *xdp)
>> >         }
>> >  }
>> >
>> > +static int virtnet_get_phys_port_name(struct net_device *dev, char *buf,
>> > +                                     size_t len)
>> > +{
>> > +       struct virtnet_info *vi = netdev_priv(dev);
>> > +       int ret;
>> > +
>> > +       if (!virtio_has_feature(vi->vdev, VIRTIO_NET_F_BACKUP))
>> > +               return -EOPNOTSUPP;
>> > +
>> > +       ret = snprintf(buf, len, "_bkup");
>> > +       if (ret >= len)
>> > +               return -EOPNOTSUPP;
>> > +
>> > +       return 0;
>> > +}
>> > +
>>
>> What if the systemd/udevd is not new enough to enforce the
>> n<phys_port_name> naming? Would virtio_bypass get a different name
>> than the original virtio_net?
>
> You mean people using ethX names? Any hardware config change breaks
> these, I don't think that can be helped.

I don't like the way to rely on .ndo_get_phys_port_name - it's fragile
and it does not completely solve the problem it tries to address.
Imagine what can end up with if getting an old udevd, or users already
have exsiting explicit udev rules around phys_port_name. It does not
give you the an ack in saying "yes, I know you're the bypass and
you're the backup, please continue and I will give you both correct
names", or an unacknowlegment saying "no, I don't know what these
extra interfaces are, please go back and leave the VF device alone".
We need new udev API for both feature negotiation and naming, or may
even completely hide the lower interfaces.

>
>> Should we detect this earlier and fall
>> back to legacy mode without creating the bypass netdev and ensalving
>> the VF?
>
> I don't think we can do this with existing kernel/userspace APIs.

That's why I ever said to make udev aware of this new type of combined
device instead of doing hacks here and there around.

Regards,
-Siwei

>
> --
> MST

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Re: [PATCH v4 2/2] virtio_net: Extend virtio to use VF datapath when available
  2018-03-02 23:12       ` [virtio-dev] " Samudrala, Sridhar
@ 2018-03-03  0:09         ` Siwei Liu
  -1 siblings, 0 replies; 102+ messages in thread
From: Siwei Liu @ 2018-03-03  0:09 UTC (permalink / raw)
  To: Samudrala, Sridhar
  Cc: Michael S. Tsirkin, Stephen Hemminger, David Miller, Netdev,
	Jiri Pirko, virtio-dev, Brandeburg, Jesse, Alexander Duyck,
	Jakub Kicinski

On Fri, Mar 2, 2018 at 3:12 PM, Samudrala, Sridhar
<sridhar.samudrala@intel.com> wrote:
> On 3/2/2018 1:11 PM, Siwei Liu wrote:
>>
>> On Thu, Mar 1, 2018 at 12:08 PM, Sridhar Samudrala
>> <sridhar.samudrala@intel.com> wrote:
>>>
>>> This patch enables virtio_net to switch over to a VF datapath when a VF
>>> netdev is present with the same MAC address. It allows live migration
>>> of a VM with a direct attached VF without the need to setup a bond/team
>>> between a VF and virtio net device in the guest.
>>>
>>> The hypervisor needs to enable only one datapath at any time so that
>>> packets don't get looped back to the VM over the other datapath. When a
>>> VF
>>> is plugged, the virtio datapath link state can be marked as down. The
>>> hypervisor needs to unplug the VF device from the guest on the source
>>> host
>>> and reset the MAC filter of the VF to initiate failover of datapath to
>>> virtio before starting the migration. After the migration is completed,
>>> the destination hypervisor sets the MAC filter on the VF and plugs it
>>> back
>>> to the guest to switch over to VF datapath.
>>>
>>> When BACKUP feature is enabled, an additional netdev(bypass netdev) is
>>> created that acts as a master device and tracks the state of the 2 lower
>>> netdevs. The original virtio_net netdev is marked as 'backup' netdev and
>>> a
>>> passthru device with the same MAC is registered as 'active' netdev.
>>>
>>> This patch is based on the discussion initiated by Jesse on this thread.
>>> https://marc.info/?l=linux-virtualization&m=151189725224231&w=2
>>>
>>> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
>>> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
>>> Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
>>> ---
>>>   drivers/net/virtio_net.c | 683
>>> ++++++++++++++++++++++++++++++++++++++++++++++-
>>>   1 file changed, 682 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>>> index bcd13fe906ca..f2860d86c952 100644
>>> --- a/drivers/net/virtio_net.c
>>> +++ b/drivers/net/virtio_net.c
>>> @@ -30,6 +30,8 @@
>>>   #include <linux/cpu.h>
>>>   #include <linux/average.h>
>>>   #include <linux/filter.h>
>>> +#include <linux/netdevice.h>
>>> +#include <linux/pci.h>
>>>   #include <net/route.h>
>>>   #include <net/xdp.h>
>>>
>>> @@ -206,6 +208,9 @@ struct virtnet_info {
>>>          u32 speed;
>>>
>>>          unsigned long guest_offloads;
>>> +
>>> +       /* upper netdev created when BACKUP feature enabled */
>>> +       struct net_device *bypass_netdev;
>>>   };
>>>
>>>   struct padded_vnet_hdr {
>>> @@ -2236,6 +2241,22 @@ static int virtnet_xdp(struct net_device *dev,
>>> struct netdev_bpf *xdp)
>>>          }
>>>   }
>>>
>>> +static int virtnet_get_phys_port_name(struct net_device *dev, char *buf,
>>> +                                     size_t len)
>>> +{
>>> +       struct virtnet_info *vi = netdev_priv(dev);
>>> +       int ret;
>>> +
>>> +       if (!virtio_has_feature(vi->vdev, VIRTIO_NET_F_BACKUP))
>>> +               return -EOPNOTSUPP;
>>> +
>>> +       ret = snprintf(buf, len, "_bkup");
>>> +       if (ret >= len)
>>> +               return -EOPNOTSUPP;
>>> +
>>> +       return 0;
>>> +}
>>> +
>>
>> What if the systemd/udevd is not new enough to enforce the
>> n<phys_port_name> naming? Would virtio_bypass get a different name
>> than the original virtio_net? Should we detect this earlier and fall
>> back to legacy mode without creating the bypass netdev and ensalving
>> the VF?
>
>
> If udev doesn't support renaming of the devices,  then the upper bypass
> device
> should get the original name and the lower virtio netdev will get the next
> name.

If you got two virtio-net's (say e.g. eth0 and eth1) before the
update, the virtio-bypass interface on the first virtio gets eth0 and
the backup netdev would get eth1? Then the IP address originally on
eth1 gets configurd on the backup interface?

> Hopefully the distros updating the kernel will also move to the new
> systemd/udev.

This is not reliable. I'd opt for a new udev API for this, and fall
back to what it was if don't see fit.

>
>
>
>>
>>>   static const struct net_device_ops virtnet_netdev = {
>>>          .ndo_open            = virtnet_open,
>>>          .ndo_stop            = virtnet_close,
>>> @@ -2253,6 +2274,7 @@ static const struct net_device_ops virtnet_netdev =
>>> {
>>>          .ndo_xdp_xmit           = virtnet_xdp_xmit,
>>>          .ndo_xdp_flush          = virtnet_xdp_flush,
>>>          .ndo_features_check     = passthru_features_check,
>>> +       .ndo_get_phys_port_name = virtnet_get_phys_port_name,
>>>   };
>>>
>>>   static void virtnet_config_changed_work(struct work_struct *work)
>>> @@ -2647,6 +2669,653 @@ static int virtnet_validate(struct virtio_device
>>> *vdev)
>>>          return 0;
>>>   }
>>>
>>> +/* START of functions supporting VIRTIO_NET_F_BACKUP feature.
>>> + * When BACKUP feature is enabled, an additional netdev(bypass netdev)
>>> + * is created that acts as a master device and tracks the state of the
>>> + * 2 lower netdevs. The original virtio_net netdev is registered as
>>> + * 'backup' netdev and a passthru device with the same MAC is registered
>>> + * as 'active' netdev.
>>> + */
>>> +
>>> +/* bypass state maintained when BACKUP feature is enabled */
>>> +struct virtnet_bypass_info {
>>> +       /* passthru netdev with same MAC */
>>> +       struct net_device __rcu *active_netdev;
>>> +
>>> +       /* virtio_net netdev */
>>> +       struct net_device __rcu *backup_netdev;
>>> +
>>> +       /* active netdev stats */
>>> +       struct rtnl_link_stats64 active_stats;
>>> +
>>> +       /* backup netdev stats */
>>> +       struct rtnl_link_stats64 backup_stats;
>>> +
>>> +       /* aggregated stats */
>>> +       struct rtnl_link_stats64 bypass_stats;
>>> +
>>> +       /* spinlock while updating stats */
>>> +       spinlock_t stats_lock;
>>> +};
>>> +
>>> +static void virtnet_bypass_child_open(struct net_device *dev,
>>> +                                     struct net_device *child_netdev)
>>> +{
>>> +       int err = dev_open(child_netdev);
>>> +
>>> +       if (err)
>>> +               netdev_warn(dev, "unable to open slave: %s: %d\n",
>>> +                           child_netdev->name, err);
>>> +}
>>> +
>>> +static int virtnet_bypass_open(struct net_device *dev)
>>> +{
>>> +       struct virtnet_bypass_info *vbi = netdev_priv(dev);
>>> +       struct net_device *child_netdev;
>>> +
>>> +       netif_carrier_off(dev);
>>> +       netif_tx_wake_all_queues(dev);
>>> +
>>> +       child_netdev = rtnl_dereference(vbi->active_netdev);
>>> +       if (child_netdev)
>>> +               virtnet_bypass_child_open(dev, child_netdev);
>>> +
>>> +       child_netdev = rtnl_dereference(vbi->backup_netdev);
>>> +       if (child_netdev)
>>> +               virtnet_bypass_child_open(dev, child_netdev);
>>> +
>>> +       return 0;
>>> +}
>>> +
>>> +static int virtnet_bypass_close(struct net_device *dev)
>>> +{
>>> +       struct virtnet_bypass_info *vi = netdev_priv(dev);
>>> +       struct net_device *child_netdev;
>>> +
>>> +       netif_tx_disable(dev);
>>> +
>>> +       child_netdev = rtnl_dereference(vi->active_netdev);
>>> +       if (child_netdev)
>>> +               dev_close(child_netdev);
>>> +
>>> +       child_netdev = rtnl_dereference(vi->backup_netdev);
>>> +       if (child_netdev)
>>> +               dev_close(child_netdev);
>>> +
>>> +       return 0;
>>> +}
>>> +
>>> +static netdev_tx_t virtnet_bypass_drop_xmit(struct sk_buff *skb,
>>> +                                           struct net_device *dev)
>>> +{
>>> +       atomic_long_inc(&dev->tx_dropped);
>>> +       dev_kfree_skb_any(skb);
>>> +       return NETDEV_TX_OK;
>>> +}
>>> +
>>> +static bool virtnet_bypass_xmit_ready(struct net_device *dev)
>>> +{
>>> +       return netif_running(dev) && netif_carrier_ok(dev);
>>> +}
>>> +
>>> +static netdev_tx_t virtnet_bypass_start_xmit(struct sk_buff *skb,
>>> +                                            struct net_device *dev)
>>> +{
>>> +       struct virtnet_bypass_info *vbi = netdev_priv(dev);
>>> +       struct net_device *xmit_dev;
>>> +
>>> +       /* Try xmit via active netdev followed by backup netdev */
>>> +       xmit_dev = rcu_dereference_bh(vbi->active_netdev);
>>> +       if (!xmit_dev || !virtnet_bypass_xmit_ready(xmit_dev)) {
>>> +               xmit_dev = rcu_dereference_bh(vbi->backup_netdev);
>>> +               if (!xmit_dev || !virtnet_bypass_xmit_ready(xmit_dev))
>>> +                       return virtnet_bypass_drop_xmit(skb, dev);
>>> +       }
>>> +
>>> +       skb->dev = xmit_dev;
>>> +       skb->queue_mapping = qdisc_skb_cb(skb)->slave_dev_queue_mapping;
>>> +
>>> +       return dev_queue_xmit(skb);
>>> +}
>>> +
>>> +static u16 virtnet_bypass_select_queue(struct net_device *dev,
>>> +                                      struct sk_buff *skb, void
>>> *accel_priv,
>>> +                                      select_queue_fallback_t fallback)
>>> +{
>>> +       /* This helper function exists to help dev_pick_tx get the
>>> correct
>>> +        * destination queue.  Using a helper function skips a call to
>>> +        * skb_tx_hash and will put the skbs in the queue we expect on
>>> their
>>> +        * way down to the bonding driver.
>>> +        */
>>> +       u16 txq = skb_rx_queue_recorded(skb) ? skb_get_rx_queue(skb) : 0;
>>> +
>>> +       /* Save the original txq to restore before passing to the driver
>>> */
>>> +       qdisc_skb_cb(skb)->slave_dev_queue_mapping = skb->queue_mapping;
>>> +
>>> +       if (unlikely(txq >= dev->real_num_tx_queues)) {
>>> +               do {
>>> +                       txq -= dev->real_num_tx_queues;
>>> +               } while (txq >= dev->real_num_tx_queues);
>>> +       }
>>> +
>>> +       return txq;
>>> +}
>>> +
>>> +/* fold stats, assuming all rtnl_link_stats64 fields are u64, but
>>> + * that some drivers can provide 32bit values only.
>>> + */
>>> +static void virtnet_bypass_fold_stats(struct rtnl_link_stats64 *_res,
>>> +                                     const struct rtnl_link_stats64
>>> *_new,
>>> +                                     const struct rtnl_link_stats64
>>> *_old)
>>> +{
>>> +       const u64 *new = (const u64 *)_new;
>>> +       const u64 *old = (const u64 *)_old;
>>> +       u64 *res = (u64 *)_res;
>>> +       int i;
>>> +
>>> +       for (i = 0; i < sizeof(*_res) / sizeof(u64); i++) {
>>> +               u64 nv = new[i];
>>> +               u64 ov = old[i];
>>> +               s64 delta = nv - ov;
>>> +
>>> +               /* detects if this particular field is 32bit only */
>>> +               if (((nv | ov) >> 32) == 0)
>>> +                       delta = (s64)(s32)((u32)nv - (u32)ov);
>>> +
>>> +               /* filter anomalies, some drivers reset their stats
>>> +                * at down/up events.
>>> +                */
>>> +               if (delta > 0)
>>> +                       res[i] += delta;
>>> +       }
>>> +}
>>> +
>>> +static void virtnet_bypass_get_stats(struct net_device *dev,
>>> +                                    struct rtnl_link_stats64 *stats)
>>> +{
>>> +       struct virtnet_bypass_info *vbi = netdev_priv(dev);
>>> +       const struct rtnl_link_stats64 *new;
>>> +       struct rtnl_link_stats64 temp;
>>> +       struct net_device *child_netdev;
>>> +
>>> +       spin_lock(&vbi->stats_lock);
>>> +       memcpy(stats, &vbi->bypass_stats, sizeof(*stats));
>>> +
>>> +       rcu_read_lock();
>>> +
>>> +       child_netdev = rcu_dereference(vbi->active_netdev);
>>> +       if (child_netdev) {
>>> +               new = dev_get_stats(child_netdev, &temp);
>>> +               virtnet_bypass_fold_stats(stats, new,
>>> &vbi->active_stats);
>>> +               memcpy(&vbi->active_stats, new, sizeof(*new));
>>> +       }
>>> +
>>> +       child_netdev = rcu_dereference(vbi->backup_netdev);
>>> +       if (child_netdev) {
>>> +               new = dev_get_stats(child_netdev, &temp);
>>> +               virtnet_bypass_fold_stats(stats, new,
>>> &vbi->backup_stats);
>>> +               memcpy(&vbi->backup_stats, new, sizeof(*new));
>>> +       }
>>> +
>>> +       rcu_read_unlock();
>>> +
>>> +       memcpy(&vbi->bypass_stats, stats, sizeof(*stats));
>>> +       spin_unlock(&vbi->stats_lock);
>>> +}
>>> +
>>> +static int virtnet_bypass_change_mtu(struct net_device *dev, int
>>> new_mtu)
>>> +{
>>> +       struct virtnet_bypass_info *vbi = netdev_priv(dev);
>>> +       struct net_device *child_netdev;
>>> +       int ret = 0;
>>> +
>>> +       child_netdev = rcu_dereference(vbi->active_netdev);
>>> +       if (child_netdev) {
>>> +               ret = dev_set_mtu(child_netdev, new_mtu);
>>> +               if (ret)
>>> +                       return ret;
>>> +       }
>>> +
>>> +       child_netdev = rcu_dereference(vbi->backup_netdev);
>>> +       if (child_netdev) {
>>> +               ret = dev_set_mtu(child_netdev, new_mtu);
>>> +               if (ret)
>>> +                       netdev_err(child_netdev,
>>> +                                  "Unexpected failure to set mtu to
>>> %d\n",
>>> +                                  new_mtu);
>>
>> Shouldn't we unwind the MTU config on active_netdev if failing to set
>> it on backup_netdev?
>
>
> backup_netdev is virtio_net and  it is not expected to fail as it supports
> MTU all the way upto 64K.

Hmmm, then why do you need to check up the return of setting mtu on
backup_netdev here? And it's not completely impossible setting MTU
fails due to some other reason at the lower level. We should not take
assumption that way.

>
>
>
>>
>>> +       }
>>> +
>>> +       dev->mtu = new_mtu;
>>> +       return 0;
>>> +}
>>> +
>>> +static void virtnet_bypass_set_rx_mode(struct net_device *dev)
>>> +{
>>> +       struct virtnet_bypass_info *vbi = netdev_priv(dev);
>>> +       struct net_device *child_netdev;
>>> +
>>> +       rcu_read_lock();
>>> +
>>> +       child_netdev = rcu_dereference(vbi->active_netdev);
>>> +       if (child_netdev) {
>>> +               dev_uc_sync_multiple(child_netdev, dev);
>>> +               dev_mc_sync_multiple(child_netdev, dev);
>>> +       }
>>> +
>>> +       child_netdev = rcu_dereference(vbi->backup_netdev);
>>> +       if (child_netdev) {
>>> +               dev_uc_sync_multiple(child_netdev, dev);
>>> +               dev_mc_sync_multiple(child_netdev, dev);
>>> +       }
>>> +
>>
>> If VF comes up later than when set_rx_mode is called where do you sync
>> up the unicast and multicast address?
>
>
> I could include sync up of unicast and multicast list along with
> MTU when the child is registered.
>
>
>
>>
>> The rest looks good.
>>
>> Thanks,
>> -Siwei
>>
>>
>

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [virtio-dev] Re: [PATCH v4 2/2] virtio_net: Extend virtio to use VF datapath when available
@ 2018-03-03  0:09         ` Siwei Liu
  0 siblings, 0 replies; 102+ messages in thread
From: Siwei Liu @ 2018-03-03  0:09 UTC (permalink / raw)
  To: Samudrala, Sridhar
  Cc: Michael S. Tsirkin, Stephen Hemminger, David Miller, Netdev,
	Jiri Pirko, virtio-dev, Brandeburg, Jesse, Alexander Duyck,
	Jakub Kicinski

On Fri, Mar 2, 2018 at 3:12 PM, Samudrala, Sridhar
<sridhar.samudrala@intel.com> wrote:
> On 3/2/2018 1:11 PM, Siwei Liu wrote:
>>
>> On Thu, Mar 1, 2018 at 12:08 PM, Sridhar Samudrala
>> <sridhar.samudrala@intel.com> wrote:
>>>
>>> This patch enables virtio_net to switch over to a VF datapath when a VF
>>> netdev is present with the same MAC address. It allows live migration
>>> of a VM with a direct attached VF without the need to setup a bond/team
>>> between a VF and virtio net device in the guest.
>>>
>>> The hypervisor needs to enable only one datapath at any time so that
>>> packets don't get looped back to the VM over the other datapath. When a
>>> VF
>>> is plugged, the virtio datapath link state can be marked as down. The
>>> hypervisor needs to unplug the VF device from the guest on the source
>>> host
>>> and reset the MAC filter of the VF to initiate failover of datapath to
>>> virtio before starting the migration. After the migration is completed,
>>> the destination hypervisor sets the MAC filter on the VF and plugs it
>>> back
>>> to the guest to switch over to VF datapath.
>>>
>>> When BACKUP feature is enabled, an additional netdev(bypass netdev) is
>>> created that acts as a master device and tracks the state of the 2 lower
>>> netdevs. The original virtio_net netdev is marked as 'backup' netdev and
>>> a
>>> passthru device with the same MAC is registered as 'active' netdev.
>>>
>>> This patch is based on the discussion initiated by Jesse on this thread.
>>> https://marc.info/?l=linux-virtualization&m=151189725224231&w=2
>>>
>>> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
>>> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
>>> Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
>>> ---
>>>   drivers/net/virtio_net.c | 683
>>> ++++++++++++++++++++++++++++++++++++++++++++++-
>>>   1 file changed, 682 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>>> index bcd13fe906ca..f2860d86c952 100644
>>> --- a/drivers/net/virtio_net.c
>>> +++ b/drivers/net/virtio_net.c
>>> @@ -30,6 +30,8 @@
>>>   #include <linux/cpu.h>
>>>   #include <linux/average.h>
>>>   #include <linux/filter.h>
>>> +#include <linux/netdevice.h>
>>> +#include <linux/pci.h>
>>>   #include <net/route.h>
>>>   #include <net/xdp.h>
>>>
>>> @@ -206,6 +208,9 @@ struct virtnet_info {
>>>          u32 speed;
>>>
>>>          unsigned long guest_offloads;
>>> +
>>> +       /* upper netdev created when BACKUP feature enabled */
>>> +       struct net_device *bypass_netdev;
>>>   };
>>>
>>>   struct padded_vnet_hdr {
>>> @@ -2236,6 +2241,22 @@ static int virtnet_xdp(struct net_device *dev,
>>> struct netdev_bpf *xdp)
>>>          }
>>>   }
>>>
>>> +static int virtnet_get_phys_port_name(struct net_device *dev, char *buf,
>>> +                                     size_t len)
>>> +{
>>> +       struct virtnet_info *vi = netdev_priv(dev);
>>> +       int ret;
>>> +
>>> +       if (!virtio_has_feature(vi->vdev, VIRTIO_NET_F_BACKUP))
>>> +               return -EOPNOTSUPP;
>>> +
>>> +       ret = snprintf(buf, len, "_bkup");
>>> +       if (ret >= len)
>>> +               return -EOPNOTSUPP;
>>> +
>>> +       return 0;
>>> +}
>>> +
>>
>> What if the systemd/udevd is not new enough to enforce the
>> n<phys_port_name> naming? Would virtio_bypass get a different name
>> than the original virtio_net? Should we detect this earlier and fall
>> back to legacy mode without creating the bypass netdev and ensalving
>> the VF?
>
>
> If udev doesn't support renaming of the devices,  then the upper bypass
> device
> should get the original name and the lower virtio netdev will get the next
> name.

If you got two virtio-net's (say e.g. eth0 and eth1) before the
update, the virtio-bypass interface on the first virtio gets eth0 and
the backup netdev would get eth1? Then the IP address originally on
eth1 gets configurd on the backup interface?

> Hopefully the distros updating the kernel will also move to the new
> systemd/udev.

This is not reliable. I'd opt for a new udev API for this, and fall
back to what it was if don't see fit.

>
>
>
>>
>>>   static const struct net_device_ops virtnet_netdev = {
>>>          .ndo_open            = virtnet_open,
>>>          .ndo_stop            = virtnet_close,
>>> @@ -2253,6 +2274,7 @@ static const struct net_device_ops virtnet_netdev =
>>> {
>>>          .ndo_xdp_xmit           = virtnet_xdp_xmit,
>>>          .ndo_xdp_flush          = virtnet_xdp_flush,
>>>          .ndo_features_check     = passthru_features_check,
>>> +       .ndo_get_phys_port_name = virtnet_get_phys_port_name,
>>>   };
>>>
>>>   static void virtnet_config_changed_work(struct work_struct *work)
>>> @@ -2647,6 +2669,653 @@ static int virtnet_validate(struct virtio_device
>>> *vdev)
>>>          return 0;
>>>   }
>>>
>>> +/* START of functions supporting VIRTIO_NET_F_BACKUP feature.
>>> + * When BACKUP feature is enabled, an additional netdev(bypass netdev)
>>> + * is created that acts as a master device and tracks the state of the
>>> + * 2 lower netdevs. The original virtio_net netdev is registered as
>>> + * 'backup' netdev and a passthru device with the same MAC is registered
>>> + * as 'active' netdev.
>>> + */
>>> +
>>> +/* bypass state maintained when BACKUP feature is enabled */
>>> +struct virtnet_bypass_info {
>>> +       /* passthru netdev with same MAC */
>>> +       struct net_device __rcu *active_netdev;
>>> +
>>> +       /* virtio_net netdev */
>>> +       struct net_device __rcu *backup_netdev;
>>> +
>>> +       /* active netdev stats */
>>> +       struct rtnl_link_stats64 active_stats;
>>> +
>>> +       /* backup netdev stats */
>>> +       struct rtnl_link_stats64 backup_stats;
>>> +
>>> +       /* aggregated stats */
>>> +       struct rtnl_link_stats64 bypass_stats;
>>> +
>>> +       /* spinlock while updating stats */
>>> +       spinlock_t stats_lock;
>>> +};
>>> +
>>> +static void virtnet_bypass_child_open(struct net_device *dev,
>>> +                                     struct net_device *child_netdev)
>>> +{
>>> +       int err = dev_open(child_netdev);
>>> +
>>> +       if (err)
>>> +               netdev_warn(dev, "unable to open slave: %s: %d\n",
>>> +                           child_netdev->name, err);
>>> +}
>>> +
>>> +static int virtnet_bypass_open(struct net_device *dev)
>>> +{
>>> +       struct virtnet_bypass_info *vbi = netdev_priv(dev);
>>> +       struct net_device *child_netdev;
>>> +
>>> +       netif_carrier_off(dev);
>>> +       netif_tx_wake_all_queues(dev);
>>> +
>>> +       child_netdev = rtnl_dereference(vbi->active_netdev);
>>> +       if (child_netdev)
>>> +               virtnet_bypass_child_open(dev, child_netdev);
>>> +
>>> +       child_netdev = rtnl_dereference(vbi->backup_netdev);
>>> +       if (child_netdev)
>>> +               virtnet_bypass_child_open(dev, child_netdev);
>>> +
>>> +       return 0;
>>> +}
>>> +
>>> +static int virtnet_bypass_close(struct net_device *dev)
>>> +{
>>> +       struct virtnet_bypass_info *vi = netdev_priv(dev);
>>> +       struct net_device *child_netdev;
>>> +
>>> +       netif_tx_disable(dev);
>>> +
>>> +       child_netdev = rtnl_dereference(vi->active_netdev);
>>> +       if (child_netdev)
>>> +               dev_close(child_netdev);
>>> +
>>> +       child_netdev = rtnl_dereference(vi->backup_netdev);
>>> +       if (child_netdev)
>>> +               dev_close(child_netdev);
>>> +
>>> +       return 0;
>>> +}
>>> +
>>> +static netdev_tx_t virtnet_bypass_drop_xmit(struct sk_buff *skb,
>>> +                                           struct net_device *dev)
>>> +{
>>> +       atomic_long_inc(&dev->tx_dropped);
>>> +       dev_kfree_skb_any(skb);
>>> +       return NETDEV_TX_OK;
>>> +}
>>> +
>>> +static bool virtnet_bypass_xmit_ready(struct net_device *dev)
>>> +{
>>> +       return netif_running(dev) && netif_carrier_ok(dev);
>>> +}
>>> +
>>> +static netdev_tx_t virtnet_bypass_start_xmit(struct sk_buff *skb,
>>> +                                            struct net_device *dev)
>>> +{
>>> +       struct virtnet_bypass_info *vbi = netdev_priv(dev);
>>> +       struct net_device *xmit_dev;
>>> +
>>> +       /* Try xmit via active netdev followed by backup netdev */
>>> +       xmit_dev = rcu_dereference_bh(vbi->active_netdev);
>>> +       if (!xmit_dev || !virtnet_bypass_xmit_ready(xmit_dev)) {
>>> +               xmit_dev = rcu_dereference_bh(vbi->backup_netdev);
>>> +               if (!xmit_dev || !virtnet_bypass_xmit_ready(xmit_dev))
>>> +                       return virtnet_bypass_drop_xmit(skb, dev);
>>> +       }
>>> +
>>> +       skb->dev = xmit_dev;
>>> +       skb->queue_mapping = qdisc_skb_cb(skb)->slave_dev_queue_mapping;
>>> +
>>> +       return dev_queue_xmit(skb);
>>> +}
>>> +
>>> +static u16 virtnet_bypass_select_queue(struct net_device *dev,
>>> +                                      struct sk_buff *skb, void
>>> *accel_priv,
>>> +                                      select_queue_fallback_t fallback)
>>> +{
>>> +       /* This helper function exists to help dev_pick_tx get the
>>> correct
>>> +        * destination queue.  Using a helper function skips a call to
>>> +        * skb_tx_hash and will put the skbs in the queue we expect on
>>> their
>>> +        * way down to the bonding driver.
>>> +        */
>>> +       u16 txq = skb_rx_queue_recorded(skb) ? skb_get_rx_queue(skb) : 0;
>>> +
>>> +       /* Save the original txq to restore before passing to the driver
>>> */
>>> +       qdisc_skb_cb(skb)->slave_dev_queue_mapping = skb->queue_mapping;
>>> +
>>> +       if (unlikely(txq >= dev->real_num_tx_queues)) {
>>> +               do {
>>> +                       txq -= dev->real_num_tx_queues;
>>> +               } while (txq >= dev->real_num_tx_queues);
>>> +       }
>>> +
>>> +       return txq;
>>> +}
>>> +
>>> +/* fold stats, assuming all rtnl_link_stats64 fields are u64, but
>>> + * that some drivers can provide 32bit values only.
>>> + */
>>> +static void virtnet_bypass_fold_stats(struct rtnl_link_stats64 *_res,
>>> +                                     const struct rtnl_link_stats64
>>> *_new,
>>> +                                     const struct rtnl_link_stats64
>>> *_old)
>>> +{
>>> +       const u64 *new = (const u64 *)_new;
>>> +       const u64 *old = (const u64 *)_old;
>>> +       u64 *res = (u64 *)_res;
>>> +       int i;
>>> +
>>> +       for (i = 0; i < sizeof(*_res) / sizeof(u64); i++) {
>>> +               u64 nv = new[i];
>>> +               u64 ov = old[i];
>>> +               s64 delta = nv - ov;
>>> +
>>> +               /* detects if this particular field is 32bit only */
>>> +               if (((nv | ov) >> 32) == 0)
>>> +                       delta = (s64)(s32)((u32)nv - (u32)ov);
>>> +
>>> +               /* filter anomalies, some drivers reset their stats
>>> +                * at down/up events.
>>> +                */
>>> +               if (delta > 0)
>>> +                       res[i] += delta;
>>> +       }
>>> +}
>>> +
>>> +static void virtnet_bypass_get_stats(struct net_device *dev,
>>> +                                    struct rtnl_link_stats64 *stats)
>>> +{
>>> +       struct virtnet_bypass_info *vbi = netdev_priv(dev);
>>> +       const struct rtnl_link_stats64 *new;
>>> +       struct rtnl_link_stats64 temp;
>>> +       struct net_device *child_netdev;
>>> +
>>> +       spin_lock(&vbi->stats_lock);
>>> +       memcpy(stats, &vbi->bypass_stats, sizeof(*stats));
>>> +
>>> +       rcu_read_lock();
>>> +
>>> +       child_netdev = rcu_dereference(vbi->active_netdev);
>>> +       if (child_netdev) {
>>> +               new = dev_get_stats(child_netdev, &temp);
>>> +               virtnet_bypass_fold_stats(stats, new,
>>> &vbi->active_stats);
>>> +               memcpy(&vbi->active_stats, new, sizeof(*new));
>>> +       }
>>> +
>>> +       child_netdev = rcu_dereference(vbi->backup_netdev);
>>> +       if (child_netdev) {
>>> +               new = dev_get_stats(child_netdev, &temp);
>>> +               virtnet_bypass_fold_stats(stats, new,
>>> &vbi->backup_stats);
>>> +               memcpy(&vbi->backup_stats, new, sizeof(*new));
>>> +       }
>>> +
>>> +       rcu_read_unlock();
>>> +
>>> +       memcpy(&vbi->bypass_stats, stats, sizeof(*stats));
>>> +       spin_unlock(&vbi->stats_lock);
>>> +}
>>> +
>>> +static int virtnet_bypass_change_mtu(struct net_device *dev, int
>>> new_mtu)
>>> +{
>>> +       struct virtnet_bypass_info *vbi = netdev_priv(dev);
>>> +       struct net_device *child_netdev;
>>> +       int ret = 0;
>>> +
>>> +       child_netdev = rcu_dereference(vbi->active_netdev);
>>> +       if (child_netdev) {
>>> +               ret = dev_set_mtu(child_netdev, new_mtu);
>>> +               if (ret)
>>> +                       return ret;
>>> +       }
>>> +
>>> +       child_netdev = rcu_dereference(vbi->backup_netdev);
>>> +       if (child_netdev) {
>>> +               ret = dev_set_mtu(child_netdev, new_mtu);
>>> +               if (ret)
>>> +                       netdev_err(child_netdev,
>>> +                                  "Unexpected failure to set mtu to
>>> %d\n",
>>> +                                  new_mtu);
>>
>> Shouldn't we unwind the MTU config on active_netdev if failing to set
>> it on backup_netdev?
>
>
> backup_netdev is virtio_net and  it is not expected to fail as it supports
> MTU all the way upto 64K.

Hmmm, then why do you need to check up the return of setting mtu on
backup_netdev here? And it's not completely impossible setting MTU
fails due to some other reason at the lower level. We should not take
assumption that way.

>
>
>
>>
>>> +       }
>>> +
>>> +       dev->mtu = new_mtu;
>>> +       return 0;
>>> +}
>>> +
>>> +static void virtnet_bypass_set_rx_mode(struct net_device *dev)
>>> +{
>>> +       struct virtnet_bypass_info *vbi = netdev_priv(dev);
>>> +       struct net_device *child_netdev;
>>> +
>>> +       rcu_read_lock();
>>> +
>>> +       child_netdev = rcu_dereference(vbi->active_netdev);
>>> +       if (child_netdev) {
>>> +               dev_uc_sync_multiple(child_netdev, dev);
>>> +               dev_mc_sync_multiple(child_netdev, dev);
>>> +       }
>>> +
>>> +       child_netdev = rcu_dereference(vbi->backup_netdev);
>>> +       if (child_netdev) {
>>> +               dev_uc_sync_multiple(child_netdev, dev);
>>> +               dev_mc_sync_multiple(child_netdev, dev);
>>> +       }
>>> +
>>
>> If VF comes up later than when set_rx_mode is called where do you sync
>> up the unicast and multicast address?
>
>
> I could include sync up of unicast and multicast list along with
> MTU when the child is registered.
>
>
>
>>
>> The rest looks good.
>>
>> Thanks,
>> -Siwei
>>
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v4 2/2] virtio_net: Extend virtio to use VF datapath when available
  2018-03-02 19:42           ` [virtio-dev] " Michael S. Tsirkin
  (?)
  (?)
@ 2018-03-03 11:31           ` Jiri Pirko
  2018-03-03 18:04               ` [virtio-dev] " Alexander Duyck
  -1 siblings, 1 reply; 102+ messages in thread
From: Jiri Pirko @ 2018-03-03 11:31 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Alexander Duyck, Sridhar Samudrala, Stephen Hemminger,
	David Miller, Netdev, virtio-dev, Brandeburg, Jesse, Duyck,
	Alexander H, Jakub Kicinski

Fri, Mar 02, 2018 at 08:42:47PM CET, mst@redhat.com wrote:
>On Fri, Mar 02, 2018 at 05:20:17PM +0100, Jiri Pirko wrote:
>> >Yeah, this code essentially calls out the "shareable" code with a
>> >comment at the start and end of the section what defines the
>> >virtio_bypass functionality. It would just be a matter of mostly
>> >cutting and pasting to put it into a separate driver module.
>> 
>> Please put it there and unite the use of it with netvsc.
>
>Surely, adding this to other drivers (e.g. might this be handy for xen
>too?) can be left for a separate patchset. Let's get one device merged
>first.

Why? Let's do the generic infra alongside with the driver. I see no good
reason to rush into merging driver and only later, if ever, to convert
it to generic solution. On contrary. That would lead into multiple
approaches and different behavious in multiple drivers. That is plain
wrong.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v4 2/2] virtio_net: Extend virtio to use VF datapath when available
  2018-03-03 11:31           ` Jiri Pirko
@ 2018-03-03 18:04               ` Alexander Duyck
  0 siblings, 0 replies; 102+ messages in thread
From: Alexander Duyck @ 2018-03-03 18:04 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Michael S. Tsirkin, Sridhar Samudrala, Stephen Hemminger,
	David Miller, Netdev, virtio-dev, Brandeburg, Jesse, Duyck,
	Alexander H, Jakub Kicinski

On Sat, Mar 3, 2018 at 3:31 AM, Jiri Pirko <jiri@resnulli.us> wrote:
> Fri, Mar 02, 2018 at 08:42:47PM CET, mst@redhat.com wrote:
>>On Fri, Mar 02, 2018 at 05:20:17PM +0100, Jiri Pirko wrote:
>>> >Yeah, this code essentially calls out the "shareable" code with a
>>> >comment at the start and end of the section what defines the
>>> >virtio_bypass functionality. It would just be a matter of mostly
>>> >cutting and pasting to put it into a separate driver module.
>>>
>>> Please put it there and unite the use of it with netvsc.
>>
>>Surely, adding this to other drivers (e.g. might this be handy for xen
>>too?) can be left for a separate patchset. Let's get one device merged
>>first.
>
> Why? Let's do the generic infra alongside with the driver. I see no good
> reason to rush into merging driver and only later, if ever, to convert
> it to generic solution. On contrary. That would lead into multiple
> approaches and different behavious in multiple drivers. That is plain
> wrong.

If nothing else it doesn't hurt to do this in one driver in a generic
way, and once it has been proven to address all the needs of that one
driver we can then start moving other drivers to it. The current
solution is quite generic, that was my contribution to this patch set
as I didn't like how invasive it was being to virtio and thought it
would be best to keep this as minimally invasive as possible.

My preference would be to give this a release or two in virtio to
mature before we start pushing it onto other drivers. It shouldn't
take much to cut/paste this into a new driver file once we decide it
is time to start extending it out to other drivers.

- Alex

^ permalink raw reply	[flat|nested] 102+ messages in thread

* [virtio-dev] Re: [PATCH v4 2/2] virtio_net: Extend virtio to use VF datapath when available
@ 2018-03-03 18:04               ` Alexander Duyck
  0 siblings, 0 replies; 102+ messages in thread
From: Alexander Duyck @ 2018-03-03 18:04 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Michael S. Tsirkin, Sridhar Samudrala, Stephen Hemminger,
	David Miller, Netdev, virtio-dev, Brandeburg, Jesse, Duyck,
	Alexander H, Jakub Kicinski

On Sat, Mar 3, 2018 at 3:31 AM, Jiri Pirko <jiri@resnulli.us> wrote:
> Fri, Mar 02, 2018 at 08:42:47PM CET, mst@redhat.com wrote:
>>On Fri, Mar 02, 2018 at 05:20:17PM +0100, Jiri Pirko wrote:
>>> >Yeah, this code essentially calls out the "shareable" code with a
>>> >comment at the start and end of the section what defines the
>>> >virtio_bypass functionality. It would just be a matter of mostly
>>> >cutting and pasting to put it into a separate driver module.
>>>
>>> Please put it there and unite the use of it with netvsc.
>>
>>Surely, adding this to other drivers (e.g. might this be handy for xen
>>too?) can be left for a separate patchset. Let's get one device merged
>>first.
>
> Why? Let's do the generic infra alongside with the driver. I see no good
> reason to rush into merging driver and only later, if ever, to convert
> it to generic solution. On contrary. That would lead into multiple
> approaches and different behavious in multiple drivers. That is plain
> wrong.

If nothing else it doesn't hurt to do this in one driver in a generic
way, and once it has been proven to address all the needs of that one
driver we can then start moving other drivers to it. The current
solution is quite generic, that was my contribution to this patch set
as I didn't like how invasive it was being to virtio and thought it
would be best to keep this as minimally invasive as possible.

My preference would be to give this a release or two in virtio to
mature before we start pushing it onto other drivers. It shouldn't
take much to cut/paste this into a new driver file once we decide it
is time to start extending it out to other drivers.

- Alex

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v4 2/2] virtio_net: Extend virtio to use VF datapath when available
  2018-03-03 18:04               ` [virtio-dev] " Alexander Duyck
  (?)
@ 2018-03-03 21:25               ` Jiri Pirko
  2018-03-04  0:26                   ` [virtio-dev] " Alexander Duyck
  -1 siblings, 1 reply; 102+ messages in thread
From: Jiri Pirko @ 2018-03-03 21:25 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Michael S. Tsirkin, Sridhar Samudrala, Stephen Hemminger,
	David Miller, Netdev, virtio-dev, Brandeburg, Jesse, Duyck,
	Alexander H, Jakub Kicinski

Sat, Mar 03, 2018 at 07:04:57PM CET, alexander.duyck@gmail.com wrote:
>On Sat, Mar 3, 2018 at 3:31 AM, Jiri Pirko <jiri@resnulli.us> wrote:
>> Fri, Mar 02, 2018 at 08:42:47PM CET, mst@redhat.com wrote:
>>>On Fri, Mar 02, 2018 at 05:20:17PM +0100, Jiri Pirko wrote:
>>>> >Yeah, this code essentially calls out the "shareable" code with a
>>>> >comment at the start and end of the section what defines the
>>>> >virtio_bypass functionality. It would just be a matter of mostly
>>>> >cutting and pasting to put it into a separate driver module.
>>>>
>>>> Please put it there and unite the use of it with netvsc.
>>>
>>>Surely, adding this to other drivers (e.g. might this be handy for xen
>>>too?) can be left for a separate patchset. Let's get one device merged
>>>first.
>>
>> Why? Let's do the generic infra alongside with the driver. I see no good
>> reason to rush into merging driver and only later, if ever, to convert
>> it to generic solution. On contrary. That would lead into multiple
>> approaches and different behavious in multiple drivers. That is plain
>> wrong.
>
>If nothing else it doesn't hurt to do this in one driver in a generic
>way, and once it has been proven to address all the needs of that one
>driver we can then start moving other drivers to it. The current
>solution is quite generic, that was my contribution to this patch set
>as I didn't like how invasive it was being to virtio and thought it
>would be best to keep this as minimally invasive as possible.
>
>My preference would be to give this a release or two in virtio to
>mature before we start pushing it onto other drivers. It shouldn't
>take much to cut/paste this into a new driver file once we decide it
>is time to start extending it out to other drivers.

I'm not talking about cut/paste and in fact that is what I'm worried
about. I'm talking about common code in net/core/ or somewhere that
would take care of this in-driver bonding. Each driver, like virtio_net,
netvsc would just register some ops to it and the core would do all
logic. I believe it is essential take this approach from the start.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v4 2/2] virtio_net: Extend virtio to use VF datapath when available
  2018-03-03 21:25               ` Jiri Pirko
@ 2018-03-04  0:26                   ` Alexander Duyck
  0 siblings, 0 replies; 102+ messages in thread
From: Alexander Duyck @ 2018-03-04  0:26 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Michael S. Tsirkin, Sridhar Samudrala, Stephen Hemminger,
	David Miller, Netdev, virtio-dev, Brandeburg, Jesse, Duyck,
	Alexander H, Jakub Kicinski

On Sat, Mar 3, 2018 at 1:25 PM, Jiri Pirko <jiri@resnulli.us> wrote:
> Sat, Mar 03, 2018 at 07:04:57PM CET, alexander.duyck@gmail.com wrote:
>>On Sat, Mar 3, 2018 at 3:31 AM, Jiri Pirko <jiri@resnulli.us> wrote:
>>> Fri, Mar 02, 2018 at 08:42:47PM CET, mst@redhat.com wrote:
>>>>On Fri, Mar 02, 2018 at 05:20:17PM +0100, Jiri Pirko wrote:
>>>>> >Yeah, this code essentially calls out the "shareable" code with a
>>>>> >comment at the start and end of the section what defines the
>>>>> >virtio_bypass functionality. It would just be a matter of mostly
>>>>> >cutting and pasting to put it into a separate driver module.
>>>>>
>>>>> Please put it there and unite the use of it with netvsc.
>>>>
>>>>Surely, adding this to other drivers (e.g. might this be handy for xen
>>>>too?) can be left for a separate patchset. Let's get one device merged
>>>>first.
>>>
>>> Why? Let's do the generic infra alongside with the driver. I see no good
>>> reason to rush into merging driver and only later, if ever, to convert
>>> it to generic solution. On contrary. That would lead into multiple
>>> approaches and different behavious in multiple drivers. That is plain
>>> wrong.
>>
>>If nothing else it doesn't hurt to do this in one driver in a generic
>>way, and once it has been proven to address all the needs of that one
>>driver we can then start moving other drivers to it. The current
>>solution is quite generic, that was my contribution to this patch set
>>as I didn't like how invasive it was being to virtio and thought it
>>would be best to keep this as minimally invasive as possible.
>>
>>My preference would be to give this a release or two in virtio to
>>mature before we start pushing it onto other drivers. It shouldn't
>>take much to cut/paste this into a new driver file once we decide it
>>is time to start extending it out to other drivers.
>
> I'm not talking about cut/paste and in fact that is what I'm worried
> about. I'm talking about common code in net/core/ or somewhere that
> would take care of this in-driver bonding. Each driver, like virtio_net,
> netvsc would just register some ops to it and the core would do all
> logic. I believe it is essential take this approach from the start.

Sorry, I didn't mean cut/paste into another driver, I meant to make it
a driver of its own. My thought was to eventually create a shared/core
driver module that is then used by the other drivers.

My concern right now is that Stephen has indicated he doesn't want
this approach taken with netvsc, and most of the community doesn't
want the netvsc approach applied to virtio. Until that impasse can be
resolved there isn't much value in trying to split this up so it is
available to other drivers. In addition I would imagine it would make
it a pain for others to back-port into distros since it would break
legacy netvsc driver behavior. Patches are always welcome. Once this
is in you are free to try fighting to get this made into a generic
module and applied to both drivers, but we have already spent close to
3 months on this and it seems like there has been significantly more
time spent arguing over the number of interfaces and/or drivers than
spent writing/reviewing actual code.

- Alex

^ permalink raw reply	[flat|nested] 102+ messages in thread

* [virtio-dev] Re: [PATCH v4 2/2] virtio_net: Extend virtio to use VF datapath when available
@ 2018-03-04  0:26                   ` Alexander Duyck
  0 siblings, 0 replies; 102+ messages in thread
From: Alexander Duyck @ 2018-03-04  0:26 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Michael S. Tsirkin, Sridhar Samudrala, Stephen Hemminger,
	David Miller, Netdev, virtio-dev, Brandeburg, Jesse, Duyck,
	Alexander H, Jakub Kicinski

On Sat, Mar 3, 2018 at 1:25 PM, Jiri Pirko <jiri@resnulli.us> wrote:
> Sat, Mar 03, 2018 at 07:04:57PM CET, alexander.duyck@gmail.com wrote:
>>On Sat, Mar 3, 2018 at 3:31 AM, Jiri Pirko <jiri@resnulli.us> wrote:
>>> Fri, Mar 02, 2018 at 08:42:47PM CET, mst@redhat.com wrote:
>>>>On Fri, Mar 02, 2018 at 05:20:17PM +0100, Jiri Pirko wrote:
>>>>> >Yeah, this code essentially calls out the "shareable" code with a
>>>>> >comment at the start and end of the section what defines the
>>>>> >virtio_bypass functionality. It would just be a matter of mostly
>>>>> >cutting and pasting to put it into a separate driver module.
>>>>>
>>>>> Please put it there and unite the use of it with netvsc.
>>>>
>>>>Surely, adding this to other drivers (e.g. might this be handy for xen
>>>>too?) can be left for a separate patchset. Let's get one device merged
>>>>first.
>>>
>>> Why? Let's do the generic infra alongside with the driver. I see no good
>>> reason to rush into merging driver and only later, if ever, to convert
>>> it to generic solution. On contrary. That would lead into multiple
>>> approaches and different behavious in multiple drivers. That is plain
>>> wrong.
>>
>>If nothing else it doesn't hurt to do this in one driver in a generic
>>way, and once it has been proven to address all the needs of that one
>>driver we can then start moving other drivers to it. The current
>>solution is quite generic, that was my contribution to this patch set
>>as I didn't like how invasive it was being to virtio and thought it
>>would be best to keep this as minimally invasive as possible.
>>
>>My preference would be to give this a release or two in virtio to
>>mature before we start pushing it onto other drivers. It shouldn't
>>take much to cut/paste this into a new driver file once we decide it
>>is time to start extending it out to other drivers.
>
> I'm not talking about cut/paste and in fact that is what I'm worried
> about. I'm talking about common code in net/core/ or somewhere that
> would take care of this in-driver bonding. Each driver, like virtio_net,
> netvsc would just register some ops to it and the core would do all
> logic. I believe it is essential take this approach from the start.

Sorry, I didn't mean cut/paste into another driver, I meant to make it
a driver of its own. My thought was to eventually create a shared/core
driver module that is then used by the other drivers.

My concern right now is that Stephen has indicated he doesn't want
this approach taken with netvsc, and most of the community doesn't
want the netvsc approach applied to virtio. Until that impasse can be
resolved there isn't much value in trying to split this up so it is
available to other drivers. In addition I would imagine it would make
it a pain for others to back-port into distros since it would break
legacy netvsc driver behavior. Patches are always welcome. Once this
is in you are free to try fighting to get this made into a generic
module and applied to both drivers, but we have already spent close to
3 months on this and it seems like there has been significantly more
time spent arguing over the number of interfaces and/or drivers than
spent writing/reviewing actual code.

- Alex

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Re: [PATCH v4 2/2] virtio_net: Extend virtio to use VF datapath when available
  2018-03-02 22:26                   ` [virtio-dev] " Siwei Liu
@ 2018-03-04  4:00                     ` Michael S. Tsirkin
  -1 siblings, 0 replies; 102+ messages in thread
From: Michael S. Tsirkin @ 2018-03-04  4:00 UTC (permalink / raw)
  To: Siwei Liu
  Cc: Samudrala, Sridhar, Alexander Duyck, Jiri Pirko,
	Stephen Hemminger, David Miller, Netdev, virtio-dev, Brandeburg,
	Jesse, Duyck, Alexander H, Jakub Kicinski

On Fri, Mar 02, 2018 at 02:26:48PM -0800, Siwei Liu wrote:
> On Fri, Mar 2, 2018 at 1:31 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> > On Fri, Mar 02, 2018 at 12:44:56PM -0800, Siwei Liu wrote:
> >> On Fri, Mar 2, 2018 at 12:10 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> >> > On Fri, Mar 02, 2018 at 11:52:27AM -0800, Samudrala, Sridhar wrote:
> >> >>
> >> >>
> >> >> On 3/2/2018 11:41 AM, Michael S. Tsirkin wrote:
> >> >> > On Fri, Mar 02, 2018 at 07:26:25AM -0800, Alexander Duyck wrote:
> >> >> > > The design limits things to a 1:1 relationship since we just have the
> >> >> > > child and backup pointers, but I don't think I am seeing exception
> >> >> > > handling to prevent us from overwriting the child pointers so there
> >> >> > > may be a leak there.
> >> >> > >
> >> >> > > Thanks.
> >> >> > >
> >> >> > > - Alex
> >> >> > In fact maintaining a list in that case would be nicer, and
> >> >> > just use an arbitrary one.
> >> >> > E.g. one can see how a user wanting to swap device 1 for device 2
> >> >> > might first add device 2 with same MAC then drop device 1.
> >> >>
> >> >> It should be possible to swap VF1 with VF2 by
> >> >> 1.- enabling virtio link
> >> >> 2.- unplugging VF1
> >> >> 3.- plugging VF2
> >> >> 4.- disabling virtio link
> >> >>
> >> >
> >> > True, but it isn't hard to avoid breakage if user
> >> > swapped steps 2 and 3. No need to make it more
> >> > fragile that it has to be.
> >>
> >> The migration case, VF2 is associated with another PF on another
> >> machine (destination), I wonder how it is possible.
> >
> > E.g. you want to remove the PF so you unplug the VF
> > then add another VF of the same PF.
> >
> >> Even with local plugging of VF2 on the same PF, the MAC address
> >> requirement (VF1's == VF2's) would fail the MAC address assignment on
> >> VF2.
> >>
> >> -Siwei
> >
> > Why would it fail? These are separate cards.
> 
> OK. I realized that you may talk about assigning a VF on a diffferent
> PF (VF1 on PF1 while VF2 on PF2). And we might assign a pass-through
> device rather than a VF. Yes, it's indeed possible that may happen but
> I take it as a further step down (another patch maybe) as it would
> involve changes to notify the network with gratuituious ARP and/or
> unsolicited ND advertisement of the MAC address association with the
> new port.
> 
> -Siwei

Interesting point. I guess that's a limitation in the curent patch then:
virtio and PT device must be connected to the same physical NIC.
Worth documenting.

> >
> >> >
> >> > --
> >> > MST
> >> >
> >> > ---------------------------------------------------------------------
> >> > To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
> >> > For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org
> >> >

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [virtio-dev] Re: [PATCH v4 2/2] virtio_net: Extend virtio to use VF datapath when available
@ 2018-03-04  4:00                     ` Michael S. Tsirkin
  0 siblings, 0 replies; 102+ messages in thread
From: Michael S. Tsirkin @ 2018-03-04  4:00 UTC (permalink / raw)
  To: Siwei Liu
  Cc: Samudrala, Sridhar, Alexander Duyck, Jiri Pirko,
	Stephen Hemminger, David Miller, Netdev, virtio-dev, Brandeburg,
	Jesse, Duyck, Alexander H, Jakub Kicinski

On Fri, Mar 02, 2018 at 02:26:48PM -0800, Siwei Liu wrote:
> On Fri, Mar 2, 2018 at 1:31 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> > On Fri, Mar 02, 2018 at 12:44:56PM -0800, Siwei Liu wrote:
> >> On Fri, Mar 2, 2018 at 12:10 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> >> > On Fri, Mar 02, 2018 at 11:52:27AM -0800, Samudrala, Sridhar wrote:
> >> >>
> >> >>
> >> >> On 3/2/2018 11:41 AM, Michael S. Tsirkin wrote:
> >> >> > On Fri, Mar 02, 2018 at 07:26:25AM -0800, Alexander Duyck wrote:
> >> >> > > The design limits things to a 1:1 relationship since we just have the
> >> >> > > child and backup pointers, but I don't think I am seeing exception
> >> >> > > handling to prevent us from overwriting the child pointers so there
> >> >> > > may be a leak there.
> >> >> > >
> >> >> > > Thanks.
> >> >> > >
> >> >> > > - Alex
> >> >> > In fact maintaining a list in that case would be nicer, and
> >> >> > just use an arbitrary one.
> >> >> > E.g. one can see how a user wanting to swap device 1 for device 2
> >> >> > might first add device 2 with same MAC then drop device 1.
> >> >>
> >> >> It should be possible to swap VF1 with VF2 by
> >> >> 1.- enabling virtio link
> >> >> 2.- unplugging VF1
> >> >> 3.- plugging VF2
> >> >> 4.- disabling virtio link
> >> >>
> >> >
> >> > True, but it isn't hard to avoid breakage if user
> >> > swapped steps 2 and 3. No need to make it more
> >> > fragile that it has to be.
> >>
> >> The migration case, VF2 is associated with another PF on another
> >> machine (destination), I wonder how it is possible.
> >
> > E.g. you want to remove the PF so you unplug the VF
> > then add another VF of the same PF.
> >
> >> Even with local plugging of VF2 on the same PF, the MAC address
> >> requirement (VF1's == VF2's) would fail the MAC address assignment on
> >> VF2.
> >>
> >> -Siwei
> >
> > Why would it fail? These are separate cards.
> 
> OK. I realized that you may talk about assigning a VF on a diffferent
> PF (VF1 on PF1 while VF2 on PF2). And we might assign a pass-through
> device rather than a VF. Yes, it's indeed possible that may happen but
> I take it as a further step down (another patch maybe) as it would
> involve changes to notify the network with gratuituious ARP and/or
> unsolicited ND advertisement of the MAC address association with the
> new port.
> 
> -Siwei

Interesting point. I guess that's a limitation in the curent patch then:
virtio and PT device must be connected to the same physical NIC.
Worth documenting.

> >
> >> >
> >> > --
> >> > MST
> >> >
> >> > ---------------------------------------------------------------------
> >> > To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
> >> > For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org
> >> >

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v4 2/2] virtio_net: Extend virtio to use VF datapath when available
  2018-03-02 23:56         ` [virtio-dev] " Siwei Liu
@ 2018-03-04  4:04           ` Michael S. Tsirkin
  -1 siblings, 0 replies; 102+ messages in thread
From: Michael S. Tsirkin @ 2018-03-04  4:04 UTC (permalink / raw)
  To: Siwei Liu
  Cc: Sridhar Samudrala, Stephen Hemminger, David Miller, Netdev,
	Jiri Pirko, virtio-dev, Brandeburg, Jesse, Alexander Duyck,
	Jakub Kicinski

On Fri, Mar 02, 2018 at 03:56:31PM -0800, Siwei Liu wrote:
> On Fri, Mar 2, 2018 at 1:36 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> > On Fri, Mar 02, 2018 at 01:11:56PM -0800, Siwei Liu wrote:
> >> On Thu, Mar 1, 2018 at 12:08 PM, Sridhar Samudrala
> >> <sridhar.samudrala@intel.com> wrote:
> >> > This patch enables virtio_net to switch over to a VF datapath when a VF
> >> > netdev is present with the same MAC address. It allows live migration
> >> > of a VM with a direct attached VF without the need to setup a bond/team
> >> > between a VF and virtio net device in the guest.
> >> >
> >> > The hypervisor needs to enable only one datapath at any time so that
> >> > packets don't get looped back to the VM over the other datapath. When a VF
> >> > is plugged, the virtio datapath link state can be marked as down. The
> >> > hypervisor needs to unplug the VF device from the guest on the source host
> >> > and reset the MAC filter of the VF to initiate failover of datapath to
> >> > virtio before starting the migration. After the migration is completed,
> >> > the destination hypervisor sets the MAC filter on the VF and plugs it back
> >> > to the guest to switch over to VF datapath.
> >> >
> >> > When BACKUP feature is enabled, an additional netdev(bypass netdev) is
> >> > created that acts as a master device and tracks the state of the 2 lower
> >> > netdevs. The original virtio_net netdev is marked as 'backup' netdev and a
> >> > passthru device with the same MAC is registered as 'active' netdev.
> >> >
> >> > This patch is based on the discussion initiated by Jesse on this thread.
> >> > https://marc.info/?l=linux-virtualization&m=151189725224231&w=2
> >> >
> >> > Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
> >> > Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
> >> > Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
> >> > ---
> >> >  drivers/net/virtio_net.c | 683 ++++++++++++++++++++++++++++++++++++++++++++++-
> >> >  1 file changed, 682 insertions(+), 1 deletion(-)
> >> >
> >> > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> >> > index bcd13fe906ca..f2860d86c952 100644
> >> > --- a/drivers/net/virtio_net.c
> >> > +++ b/drivers/net/virtio_net.c
> >> > @@ -30,6 +30,8 @@
> >> >  #include <linux/cpu.h>
> >> >  #include <linux/average.h>
> >> >  #include <linux/filter.h>
> >> > +#include <linux/netdevice.h>
> >> > +#include <linux/pci.h>
> >> >  #include <net/route.h>
> >> >  #include <net/xdp.h>
> >> >
> >> > @@ -206,6 +208,9 @@ struct virtnet_info {
> >> >         u32 speed;
> >> >
> >> >         unsigned long guest_offloads;
> >> > +
> >> > +       /* upper netdev created when BACKUP feature enabled */
> >> > +       struct net_device *bypass_netdev;
> >> >  };
> >> >
> >> >  struct padded_vnet_hdr {
> >> > @@ -2236,6 +2241,22 @@ static int virtnet_xdp(struct net_device *dev, struct netdev_bpf *xdp)
> >> >         }
> >> >  }
> >> >
> >> > +static int virtnet_get_phys_port_name(struct net_device *dev, char *buf,
> >> > +                                     size_t len)
> >> > +{
> >> > +       struct virtnet_info *vi = netdev_priv(dev);
> >> > +       int ret;
> >> > +
> >> > +       if (!virtio_has_feature(vi->vdev, VIRTIO_NET_F_BACKUP))
> >> > +               return -EOPNOTSUPP;
> >> > +
> >> > +       ret = snprintf(buf, len, "_bkup");
> >> > +       if (ret >= len)
> >> > +               return -EOPNOTSUPP;
> >> > +
> >> > +       return 0;
> >> > +}
> >> > +
> >>
> >> What if the systemd/udevd is not new enough to enforce the
> >> n<phys_port_name> naming? Would virtio_bypass get a different name
> >> than the original virtio_net?
> >
> > You mean people using ethX names? Any hardware config change breaks
> > these, I don't think that can be helped.
> 
> I don't like the way to rely on .ndo_get_phys_port_name - it's fragile
> and it does not completely solve the problem it tries to address.
> Imagine what can end up with if getting an old udevd, or users already
> have exsiting explicit udev rules around phys_port_name. It does not
> give you the an ack in saying "yes, I know you're the bypass and
> you're the backup, please continue and I will give you both correct
> names", or an unacknowlegment saying "no, I don't know what these
> extra interfaces are, please go back and leave the VF device alone".
> We need new udev API for both feature negotiation and naming, or may
> even completely hide the lower interfaces.

Go ahead and try to make this happen, but I won't hold my
breath.

> >
> >> Should we detect this earlier and fall
> >> back to legacy mode without creating the bypass netdev and ensalving
> >> the VF?
> >
> > I don't think we can do this with existing kernel/userspace APIs.
> 
> That's why I ever said to make udev aware of this new type of combined
> device instead of doing hacks here and there around.
> 
> Regards,
> -Siwei

We can add new interfaces on top but the main purpose here is to
make old userspace do new tricks.

> >
> > --
> > MST

^ permalink raw reply	[flat|nested] 102+ messages in thread

* [virtio-dev] Re: [PATCH v4 2/2] virtio_net: Extend virtio to use VF datapath when available
@ 2018-03-04  4:04           ` Michael S. Tsirkin
  0 siblings, 0 replies; 102+ messages in thread
From: Michael S. Tsirkin @ 2018-03-04  4:04 UTC (permalink / raw)
  To: Siwei Liu
  Cc: Sridhar Samudrala, Stephen Hemminger, David Miller, Netdev,
	Jiri Pirko, virtio-dev, Brandeburg, Jesse, Alexander Duyck,
	Jakub Kicinski

On Fri, Mar 02, 2018 at 03:56:31PM -0800, Siwei Liu wrote:
> On Fri, Mar 2, 2018 at 1:36 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> > On Fri, Mar 02, 2018 at 01:11:56PM -0800, Siwei Liu wrote:
> >> On Thu, Mar 1, 2018 at 12:08 PM, Sridhar Samudrala
> >> <sridhar.samudrala@intel.com> wrote:
> >> > This patch enables virtio_net to switch over to a VF datapath when a VF
> >> > netdev is present with the same MAC address. It allows live migration
> >> > of a VM with a direct attached VF without the need to setup a bond/team
> >> > between a VF and virtio net device in the guest.
> >> >
> >> > The hypervisor needs to enable only one datapath at any time so that
> >> > packets don't get looped back to the VM over the other datapath. When a VF
> >> > is plugged, the virtio datapath link state can be marked as down. The
> >> > hypervisor needs to unplug the VF device from the guest on the source host
> >> > and reset the MAC filter of the VF to initiate failover of datapath to
> >> > virtio before starting the migration. After the migration is completed,
> >> > the destination hypervisor sets the MAC filter on the VF and plugs it back
> >> > to the guest to switch over to VF datapath.
> >> >
> >> > When BACKUP feature is enabled, an additional netdev(bypass netdev) is
> >> > created that acts as a master device and tracks the state of the 2 lower
> >> > netdevs. The original virtio_net netdev is marked as 'backup' netdev and a
> >> > passthru device with the same MAC is registered as 'active' netdev.
> >> >
> >> > This patch is based on the discussion initiated by Jesse on this thread.
> >> > https://marc.info/?l=linux-virtualization&m=151189725224231&w=2
> >> >
> >> > Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
> >> > Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
> >> > Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
> >> > ---
> >> >  drivers/net/virtio_net.c | 683 ++++++++++++++++++++++++++++++++++++++++++++++-
> >> >  1 file changed, 682 insertions(+), 1 deletion(-)
> >> >
> >> > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> >> > index bcd13fe906ca..f2860d86c952 100644
> >> > --- a/drivers/net/virtio_net.c
> >> > +++ b/drivers/net/virtio_net.c
> >> > @@ -30,6 +30,8 @@
> >> >  #include <linux/cpu.h>
> >> >  #include <linux/average.h>
> >> >  #include <linux/filter.h>
> >> > +#include <linux/netdevice.h>
> >> > +#include <linux/pci.h>
> >> >  #include <net/route.h>
> >> >  #include <net/xdp.h>
> >> >
> >> > @@ -206,6 +208,9 @@ struct virtnet_info {
> >> >         u32 speed;
> >> >
> >> >         unsigned long guest_offloads;
> >> > +
> >> > +       /* upper netdev created when BACKUP feature enabled */
> >> > +       struct net_device *bypass_netdev;
> >> >  };
> >> >
> >> >  struct padded_vnet_hdr {
> >> > @@ -2236,6 +2241,22 @@ static int virtnet_xdp(struct net_device *dev, struct netdev_bpf *xdp)
> >> >         }
> >> >  }
> >> >
> >> > +static int virtnet_get_phys_port_name(struct net_device *dev, char *buf,
> >> > +                                     size_t len)
> >> > +{
> >> > +       struct virtnet_info *vi = netdev_priv(dev);
> >> > +       int ret;
> >> > +
> >> > +       if (!virtio_has_feature(vi->vdev, VIRTIO_NET_F_BACKUP))
> >> > +               return -EOPNOTSUPP;
> >> > +
> >> > +       ret = snprintf(buf, len, "_bkup");
> >> > +       if (ret >= len)
> >> > +               return -EOPNOTSUPP;
> >> > +
> >> > +       return 0;
> >> > +}
> >> > +
> >>
> >> What if the systemd/udevd is not new enough to enforce the
> >> n<phys_port_name> naming? Would virtio_bypass get a different name
> >> than the original virtio_net?
> >
> > You mean people using ethX names? Any hardware config change breaks
> > these, I don't think that can be helped.
> 
> I don't like the way to rely on .ndo_get_phys_port_name - it's fragile
> and it does not completely solve the problem it tries to address.
> Imagine what can end up with if getting an old udevd, or users already
> have exsiting explicit udev rules around phys_port_name. It does not
> give you the an ack in saying "yes, I know you're the bypass and
> you're the backup, please continue and I will give you both correct
> names", or an unacknowlegment saying "no, I don't know what these
> extra interfaces are, please go back and leave the VF device alone".
> We need new udev API for both feature negotiation and naming, or may
> even completely hide the lower interfaces.

Go ahead and try to make this happen, but I won't hold my
breath.

> >
> >> Should we detect this earlier and fall
> >> back to legacy mode without creating the bypass netdev and ensalving
> >> the VF?
> >
> > I don't think we can do this with existing kernel/userspace APIs.
> 
> That's why I ever said to make udev aware of this new type of combined
> device instead of doing hacks here and there around.
> 
> Regards,
> -Siwei

We can add new interfaces on top but the main purpose here is to
make old userspace do new tricks.

> >
> > --
> > MST

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v4 2/2] virtio_net: Extend virtio to use VF datapath when available
  2018-03-04  0:26                   ` [virtio-dev] " Alexander Duyck
  (?)
@ 2018-03-04  7:13                   ` Jiri Pirko
  2018-03-04 18:24                       ` [virtio-dev] " Alexander Duyck
  -1 siblings, 1 reply; 102+ messages in thread
From: Jiri Pirko @ 2018-03-04  7:13 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Michael S. Tsirkin, Sridhar Samudrala, Stephen Hemminger,
	David Miller, Netdev, virtio-dev, Brandeburg, Jesse, Duyck,
	Alexander H, Jakub Kicinski

Sun, Mar 04, 2018 at 01:26:53AM CET, alexander.duyck@gmail.com wrote:
>On Sat, Mar 3, 2018 at 1:25 PM, Jiri Pirko <jiri@resnulli.us> wrote:
>> Sat, Mar 03, 2018 at 07:04:57PM CET, alexander.duyck@gmail.com wrote:
>>>On Sat, Mar 3, 2018 at 3:31 AM, Jiri Pirko <jiri@resnulli.us> wrote:
>>>> Fri, Mar 02, 2018 at 08:42:47PM CET, mst@redhat.com wrote:
>>>>>On Fri, Mar 02, 2018 at 05:20:17PM +0100, Jiri Pirko wrote:
>>>>>> >Yeah, this code essentially calls out the "shareable" code with a
>>>>>> >comment at the start and end of the section what defines the
>>>>>> >virtio_bypass functionality. It would just be a matter of mostly
>>>>>> >cutting and pasting to put it into a separate driver module.
>>>>>>
>>>>>> Please put it there and unite the use of it with netvsc.
>>>>>
>>>>>Surely, adding this to other drivers (e.g. might this be handy for xen
>>>>>too?) can be left for a separate patchset. Let's get one device merged
>>>>>first.
>>>>
>>>> Why? Let's do the generic infra alongside with the driver. I see no good
>>>> reason to rush into merging driver and only later, if ever, to convert
>>>> it to generic solution. On contrary. That would lead into multiple
>>>> approaches and different behavious in multiple drivers. That is plain
>>>> wrong.
>>>
>>>If nothing else it doesn't hurt to do this in one driver in a generic
>>>way, and once it has been proven to address all the needs of that one
>>>driver we can then start moving other drivers to it. The current
>>>solution is quite generic, that was my contribution to this patch set
>>>as I didn't like how invasive it was being to virtio and thought it
>>>would be best to keep this as minimally invasive as possible.
>>>
>>>My preference would be to give this a release or two in virtio to
>>>mature before we start pushing it onto other drivers. It shouldn't
>>>take much to cut/paste this into a new driver file once we decide it
>>>is time to start extending it out to other drivers.
>>
>> I'm not talking about cut/paste and in fact that is what I'm worried
>> about. I'm talking about common code in net/core/ or somewhere that
>> would take care of this in-driver bonding. Each driver, like virtio_net,
>> netvsc would just register some ops to it and the core would do all
>> logic. I believe it is essential take this approach from the start.
>
>Sorry, I didn't mean cut/paste into another driver, I meant to make it
>a driver of its own. My thought was to eventually create a shared/core
>driver module that is then used by the other drivers.
>
>My concern right now is that Stephen has indicated he doesn't want
>this approach taken with netvsc, and most of the community doesn't

IIUC, he only does not like the extra netdev. Is there anything else?


>want the netvsc approach applied to virtio. Until that impasse can be
>resolved there isn't much value in trying to split this up so it is
>available to other drivers. In addition I would imagine it would make
>it a pain for others to back-port into distros since it would break
>legacy netvsc driver behavior. Patches are always welcome. Once this
>is in you are free to try fighting to get this made into a generic
>module and applied to both drivers, but we have already spent close to
>3 months on this and it seems like there has been significantly more

Alex, time is never a good argument for poor design and shortcuts.


>time spent arguing over the number of interfaces and/or drivers than
>spent writing/reviewing actual code.
>
>- Alex

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v4 2/2] virtio_net: Extend virtio to use VF datapath when available
  2018-03-04  7:13                   ` Jiri Pirko
@ 2018-03-04 18:24                       ` Alexander Duyck
  0 siblings, 0 replies; 102+ messages in thread
From: Alexander Duyck @ 2018-03-04 18:24 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Michael S. Tsirkin, Sridhar Samudrala, Stephen Hemminger,
	David Miller, Netdev, virtio-dev, Brandeburg, Jesse, Duyck,
	Alexander H, Jakub Kicinski

On Sat, Mar 3, 2018 at 11:13 PM, Jiri Pirko <jiri@resnulli.us> wrote:
> Sun, Mar 04, 2018 at 01:26:53AM CET, alexander.duyck@gmail.com wrote:
>>On Sat, Mar 3, 2018 at 1:25 PM, Jiri Pirko <jiri@resnulli.us> wrote:
>>> Sat, Mar 03, 2018 at 07:04:57PM CET, alexander.duyck@gmail.com wrote:
>>>>On Sat, Mar 3, 2018 at 3:31 AM, Jiri Pirko <jiri@resnulli.us> wrote:
>>>>> Fri, Mar 02, 2018 at 08:42:47PM CET, mst@redhat.com wrote:
>>>>>>On Fri, Mar 02, 2018 at 05:20:17PM +0100, Jiri Pirko wrote:
>>>>>>> >Yeah, this code essentially calls out the "shareable" code with a
>>>>>>> >comment at the start and end of the section what defines the
>>>>>>> >virtio_bypass functionality. It would just be a matter of mostly
>>>>>>> >cutting and pasting to put it into a separate driver module.
>>>>>>>
>>>>>>> Please put it there and unite the use of it with netvsc.
>>>>>>
>>>>>>Surely, adding this to other drivers (e.g. might this be handy for xen
>>>>>>too?) can be left for a separate patchset. Let's get one device merged
>>>>>>first.
>>>>>
>>>>> Why? Let's do the generic infra alongside with the driver. I see no good
>>>>> reason to rush into merging driver and only later, if ever, to convert
>>>>> it to generic solution. On contrary. That would lead into multiple
>>>>> approaches and different behavious in multiple drivers. That is plain
>>>>> wrong.
>>>>
>>>>If nothing else it doesn't hurt to do this in one driver in a generic
>>>>way, and once it has been proven to address all the needs of that one
>>>>driver we can then start moving other drivers to it. The current
>>>>solution is quite generic, that was my contribution to this patch set
>>>>as I didn't like how invasive it was being to virtio and thought it
>>>>would be best to keep this as minimally invasive as possible.
>>>>
>>>>My preference would be to give this a release or two in virtio to
>>>>mature before we start pushing it onto other drivers. It shouldn't
>>>>take much to cut/paste this into a new driver file once we decide it
>>>>is time to start extending it out to other drivers.
>>>
>>> I'm not talking about cut/paste and in fact that is what I'm worried
>>> about. I'm talking about common code in net/core/ or somewhere that
>>> would take care of this in-driver bonding. Each driver, like virtio_net,
>>> netvsc would just register some ops to it and the core would do all
>>> logic. I believe it is essential take this approach from the start.
>>
>>Sorry, I didn't mean cut/paste into another driver, I meant to make it
>>a driver of its own. My thought was to eventually create a shared/core
>>driver module that is then used by the other drivers.
>>
>>My concern right now is that Stephen has indicated he doesn't want
>>this approach taken with netvsc, and most of the community doesn't
>
> IIUC, he only does not like the extra netdev. Is there anything else?

Nope that is pretty much it. It doesn't seem like a big deal for
virtio, but for netvsc it is significant since they don't have any
"backup" bit feature differentiation, so they would likely be stuck
with 2 netdevs even in their basic setup.

>>want the netvsc approach applied to virtio. Until that impasse can be
>>resolved there isn't much value in trying to split this up so it is
>>available to other drivers. In addition I would imagine it would make
>>it a pain for others to back-port into distros since it would break
>>legacy netvsc driver behavior. Patches are always welcome. Once this
>>is in you are free to try fighting to get this made into a generic
>>module and applied to both drivers, but we have already spent close to
>>3 months on this and it seems like there has been significantly more
>
> Alex, time is never a good argument for poor design and shortcuts.

I'm not saying we should go with a poor design due to time. But
expecting us to implement something where the maintainer of said
driver has not agreed to is pointless, and I don't see it as a design
shortcut to implement something in one driver with the expectation
that we will then make it core later once it has proven itself and has
use elsewhere. In the meantime I would imagine it also makes it easier
for things like backports and such for us to do it this way since we
are only impacting one driver.

You are telling us to do something that not everyone has agreed to.
Currently we only have agreement from Michael on taking this code, as
such we are working with virtio only for now. When the time comes that
we can get other maintainers, specifically Stephen, to agree to it
then we can cut/paste this code into a core file or into a module of
its own. Alternatively I suppose we could take this up to Dave if you
can't get Stephen to agree. If you can get Dave to say we need to
change netvsc then we will go ahead with it, but generally I prefer to
respect when the maintainer of something says they don't want us
modifying their code in some way.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* [virtio-dev] Re: [PATCH v4 2/2] virtio_net: Extend virtio to use VF datapath when available
@ 2018-03-04 18:24                       ` Alexander Duyck
  0 siblings, 0 replies; 102+ messages in thread
From: Alexander Duyck @ 2018-03-04 18:24 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Michael S. Tsirkin, Sridhar Samudrala, Stephen Hemminger,
	David Miller, Netdev, virtio-dev, Brandeburg, Jesse, Duyck,
	Alexander H, Jakub Kicinski

On Sat, Mar 3, 2018 at 11:13 PM, Jiri Pirko <jiri@resnulli.us> wrote:
> Sun, Mar 04, 2018 at 01:26:53AM CET, alexander.duyck@gmail.com wrote:
>>On Sat, Mar 3, 2018 at 1:25 PM, Jiri Pirko <jiri@resnulli.us> wrote:
>>> Sat, Mar 03, 2018 at 07:04:57PM CET, alexander.duyck@gmail.com wrote:
>>>>On Sat, Mar 3, 2018 at 3:31 AM, Jiri Pirko <jiri@resnulli.us> wrote:
>>>>> Fri, Mar 02, 2018 at 08:42:47PM CET, mst@redhat.com wrote:
>>>>>>On Fri, Mar 02, 2018 at 05:20:17PM +0100, Jiri Pirko wrote:
>>>>>>> >Yeah, this code essentially calls out the "shareable" code with a
>>>>>>> >comment at the start and end of the section what defines the
>>>>>>> >virtio_bypass functionality. It would just be a matter of mostly
>>>>>>> >cutting and pasting to put it into a separate driver module.
>>>>>>>
>>>>>>> Please put it there and unite the use of it with netvsc.
>>>>>>
>>>>>>Surely, adding this to other drivers (e.g. might this be handy for xen
>>>>>>too?) can be left for a separate patchset. Let's get one device merged
>>>>>>first.
>>>>>
>>>>> Why? Let's do the generic infra alongside with the driver. I see no good
>>>>> reason to rush into merging driver and only later, if ever, to convert
>>>>> it to generic solution. On contrary. That would lead into multiple
>>>>> approaches and different behavious in multiple drivers. That is plain
>>>>> wrong.
>>>>
>>>>If nothing else it doesn't hurt to do this in one driver in a generic
>>>>way, and once it has been proven to address all the needs of that one
>>>>driver we can then start moving other drivers to it. The current
>>>>solution is quite generic, that was my contribution to this patch set
>>>>as I didn't like how invasive it was being to virtio and thought it
>>>>would be best to keep this as minimally invasive as possible.
>>>>
>>>>My preference would be to give this a release or two in virtio to
>>>>mature before we start pushing it onto other drivers. It shouldn't
>>>>take much to cut/paste this into a new driver file once we decide it
>>>>is time to start extending it out to other drivers.
>>>
>>> I'm not talking about cut/paste and in fact that is what I'm worried
>>> about. I'm talking about common code in net/core/ or somewhere that
>>> would take care of this in-driver bonding. Each driver, like virtio_net,
>>> netvsc would just register some ops to it and the core would do all
>>> logic. I believe it is essential take this approach from the start.
>>
>>Sorry, I didn't mean cut/paste into another driver, I meant to make it
>>a driver of its own. My thought was to eventually create a shared/core
>>driver module that is then used by the other drivers.
>>
>>My concern right now is that Stephen has indicated he doesn't want
>>this approach taken with netvsc, and most of the community doesn't
>
> IIUC, he only does not like the extra netdev. Is there anything else?

Nope that is pretty much it. It doesn't seem like a big deal for
virtio, but for netvsc it is significant since they don't have any
"backup" bit feature differentiation, so they would likely be stuck
with 2 netdevs even in their basic setup.

>>want the netvsc approach applied to virtio. Until that impasse can be
>>resolved there isn't much value in trying to split this up so it is
>>available to other drivers. In addition I would imagine it would make
>>it a pain for others to back-port into distros since it would break
>>legacy netvsc driver behavior. Patches are always welcome. Once this
>>is in you are free to try fighting to get this made into a generic
>>module and applied to both drivers, but we have already spent close to
>>3 months on this and it seems like there has been significantly more
>
> Alex, time is never a good argument for poor design and shortcuts.

I'm not saying we should go with a poor design due to time. But
expecting us to implement something where the maintainer of said
driver has not agreed to is pointless, and I don't see it as a design
shortcut to implement something in one driver with the expectation
that we will then make it core later once it has proven itself and has
use elsewhere. In the meantime I would imagine it also makes it easier
for things like backports and such for us to do it this way since we
are only impacting one driver.

You are telling us to do something that not everyone has agreed to.
Currently we only have agreement from Michael on taking this code, as
such we are working with virtio only for now. When the time comes that
we can get other maintainers, specifically Stephen, to agree to it
then we can cut/paste this code into a core file or into a module of
its own. Alternatively I suppose we could take this up to Dave if you
can't get Stephen to agree. If you can get Dave to say we need to
change netvsc then we will go ahead with it, but generally I prefer to
respect when the maintainer of something says they don't want us
modifying their code in some way.

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v4 2/2] virtio_net: Extend virtio to use VF datapath when available
  2018-03-04 18:24                       ` [virtio-dev] " Alexander Duyck
  (?)
@ 2018-03-04 18:50                       ` Jiri Pirko
  2018-03-04 21:54                           ` [virtio-dev] " Samudrala, Sridhar
  2018-03-04 21:58                           ` [virtio-dev] " Alexander Duyck
  -1 siblings, 2 replies; 102+ messages in thread
From: Jiri Pirko @ 2018-03-04 18:50 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Michael S. Tsirkin, Sridhar Samudrala, Stephen Hemminger,
	David Miller, Netdev, virtio-dev, Brandeburg, Jesse, Duyck,
	Alexander H, Jakub Kicinski

Sun, Mar 04, 2018 at 07:24:12PM CET, alexander.duyck@gmail.com wrote:
>On Sat, Mar 3, 2018 at 11:13 PM, Jiri Pirko <jiri@resnulli.us> wrote:
>> Sun, Mar 04, 2018 at 01:26:53AM CET, alexander.duyck@gmail.com wrote:
>>>On Sat, Mar 3, 2018 at 1:25 PM, Jiri Pirko <jiri@resnulli.us> wrote:
>>>> Sat, Mar 03, 2018 at 07:04:57PM CET, alexander.duyck@gmail.com wrote:
>>>>>On Sat, Mar 3, 2018 at 3:31 AM, Jiri Pirko <jiri@resnulli.us> wrote:
>>>>>> Fri, Mar 02, 2018 at 08:42:47PM CET, mst@redhat.com wrote:
>>>>>>>On Fri, Mar 02, 2018 at 05:20:17PM +0100, Jiri Pirko wrote:
>>>>>>>> >Yeah, this code essentially calls out the "shareable" code with a
>>>>>>>> >comment at the start and end of the section what defines the
>>>>>>>> >virtio_bypass functionality. It would just be a matter of mostly
>>>>>>>> >cutting and pasting to put it into a separate driver module.
>>>>>>>>
>>>>>>>> Please put it there and unite the use of it with netvsc.
>>>>>>>
>>>>>>>Surely, adding this to other drivers (e.g. might this be handy for xen
>>>>>>>too?) can be left for a separate patchset. Let's get one device merged
>>>>>>>first.
>>>>>>
>>>>>> Why? Let's do the generic infra alongside with the driver. I see no good
>>>>>> reason to rush into merging driver and only later, if ever, to convert
>>>>>> it to generic solution. On contrary. That would lead into multiple
>>>>>> approaches and different behavious in multiple drivers. That is plain
>>>>>> wrong.
>>>>>
>>>>>If nothing else it doesn't hurt to do this in one driver in a generic
>>>>>way, and once it has been proven to address all the needs of that one
>>>>>driver we can then start moving other drivers to it. The current
>>>>>solution is quite generic, that was my contribution to this patch set
>>>>>as I didn't like how invasive it was being to virtio and thought it
>>>>>would be best to keep this as minimally invasive as possible.
>>>>>
>>>>>My preference would be to give this a release or two in virtio to
>>>>>mature before we start pushing it onto other drivers. It shouldn't
>>>>>take much to cut/paste this into a new driver file once we decide it
>>>>>is time to start extending it out to other drivers.
>>>>
>>>> I'm not talking about cut/paste and in fact that is what I'm worried
>>>> about. I'm talking about common code in net/core/ or somewhere that
>>>> would take care of this in-driver bonding. Each driver, like virtio_net,
>>>> netvsc would just register some ops to it and the core would do all
>>>> logic. I believe it is essential take this approach from the start.
>>>
>>>Sorry, I didn't mean cut/paste into another driver, I meant to make it
>>>a driver of its own. My thought was to eventually create a shared/core
>>>driver module that is then used by the other drivers.
>>>
>>>My concern right now is that Stephen has indicated he doesn't want
>>>this approach taken with netvsc, and most of the community doesn't
>>
>> IIUC, he only does not like the extra netdev. Is there anything else?
>
>Nope that is pretty much it. It doesn't seem like a big deal for
>virtio, but for netvsc it is significant since they don't have any
>"backup" bit feature differentiation, so they would likely be stuck
>with 2 netdevs even in their basic setup.

Okay. If that is a strict "no-go" for netvsc, this should be
just a flag passed down to the in-driver bond code.


>
>>>want the netvsc approach applied to virtio. Until that impasse can be
>>>resolved there isn't much value in trying to split this up so it is
>>>available to other drivers. In addition I would imagine it would make
>>>it a pain for others to back-port into distros since it would break
>>>legacy netvsc driver behavior. Patches are always welcome. Once this
>>>is in you are free to try fighting to get this made into a generic
>>>module and applied to both drivers, but we have already spent close to
>>>3 months on this and it seems like there has been significantly more
>>
>> Alex, time is never a good argument for poor design and shortcuts.
>
>I'm not saying we should go with a poor design due to time. But
>expecting us to implement something where the maintainer of said
>driver has not agreed to is pointless, and I don't see it as a design

He just does not like the third netdev, not the fact that the code
for in-driver bonding would be shared.


>shortcut to implement something in one driver with the expectation
>that we will then make it core later once it has proven itself and has
>use elsewhere. In the meantime I would imagine it also makes it easier
>for things like backports and such for us to do it this way since we
>are only impacting one driver.

When you are working on upstream kernel, you should not care about
backports. That leads to poor design. Not an argument.


>
>You are telling us to do something that not everyone has agreed to.

Who did not?


>Currently we only have agreement from Michael on taking this code, as
>such we are working with virtio only for now. When the time comes that

If you do duplication of netvsc in-driver bonding in virtio_net, it will
stay there forever. So what you say is: "We will do it halfway now
and promise to fix it later". That later will never happen, I'm pretty
sure. That is why I push for in-driver bonding shared code as a part of
this patchset.

+ if you would be pushing first driver to do this, I would understand.
But the first driver is already in. You are pushing second. This is the
time to do the sharing, unification of behaviour. Next time is too late.


>we can get other maintainers, specifically Stephen, to agree to it
>then we can cut/paste this code into a core file or into a module of
>its own. Alternatively I suppose we could take this up to Dave if you
>can't get Stephen to agree. If you can get Dave to say we need to
>change netvsc then we will go ahead with it, but generally I prefer to
>respect when the maintainer of something says they don't want us
>modifying their code in some way.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v4 2/2] virtio_net: Extend virtio to use VF datapath when available
  2018-03-04 18:50                       ` Jiri Pirko
@ 2018-03-04 21:54                           ` Samudrala, Sridhar
  2018-03-04 21:58                           ` [virtio-dev] " Alexander Duyck
  1 sibling, 0 replies; 102+ messages in thread
From: Samudrala, Sridhar @ 2018-03-04 21:54 UTC (permalink / raw)
  To: Jiri Pirko, Alexander Duyck
  Cc: Michael S. Tsirkin, Stephen Hemminger, David Miller, Netdev,
	virtio-dev, Brandeburg, Jesse, Duyck, Alexander H,
	Jakub Kicinski



On 3/4/2018 10:50 AM, Jiri Pirko wrote:
> Sun, Mar 04, 2018 at 07:24:12PM CET, alexander.duyck@gmail.com wrote:
>> On Sat, Mar 3, 2018 at 11:13 PM, Jiri Pirko <jiri@resnulli.us> wrote:
>>> Sun, Mar 04, 2018 at 01:26:53AM CET, alexander.duyck@gmail.com wrote:
>>>> On Sat, Mar 3, 2018 at 1:25 PM, Jiri Pirko <jiri@resnulli.us> wrote:
>>>>> Sat, Mar 03, 2018 at 07:04:57PM CET, alexander.duyck@gmail.com wrote:
>>>>>> On Sat, Mar 3, 2018 at 3:31 AM, Jiri Pirko <jiri@resnulli.us> wrote:
>>>>>>> Fri, Mar 02, 2018 at 08:42:47PM CET, mst@redhat.com wrote:
>>>>>>>> On Fri, Mar 02, 2018 at 05:20:17PM +0100, Jiri Pirko wrote:
>>>>>>>>>> Yeah, this code essentially calls out the "shareable" code with a
>>>>>>>>>> comment at the start and end of the section what defines the
>>>>>>>>>> virtio_bypass functionality. It would just be a matter of mostly
>>>>>>>>>> cutting and pasting to put it into a separate driver module.
>>>>>>>>> Please put it there and unite the use of it with netvsc.
>>>>>>>> Surely, adding this to other drivers (e.g. might this be handy for xen
>>>>>>>> too?) can be left for a separate patchset. Let's get one device merged
>>>>>>>> first.
>>>>>>> Why? Let's do the generic infra alongside with the driver. I see no good
>>>>>>> reason to rush into merging driver and only later, if ever, to convert
>>>>>>> it to generic solution. On contrary. That would lead into multiple
>>>>>>> approaches and different behavious in multiple drivers. That is plain
>>>>>>> wrong.
>>>>>> If nothing else it doesn't hurt to do this in one driver in a generic
>>>>>> way, and once it has been proven to address all the needs of that one
>>>>>> driver we can then start moving other drivers to it. The current
>>>>>> solution is quite generic, that was my contribution to this patch set
>>>>>> as I didn't like how invasive it was being to virtio and thought it
>>>>>> would be best to keep this as minimally invasive as possible.
>>>>>>
>>>>>> My preference would be to give this a release or two in virtio to
>>>>>> mature before we start pushing it onto other drivers. It shouldn't
>>>>>> take much to cut/paste this into a new driver file once we decide it
>>>>>> is time to start extending it out to other drivers.
>>>>> I'm not talking about cut/paste and in fact that is what I'm worried
>>>>> about. I'm talking about common code in net/core/ or somewhere that
>>>>> would take care of this in-driver bonding. Each driver, like virtio_net,
>>>>> netvsc would just register some ops to it and the core would do all
>>>>> logic. I believe it is essential take this approach from the start.
>>>> Sorry, I didn't mean cut/paste into another driver, I meant to make it
>>>> a driver of its own. My thought was to eventually create a shared/core
>>>> driver module that is then used by the other drivers.
>>>>
>>>> My concern right now is that Stephen has indicated he doesn't want
>>>> this approach taken with netvsc, and most of the community doesn't
>>> IIUC, he only does not like the extra netdev. Is there anything else?
>> Nope that is pretty much it. It doesn't seem like a big deal for
>> virtio, but for netvsc it is significant since they don't have any
>> "backup" bit feature differentiation, so they would likely be stuck
>> with 2 netdevs even in their basic setup.
> Okay. If that is a strict "no-go" for netvsc, this should be
> just a flag passed down to the in-driver bond code.

This results in a 3 driver model (virtio/netvsc, vf & bypass) with 2 netdevs
created when bypass is based on netvsc and 3 netdevs created when the bypass
is based on virtio_net.

Unless we agree on a common netdev model between netvsc and virtio_net,
i am not sure if it is useful to commonize the code into a separate driver.

-Sridhar

^ permalink raw reply	[flat|nested] 102+ messages in thread

* [virtio-dev] Re: [PATCH v4 2/2] virtio_net: Extend virtio to use VF datapath when available
@ 2018-03-04 21:54                           ` Samudrala, Sridhar
  0 siblings, 0 replies; 102+ messages in thread
From: Samudrala, Sridhar @ 2018-03-04 21:54 UTC (permalink / raw)
  To: Jiri Pirko, Alexander Duyck
  Cc: Michael S. Tsirkin, Stephen Hemminger, David Miller, Netdev,
	virtio-dev, Brandeburg, Jesse, Duyck, Alexander H,
	Jakub Kicinski



On 3/4/2018 10:50 AM, Jiri Pirko wrote:
> Sun, Mar 04, 2018 at 07:24:12PM CET, alexander.duyck@gmail.com wrote:
>> On Sat, Mar 3, 2018 at 11:13 PM, Jiri Pirko <jiri@resnulli.us> wrote:
>>> Sun, Mar 04, 2018 at 01:26:53AM CET, alexander.duyck@gmail.com wrote:
>>>> On Sat, Mar 3, 2018 at 1:25 PM, Jiri Pirko <jiri@resnulli.us> wrote:
>>>>> Sat, Mar 03, 2018 at 07:04:57PM CET, alexander.duyck@gmail.com wrote:
>>>>>> On Sat, Mar 3, 2018 at 3:31 AM, Jiri Pirko <jiri@resnulli.us> wrote:
>>>>>>> Fri, Mar 02, 2018 at 08:42:47PM CET, mst@redhat.com wrote:
>>>>>>>> On Fri, Mar 02, 2018 at 05:20:17PM +0100, Jiri Pirko wrote:
>>>>>>>>>> Yeah, this code essentially calls out the "shareable" code with a
>>>>>>>>>> comment at the start and end of the section what defines the
>>>>>>>>>> virtio_bypass functionality. It would just be a matter of mostly
>>>>>>>>>> cutting and pasting to put it into a separate driver module.
>>>>>>>>> Please put it there and unite the use of it with netvsc.
>>>>>>>> Surely, adding this to other drivers (e.g. might this be handy for xen
>>>>>>>> too?) can be left for a separate patchset. Let's get one device merged
>>>>>>>> first.
>>>>>>> Why? Let's do the generic infra alongside with the driver. I see no good
>>>>>>> reason to rush into merging driver and only later, if ever, to convert
>>>>>>> it to generic solution. On contrary. That would lead into multiple
>>>>>>> approaches and different behavious in multiple drivers. That is plain
>>>>>>> wrong.
>>>>>> If nothing else it doesn't hurt to do this in one driver in a generic
>>>>>> way, and once it has been proven to address all the needs of that one
>>>>>> driver we can then start moving other drivers to it. The current
>>>>>> solution is quite generic, that was my contribution to this patch set
>>>>>> as I didn't like how invasive it was being to virtio and thought it
>>>>>> would be best to keep this as minimally invasive as possible.
>>>>>>
>>>>>> My preference would be to give this a release or two in virtio to
>>>>>> mature before we start pushing it onto other drivers. It shouldn't
>>>>>> take much to cut/paste this into a new driver file once we decide it
>>>>>> is time to start extending it out to other drivers.
>>>>> I'm not talking about cut/paste and in fact that is what I'm worried
>>>>> about. I'm talking about common code in net/core/ or somewhere that
>>>>> would take care of this in-driver bonding. Each driver, like virtio_net,
>>>>> netvsc would just register some ops to it and the core would do all
>>>>> logic. I believe it is essential take this approach from the start.
>>>> Sorry, I didn't mean cut/paste into another driver, I meant to make it
>>>> a driver of its own. My thought was to eventually create a shared/core
>>>> driver module that is then used by the other drivers.
>>>>
>>>> My concern right now is that Stephen has indicated he doesn't want
>>>> this approach taken with netvsc, and most of the community doesn't
>>> IIUC, he only does not like the extra netdev. Is there anything else?
>> Nope that is pretty much it. It doesn't seem like a big deal for
>> virtio, but for netvsc it is significant since they don't have any
>> "backup" bit feature differentiation, so they would likely be stuck
>> with 2 netdevs even in their basic setup.
> Okay. If that is a strict "no-go" for netvsc, this should be
> just a flag passed down to the in-driver bond code.

This results in a 3 driver model (virtio/netvsc, vf & bypass) with 2 netdevs
created when bypass is based on netvsc and 3 netdevs created when the bypass
is based on virtio_net.

Unless we agree on a common netdev model between netvsc and virtio_net,
i am not sure if it is useful to commonize the code into a separate driver.

-Sridhar

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v4 2/2] virtio_net: Extend virtio to use VF datapath when available
  2018-03-04 18:50                       ` Jiri Pirko
@ 2018-03-04 21:58                           ` Alexander Duyck
  2018-03-04 21:58                           ` [virtio-dev] " Alexander Duyck
  1 sibling, 0 replies; 102+ messages in thread
From: Alexander Duyck @ 2018-03-04 21:58 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Michael S. Tsirkin, Sridhar Samudrala, Stephen Hemminger,
	David Miller, Netdev, virtio-dev, Brandeburg, Jesse, Duyck,
	Alexander H, Jakub Kicinski

On Sun, Mar 4, 2018 at 10:50 AM, Jiri Pirko <jiri@resnulli.us> wrote:
> Sun, Mar 04, 2018 at 07:24:12PM CET, alexander.duyck@gmail.com wrote:
>>On Sat, Mar 3, 2018 at 11:13 PM, Jiri Pirko <jiri@resnulli.us> wrote:
>>> Sun, Mar 04, 2018 at 01:26:53AM CET, alexander.duyck@gmail.com wrote:
>>>>On Sat, Mar 3, 2018 at 1:25 PM, Jiri Pirko <jiri@resnulli.us> wrote:
>>>>> Sat, Mar 03, 2018 at 07:04:57PM CET, alexander.duyck@gmail.com wrote:
>>>>>>On Sat, Mar 3, 2018 at 3:31 AM, Jiri Pirko <jiri@resnulli.us> wrote:
>>>>>>> Fri, Mar 02, 2018 at 08:42:47PM CET, mst@redhat.com wrote:
>>>>>>>>On Fri, Mar 02, 2018 at 05:20:17PM +0100, Jiri Pirko wrote:
>>>>>>>>> >Yeah, this code essentially calls out the "shareable" code with a
>>>>>>>>> >comment at the start and end of the section what defines the
>>>>>>>>> >virtio_bypass functionality. It would just be a matter of mostly
>>>>>>>>> >cutting and pasting to put it into a separate driver module.
>>>>>>>>>
>>>>>>>>> Please put it there and unite the use of it with netvsc.
>>>>>>>>
>>>>>>>>Surely, adding this to other drivers (e.g. might this be handy for xen
>>>>>>>>too?) can be left for a separate patchset. Let's get one device merged
>>>>>>>>first.
>>>>>>>
>>>>>>> Why? Let's do the generic infra alongside with the driver. I see no good
>>>>>>> reason to rush into merging driver and only later, if ever, to convert
>>>>>>> it to generic solution. On contrary. That would lead into multiple
>>>>>>> approaches and different behavious in multiple drivers. That is plain
>>>>>>> wrong.
>>>>>>
>>>>>>If nothing else it doesn't hurt to do this in one driver in a generic
>>>>>>way, and once it has been proven to address all the needs of that one
>>>>>>driver we can then start moving other drivers to it. The current
>>>>>>solution is quite generic, that was my contribution to this patch set
>>>>>>as I didn't like how invasive it was being to virtio and thought it
>>>>>>would be best to keep this as minimally invasive as possible.
>>>>>>
>>>>>>My preference would be to give this a release or two in virtio to
>>>>>>mature before we start pushing it onto other drivers. It shouldn't
>>>>>>take much to cut/paste this into a new driver file once we decide it
>>>>>>is time to start extending it out to other drivers.
>>>>>
>>>>> I'm not talking about cut/paste and in fact that is what I'm worried
>>>>> about. I'm talking about common code in net/core/ or somewhere that
>>>>> would take care of this in-driver bonding. Each driver, like virtio_net,
>>>>> netvsc would just register some ops to it and the core would do all
>>>>> logic. I believe it is essential take this approach from the start.
>>>>
>>>>Sorry, I didn't mean cut/paste into another driver, I meant to make it
>>>>a driver of its own. My thought was to eventually create a shared/core
>>>>driver module that is then used by the other drivers.
>>>>
>>>>My concern right now is that Stephen has indicated he doesn't want
>>>>this approach taken with netvsc, and most of the community doesn't
>>>
>>> IIUC, he only does not like the extra netdev. Is there anything else?
>>
>>Nope that is pretty much it. It doesn't seem like a big deal for
>>virtio, but for netvsc it is significant since they don't have any
>>"backup" bit feature differentiation, so they would likely be stuck
>>with 2 netdevs even in their basic setup.
>
> Okay. If that is a strict "no-go" for netvsc, this should be
> just a flag passed down to the in-driver bond code.

Are you serious? We might as well just do a per-driver bond then if
that is what you want. Once you go back to the "2 netdev" model for
this the bond becomes tightly woven into the driver and becomes a
separate beast entirely. At that point sharing kind of goes out the
window since you have to be tightly coupled into all of the per-driver
ops. I would argue there is no way to do the "2 netdev" model
generically. It is kind of inherent to the "2 netdev" model in the
first place since you can't have a third driver pop up so now
everything is pulled into the paravirtual interface unless you invert
everything and require the netvsc driver to provide the driver with a
set of function pointers allowing it to call back into it. In addition
you suddenly have to deal with all the qdisc and Tx queue locking
mess. So the 3 netdev model let the driver be lockless and run with no
queue disc. Are you telling us you expect our solution to run in both
modes or are you pushing the qdisc overhead and Tx queue locking into
the 3 netdev model?

What it ultimately comes down to is how do you create a new netdev
without exposing a new netdev? In the 3 netdev model this all makes
sense as we can leave the paravirtual interface in tact. Now you are
telling us that based on a flag we either have to embed ourselves into
the paravirtual interface without exposing our operations, or we have
to embed the paravirtual interface into our device without letting it
be visible. The sheer overhead of that will end up more then doubling
the code size for just the core bits of this patch.

>>
>>>>want the netvsc approach applied to virtio. Until that impasse can be
>>>>resolved there isn't much value in trying to split this up so it is
>>>>available to other drivers. In addition I would imagine it would make
>>>>it a pain for others to back-port into distros since it would break
>>>>legacy netvsc driver behavior. Patches are always welcome. Once this
>>>>is in you are free to try fighting to get this made into a generic
>>>>module and applied to both drivers, but we have already spent close to
>>>>3 months on this and it seems like there has been significantly more
>>>
>>> Alex, time is never a good argument for poor design and shortcuts.
>>
>>I'm not saying we should go with a poor design due to time. But
>>expecting us to implement something where the maintainer of said
>>driver has not agreed to is pointless, and I don't see it as a design
>
> He just does not like the third netdev, not the fact that the code
> for in-driver bonding would be shared.

Yes, but losing the third/master netdev suddenly steps up the
complexity several fold. I would rather not support it at all. In
addition Jakub and others were strongly opposed the "abomination" as I
believe you guys called it, and quite frankly I think it is a bad
design decision made for cosmetic reasons.

>>shortcut to implement something in one driver with the expectation
>>that we will then make it core later once it has proven itself and has
>>use elsewhere. In the meantime I would imagine it also makes it easier
>>for things like backports and such for us to do it this way since we
>>are only impacting one driver.
>
> When you are working on upstream kernel, you should not care about
> backports. That leads to poor design. Not an argument.

I disagree. When you have to think about maintenance of things it is
quite useful to think about backports. Otherwise older versions of
your code become unmaintainable.

>>
>>You are telling us to do something that not everyone has agreed to.
>
> Who did not?

As far as I know you are the only person asking that this has to
support both models. At least with us only supporting 3 netdevs, it
was only the 2 netdev crowd that was unhappy with it. I don't know who
is going to be happy with us doing an either/or setup that will
ultimately more confusing from a user experience perspective since now
you are saying this thing is either a hybrid slave/master, or just a
master depending on who is using it.

>>Currently we only have agreement from Michael on taking this code, as
>>such we are working with virtio only for now. When the time comes that
>
> If you do duplication of netvsc in-driver bonding in virtio_net, it will
> stay there forever. So what you say is: "We will do it halfway now
> and promise to fix it later". That later will never happen, I'm pretty
> sure. That is why I push for in-driver bonding shared code as a part of
> this patchset.

You want this new approach and a copy of netvsc moved into either core
or some module of its own. I say pick an architecture. We are looking
at either 2 netdevs or 3. We are not going to support both because
that will ultimately lead to a terrible user experience and make
things quite confusing.

> + if you would be pushing first driver to do this, I would understand.
> But the first driver is already in. You are pushing second. This is the
> time to do the sharing, unification of behaviour. Next time is too late.

That is great, if we want to share then lets share. But what you are
essentially telling us is that we need to fork this solution and
maintain two code paths, one for 2 netdevs, and another for 3. At that
point what is the point in merging them together?

^ permalink raw reply	[flat|nested] 102+ messages in thread

* [virtio-dev] Re: [PATCH v4 2/2] virtio_net: Extend virtio to use VF datapath when available
@ 2018-03-04 21:58                           ` Alexander Duyck
  0 siblings, 0 replies; 102+ messages in thread
From: Alexander Duyck @ 2018-03-04 21:58 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Michael S. Tsirkin, Sridhar Samudrala, Stephen Hemminger,
	David Miller, Netdev, virtio-dev, Brandeburg, Jesse, Duyck,
	Alexander H, Jakub Kicinski

On Sun, Mar 4, 2018 at 10:50 AM, Jiri Pirko <jiri@resnulli.us> wrote:
> Sun, Mar 04, 2018 at 07:24:12PM CET, alexander.duyck@gmail.com wrote:
>>On Sat, Mar 3, 2018 at 11:13 PM, Jiri Pirko <jiri@resnulli.us> wrote:
>>> Sun, Mar 04, 2018 at 01:26:53AM CET, alexander.duyck@gmail.com wrote:
>>>>On Sat, Mar 3, 2018 at 1:25 PM, Jiri Pirko <jiri@resnulli.us> wrote:
>>>>> Sat, Mar 03, 2018 at 07:04:57PM CET, alexander.duyck@gmail.com wrote:
>>>>>>On Sat, Mar 3, 2018 at 3:31 AM, Jiri Pirko <jiri@resnulli.us> wrote:
>>>>>>> Fri, Mar 02, 2018 at 08:42:47PM CET, mst@redhat.com wrote:
>>>>>>>>On Fri, Mar 02, 2018 at 05:20:17PM +0100, Jiri Pirko wrote:
>>>>>>>>> >Yeah, this code essentially calls out the "shareable" code with a
>>>>>>>>> >comment at the start and end of the section what defines the
>>>>>>>>> >virtio_bypass functionality. It would just be a matter of mostly
>>>>>>>>> >cutting and pasting to put it into a separate driver module.
>>>>>>>>>
>>>>>>>>> Please put it there and unite the use of it with netvsc.
>>>>>>>>
>>>>>>>>Surely, adding this to other drivers (e.g. might this be handy for xen
>>>>>>>>too?) can be left for a separate patchset. Let's get one device merged
>>>>>>>>first.
>>>>>>>
>>>>>>> Why? Let's do the generic infra alongside with the driver. I see no good
>>>>>>> reason to rush into merging driver and only later, if ever, to convert
>>>>>>> it to generic solution. On contrary. That would lead into multiple
>>>>>>> approaches and different behavious in multiple drivers. That is plain
>>>>>>> wrong.
>>>>>>
>>>>>>If nothing else it doesn't hurt to do this in one driver in a generic
>>>>>>way, and once it has been proven to address all the needs of that one
>>>>>>driver we can then start moving other drivers to it. The current
>>>>>>solution is quite generic, that was my contribution to this patch set
>>>>>>as I didn't like how invasive it was being to virtio and thought it
>>>>>>would be best to keep this as minimally invasive as possible.
>>>>>>
>>>>>>My preference would be to give this a release or two in virtio to
>>>>>>mature before we start pushing it onto other drivers. It shouldn't
>>>>>>take much to cut/paste this into a new driver file once we decide it
>>>>>>is time to start extending it out to other drivers.
>>>>>
>>>>> I'm not talking about cut/paste and in fact that is what I'm worried
>>>>> about. I'm talking about common code in net/core/ or somewhere that
>>>>> would take care of this in-driver bonding. Each driver, like virtio_net,
>>>>> netvsc would just register some ops to it and the core would do all
>>>>> logic. I believe it is essential take this approach from the start.
>>>>
>>>>Sorry, I didn't mean cut/paste into another driver, I meant to make it
>>>>a driver of its own. My thought was to eventually create a shared/core
>>>>driver module that is then used by the other drivers.
>>>>
>>>>My concern right now is that Stephen has indicated he doesn't want
>>>>this approach taken with netvsc, and most of the community doesn't
>>>
>>> IIUC, he only does not like the extra netdev. Is there anything else?
>>
>>Nope that is pretty much it. It doesn't seem like a big deal for
>>virtio, but for netvsc it is significant since they don't have any
>>"backup" bit feature differentiation, so they would likely be stuck
>>with 2 netdevs even in their basic setup.
>
> Okay. If that is a strict "no-go" for netvsc, this should be
> just a flag passed down to the in-driver bond code.

Are you serious? We might as well just do a per-driver bond then if
that is what you want. Once you go back to the "2 netdev" model for
this the bond becomes tightly woven into the driver and becomes a
separate beast entirely. At that point sharing kind of goes out the
window since you have to be tightly coupled into all of the per-driver
ops. I would argue there is no way to do the "2 netdev" model
generically. It is kind of inherent to the "2 netdev" model in the
first place since you can't have a third driver pop up so now
everything is pulled into the paravirtual interface unless you invert
everything and require the netvsc driver to provide the driver with a
set of function pointers allowing it to call back into it. In addition
you suddenly have to deal with all the qdisc and Tx queue locking
mess. So the 3 netdev model let the driver be lockless and run with no
queue disc. Are you telling us you expect our solution to run in both
modes or are you pushing the qdisc overhead and Tx queue locking into
the 3 netdev model?

What it ultimately comes down to is how do you create a new netdev
without exposing a new netdev? In the 3 netdev model this all makes
sense as we can leave the paravirtual interface in tact. Now you are
telling us that based on a flag we either have to embed ourselves into
the paravirtual interface without exposing our operations, or we have
to embed the paravirtual interface into our device without letting it
be visible. The sheer overhead of that will end up more then doubling
the code size for just the core bits of this patch.

>>
>>>>want the netvsc approach applied to virtio. Until that impasse can be
>>>>resolved there isn't much value in trying to split this up so it is
>>>>available to other drivers. In addition I would imagine it would make
>>>>it a pain for others to back-port into distros since it would break
>>>>legacy netvsc driver behavior. Patches are always welcome. Once this
>>>>is in you are free to try fighting to get this made into a generic
>>>>module and applied to both drivers, but we have already spent close to
>>>>3 months on this and it seems like there has been significantly more
>>>
>>> Alex, time is never a good argument for poor design and shortcuts.
>>
>>I'm not saying we should go with a poor design due to time. But
>>expecting us to implement something where the maintainer of said
>>driver has not agreed to is pointless, and I don't see it as a design
>
> He just does not like the third netdev, not the fact that the code
> for in-driver bonding would be shared.

Yes, but losing the third/master netdev suddenly steps up the
complexity several fold. I would rather not support it at all. In
addition Jakub and others were strongly opposed the "abomination" as I
believe you guys called it, and quite frankly I think it is a bad
design decision made for cosmetic reasons.

>>shortcut to implement something in one driver with the expectation
>>that we will then make it core later once it has proven itself and has
>>use elsewhere. In the meantime I would imagine it also makes it easier
>>for things like backports and such for us to do it this way since we
>>are only impacting one driver.
>
> When you are working on upstream kernel, you should not care about
> backports. That leads to poor design. Not an argument.

I disagree. When you have to think about maintenance of things it is
quite useful to think about backports. Otherwise older versions of
your code become unmaintainable.

>>
>>You are telling us to do something that not everyone has agreed to.
>
> Who did not?

As far as I know you are the only person asking that this has to
support both models. At least with us only supporting 3 netdevs, it
was only the 2 netdev crowd that was unhappy with it. I don't know who
is going to be happy with us doing an either/or setup that will
ultimately more confusing from a user experience perspective since now
you are saying this thing is either a hybrid slave/master, or just a
master depending on who is using it.

>>Currently we only have agreement from Michael on taking this code, as
>>such we are working with virtio only for now. When the time comes that
>
> If you do duplication of netvsc in-driver bonding in virtio_net, it will
> stay there forever. So what you say is: "We will do it halfway now
> and promise to fix it later". That later will never happen, I'm pretty
> sure. That is why I push for in-driver bonding shared code as a part of
> this patchset.

You want this new approach and a copy of netvsc moved into either core
or some module of its own. I say pick an architecture. We are looking
at either 2 netdevs or 3. We are not going to support both because
that will ultimately lead to a terrible user experience and make
things quite confusing.

> + if you would be pushing first driver to do this, I would understand.
> But the first driver is already in. You are pushing second. This is the
> time to do the sharing, unification of behaviour. Next time is too late.

That is great, if we want to share then lets share. But what you are
essentially telling us is that we need to fork this solution and
maintain two code paths, one for 2 netdevs, and another for 3. At that
point what is the point in merging them together?

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v4 2/2] virtio_net: Extend virtio to use VF datapath when available
  2018-03-04 21:58                           ` [virtio-dev] " Alexander Duyck
  (?)
@ 2018-03-05  9:21                           ` Jiri Pirko
  2018-03-05 16:11                             ` Stephen Hemminger
  -1 siblings, 1 reply; 102+ messages in thread
From: Jiri Pirko @ 2018-03-05  9:21 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Michael S. Tsirkin, Sridhar Samudrala, Stephen Hemminger,
	David Miller, Netdev, virtio-dev, Brandeburg, Jesse, Duyck,
	Alexander H, Jakub Kicinski

Sun, Mar 04, 2018 at 10:58:34PM CET, alexander.duyck@gmail.com wrote:
>On Sun, Mar 4, 2018 at 10:50 AM, Jiri Pirko <jiri@resnulli.us> wrote:
>> Sun, Mar 04, 2018 at 07:24:12PM CET, alexander.duyck@gmail.com wrote:
>>>On Sat, Mar 3, 2018 at 11:13 PM, Jiri Pirko <jiri@resnulli.us> wrote:

[...]

>
>>>Currently we only have agreement from Michael on taking this code, as
>>>such we are working with virtio only for now. When the time comes that
>>
>> If you do duplication of netvsc in-driver bonding in virtio_net, it will
>> stay there forever. So what you say is: "We will do it halfway now
>> and promise to fix it later". That later will never happen, I'm pretty
>> sure. That is why I push for in-driver bonding shared code as a part of
>> this patchset.
>
>You want this new approach and a copy of netvsc moved into either core
>or some module of its own. I say pick an architecture. We are looking
>at either 2 netdevs or 3. We are not going to support both because
>that will ultimately lead to a terrible user experience and make
>things quite confusing.
>
>> + if you would be pushing first driver to do this, I would understand.
>> But the first driver is already in. You are pushing second. This is the
>> time to do the sharing, unification of behaviour. Next time is too late.
>
>That is great, if we want to share then lets share. But what you are
>essentially telling us is that we need to fork this solution and
>maintain two code paths, one for 2 netdevs, and another for 3. At that
>point what is the point in merging them together?

Of course, I vote for the same behaviour for netvsc and virtio_net. That
is my point from the very beginning.

Stephen, what do you think? Could we please make virtio_net and netvsc
behave the same and to use a single code with well-defined checks and
restrictions for this feature?

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v4 2/2] virtio_net: Extend virtio to use VF datapath when available
  2018-03-05  9:21                           ` Jiri Pirko
@ 2018-03-05 16:11                             ` Stephen Hemminger
  2018-03-05 22:30                               ` Jiri Pirko
  0 siblings, 1 reply; 102+ messages in thread
From: Stephen Hemminger @ 2018-03-05 16:11 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Alexander Duyck, Michael S. Tsirkin, Sridhar Samudrala,
	David Miller, Netdev, virtio-dev, Brandeburg, Jesse, Duyck,
	Alexander H, Jakub Kicinski

On Mon, 5 Mar 2018 10:21:18 +0100
Jiri Pirko <jiri@resnulli.us> wrote:

> Sun, Mar 04, 2018 at 10:58:34PM CET, alexander.duyck@gmail.com wrote:
> >On Sun, Mar 4, 2018 at 10:50 AM, Jiri Pirko <jiri@resnulli.us> wrote:  
> >> Sun, Mar 04, 2018 at 07:24:12PM CET, alexander.duyck@gmail.com wrote:  
> >>>On Sat, Mar 3, 2018 at 11:13 PM, Jiri Pirko <jiri@resnulli.us> wrote:  
> 
> [...]
> 
> >  
> >>>Currently we only have agreement from Michael on taking this code, as
> >>>such we are working with virtio only for now. When the time comes that  
> >>
> >> If you do duplication of netvsc in-driver bonding in virtio_net, it will
> >> stay there forever. So what you say is: "We will do it halfway now
> >> and promise to fix it later". That later will never happen, I'm pretty
> >> sure. That is why I push for in-driver bonding shared code as a part of
> >> this patchset.  
> >
> >You want this new approach and a copy of netvsc moved into either core
> >or some module of its own. I say pick an architecture. We are looking
> >at either 2 netdevs or 3. We are not going to support both because
> >that will ultimately lead to a terrible user experience and make
> >things quite confusing.
> >  
> >> + if you would be pushing first driver to do this, I would understand.
> >> But the first driver is already in. You are pushing second. This is the
> >> time to do the sharing, unification of behaviour. Next time is too late.  
> >
> >That is great, if we want to share then lets share. But what you are
> >essentially telling us is that we need to fork this solution and
> >maintain two code paths, one for 2 netdevs, and another for 3. At that
> >point what is the point in merging them together?  
> 
> Of course, I vote for the same behaviour for netvsc and virtio_net. That
> is my point from the very beginning.
> 
> Stephen, what do you think? Could we please make virtio_net and netvsc
> behave the same and to use a single code with well-defined checks and
> restrictions for this feature?

Eventually, yes both could share common code routines. In reality,
the failover stuff is only a very small part of either driver so
it is not worth stretching to try and cover too much. If you look,
the failover code is just using routines that already exist for
use by bonding, teaming, etc.

There will always be two drivers, the ring buffers and buffering
are very different between vmbus and virtio. It would help to address
some of the awkward stuff like queue selection and offload handling
in a common way.

Don't worry too much about backports. The backport can use the
old code if necessary.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v4 2/2] virtio_net: Extend virtio to use VF datapath when available
  2018-03-05 16:11                             ` Stephen Hemminger
@ 2018-03-05 22:30                               ` Jiri Pirko
  2018-03-05 22:47                                   ` [virtio-dev] " Alexander Duyck
  0 siblings, 1 reply; 102+ messages in thread
From: Jiri Pirko @ 2018-03-05 22:30 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Alexander Duyck, Michael S. Tsirkin, Sridhar Samudrala,
	David Miller, Netdev, virtio-dev, Brandeburg, Jesse, Duyck,
	Alexander H, Jakub Kicinski

Mon, Mar 05, 2018 at 05:11:32PM CET, stephen@networkplumber.org wrote:
>On Mon, 5 Mar 2018 10:21:18 +0100
>Jiri Pirko <jiri@resnulli.us> wrote:
>
>> Sun, Mar 04, 2018 at 10:58:34PM CET, alexander.duyck@gmail.com wrote:
>> >On Sun, Mar 4, 2018 at 10:50 AM, Jiri Pirko <jiri@resnulli.us> wrote:  
>> >> Sun, Mar 04, 2018 at 07:24:12PM CET, alexander.duyck@gmail.com wrote:  
>> >>>On Sat, Mar 3, 2018 at 11:13 PM, Jiri Pirko <jiri@resnulli.us> wrote:  
>> 
>> [...]
>> 
>> >  
>> >>>Currently we only have agreement from Michael on taking this code, as
>> >>>such we are working with virtio only for now. When the time comes that  
>> >>
>> >> If you do duplication of netvsc in-driver bonding in virtio_net, it will
>> >> stay there forever. So what you say is: "We will do it halfway now
>> >> and promise to fix it later". That later will never happen, I'm pretty
>> >> sure. That is why I push for in-driver bonding shared code as a part of
>> >> this patchset.  
>> >
>> >You want this new approach and a copy of netvsc moved into either core
>> >or some module of its own. I say pick an architecture. We are looking
>> >at either 2 netdevs or 3. We are not going to support both because
>> >that will ultimately lead to a terrible user experience and make
>> >things quite confusing.
>> >  
>> >> + if you would be pushing first driver to do this, I would understand.
>> >> But the first driver is already in. You are pushing second. This is the
>> >> time to do the sharing, unification of behaviour. Next time is too late.  
>> >
>> >That is great, if we want to share then lets share. But what you are
>> >essentially telling us is that we need to fork this solution and
>> >maintain two code paths, one for 2 netdevs, and another for 3. At that
>> >point what is the point in merging them together?  
>> 
>> Of course, I vote for the same behaviour for netvsc and virtio_net. That
>> is my point from the very beginning.
>> 
>> Stephen, what do you think? Could we please make virtio_net and netvsc
>> behave the same and to use a single code with well-defined checks and
>> restrictions for this feature?
>
>Eventually, yes both could share common code routines. In reality,
>the failover stuff is only a very small part of either driver so
>it is not worth stretching to try and cover too much. If you look,
>the failover code is just using routines that already exist for
>use by bonding, teaming, etc.

Yeah, we consern was also about the code that processes the netdev
notifications and does auto-enslave and all related stuff.


>
>There will always be two drivers, the ring buffers and buffering
>are very different between vmbus and virtio. It would help to address
>some of the awkward stuff like queue selection and offload handling
>in a common way.

Agreed.


>
>Don't worry too much about backports. The backport can use the
>old code if necessary.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v4 2/2] virtio_net: Extend virtio to use VF datapath when available
  2018-03-05 22:30                               ` Jiri Pirko
@ 2018-03-05 22:47                                   ` Alexander Duyck
  0 siblings, 0 replies; 102+ messages in thread
From: Alexander Duyck @ 2018-03-05 22:47 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Stephen Hemminger, Michael S. Tsirkin, Sridhar Samudrala,
	David Miller, Netdev, virtio-dev, Brandeburg, Jesse, Duyck,
	Alexander H, Jakub Kicinski

On Mon, Mar 5, 2018 at 2:30 PM, Jiri Pirko <jiri@resnulli.us> wrote:
> Mon, Mar 05, 2018 at 05:11:32PM CET, stephen@networkplumber.org wrote:
>>On Mon, 5 Mar 2018 10:21:18 +0100
>>Jiri Pirko <jiri@resnulli.us> wrote:
>>
>>> Sun, Mar 04, 2018 at 10:58:34PM CET, alexander.duyck@gmail.com wrote:
>>> >On Sun, Mar 4, 2018 at 10:50 AM, Jiri Pirko <jiri@resnulli.us> wrote:
>>> >> Sun, Mar 04, 2018 at 07:24:12PM CET, alexander.duyck@gmail.com wrote:
>>> >>>On Sat, Mar 3, 2018 at 11:13 PM, Jiri Pirko <jiri@resnulli.us> wrote:
>>>
>>> [...]
>>>
>>> >
>>> >>>Currently we only have agreement from Michael on taking this code, as
>>> >>>such we are working with virtio only for now. When the time comes that
>>> >>
>>> >> If you do duplication of netvsc in-driver bonding in virtio_net, it will
>>> >> stay there forever. So what you say is: "We will do it halfway now
>>> >> and promise to fix it later". That later will never happen, I'm pretty
>>> >> sure. That is why I push for in-driver bonding shared code as a part of
>>> >> this patchset.
>>> >
>>> >You want this new approach and a copy of netvsc moved into either core
>>> >or some module of its own. I say pick an architecture. We are looking
>>> >at either 2 netdevs or 3. We are not going to support both because
>>> >that will ultimately lead to a terrible user experience and make
>>> >things quite confusing.
>>> >
>>> >> + if you would be pushing first driver to do this, I would understand.
>>> >> But the first driver is already in. You are pushing second. This is the
>>> >> time to do the sharing, unification of behaviour. Next time is too late.
>>> >
>>> >That is great, if we want to share then lets share. But what you are
>>> >essentially telling us is that we need to fork this solution and
>>> >maintain two code paths, one for 2 netdevs, and another for 3. At that
>>> >point what is the point in merging them together?
>>>
>>> Of course, I vote for the same behaviour for netvsc and virtio_net. That
>>> is my point from the very beginning.
>>>
>>> Stephen, what do you think? Could we please make virtio_net and netvsc
>>> behave the same and to use a single code with well-defined checks and
>>> restrictions for this feature?
>>
>>Eventually, yes both could share common code routines. In reality,
>>the failover stuff is only a very small part of either driver so
>>it is not worth stretching to try and cover too much. If you look,
>>the failover code is just using routines that already exist for
>>use by bonding, teaming, etc.
>
> Yeah, we consern was also about the code that processes the netdev
> notifications and does auto-enslave and all related stuff.

The concern was the driver model. If we expose 3 netdevs or 2 with the
VF driver present. Somehow this is turning into a "merge netvsc into
virtio" think and that isn't the subject that was being asked.

Ideally we want one model for this. Either 3 netdevs or 2. The problem
is 2 causes issues in terms of performance and will limit features of
virtio, but 2 is the precedent set by netvsc. We need to figure out
the path forward for this. There is talk about "sharing" but it is
hard to make these two approaches share code when they are doing two
very different setups and end up presenting themselves as two very
different driver models.

>>
>>There will always be two drivers, the ring buffers and buffering
>>are very different between vmbus and virtio. It would help to address
>>some of the awkward stuff like queue selection and offload handling
>>in a common way.
>
> Agreed.

There are going to end up being three drivers by the time we are done.
We will end up with netvsc, virtio, and some shared block of
functionality that is used between the two of them. At least that is
the assumption if the two are going to share code. I don't know if
everyone will want to take on the extra overhead for the code shared
between these two drivers being a part of the core net code.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* [virtio-dev] Re: [PATCH v4 2/2] virtio_net: Extend virtio to use VF datapath when available
@ 2018-03-05 22:47                                   ` Alexander Duyck
  0 siblings, 0 replies; 102+ messages in thread
From: Alexander Duyck @ 2018-03-05 22:47 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Stephen Hemminger, Michael S. Tsirkin, Sridhar Samudrala,
	David Miller, Netdev, virtio-dev, Brandeburg, Jesse, Duyck,
	Alexander H, Jakub Kicinski

On Mon, Mar 5, 2018 at 2:30 PM, Jiri Pirko <jiri@resnulli.us> wrote:
> Mon, Mar 05, 2018 at 05:11:32PM CET, stephen@networkplumber.org wrote:
>>On Mon, 5 Mar 2018 10:21:18 +0100
>>Jiri Pirko <jiri@resnulli.us> wrote:
>>
>>> Sun, Mar 04, 2018 at 10:58:34PM CET, alexander.duyck@gmail.com wrote:
>>> >On Sun, Mar 4, 2018 at 10:50 AM, Jiri Pirko <jiri@resnulli.us> wrote:
>>> >> Sun, Mar 04, 2018 at 07:24:12PM CET, alexander.duyck@gmail.com wrote:
>>> >>>On Sat, Mar 3, 2018 at 11:13 PM, Jiri Pirko <jiri@resnulli.us> wrote:
>>>
>>> [...]
>>>
>>> >
>>> >>>Currently we only have agreement from Michael on taking this code, as
>>> >>>such we are working with virtio only for now. When the time comes that
>>> >>
>>> >> If you do duplication of netvsc in-driver bonding in virtio_net, it will
>>> >> stay there forever. So what you say is: "We will do it halfway now
>>> >> and promise to fix it later". That later will never happen, I'm pretty
>>> >> sure. That is why I push for in-driver bonding shared code as a part of
>>> >> this patchset.
>>> >
>>> >You want this new approach and a copy of netvsc moved into either core
>>> >or some module of its own. I say pick an architecture. We are looking
>>> >at either 2 netdevs or 3. We are not going to support both because
>>> >that will ultimately lead to a terrible user experience and make
>>> >things quite confusing.
>>> >
>>> >> + if you would be pushing first driver to do this, I would understand.
>>> >> But the first driver is already in. You are pushing second. This is the
>>> >> time to do the sharing, unification of behaviour. Next time is too late.
>>> >
>>> >That is great, if we want to share then lets share. But what you are
>>> >essentially telling us is that we need to fork this solution and
>>> >maintain two code paths, one for 2 netdevs, and another for 3. At that
>>> >point what is the point in merging them together?
>>>
>>> Of course, I vote for the same behaviour for netvsc and virtio_net. That
>>> is my point from the very beginning.
>>>
>>> Stephen, what do you think? Could we please make virtio_net and netvsc
>>> behave the same and to use a single code with well-defined checks and
>>> restrictions for this feature?
>>
>>Eventually, yes both could share common code routines. In reality,
>>the failover stuff is only a very small part of either driver so
>>it is not worth stretching to try and cover too much. If you look,
>>the failover code is just using routines that already exist for
>>use by bonding, teaming, etc.
>
> Yeah, we consern was also about the code that processes the netdev
> notifications and does auto-enslave and all related stuff.

The concern was the driver model. If we expose 3 netdevs or 2 with the
VF driver present. Somehow this is turning into a "merge netvsc into
virtio" think and that isn't the subject that was being asked.

Ideally we want one model for this. Either 3 netdevs or 2. The problem
is 2 causes issues in terms of performance and will limit features of
virtio, but 2 is the precedent set by netvsc. We need to figure out
the path forward for this. There is talk about "sharing" but it is
hard to make these two approaches share code when they are doing two
very different setups and end up presenting themselves as two very
different driver models.

>>
>>There will always be two drivers, the ring buffers and buffering
>>are very different between vmbus and virtio. It would help to address
>>some of the awkward stuff like queue selection and offload handling
>>in a common way.
>
> Agreed.

There are going to end up being three drivers by the time we are done.
We will end up with netvsc, virtio, and some shared block of
functionality that is used between the two of them. At least that is
the assumption if the two are going to share code. I don't know if
everyone will want to take on the extra overhead for the code shared
between these two drivers being a part of the core net code.

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v4 2/2] virtio_net: Extend virtio to use VF datapath when available
  2018-03-05 22:47                                   ` [virtio-dev] " Alexander Duyck
  (?)
@ 2018-03-06  3:15                                   ` Stephen Hemminger
  2018-03-06 19:08                                       ` [virtio-dev] " Alexander Duyck
  -1 siblings, 1 reply; 102+ messages in thread
From: Stephen Hemminger @ 2018-03-06  3:15 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Jiri Pirko, Michael S. Tsirkin, Sridhar Samudrala, David Miller,
	Netdev, virtio-dev, Brandeburg, Jesse, Duyck, Alexander H,
	Jakub Kicinski

On Mon, 5 Mar 2018 14:47:20 -0800
Alexander Duyck <alexander.duyck@gmail.com> wrote:

> On Mon, Mar 5, 2018 at 2:30 PM, Jiri Pirko <jiri@resnulli.us> wrote:
> > Mon, Mar 05, 2018 at 05:11:32PM CET, stephen@networkplumber.org wrote:  
> >>On Mon, 5 Mar 2018 10:21:18 +0100
> >>Jiri Pirko <jiri@resnulli.us> wrote:
> >>  
> >>> Sun, Mar 04, 2018 at 10:58:34PM CET, alexander.duyck@gmail.com wrote:  
> >>> >On Sun, Mar 4, 2018 at 10:50 AM, Jiri Pirko <jiri@resnulli.us> wrote:  
> >>> >> Sun, Mar 04, 2018 at 07:24:12PM CET, alexander.duyck@gmail.com wrote:  
> >>> >>>On Sat, Mar 3, 2018 at 11:13 PM, Jiri Pirko <jiri@resnulli.us> wrote:  
> >>>
> >>> [...]
> >>>  
> >>> >  
> >>> >>>Currently we only have agreement from Michael on taking this code, as
> >>> >>>such we are working with virtio only for now. When the time comes that  
> >>> >>
> >>> >> If you do duplication of netvsc in-driver bonding in virtio_net, it will
> >>> >> stay there forever. So what you say is: "We will do it halfway now
> >>> >> and promise to fix it later". That later will never happen, I'm pretty
> >>> >> sure. That is why I push for in-driver bonding shared code as a part of
> >>> >> this patchset.  
> >>> >
> >>> >You want this new approach and a copy of netvsc moved into either core
> >>> >or some module of its own. I say pick an architecture. We are looking
> >>> >at either 2 netdevs or 3. We are not going to support both because
> >>> >that will ultimately lead to a terrible user experience and make
> >>> >things quite confusing.
> >>> >  
> >>> >> + if you would be pushing first driver to do this, I would understand.
> >>> >> But the first driver is already in. You are pushing second. This is the
> >>> >> time to do the sharing, unification of behaviour. Next time is too late.  
> >>> >
> >>> >That is great, if we want to share then lets share. But what you are
> >>> >essentially telling us is that we need to fork this solution and
> >>> >maintain two code paths, one for 2 netdevs, and another for 3. At that
> >>> >point what is the point in merging them together?  
> >>>
> >>> Of course, I vote for the same behaviour for netvsc and virtio_net. That
> >>> is my point from the very beginning.
> >>>
> >>> Stephen, what do you think? Could we please make virtio_net and netvsc
> >>> behave the same and to use a single code with well-defined checks and
> >>> restrictions for this feature?  
> >>
> >>Eventually, yes both could share common code routines. In reality,
> >>the failover stuff is only a very small part of either driver so
> >>it is not worth stretching to try and cover too much. If you look,
> >>the failover code is just using routines that already exist for
> >>use by bonding, teaming, etc.  
> >
> > Yeah, we consern was also about the code that processes the netdev
> > notifications and does auto-enslave and all related stuff.  
> 
> The concern was the driver model. If we expose 3 netdevs or 2 with the
> VF driver present. Somehow this is turning into a "merge netvsc into
> virtio" think and that isn't the subject that was being asked.
> 
> Ideally we want one model for this. Either 3 netdevs or 2. The problem
> is 2 causes issues in terms of performance and will limit features of
> virtio, but 2 is the precedent set by netvsc. We need to figure out
> the path forward for this. There is talk about "sharing" but it is
> hard to make these two approaches share code when they are doing two
> very different setups and end up presenting themselves as two very
> different driver models.

I appreciate this discussion, and it has helped a lot.

Netvsc is stuck with 2 netdev model for the foreseeable future.
We already failed once with the bonding model, and that created a lot of
pain. The current model is working well and have convinced the major distros
to support the two netdev model and don't want to back.

Very open to optimizations and ways to smooth out the rough edges.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v4 2/2] virtio_net: Extend virtio to use VF datapath when available
  2018-03-06  3:15                                   ` Stephen Hemminger
@ 2018-03-06 19:08                                       ` Alexander Duyck
  0 siblings, 0 replies; 102+ messages in thread
From: Alexander Duyck @ 2018-03-06 19:08 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Jiri Pirko, Michael S. Tsirkin, Sridhar Samudrala, David Miller,
	Netdev, virtio-dev, Brandeburg, Jesse, Duyck, Alexander H,
	Jakub Kicinski

On Mon, Mar 5, 2018 at 7:15 PM, Stephen Hemminger
<stephen@networkplumber.org> wrote:
> On Mon, 5 Mar 2018 14:47:20 -0800
> Alexander Duyck <alexander.duyck@gmail.com> wrote:
>
>> On Mon, Mar 5, 2018 at 2:30 PM, Jiri Pirko <jiri@resnulli.us> wrote:
>> > Mon, Mar 05, 2018 at 05:11:32PM CET, stephen@networkplumber.org wrote:
>> >>On Mon, 5 Mar 2018 10:21:18 +0100
>> >>Jiri Pirko <jiri@resnulli.us> wrote:
>> >>
>> >>> Sun, Mar 04, 2018 at 10:58:34PM CET, alexander.duyck@gmail.com wrote:
>> >>> >On Sun, Mar 4, 2018 at 10:50 AM, Jiri Pirko <jiri@resnulli.us> wrote:
>> >>> >> Sun, Mar 04, 2018 at 07:24:12PM CET, alexander.duyck@gmail.com wrote:
>> >>> >>>On Sat, Mar 3, 2018 at 11:13 PM, Jiri Pirko <jiri@resnulli.us> wrote:
>> >>>
>> >>> [...]
>> >>>
>> >>> >
>> >>> >>>Currently we only have agreement from Michael on taking this code, as
>> >>> >>>such we are working with virtio only for now. When the time comes that
>> >>> >>
>> >>> >> If you do duplication of netvsc in-driver bonding in virtio_net, it will
>> >>> >> stay there forever. So what you say is: "We will do it halfway now
>> >>> >> and promise to fix it later". That later will never happen, I'm pretty
>> >>> >> sure. That is why I push for in-driver bonding shared code as a part of
>> >>> >> this patchset.
>> >>> >
>> >>> >You want this new approach and a copy of netvsc moved into either core
>> >>> >or some module of its own. I say pick an architecture. We are looking
>> >>> >at either 2 netdevs or 3. We are not going to support both because
>> >>> >that will ultimately lead to a terrible user experience and make
>> >>> >things quite confusing.
>> >>> >
>> >>> >> + if you would be pushing first driver to do this, I would understand.
>> >>> >> But the first driver is already in. You are pushing second. This is the
>> >>> >> time to do the sharing, unification of behaviour. Next time is too late.
>> >>> >
>> >>> >That is great, if we want to share then lets share. But what you are
>> >>> >essentially telling us is that we need to fork this solution and
>> >>> >maintain two code paths, one for 2 netdevs, and another for 3. At that
>> >>> >point what is the point in merging them together?
>> >>>
>> >>> Of course, I vote for the same behaviour for netvsc and virtio_net. That
>> >>> is my point from the very beginning.
>> >>>
>> >>> Stephen, what do you think? Could we please make virtio_net and netvsc
>> >>> behave the same and to use a single code with well-defined checks and
>> >>> restrictions for this feature?
>> >>
>> >>Eventually, yes both could share common code routines. In reality,
>> >>the failover stuff is only a very small part of either driver so
>> >>it is not worth stretching to try and cover too much. If you look,
>> >>the failover code is just using routines that already exist for
>> >>use by bonding, teaming, etc.
>> >
>> > Yeah, we consern was also about the code that processes the netdev
>> > notifications and does auto-enslave and all related stuff.
>>
>> The concern was the driver model. If we expose 3 netdevs or 2 with the
>> VF driver present. Somehow this is turning into a "merge netvsc into
>> virtio" think and that isn't the subject that was being asked.
>>
>> Ideally we want one model for this. Either 3 netdevs or 2. The problem
>> is 2 causes issues in terms of performance and will limit features of
>> virtio, but 2 is the precedent set by netvsc. We need to figure out
>> the path forward for this. There is talk about "sharing" but it is
>> hard to make these two approaches share code when they are doing two
>> very different setups and end up presenting themselves as two very
>> different driver models.
>
> I appreciate this discussion, and it has helped a lot.
>
> Netvsc is stuck with 2 netdev model for the foreseeable future.
> We already failed once with the bonding model, and that created a lot of
> pain. The current model is working well and have convinced the major distros
> to support the two netdev model and don't want to back.
>
> Very open to optimizations and ways to smooth out the rough edges.

Thank you for clarifying this Stephen.

Okay. So with things defined such that we are doing a 2 netdev model
for netvsc, and a 3 netdev model for virtio, is it still in our
interest for us to try making a shared library between the two? In my
mind, the virtnet_bypass becomes the way we go forward for any future
solutions. I say we treat the netvsc approach as a "legacy" approach
and avoid creating any new libraries or drivers to support it, and
instead just focus on the 3 netdev approach as the way this is to be
done going forward. That way we avoid anyone else trying to implement
something like the 2 netdev solution in the future.

So getting back to the code here. Should we split the virtnet_bypass
code out into a separate module? My preference would be to let this
incubate as a part of virtio_net until either there is another user,
or it becomes big enough that it needs to be moved. That said, there
can be arguments for either keeping it in, or making it a module on
its own depending on where we see this going in the future.

- Alex

^ permalink raw reply	[flat|nested] 102+ messages in thread

* [virtio-dev] Re: [PATCH v4 2/2] virtio_net: Extend virtio to use VF datapath when available
@ 2018-03-06 19:08                                       ` Alexander Duyck
  0 siblings, 0 replies; 102+ messages in thread
From: Alexander Duyck @ 2018-03-06 19:08 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Jiri Pirko, Michael S. Tsirkin, Sridhar Samudrala, David Miller,
	Netdev, virtio-dev, Brandeburg, Jesse, Duyck, Alexander H,
	Jakub Kicinski

On Mon, Mar 5, 2018 at 7:15 PM, Stephen Hemminger
<stephen@networkplumber.org> wrote:
> On Mon, 5 Mar 2018 14:47:20 -0800
> Alexander Duyck <alexander.duyck@gmail.com> wrote:
>
>> On Mon, Mar 5, 2018 at 2:30 PM, Jiri Pirko <jiri@resnulli.us> wrote:
>> > Mon, Mar 05, 2018 at 05:11:32PM CET, stephen@networkplumber.org wrote:
>> >>On Mon, 5 Mar 2018 10:21:18 +0100
>> >>Jiri Pirko <jiri@resnulli.us> wrote:
>> >>
>> >>> Sun, Mar 04, 2018 at 10:58:34PM CET, alexander.duyck@gmail.com wrote:
>> >>> >On Sun, Mar 4, 2018 at 10:50 AM, Jiri Pirko <jiri@resnulli.us> wrote:
>> >>> >> Sun, Mar 04, 2018 at 07:24:12PM CET, alexander.duyck@gmail.com wrote:
>> >>> >>>On Sat, Mar 3, 2018 at 11:13 PM, Jiri Pirko <jiri@resnulli.us> wrote:
>> >>>
>> >>> [...]
>> >>>
>> >>> >
>> >>> >>>Currently we only have agreement from Michael on taking this code, as
>> >>> >>>such we are working with virtio only for now. When the time comes that
>> >>> >>
>> >>> >> If you do duplication of netvsc in-driver bonding in virtio_net, it will
>> >>> >> stay there forever. So what you say is: "We will do it halfway now
>> >>> >> and promise to fix it later". That later will never happen, I'm pretty
>> >>> >> sure. That is why I push for in-driver bonding shared code as a part of
>> >>> >> this patchset.
>> >>> >
>> >>> >You want this new approach and a copy of netvsc moved into either core
>> >>> >or some module of its own. I say pick an architecture. We are looking
>> >>> >at either 2 netdevs or 3. We are not going to support both because
>> >>> >that will ultimately lead to a terrible user experience and make
>> >>> >things quite confusing.
>> >>> >
>> >>> >> + if you would be pushing first driver to do this, I would understand.
>> >>> >> But the first driver is already in. You are pushing second. This is the
>> >>> >> time to do the sharing, unification of behaviour. Next time is too late.
>> >>> >
>> >>> >That is great, if we want to share then lets share. But what you are
>> >>> >essentially telling us is that we need to fork this solution and
>> >>> >maintain two code paths, one for 2 netdevs, and another for 3. At that
>> >>> >point what is the point in merging them together?
>> >>>
>> >>> Of course, I vote for the same behaviour for netvsc and virtio_net. That
>> >>> is my point from the very beginning.
>> >>>
>> >>> Stephen, what do you think? Could we please make virtio_net and netvsc
>> >>> behave the same and to use a single code with well-defined checks and
>> >>> restrictions for this feature?
>> >>
>> >>Eventually, yes both could share common code routines. In reality,
>> >>the failover stuff is only a very small part of either driver so
>> >>it is not worth stretching to try and cover too much. If you look,
>> >>the failover code is just using routines that already exist for
>> >>use by bonding, teaming, etc.
>> >
>> > Yeah, we consern was also about the code that processes the netdev
>> > notifications and does auto-enslave and all related stuff.
>>
>> The concern was the driver model. If we expose 3 netdevs or 2 with the
>> VF driver present. Somehow this is turning into a "merge netvsc into
>> virtio" think and that isn't the subject that was being asked.
>>
>> Ideally we want one model for this. Either 3 netdevs or 2. The problem
>> is 2 causes issues in terms of performance and will limit features of
>> virtio, but 2 is the precedent set by netvsc. We need to figure out
>> the path forward for this. There is talk about "sharing" but it is
>> hard to make these two approaches share code when they are doing two
>> very different setups and end up presenting themselves as two very
>> different driver models.
>
> I appreciate this discussion, and it has helped a lot.
>
> Netvsc is stuck with 2 netdev model for the foreseeable future.
> We already failed once with the bonding model, and that created a lot of
> pain. The current model is working well and have convinced the major distros
> to support the two netdev model and don't want to back.
>
> Very open to optimizations and ways to smooth out the rough edges.

Thank you for clarifying this Stephen.

Okay. So with things defined such that we are doing a 2 netdev model
for netvsc, and a 3 netdev model for virtio, is it still in our
interest for us to try making a shared library between the two? In my
mind, the virtnet_bypass becomes the way we go forward for any future
solutions. I say we treat the netvsc approach as a "legacy" approach
and avoid creating any new libraries or drivers to support it, and
instead just focus on the 3 netdev approach as the way this is to be
done going forward. That way we avoid anyone else trying to implement
something like the 2 netdev solution in the future.

So getting back to the code here. Should we split the virtnet_bypass
code out into a separate module? My preference would be to let this
incubate as a part of virtio_net until either there is another user,
or it becomes big enough that it needs to be moved. That said, there
can be arguments for either keeping it in, or making it a module on
its own depending on where we see this going in the future.

- Alex

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v4 2/2] virtio_net: Extend virtio to use VF datapath when available
  2018-03-06 19:08                                       ` [virtio-dev] " Alexander Duyck
  (?)
@ 2018-03-06 22:59                                       ` Jiri Pirko
  2018-03-06 23:27                                           ` [virtio-dev] " Alexander Duyck
  -1 siblings, 1 reply; 102+ messages in thread
From: Jiri Pirko @ 2018-03-06 22:59 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Stephen Hemminger, Michael S. Tsirkin, Sridhar Samudrala,
	David Miller, Netdev, virtio-dev, Brandeburg, Jesse, Duyck,
	Alexander H, Jakub Kicinski

Tue, Mar 06, 2018 at 08:08:21PM CET, alexander.duyck@gmail.com wrote:
>On Mon, Mar 5, 2018 at 7:15 PM, Stephen Hemminger
><stephen@networkplumber.org> wrote:
>> On Mon, 5 Mar 2018 14:47:20 -0800
>> Alexander Duyck <alexander.duyck@gmail.com> wrote:
>>
>>> On Mon, Mar 5, 2018 at 2:30 PM, Jiri Pirko <jiri@resnulli.us> wrote:
>>> > Mon, Mar 05, 2018 at 05:11:32PM CET, stephen@networkplumber.org wrote:
>>> >>On Mon, 5 Mar 2018 10:21:18 +0100
>>> >>Jiri Pirko <jiri@resnulli.us> wrote:
>>> >>
>>> >>> Sun, Mar 04, 2018 at 10:58:34PM CET, alexander.duyck@gmail.com wrote:
>>> >>> >On Sun, Mar 4, 2018 at 10:50 AM, Jiri Pirko <jiri@resnulli.us> wrote:
>>> >>> >> Sun, Mar 04, 2018 at 07:24:12PM CET, alexander.duyck@gmail.com wrote:
>>> >>> >>>On Sat, Mar 3, 2018 at 11:13 PM, Jiri Pirko <jiri@resnulli.us> wrote:
>>> >>>
>>> >>> [...]
>>> >>>
>>> >>> >
>>> >>> >>>Currently we only have agreement from Michael on taking this code, as
>>> >>> >>>such we are working with virtio only for now. When the time comes that
>>> >>> >>
>>> >>> >> If you do duplication of netvsc in-driver bonding in virtio_net, it will
>>> >>> >> stay there forever. So what you say is: "We will do it halfway now
>>> >>> >> and promise to fix it later". That later will never happen, I'm pretty
>>> >>> >> sure. That is why I push for in-driver bonding shared code as a part of
>>> >>> >> this patchset.
>>> >>> >
>>> >>> >You want this new approach and a copy of netvsc moved into either core
>>> >>> >or some module of its own. I say pick an architecture. We are looking
>>> >>> >at either 2 netdevs or 3. We are not going to support both because
>>> >>> >that will ultimately lead to a terrible user experience and make
>>> >>> >things quite confusing.
>>> >>> >
>>> >>> >> + if you would be pushing first driver to do this, I would understand.
>>> >>> >> But the first driver is already in. You are pushing second. This is the
>>> >>> >> time to do the sharing, unification of behaviour. Next time is too late.
>>> >>> >
>>> >>> >That is great, if we want to share then lets share. But what you are
>>> >>> >essentially telling us is that we need to fork this solution and
>>> >>> >maintain two code paths, one for 2 netdevs, and another for 3. At that
>>> >>> >point what is the point in merging them together?
>>> >>>
>>> >>> Of course, I vote for the same behaviour for netvsc and virtio_net. That
>>> >>> is my point from the very beginning.
>>> >>>
>>> >>> Stephen, what do you think? Could we please make virtio_net and netvsc
>>> >>> behave the same and to use a single code with well-defined checks and
>>> >>> restrictions for this feature?
>>> >>
>>> >>Eventually, yes both could share common code routines. In reality,
>>> >>the failover stuff is only a very small part of either driver so
>>> >>it is not worth stretching to try and cover too much. If you look,
>>> >>the failover code is just using routines that already exist for
>>> >>use by bonding, teaming, etc.
>>> >
>>> > Yeah, we consern was also about the code that processes the netdev
>>> > notifications and does auto-enslave and all related stuff.
>>>
>>> The concern was the driver model. If we expose 3 netdevs or 2 with the
>>> VF driver present. Somehow this is turning into a "merge netvsc into
>>> virtio" think and that isn't the subject that was being asked.
>>>
>>> Ideally we want one model for this. Either 3 netdevs or 2. The problem
>>> is 2 causes issues in terms of performance and will limit features of
>>> virtio, but 2 is the precedent set by netvsc. We need to figure out
>>> the path forward for this. There is talk about "sharing" but it is
>>> hard to make these two approaches share code when they are doing two
>>> very different setups and end up presenting themselves as two very
>>> different driver models.
>>
>> I appreciate this discussion, and it has helped a lot.
>>
>> Netvsc is stuck with 2 netdev model for the foreseeable future.
>> We already failed once with the bonding model, and that created a lot of
>> pain. The current model is working well and have convinced the major distros
>> to support the two netdev model and don't want to back.
>>
>> Very open to optimizations and ways to smooth out the rough edges.
>
>Thank you for clarifying this Stephen.
>
>Okay. So with things defined such that we are doing a 2 netdev model
>for netvsc, and a 3 netdev model for virtio, is it still in our
>interest for us to try making a shared library between the two? In my
>mind, the virtnet_bypass becomes the way we go forward for any future
>solutions. I say we treat the netvsc approach as a "legacy" approach
>and avoid creating any new libraries or drivers to support it, and
>instead just focus on the 3 netdev approach as the way this is to be
>done going forward. That way we avoid anyone else trying to implement
>something like the 2 netdev solution in the future.
>
>So getting back to the code here. Should we split the virtnet_bypass
>code out into a separate module? My preference would be to let this
>incubate as a part of virtio_net until either there is another user,
>or it becomes big enough that it needs to be moved. That said, there

What do you mean by "big enough"? I hope that is will never be extended
by anything. If you need to do anything more complex, you should use
bong/team.

I definitelly vote for a separate common shared code for both netvsc and
virtio_net - even if you use 2 and 3 netdev model, you could share the
common code. Strict checks and limitation should be in place.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v4 2/2] virtio_net: Extend virtio to use VF datapath when available
  2018-03-06 22:59                                       ` Jiri Pirko
@ 2018-03-06 23:27                                           ` Alexander Duyck
  0 siblings, 0 replies; 102+ messages in thread
From: Alexander Duyck @ 2018-03-06 23:27 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Stephen Hemminger, Michael S. Tsirkin, Sridhar Samudrala,
	David Miller, Netdev, virtio-dev, Brandeburg, Jesse, Duyck,
	Alexander H, Jakub Kicinski

On Tue, Mar 6, 2018 at 2:59 PM, Jiri Pirko <jiri@resnulli.us> wrote:
> Tue, Mar 06, 2018 at 08:08:21PM CET, alexander.duyck@gmail.com wrote:
>>On Mon, Mar 5, 2018 at 7:15 PM, Stephen Hemminger
>><stephen@networkplumber.org> wrote:
>>> On Mon, 5 Mar 2018 14:47:20 -0800
>>> Alexander Duyck <alexander.duyck@gmail.com> wrote:
>>>
>>>> On Mon, Mar 5, 2018 at 2:30 PM, Jiri Pirko <jiri@resnulli.us> wrote:
>>>> > Mon, Mar 05, 2018 at 05:11:32PM CET, stephen@networkplumber.org wrote:
>>>> >>On Mon, 5 Mar 2018 10:21:18 +0100
>>>> >>Jiri Pirko <jiri@resnulli.us> wrote:
>>>> >>
>>>> >>> Sun, Mar 04, 2018 at 10:58:34PM CET, alexander.duyck@gmail.com wrote:
>>>> >>> >On Sun, Mar 4, 2018 at 10:50 AM, Jiri Pirko <jiri@resnulli.us> wrote:
>>>> >>> >> Sun, Mar 04, 2018 at 07:24:12PM CET, alexander.duyck@gmail.com wrote:
>>>> >>> >>>On Sat, Mar 3, 2018 at 11:13 PM, Jiri Pirko <jiri@resnulli.us> wrote:
>>>> >>>
>>>> >>> [...]
>>>> >>>
>>>> >>> >
>>>> >>> >>>Currently we only have agreement from Michael on taking this code, as
>>>> >>> >>>such we are working with virtio only for now. When the time comes that
>>>> >>> >>
>>>> >>> >> If you do duplication of netvsc in-driver bonding in virtio_net, it will
>>>> >>> >> stay there forever. So what you say is: "We will do it halfway now
>>>> >>> >> and promise to fix it later". That later will never happen, I'm pretty
>>>> >>> >> sure. That is why I push for in-driver bonding shared code as a part of
>>>> >>> >> this patchset.
>>>> >>> >
>>>> >>> >You want this new approach and a copy of netvsc moved into either core
>>>> >>> >or some module of its own. I say pick an architecture. We are looking
>>>> >>> >at either 2 netdevs or 3. We are not going to support both because
>>>> >>> >that will ultimately lead to a terrible user experience and make
>>>> >>> >things quite confusing.
>>>> >>> >
>>>> >>> >> + if you would be pushing first driver to do this, I would understand.
>>>> >>> >> But the first driver is already in. You are pushing second. This is the
>>>> >>> >> time to do the sharing, unification of behaviour. Next time is too late.
>>>> >>> >
>>>> >>> >That is great, if we want to share then lets share. But what you are
>>>> >>> >essentially telling us is that we need to fork this solution and
>>>> >>> >maintain two code paths, one for 2 netdevs, and another for 3. At that
>>>> >>> >point what is the point in merging them together?
>>>> >>>
>>>> >>> Of course, I vote for the same behaviour for netvsc and virtio_net. That
>>>> >>> is my point from the very beginning.
>>>> >>>
>>>> >>> Stephen, what do you think? Could we please make virtio_net and netvsc
>>>> >>> behave the same and to use a single code with well-defined checks and
>>>> >>> restrictions for this feature?
>>>> >>
>>>> >>Eventually, yes both could share common code routines. In reality,
>>>> >>the failover stuff is only a very small part of either driver so
>>>> >>it is not worth stretching to try and cover too much. If you look,
>>>> >>the failover code is just using routines that already exist for
>>>> >>use by bonding, teaming, etc.
>>>> >
>>>> > Yeah, we consern was also about the code that processes the netdev
>>>> > notifications and does auto-enslave and all related stuff.
>>>>
>>>> The concern was the driver model. If we expose 3 netdevs or 2 with the
>>>> VF driver present. Somehow this is turning into a "merge netvsc into
>>>> virtio" think and that isn't the subject that was being asked.
>>>>
>>>> Ideally we want one model for this. Either 3 netdevs or 2. The problem
>>>> is 2 causes issues in terms of performance and will limit features of
>>>> virtio, but 2 is the precedent set by netvsc. We need to figure out
>>>> the path forward for this. There is talk about "sharing" but it is
>>>> hard to make these two approaches share code when they are doing two
>>>> very different setups and end up presenting themselves as two very
>>>> different driver models.
>>>
>>> I appreciate this discussion, and it has helped a lot.
>>>
>>> Netvsc is stuck with 2 netdev model for the foreseeable future.
>>> We already failed once with the bonding model, and that created a lot of
>>> pain. The current model is working well and have convinced the major distros
>>> to support the two netdev model and don't want to back.
>>>
>>> Very open to optimizations and ways to smooth out the rough edges.
>>
>>Thank you for clarifying this Stephen.
>>
>>Okay. So with things defined such that we are doing a 2 netdev model
>>for netvsc, and a 3 netdev model for virtio, is it still in our
>>interest for us to try making a shared library between the two? In my
>>mind, the virtnet_bypass becomes the way we go forward for any future
>>solutions. I say we treat the netvsc approach as a "legacy" approach
>>and avoid creating any new libraries or drivers to support it, and
>>instead just focus on the 3 netdev approach as the way this is to be
>>done going forward. That way we avoid anyone else trying to implement
>>something like the 2 netdev solution in the future.
>>
>>So getting back to the code here. Should we split the virtnet_bypass
>>code out into a separate module? My preference would be to let this
>>incubate as a part of virtio_net until either there is another user,
>>or it becomes big enough that it needs to be moved. That said, there
>
> What do you mean by "big enough"? I hope that is will never be extended
> by anything. If you need to do anything more complex, you should use
> bong/team.

I wasn't thinking so much extra features as enough workaround for
discovered issues. Really this should be more-or-less feature locked.

> I definitelly vote for a separate common shared code for both netvsc and
> virtio_net - even if you use 2 and 3 netdev model, you could share the
> common code. Strict checks and limitation should be in place.

Noted. But as I also mentioned there isn't that much "common" code
between the two models. I think if anything we could probably look at
peeling out a few bits such as "get_<iface>_bymac" which really would
become dev_get_by_mac_and_ops in order to find the device for the
notifiers. I probably wouldn't even put that in our driver and would
instead put it in the core code since it almost makes more sense
there. Beyond that sharing becomes much more challenging due to the
differences in the Rx and Tx paths that build out of the difference
between the 2 driver and 3 driver models.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* [virtio-dev] Re: [PATCH v4 2/2] virtio_net: Extend virtio to use VF datapath when available
@ 2018-03-06 23:27                                           ` Alexander Duyck
  0 siblings, 0 replies; 102+ messages in thread
From: Alexander Duyck @ 2018-03-06 23:27 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Stephen Hemminger, Michael S. Tsirkin, Sridhar Samudrala,
	David Miller, Netdev, virtio-dev, Brandeburg, Jesse, Duyck,
	Alexander H, Jakub Kicinski

On Tue, Mar 6, 2018 at 2:59 PM, Jiri Pirko <jiri@resnulli.us> wrote:
> Tue, Mar 06, 2018 at 08:08:21PM CET, alexander.duyck@gmail.com wrote:
>>On Mon, Mar 5, 2018 at 7:15 PM, Stephen Hemminger
>><stephen@networkplumber.org> wrote:
>>> On Mon, 5 Mar 2018 14:47:20 -0800
>>> Alexander Duyck <alexander.duyck@gmail.com> wrote:
>>>
>>>> On Mon, Mar 5, 2018 at 2:30 PM, Jiri Pirko <jiri@resnulli.us> wrote:
>>>> > Mon, Mar 05, 2018 at 05:11:32PM CET, stephen@networkplumber.org wrote:
>>>> >>On Mon, 5 Mar 2018 10:21:18 +0100
>>>> >>Jiri Pirko <jiri@resnulli.us> wrote:
>>>> >>
>>>> >>> Sun, Mar 04, 2018 at 10:58:34PM CET, alexander.duyck@gmail.com wrote:
>>>> >>> >On Sun, Mar 4, 2018 at 10:50 AM, Jiri Pirko <jiri@resnulli.us> wrote:
>>>> >>> >> Sun, Mar 04, 2018 at 07:24:12PM CET, alexander.duyck@gmail.com wrote:
>>>> >>> >>>On Sat, Mar 3, 2018 at 11:13 PM, Jiri Pirko <jiri@resnulli.us> wrote:
>>>> >>>
>>>> >>> [...]
>>>> >>>
>>>> >>> >
>>>> >>> >>>Currently we only have agreement from Michael on taking this code, as
>>>> >>> >>>such we are working with virtio only for now. When the time comes that
>>>> >>> >>
>>>> >>> >> If you do duplication of netvsc in-driver bonding in virtio_net, it will
>>>> >>> >> stay there forever. So what you say is: "We will do it halfway now
>>>> >>> >> and promise to fix it later". That later will never happen, I'm pretty
>>>> >>> >> sure. That is why I push for in-driver bonding shared code as a part of
>>>> >>> >> this patchset.
>>>> >>> >
>>>> >>> >You want this new approach and a copy of netvsc moved into either core
>>>> >>> >or some module of its own. I say pick an architecture. We are looking
>>>> >>> >at either 2 netdevs or 3. We are not going to support both because
>>>> >>> >that will ultimately lead to a terrible user experience and make
>>>> >>> >things quite confusing.
>>>> >>> >
>>>> >>> >> + if you would be pushing first driver to do this, I would understand.
>>>> >>> >> But the first driver is already in. You are pushing second. This is the
>>>> >>> >> time to do the sharing, unification of behaviour. Next time is too late.
>>>> >>> >
>>>> >>> >That is great, if we want to share then lets share. But what you are
>>>> >>> >essentially telling us is that we need to fork this solution and
>>>> >>> >maintain two code paths, one for 2 netdevs, and another for 3. At that
>>>> >>> >point what is the point in merging them together?
>>>> >>>
>>>> >>> Of course, I vote for the same behaviour for netvsc and virtio_net. That
>>>> >>> is my point from the very beginning.
>>>> >>>
>>>> >>> Stephen, what do you think? Could we please make virtio_net and netvsc
>>>> >>> behave the same and to use a single code with well-defined checks and
>>>> >>> restrictions for this feature?
>>>> >>
>>>> >>Eventually, yes both could share common code routines. In reality,
>>>> >>the failover stuff is only a very small part of either driver so
>>>> >>it is not worth stretching to try and cover too much. If you look,
>>>> >>the failover code is just using routines that already exist for
>>>> >>use by bonding, teaming, etc.
>>>> >
>>>> > Yeah, we consern was also about the code that processes the netdev
>>>> > notifications and does auto-enslave and all related stuff.
>>>>
>>>> The concern was the driver model. If we expose 3 netdevs or 2 with the
>>>> VF driver present. Somehow this is turning into a "merge netvsc into
>>>> virtio" think and that isn't the subject that was being asked.
>>>>
>>>> Ideally we want one model for this. Either 3 netdevs or 2. The problem
>>>> is 2 causes issues in terms of performance and will limit features of
>>>> virtio, but 2 is the precedent set by netvsc. We need to figure out
>>>> the path forward for this. There is talk about "sharing" but it is
>>>> hard to make these two approaches share code when they are doing two
>>>> very different setups and end up presenting themselves as two very
>>>> different driver models.
>>>
>>> I appreciate this discussion, and it has helped a lot.
>>>
>>> Netvsc is stuck with 2 netdev model for the foreseeable future.
>>> We already failed once with the bonding model, and that created a lot of
>>> pain. The current model is working well and have convinced the major distros
>>> to support the two netdev model and don't want to back.
>>>
>>> Very open to optimizations and ways to smooth out the rough edges.
>>
>>Thank you for clarifying this Stephen.
>>
>>Okay. So with things defined such that we are doing a 2 netdev model
>>for netvsc, and a 3 netdev model for virtio, is it still in our
>>interest for us to try making a shared library between the two? In my
>>mind, the virtnet_bypass becomes the way we go forward for any future
>>solutions. I say we treat the netvsc approach as a "legacy" approach
>>and avoid creating any new libraries or drivers to support it, and
>>instead just focus on the 3 netdev approach as the way this is to be
>>done going forward. That way we avoid anyone else trying to implement
>>something like the 2 netdev solution in the future.
>>
>>So getting back to the code here. Should we split the virtnet_bypass
>>code out into a separate module? My preference would be to let this
>>incubate as a part of virtio_net until either there is another user,
>>or it becomes big enough that it needs to be moved. That said, there
>
> What do you mean by "big enough"? I hope that is will never be extended
> by anything. If you need to do anything more complex, you should use
> bong/team.

I wasn't thinking so much extra features as enough workaround for
discovered issues. Really this should be more-or-less feature locked.

> I definitelly vote for a separate common shared code for both netvsc and
> virtio_net - even if you use 2 and 3 netdev model, you could share the
> common code. Strict checks and limitation should be in place.

Noted. But as I also mentioned there isn't that much "common" code
between the two models. I think if anything we could probably look at
peeling out a few bits such as "get_<iface>_bymac" which really would
become dev_get_by_mac_and_ops in order to find the device for the
notifiers. I probably wouldn't even put that in our driver and would
instead put it in the core code since it almost makes more sense
there. Beyond that sharing becomes much more challenging due to the
differences in the Rx and Tx paths that build out of the difference
between the 2 driver and 3 driver models.

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v4 2/2] virtio_net: Extend virtio to use VF datapath when available
  2018-03-06 23:27                                           ` [virtio-dev] " Alexander Duyck
@ 2018-03-07  2:38                                             ` Michael S. Tsirkin
  -1 siblings, 0 replies; 102+ messages in thread
From: Michael S. Tsirkin @ 2018-03-07  2:38 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Jiri Pirko, Stephen Hemminger, Sridhar Samudrala, David Miller,
	Netdev, virtio-dev, Brandeburg, Jesse, Duyck, Alexander H,
	Jakub Kicinski

On Tue, Mar 06, 2018 at 03:27:46PM -0800, Alexander Duyck wrote:
> > I definitelly vote for a separate common shared code for both netvsc and
> > virtio_net - even if you use 2 and 3 netdev model, you could share the
> > common code. Strict checks and limitation should be in place.
> 
> Noted. But as I also mentioned there isn't that much "common" code
> between the two models. I think if anything we could probably look at
> peeling out a few bits such as "get_<iface>_bymac" which really would
> become dev_get_by_mac_and_ops in order to find the device for the
> notifiers. I probably wouldn't even put that in our driver and would
> instead put it in the core code since it almost makes more sense
> there. Beyond that sharing becomes much more challenging due to the
> differences in the Rx and Tx paths that build out of the difference
> between the 2 driver and 3 driver models.

At this point it might be worth it to articulate the advantages
of the 3 netdev model.

If they are compelling, why wouldn't netvsc users want them?

Alex, I think you were one of the strongest proponents of this model,
you should be well placed to provide a summary.

-- 
MST

^ permalink raw reply	[flat|nested] 102+ messages in thread

* [virtio-dev] Re: [PATCH v4 2/2] virtio_net: Extend virtio to use VF datapath when available
@ 2018-03-07  2:38                                             ` Michael S. Tsirkin
  0 siblings, 0 replies; 102+ messages in thread
From: Michael S. Tsirkin @ 2018-03-07  2:38 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Jiri Pirko, Stephen Hemminger, Sridhar Samudrala, David Miller,
	Netdev, virtio-dev, Brandeburg, Jesse, Duyck, Alexander H,
	Jakub Kicinski

On Tue, Mar 06, 2018 at 03:27:46PM -0800, Alexander Duyck wrote:
> > I definitelly vote for a separate common shared code for both netvsc and
> > virtio_net - even if you use 2 and 3 netdev model, you could share the
> > common code. Strict checks and limitation should be in place.
> 
> Noted. But as I also mentioned there isn't that much "common" code
> between the two models. I think if anything we could probably look at
> peeling out a few bits such as "get_<iface>_bymac" which really would
> become dev_get_by_mac_and_ops in order to find the device for the
> notifiers. I probably wouldn't even put that in our driver and would
> instead put it in the core code since it almost makes more sense
> there. Beyond that sharing becomes much more challenging due to the
> differences in the Rx and Tx paths that build out of the difference
> between the 2 driver and 3 driver models.

At this point it might be worth it to articulate the advantages
of the 3 netdev model.

If they are compelling, why wouldn't netvsc users want them?

Alex, I think you were one of the strongest proponents of this model,
you should be well placed to provide a summary.

-- 
MST

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v4 2/2] virtio_net: Extend virtio to use VF datapath when available
  2018-03-07  2:38                                             ` [virtio-dev] " Michael S. Tsirkin
@ 2018-03-07 17:50                                               ` Alexander Duyck
  -1 siblings, 0 replies; 102+ messages in thread
From: Alexander Duyck @ 2018-03-07 17:50 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Jiri Pirko, Stephen Hemminger, Sridhar Samudrala, David Miller,
	Netdev, virtio-dev, Brandeburg, Jesse, Duyck, Alexander H,
	Jakub Kicinski

On Tue, Mar 6, 2018 at 6:38 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Tue, Mar 06, 2018 at 03:27:46PM -0800, Alexander Duyck wrote:
>> > I definitelly vote for a separate common shared code for both netvsc and
>> > virtio_net - even if you use 2 and 3 netdev model, you could share the
>> > common code. Strict checks and limitation should be in place.
>>
>> Noted. But as I also mentioned there isn't that much "common" code
>> between the two models. I think if anything we could probably look at
>> peeling out a few bits such as "get_<iface>_bymac" which really would
>> become dev_get_by_mac_and_ops in order to find the device for the
>> notifiers. I probably wouldn't even put that in our driver and would
>> instead put it in the core code since it almost makes more sense
>> there. Beyond that sharing becomes much more challenging due to the
>> differences in the Rx and Tx paths that build out of the difference
>> between the 2 driver and 3 driver models.
>
> At this point it might be worth it to articulate the advantages
> of the 3 netdev model.

The main advantages are performance on the lower devices. Specifically
we can run LLTX and IFF_NOQUEUE on the upper device, meaning the VF
doesn't take a performance hit when we start transmitting through it.
In addition the paravirtual device is able to fully expose it's
features, so for example virtio_net can maintain in-driver XDP, and we
can provide generic XDP on the upper device. We can also have an
asymmetric configuration where the number of queues or feature sets
can be different between the ports.

Basically what it comes down to is in the 3 netdev model we never have
to deal with the issues of going from being a standard netdev to being
a master netdev. The role of each netdev is clearly defined.

As far as doing a generic solution it is the preferred way to go as
well since we don't have to rewrite functionality of the lower device.
Currently the only changes really needed are to add a phys_port_name
operation so that you can distinguish between the bypass interface and
the original paravirtual one. As such you have no confusion about what
you are running.

> If they are compelling, why wouldn't netvsc users want them?

I believe part of the logic for Stephen's choice is that netvsc
doesn't have a "BACKUP" bit like what we have virtio_net. As a result
they have no way of knowing if a VF will ever show up or not. That
makes the 3 netdev solution less appealing as they always end up in
the bonding mode. In addition they have intentionally limited
themselves to avoid features such as XDP that would cause issues with
the 2 netdev model.

> Alex, I think you were one of the strongest proponents of this model,
> you should be well placed to provide a summary.

Hopefully the information provided helps. In my mind the issue is that
netvsc was not designed correctly, and as such it is using the 2
netdev model as a bit of a crutch to handle the case where they wanted
to add a VF bypass. At this point it has become a part of their
architecture so it would be confusing for their user base to deal with
having 2 netdevs spawn every time their driver sees a new device. In
the case of virtio we only spawn 2 netdevs if the backup bit is set
which implies a specific use case. When the bit isn't set we are only
spawning one device, and as a result we can get much better
performance out of the resultant combination of devices, and can
expose functionality such as in-driver XDP in virtio.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* [virtio-dev] Re: [PATCH v4 2/2] virtio_net: Extend virtio to use VF datapath when available
@ 2018-03-07 17:50                                               ` Alexander Duyck
  0 siblings, 0 replies; 102+ messages in thread
From: Alexander Duyck @ 2018-03-07 17:50 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Jiri Pirko, Stephen Hemminger, Sridhar Samudrala, David Miller,
	Netdev, virtio-dev, Brandeburg, Jesse, Duyck, Alexander H,
	Jakub Kicinski

On Tue, Mar 6, 2018 at 6:38 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Tue, Mar 06, 2018 at 03:27:46PM -0800, Alexander Duyck wrote:
>> > I definitelly vote for a separate common shared code for both netvsc and
>> > virtio_net - even if you use 2 and 3 netdev model, you could share the
>> > common code. Strict checks and limitation should be in place.
>>
>> Noted. But as I also mentioned there isn't that much "common" code
>> between the two models. I think if anything we could probably look at
>> peeling out a few bits such as "get_<iface>_bymac" which really would
>> become dev_get_by_mac_and_ops in order to find the device for the
>> notifiers. I probably wouldn't even put that in our driver and would
>> instead put it in the core code since it almost makes more sense
>> there. Beyond that sharing becomes much more challenging due to the
>> differences in the Rx and Tx paths that build out of the difference
>> between the 2 driver and 3 driver models.
>
> At this point it might be worth it to articulate the advantages
> of the 3 netdev model.

The main advantages are performance on the lower devices. Specifically
we can run LLTX and IFF_NOQUEUE on the upper device, meaning the VF
doesn't take a performance hit when we start transmitting through it.
In addition the paravirtual device is able to fully expose it's
features, so for example virtio_net can maintain in-driver XDP, and we
can provide generic XDP on the upper device. We can also have an
asymmetric configuration where the number of queues or feature sets
can be different between the ports.

Basically what it comes down to is in the 3 netdev model we never have
to deal with the issues of going from being a standard netdev to being
a master netdev. The role of each netdev is clearly defined.

As far as doing a generic solution it is the preferred way to go as
well since we don't have to rewrite functionality of the lower device.
Currently the only changes really needed are to add a phys_port_name
operation so that you can distinguish between the bypass interface and
the original paravirtual one. As such you have no confusion about what
you are running.

> If they are compelling, why wouldn't netvsc users want them?

I believe part of the logic for Stephen's choice is that netvsc
doesn't have a "BACKUP" bit like what we have virtio_net. As a result
they have no way of knowing if a VF will ever show up or not. That
makes the 3 netdev solution less appealing as they always end up in
the bonding mode. In addition they have intentionally limited
themselves to avoid features such as XDP that would cause issues with
the 2 netdev model.

> Alex, I think you were one of the strongest proponents of this model,
> you should be well placed to provide a summary.

Hopefully the information provided helps. In my mind the issue is that
netvsc was not designed correctly, and as such it is using the 2
netdev model as a bit of a crutch to handle the case where they wanted
to add a VF bypass. At this point it has become a part of their
architecture so it would be confusing for their user base to deal with
having 2 netdevs spawn every time their driver sees a new device. In
the case of virtio we only spawn 2 netdevs if the backup bit is set
which implies a specific use case. When the bit isn't set we are only
spawning one device, and as a result we can get much better
performance out of the resultant combination of devices, and can
expose functionality such as in-driver XDP in virtio.

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v4 2/2] virtio_net: Extend virtio to use VF datapath when available
  2018-03-07 17:50                                               ` [virtio-dev] " Alexander Duyck
  (?)
@ 2018-03-07 18:06                                               ` Stephen Hemminger
  2018-03-07 18:55                                                   ` [virtio-dev] " Alexander Duyck
  2018-03-07 20:11                                                   ` [virtio-dev] " Michael S. Tsirkin
  -1 siblings, 2 replies; 102+ messages in thread
From: Stephen Hemminger @ 2018-03-07 18:06 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Michael S. Tsirkin, Jiri Pirko, Sridhar Samudrala, David Miller,
	Netdev, virtio-dev, Brandeburg, Jesse, Duyck, Alexander H,
	Jakub Kicinski

On Wed, 7 Mar 2018 09:50:50 -0800
Alexander Duyck <alexander.duyck@gmail.com> wrote:

> On Tue, Mar 6, 2018 at 6:38 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> > On Tue, Mar 06, 2018 at 03:27:46PM -0800, Alexander Duyck wrote:  
> >> > I definitelly vote for a separate common shared code for both netvsc and
> >> > virtio_net - even if you use 2 and 3 netdev model, you could share the
> >> > common code. Strict checks and limitation should be in place.  
> >>
> >> Noted. But as I also mentioned there isn't that much "common" code
> >> between the two models. I think if anything we could probably look at
> >> peeling out a few bits such as "get_<iface>_bymac" which really would
> >> become dev_get_by_mac_and_ops in order to find the device for the
> >> notifiers. I probably wouldn't even put that in our driver and would
> >> instead put it in the core code since it almost makes more sense
> >> there. Beyond that sharing becomes much more challenging due to the
> >> differences in the Rx and Tx paths that build out of the difference
> >> between the 2 driver and 3 driver models.  
> >
> > At this point it might be worth it to articulate the advantages
> > of the 3 netdev model.  
> 
> The main advantages are performance on the lower devices. Specifically
> we can run LLTX and IFF_NOQUEUE on the upper device, meaning the VF
> doesn't take a performance hit when we start transmitting through it.
> In addition the paravirtual device is able to fully expose it's
> features, so for example virtio_net can maintain in-driver XDP, and we
> can provide generic XDP on the upper device. We can also have an
> asymmetric configuration where the number of queues or feature sets
> can be different between the ports.
> 
> Basically what it comes down to is in the 3 netdev model we never have
> to deal with the issues of going from being a standard netdev to being
> a master netdev. The role of each netdev is clearly defined.
> 
> As far as doing a generic solution it is the preferred way to go as
> well since we don't have to rewrite functionality of the lower device.
> Currently the only changes really needed are to add a phys_port_name
> operation so that you can distinguish between the bypass interface and
> the original paravirtual one. As such you have no confusion about what
> you are running.
> 
> > If they are compelling, why wouldn't netvsc users want them?  
> 
> I believe part of the logic for Stephen's choice is that netvsc
> doesn't have a "BACKUP" bit like what we have virtio_net. As a result
> they have no way of knowing if a VF will ever show up or not. That
> makes the 3 netdev solution less appealing as they always end up in
> the bonding mode. In addition they have intentionally limited
> themselves to avoid features such as XDP that would cause issues with
> the 2 netdev model.
> 
> > Alex, I think you were one of the strongest proponents of this model,
> > you should be well placed to provide a summary.  
> 
> Hopefully the information provided helps. In my mind the issue is that
> netvsc was not designed correctly, and as such it is using the 2
> netdev model as a bit of a crutch to handle the case where they wanted
> to add a VF bypass. At this point it has become a part of their
> architecture so it would be confusing for their user base to deal with
> having 2 netdevs spawn every time their driver sees a new device. In
> the case of virtio we only spawn 2 netdevs if the backup bit is set
> which implies a specific use case. When the bit isn't set we are only
> spawning one device, and as a result we can get much better
> performance out of the resultant combination of devices, and can
> expose functionality such as in-driver XDP in virtio.


I have experimented with toggling IFF_NOQUEUE and/or LLTX on the netvsc
device when doing passthrough. It didn't help performance much and there
are corner cases that make it painful, like what if qdisc was placed
by user on the netvsc device?  Should the qdisc then be moved back
and forth to the VF device when it arrives or is removed?

As far as XDP support, it is on the plan to support XDP on the netvsc
device since the receive buffers do have to be copied. But there has
been no customer demand for it; and the VF model on Azure has limitations
which make a transparent XDP model pretty much impossible.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v4 2/2] virtio_net: Extend virtio to use VF datapath when available
  2018-03-07 18:06                                               ` Stephen Hemminger
@ 2018-03-07 18:55                                                   ` Alexander Duyck
  2018-03-07 20:11                                                   ` [virtio-dev] " Michael S. Tsirkin
  1 sibling, 0 replies; 102+ messages in thread
From: Alexander Duyck @ 2018-03-07 18:55 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Michael S. Tsirkin, Jiri Pirko, Sridhar Samudrala, David Miller,
	Netdev, virtio-dev, Brandeburg, Jesse, Duyck, Alexander H,
	Jakub Kicinski

On Wed, Mar 7, 2018 at 10:06 AM, Stephen Hemminger
<stephen@networkplumber.org> wrote:
> On Wed, 7 Mar 2018 09:50:50 -0800
> Alexander Duyck <alexander.duyck@gmail.com> wrote:
>
>> On Tue, Mar 6, 2018 at 6:38 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
>> > On Tue, Mar 06, 2018 at 03:27:46PM -0800, Alexander Duyck wrote:
>> >> > I definitelly vote for a separate common shared code for both netvsc and
>> >> > virtio_net - even if you use 2 and 3 netdev model, you could share the
>> >> > common code. Strict checks and limitation should be in place.
>> >>
>> >> Noted. But as I also mentioned there isn't that much "common" code
>> >> between the two models. I think if anything we could probably look at
>> >> peeling out a few bits such as "get_<iface>_bymac" which really would
>> >> become dev_get_by_mac_and_ops in order to find the device for the
>> >> notifiers. I probably wouldn't even put that in our driver and would
>> >> instead put it in the core code since it almost makes more sense
>> >> there. Beyond that sharing becomes much more challenging due to the
>> >> differences in the Rx and Tx paths that build out of the difference
>> >> between the 2 driver and 3 driver models.
>> >
>> > At this point it might be worth it to articulate the advantages
>> > of the 3 netdev model.
>>
>> The main advantages are performance on the lower devices. Specifically
>> we can run LLTX and IFF_NOQUEUE on the upper device, meaning the VF
>> doesn't take a performance hit when we start transmitting through it.
>> In addition the paravirtual device is able to fully expose it's
>> features, so for example virtio_net can maintain in-driver XDP, and we
>> can provide generic XDP on the upper device. We can also have an
>> asymmetric configuration where the number of queues or feature sets
>> can be different between the ports.
>>
>> Basically what it comes down to is in the 3 netdev model we never have
>> to deal with the issues of going from being a standard netdev to being
>> a master netdev. The role of each netdev is clearly defined.
>>
>> As far as doing a generic solution it is the preferred way to go as
>> well since we don't have to rewrite functionality of the lower device.
>> Currently the only changes really needed are to add a phys_port_name
>> operation so that you can distinguish between the bypass interface and
>> the original paravirtual one. As such you have no confusion about what
>> you are running.
>>
>> > If they are compelling, why wouldn't netvsc users want them?
>>
>> I believe part of the logic for Stephen's choice is that netvsc
>> doesn't have a "BACKUP" bit like what we have virtio_net. As a result
>> they have no way of knowing if a VF will ever show up or not. That
>> makes the 3 netdev solution less appealing as they always end up in
>> the bonding mode. In addition they have intentionally limited
>> themselves to avoid features such as XDP that would cause issues with
>> the 2 netdev model.
>>
>> > Alex, I think you were one of the strongest proponents of this model,
>> > you should be well placed to provide a summary.
>>
>> Hopefully the information provided helps. In my mind the issue is that
>> netvsc was not designed correctly, and as such it is using the 2
>> netdev model as a bit of a crutch to handle the case where they wanted
>> to add a VF bypass. At this point it has become a part of their
>> architecture so it would be confusing for their user base to deal with
>> having 2 netdevs spawn every time their driver sees a new device. In
>> the case of virtio we only spawn 2 netdevs if the backup bit is set
>> which implies a specific use case. When the bit isn't set we are only
>> spawning one device, and as a result we can get much better
>> performance out of the resultant combination of devices, and can
>> expose functionality such as in-driver XDP in virtio.
>
>
> I have experimented with toggling IFF_NOQUEUE and/or LLTX on the netvsc
> device when doing passthrough. It didn't help performance much and there
> are corner cases that make it painful, like what if qdisc was placed
> by user on the netvsc device?  Should the qdisc then be moved back
> and forth to the VF device when it arrives or is removed?

The advantages of IFF_NOQUEUE and LLTX would mostly be seen on small
packets, and I wouldn't risk toggling them on/off as doing so with
packets in flight might lead to odd results. That was one of the
reasons why I wanted to just stick with the 3 netdev model for us, and
leave you guys running with the 2 netdev model.

> As far as XDP support, it is on the plan to support XDP on the netvsc
> device since the receive buffers do have to be copied. But there has
> been no customer demand for it; and the VF model on Azure has limitations
> which make a transparent XDP model pretty much impossible.

I wasn't saying you needed to implement XDP in netvsc. My point was
that in virtio we already have it and it would be confusing for us to
have a virtio driver that had to drop support for it in order to
support the 2 netdev model.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* [virtio-dev] Re: [PATCH v4 2/2] virtio_net: Extend virtio to use VF datapath when available
@ 2018-03-07 18:55                                                   ` Alexander Duyck
  0 siblings, 0 replies; 102+ messages in thread
From: Alexander Duyck @ 2018-03-07 18:55 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Michael S. Tsirkin, Jiri Pirko, Sridhar Samudrala, David Miller,
	Netdev, virtio-dev, Brandeburg, Jesse, Duyck, Alexander H,
	Jakub Kicinski

On Wed, Mar 7, 2018 at 10:06 AM, Stephen Hemminger
<stephen@networkplumber.org> wrote:
> On Wed, 7 Mar 2018 09:50:50 -0800
> Alexander Duyck <alexander.duyck@gmail.com> wrote:
>
>> On Tue, Mar 6, 2018 at 6:38 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
>> > On Tue, Mar 06, 2018 at 03:27:46PM -0800, Alexander Duyck wrote:
>> >> > I definitelly vote for a separate common shared code for both netvsc and
>> >> > virtio_net - even if you use 2 and 3 netdev model, you could share the
>> >> > common code. Strict checks and limitation should be in place.
>> >>
>> >> Noted. But as I also mentioned there isn't that much "common" code
>> >> between the two models. I think if anything we could probably look at
>> >> peeling out a few bits such as "get_<iface>_bymac" which really would
>> >> become dev_get_by_mac_and_ops in order to find the device for the
>> >> notifiers. I probably wouldn't even put that in our driver and would
>> >> instead put it in the core code since it almost makes more sense
>> >> there. Beyond that sharing becomes much more challenging due to the
>> >> differences in the Rx and Tx paths that build out of the difference
>> >> between the 2 driver and 3 driver models.
>> >
>> > At this point it might be worth it to articulate the advantages
>> > of the 3 netdev model.
>>
>> The main advantages are performance on the lower devices. Specifically
>> we can run LLTX and IFF_NOQUEUE on the upper device, meaning the VF
>> doesn't take a performance hit when we start transmitting through it.
>> In addition the paravirtual device is able to fully expose it's
>> features, so for example virtio_net can maintain in-driver XDP, and we
>> can provide generic XDP on the upper device. We can also have an
>> asymmetric configuration where the number of queues or feature sets
>> can be different between the ports.
>>
>> Basically what it comes down to is in the 3 netdev model we never have
>> to deal with the issues of going from being a standard netdev to being
>> a master netdev. The role of each netdev is clearly defined.
>>
>> As far as doing a generic solution it is the preferred way to go as
>> well since we don't have to rewrite functionality of the lower device.
>> Currently the only changes really needed are to add a phys_port_name
>> operation so that you can distinguish between the bypass interface and
>> the original paravirtual one. As such you have no confusion about what
>> you are running.
>>
>> > If they are compelling, why wouldn't netvsc users want them?
>>
>> I believe part of the logic for Stephen's choice is that netvsc
>> doesn't have a "BACKUP" bit like what we have virtio_net. As a result
>> they have no way of knowing if a VF will ever show up or not. That
>> makes the 3 netdev solution less appealing as they always end up in
>> the bonding mode. In addition they have intentionally limited
>> themselves to avoid features such as XDP that would cause issues with
>> the 2 netdev model.
>>
>> > Alex, I think you were one of the strongest proponents of this model,
>> > you should be well placed to provide a summary.
>>
>> Hopefully the information provided helps. In my mind the issue is that
>> netvsc was not designed correctly, and as such it is using the 2
>> netdev model as a bit of a crutch to handle the case where they wanted
>> to add a VF bypass. At this point it has become a part of their
>> architecture so it would be confusing for their user base to deal with
>> having 2 netdevs spawn every time their driver sees a new device. In
>> the case of virtio we only spawn 2 netdevs if the backup bit is set
>> which implies a specific use case. When the bit isn't set we are only
>> spawning one device, and as a result we can get much better
>> performance out of the resultant combination of devices, and can
>> expose functionality such as in-driver XDP in virtio.
>
>
> I have experimented with toggling IFF_NOQUEUE and/or LLTX on the netvsc
> device when doing passthrough. It didn't help performance much and there
> are corner cases that make it painful, like what if qdisc was placed
> by user on the netvsc device?  Should the qdisc then be moved back
> and forth to the VF device when it arrives or is removed?

The advantages of IFF_NOQUEUE and LLTX would mostly be seen on small
packets, and I wouldn't risk toggling them on/off as doing so with
packets in flight might lead to odd results. That was one of the
reasons why I wanted to just stick with the 3 netdev model for us, and
leave you guys running with the 2 netdev model.

> As far as XDP support, it is on the plan to support XDP on the netvsc
> device since the receive buffers do have to be copied. But there has
> been no customer demand for it; and the VF model on Azure has limitations
> which make a transparent XDP model pretty much impossible.

I wasn't saying you needed to implement XDP in netvsc. My point was
that in virtio we already have it and it would be confusing for us to
have a virtio driver that had to drop support for it in order to
support the 2 netdev model.

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v4 2/2] virtio_net: Extend virtio to use VF datapath when available
  2018-03-07 18:06                                               ` Stephen Hemminger
@ 2018-03-07 20:11                                                   ` Michael S. Tsirkin
  2018-03-07 20:11                                                   ` [virtio-dev] " Michael S. Tsirkin
  1 sibling, 0 replies; 102+ messages in thread
From: Michael S. Tsirkin @ 2018-03-07 20:11 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Alexander Duyck, Jiri Pirko, Sridhar Samudrala, David Miller,
	Netdev, virtio-dev, Brandeburg, Jesse, Duyck, Alexander H,
	Jakub Kicinski

On Wed, Mar 07, 2018 at 10:06:30AM -0800, Stephen Hemminger wrote:
> On Wed, 7 Mar 2018 09:50:50 -0800
> Alexander Duyck <alexander.duyck@gmail.com> wrote:
> 
> > On Tue, Mar 6, 2018 at 6:38 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> > > On Tue, Mar 06, 2018 at 03:27:46PM -0800, Alexander Duyck wrote:  
> > >> > I definitelly vote for a separate common shared code for both netvsc and
> > >> > virtio_net - even if you use 2 and 3 netdev model, you could share the
> > >> > common code. Strict checks and limitation should be in place.  
> > >>
> > >> Noted. But as I also mentioned there isn't that much "common" code
> > >> between the two models. I think if anything we could probably look at
> > >> peeling out a few bits such as "get_<iface>_bymac" which really would
> > >> become dev_get_by_mac_and_ops in order to find the device for the
> > >> notifiers. I probably wouldn't even put that in our driver and would
> > >> instead put it in the core code since it almost makes more sense
> > >> there. Beyond that sharing becomes much more challenging due to the
> > >> differences in the Rx and Tx paths that build out of the difference
> > >> between the 2 driver and 3 driver models.  
> > >
> > > At this point it might be worth it to articulate the advantages
> > > of the 3 netdev model.  
> > 
> > The main advantages are performance on the lower devices. Specifically
> > we can run LLTX and IFF_NOQUEUE on the upper device, meaning the VF
> > doesn't take a performance hit when we start transmitting through it.
> > In addition the paravirtual device is able to fully expose it's
> > features, so for example virtio_net can maintain in-driver XDP, and we
> > can provide generic XDP on the upper device. We can also have an
> > asymmetric configuration where the number of queues or feature sets
> > can be different between the ports.
> > 
> > Basically what it comes down to is in the 3 netdev model we never have
> > to deal with the issues of going from being a standard netdev to being
> > a master netdev. The role of each netdev is clearly defined.
> > 
> > As far as doing a generic solution it is the preferred way to go as
> > well since we don't have to rewrite functionality of the lower device.
> > Currently the only changes really needed are to add a phys_port_name
> > operation so that you can distinguish between the bypass interface and
> > the original paravirtual one. As such you have no confusion about what
> > you are running.
> > 
> > > If they are compelling, why wouldn't netvsc users want them?  
> > 
> > I believe part of the logic for Stephen's choice is that netvsc
> > doesn't have a "BACKUP" bit like what we have virtio_net. As a result
> > they have no way of knowing if a VF will ever show up or not. That
> > makes the 3 netdev solution less appealing as they always end up in
> > the bonding mode. In addition they have intentionally limited
> > themselves to avoid features such as XDP that would cause issues with
> > the 2 netdev model.
> > 
> > > Alex, I think you were one of the strongest proponents of this model,
> > > you should be well placed to provide a summary.  
> > 
> > Hopefully the information provided helps. In my mind the issue is that
> > netvsc was not designed correctly, and as such it is using the 2
> > netdev model as a bit of a crutch to handle the case where they wanted
> > to add a VF bypass. At this point it has become a part of their
> > architecture so it would be confusing for their user base to deal with
> > having 2 netdevs spawn every time their driver sees a new device. In
> > the case of virtio we only spawn 2 netdevs if the backup bit is set
> > which implies a specific use case. When the bit isn't set we are only
> > spawning one device, and as a result we can get much better
> > performance out of the resultant combination of devices, and can
> > expose functionality such as in-driver XDP in virtio.
> 
> 
> I have experimented with toggling IFF_NOQUEUE and/or LLTX on the netvsc
> device when doing passthrough. It didn't help performance much and there
> are corner cases that make it painful, like what if qdisc was placed
> by user on the netvsc device?  Should the qdisc then be moved back
> and forth to the VF device when it arrives or is removed?
> 
> As far as XDP support, it is on the plan to support XDP on the netvsc
> device since the receive buffers do have to be copied. But there has
> been no customer demand for it; and the VF model on Azure has limitations
> which make a transparent XDP model pretty much impossible.


Jiri, does that answer your question? Even though there is some
similarity, the needs of netvsc and virtio appear significantly
different.

We do not yet know whether other PV devices will be virtio-like
or netvsc-like.

Do you agree?

-- 
MST

^ permalink raw reply	[flat|nested] 102+ messages in thread

* [virtio-dev] Re: [PATCH v4 2/2] virtio_net: Extend virtio to use VF datapath when available
@ 2018-03-07 20:11                                                   ` Michael S. Tsirkin
  0 siblings, 0 replies; 102+ messages in thread
From: Michael S. Tsirkin @ 2018-03-07 20:11 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Alexander Duyck, Jiri Pirko, Sridhar Samudrala, David Miller,
	Netdev, virtio-dev, Brandeburg, Jesse, Duyck, Alexander H,
	Jakub Kicinski

On Wed, Mar 07, 2018 at 10:06:30AM -0800, Stephen Hemminger wrote:
> On Wed, 7 Mar 2018 09:50:50 -0800
> Alexander Duyck <alexander.duyck@gmail.com> wrote:
> 
> > On Tue, Mar 6, 2018 at 6:38 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> > > On Tue, Mar 06, 2018 at 03:27:46PM -0800, Alexander Duyck wrote:  
> > >> > I definitelly vote for a separate common shared code for both netvsc and
> > >> > virtio_net - even if you use 2 and 3 netdev model, you could share the
> > >> > common code. Strict checks and limitation should be in place.  
> > >>
> > >> Noted. But as I also mentioned there isn't that much "common" code
> > >> between the two models. I think if anything we could probably look at
> > >> peeling out a few bits such as "get_<iface>_bymac" which really would
> > >> become dev_get_by_mac_and_ops in order to find the device for the
> > >> notifiers. I probably wouldn't even put that in our driver and would
> > >> instead put it in the core code since it almost makes more sense
> > >> there. Beyond that sharing becomes much more challenging due to the
> > >> differences in the Rx and Tx paths that build out of the difference
> > >> between the 2 driver and 3 driver models.  
> > >
> > > At this point it might be worth it to articulate the advantages
> > > of the 3 netdev model.  
> > 
> > The main advantages are performance on the lower devices. Specifically
> > we can run LLTX and IFF_NOQUEUE on the upper device, meaning the VF
> > doesn't take a performance hit when we start transmitting through it.
> > In addition the paravirtual device is able to fully expose it's
> > features, so for example virtio_net can maintain in-driver XDP, and we
> > can provide generic XDP on the upper device. We can also have an
> > asymmetric configuration where the number of queues or feature sets
> > can be different between the ports.
> > 
> > Basically what it comes down to is in the 3 netdev model we never have
> > to deal with the issues of going from being a standard netdev to being
> > a master netdev. The role of each netdev is clearly defined.
> > 
> > As far as doing a generic solution it is the preferred way to go as
> > well since we don't have to rewrite functionality of the lower device.
> > Currently the only changes really needed are to add a phys_port_name
> > operation so that you can distinguish between the bypass interface and
> > the original paravirtual one. As such you have no confusion about what
> > you are running.
> > 
> > > If they are compelling, why wouldn't netvsc users want them?  
> > 
> > I believe part of the logic for Stephen's choice is that netvsc
> > doesn't have a "BACKUP" bit like what we have virtio_net. As a result
> > they have no way of knowing if a VF will ever show up or not. That
> > makes the 3 netdev solution less appealing as they always end up in
> > the bonding mode. In addition they have intentionally limited
> > themselves to avoid features such as XDP that would cause issues with
> > the 2 netdev model.
> > 
> > > Alex, I think you were one of the strongest proponents of this model,
> > > you should be well placed to provide a summary.  
> > 
> > Hopefully the information provided helps. In my mind the issue is that
> > netvsc was not designed correctly, and as such it is using the 2
> > netdev model as a bit of a crutch to handle the case where they wanted
> > to add a VF bypass. At this point it has become a part of their
> > architecture so it would be confusing for their user base to deal with
> > having 2 netdevs spawn every time their driver sees a new device. In
> > the case of virtio we only spawn 2 netdevs if the backup bit is set
> > which implies a specific use case. When the bit isn't set we are only
> > spawning one device, and as a result we can get much better
> > performance out of the resultant combination of devices, and can
> > expose functionality such as in-driver XDP in virtio.
> 
> 
> I have experimented with toggling IFF_NOQUEUE and/or LLTX on the netvsc
> device when doing passthrough. It didn't help performance much and there
> are corner cases that make it painful, like what if qdisc was placed
> by user on the netvsc device?  Should the qdisc then be moved back
> and forth to the VF device when it arrives or is removed?
> 
> As far as XDP support, it is on the plan to support XDP on the netvsc
> device since the receive buffers do have to be copied. But there has
> been no customer demand for it; and the VF model on Azure has limitations
> which make a transparent XDP model pretty much impossible.


Jiri, does that answer your question? Even though there is some
similarity, the needs of netvsc and virtio appear significantly
different.

We do not yet know whether other PV devices will be virtio-like
or netvsc-like.

Do you agree?

-- 
MST

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Re: [PATCH v4 2/2] virtio_net: Extend virtio to use VF datapath when available
  2018-03-07 20:11                                                   ` [virtio-dev] " Michael S. Tsirkin
@ 2018-03-12 18:47                                                     ` Samudrala, Sridhar
  -1 siblings, 0 replies; 102+ messages in thread
From: Samudrala, Sridhar @ 2018-03-12 18:47 UTC (permalink / raw)
  To: Michael S. Tsirkin, Stephen Hemminger
  Cc: Alexander Duyck, Jiri Pirko, David Miller, Netdev, virtio-dev,
	Brandeburg, Jesse, Duyck, Alexander H, Jakub Kicinski

On 3/7/2018 12:11 PM, Michael S. Tsirkin wrote:
> On Wed, Mar 07, 2018 at 10:06:30AM -0800, Stephen Hemminger wrote:
>> On Wed, 7 Mar 2018 09:50:50 -0800
>> Alexander Duyck <alexander.duyck@gmail.com> wrote:
>>
>>> On Tue, Mar 6, 2018 at 6:38 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
>>>> On Tue, Mar 06, 2018 at 03:27:46PM -0800, Alexander Duyck wrote:
>>>>>> I definitelly vote for a separate common shared code for both netvsc and
>>>>>> virtio_net - even if you use 2 and 3 netdev model, you could share the
>>>>>> common code. Strict checks and limitation should be in place.
>>>>> Noted. But as I also mentioned there isn't that much "common" code
>>>>> between the two models. I think if anything we could probably look at
>>>>> peeling out a few bits such as "get_<iface>_bymac" which really would
>>>>> become dev_get_by_mac_and_ops in order to find the device for the
>>>>> notifiers. I probably wouldn't even put that in our driver and would
>>>>> instead put it in the core code since it almost makes more sense
>>>>> there. Beyond that sharing becomes much more challenging due to the
>>>>> differences in the Rx and Tx paths that build out of the difference
>>>>> between the 2 driver and 3 driver models.
>>>> At this point it might be worth it to articulate the advantages
>>>> of the 3 netdev model.
>>> The main advantages are performance on the lower devices. Specifically
>>> we can run LLTX and IFF_NOQUEUE on the upper device, meaning the VF
>>> doesn't take a performance hit when we start transmitting through it.
>>> In addition the paravirtual device is able to fully expose it's
>>> features, so for example virtio_net can maintain in-driver XDP, and we
>>> can provide generic XDP on the upper device. We can also have an
>>> asymmetric configuration where the number of queues or feature sets
>>> can be different between the ports.
>>>
>>> Basically what it comes down to is in the 3 netdev model we never have
>>> to deal with the issues of going from being a standard netdev to being
>>> a master netdev. The role of each netdev is clearly defined.
>>>
>>> As far as doing a generic solution it is the preferred way to go as
>>> well since we don't have to rewrite functionality of the lower device.
>>> Currently the only changes really needed are to add a phys_port_name
>>> operation so that you can distinguish between the bypass interface and
>>> the original paravirtual one. As such you have no confusion about what
>>> you are running.
>>>
>>>> If they are compelling, why wouldn't netvsc users want them?
>>> I believe part of the logic for Stephen's choice is that netvsc
>>> doesn't have a "BACKUP" bit like what we have virtio_net. As a result
>>> they have no way of knowing if a VF will ever show up or not. That
>>> makes the 3 netdev solution less appealing as they always end up in
>>> the bonding mode. In addition they have intentionally limited
>>> themselves to avoid features such as XDP that would cause issues with
>>> the 2 netdev model.
>>>
>>>> Alex, I think you were one of the strongest proponents of this model,
>>>> you should be well placed to provide a summary.
>>> Hopefully the information provided helps. In my mind the issue is that
>>> netvsc was not designed correctly, and as such it is using the 2
>>> netdev model as a bit of a crutch to handle the case where they wanted
>>> to add a VF bypass. At this point it has become a part of their
>>> architecture so it would be confusing for their user base to deal with
>>> having 2 netdevs spawn every time their driver sees a new device. In
>>> the case of virtio we only spawn 2 netdevs if the backup bit is set
>>> which implies a specific use case. When the bit isn't set we are only
>>> spawning one device, and as a result we can get much better
>>> performance out of the resultant combination of devices, and can
>>> expose functionality such as in-driver XDP in virtio.
>>
>> I have experimented with toggling IFF_NOQUEUE and/or LLTX on the netvsc
>> device when doing passthrough. It didn't help performance much and there
>> are corner cases that make it painful, like what if qdisc was placed
>> by user on the netvsc device?  Should the qdisc then be moved back
>> and forth to the VF device when it arrives or is removed?
>>
>> As far as XDP support, it is on the plan to support XDP on the netvsc
>> device since the receive buffers do have to be copied. But there has
>> been no customer demand for it; and the VF model on Azure has limitations
>> which make a transparent XDP model pretty much impossible.
>
> Jiri, does that answer your question? Even though there is some
> similarity, the needs of netvsc and virtio appear significantly
> different.
>
> We do not yet know whether other PV devices will be virtio-like
> or netvsc-like.
>
> Do you agree?

Michael,

At this point can we agree to go with 3 netdev model for virtio and as 
this is
not compatible with netvsc's 2 netdev model,  the scope for any common code
between virtio and netvsc is very limited.

Should i submit a v5 version of this patch series with some minor 
fixes?  Or can
you pull in this series and i can submit patches on top of that?

Thanks
Sridhar

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [virtio-dev] Re: [PATCH v4 2/2] virtio_net: Extend virtio to use VF datapath when available
@ 2018-03-12 18:47                                                     ` Samudrala, Sridhar
  0 siblings, 0 replies; 102+ messages in thread
From: Samudrala, Sridhar @ 2018-03-12 18:47 UTC (permalink / raw)
  To: Michael S. Tsirkin, Stephen Hemminger
  Cc: Alexander Duyck, Jiri Pirko, David Miller, Netdev, virtio-dev,
	Brandeburg, Jesse, Duyck, Alexander H, Jakub Kicinski

On 3/7/2018 12:11 PM, Michael S. Tsirkin wrote:
> On Wed, Mar 07, 2018 at 10:06:30AM -0800, Stephen Hemminger wrote:
>> On Wed, 7 Mar 2018 09:50:50 -0800
>> Alexander Duyck <alexander.duyck@gmail.com> wrote:
>>
>>> On Tue, Mar 6, 2018 at 6:38 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
>>>> On Tue, Mar 06, 2018 at 03:27:46PM -0800, Alexander Duyck wrote:
>>>>>> I definitelly vote for a separate common shared code for both netvsc and
>>>>>> virtio_net - even if you use 2 and 3 netdev model, you could share the
>>>>>> common code. Strict checks and limitation should be in place.
>>>>> Noted. But as I also mentioned there isn't that much "common" code
>>>>> between the two models. I think if anything we could probably look at
>>>>> peeling out a few bits such as "get_<iface>_bymac" which really would
>>>>> become dev_get_by_mac_and_ops in order to find the device for the
>>>>> notifiers. I probably wouldn't even put that in our driver and would
>>>>> instead put it in the core code since it almost makes more sense
>>>>> there. Beyond that sharing becomes much more challenging due to the
>>>>> differences in the Rx and Tx paths that build out of the difference
>>>>> between the 2 driver and 3 driver models.
>>>> At this point it might be worth it to articulate the advantages
>>>> of the 3 netdev model.
>>> The main advantages are performance on the lower devices. Specifically
>>> we can run LLTX and IFF_NOQUEUE on the upper device, meaning the VF
>>> doesn't take a performance hit when we start transmitting through it.
>>> In addition the paravirtual device is able to fully expose it's
>>> features, so for example virtio_net can maintain in-driver XDP, and we
>>> can provide generic XDP on the upper device. We can also have an
>>> asymmetric configuration where the number of queues or feature sets
>>> can be different between the ports.
>>>
>>> Basically what it comes down to is in the 3 netdev model we never have
>>> to deal with the issues of going from being a standard netdev to being
>>> a master netdev. The role of each netdev is clearly defined.
>>>
>>> As far as doing a generic solution it is the preferred way to go as
>>> well since we don't have to rewrite functionality of the lower device.
>>> Currently the only changes really needed are to add a phys_port_name
>>> operation so that you can distinguish between the bypass interface and
>>> the original paravirtual one. As such you have no confusion about what
>>> you are running.
>>>
>>>> If they are compelling, why wouldn't netvsc users want them?
>>> I believe part of the logic for Stephen's choice is that netvsc
>>> doesn't have a "BACKUP" bit like what we have virtio_net. As a result
>>> they have no way of knowing if a VF will ever show up or not. That
>>> makes the 3 netdev solution less appealing as they always end up in
>>> the bonding mode. In addition they have intentionally limited
>>> themselves to avoid features such as XDP that would cause issues with
>>> the 2 netdev model.
>>>
>>>> Alex, I think you were one of the strongest proponents of this model,
>>>> you should be well placed to provide a summary.
>>> Hopefully the information provided helps. In my mind the issue is that
>>> netvsc was not designed correctly, and as such it is using the 2
>>> netdev model as a bit of a crutch to handle the case where they wanted
>>> to add a VF bypass. At this point it has become a part of their
>>> architecture so it would be confusing for their user base to deal with
>>> having 2 netdevs spawn every time their driver sees a new device. In
>>> the case of virtio we only spawn 2 netdevs if the backup bit is set
>>> which implies a specific use case. When the bit isn't set we are only
>>> spawning one device, and as a result we can get much better
>>> performance out of the resultant combination of devices, and can
>>> expose functionality such as in-driver XDP in virtio.
>>
>> I have experimented with toggling IFF_NOQUEUE and/or LLTX on the netvsc
>> device when doing passthrough. It didn't help performance much and there
>> are corner cases that make it painful, like what if qdisc was placed
>> by user on the netvsc device?  Should the qdisc then be moved back
>> and forth to the VF device when it arrives or is removed?
>>
>> As far as XDP support, it is on the plan to support XDP on the netvsc
>> device since the receive buffers do have to be copied. But there has
>> been no customer demand for it; and the VF model on Azure has limitations
>> which make a transparent XDP model pretty much impossible.
>
> Jiri, does that answer your question? Even though there is some
> similarity, the needs of netvsc and virtio appear significantly
> different.
>
> We do not yet know whether other PV devices will be virtio-like
> or netvsc-like.
>
> Do you agree?

Michael,

At this point can we agree to go with 3 netdev model for virtio and as 
this is
not compatible with netvsc's 2 netdev model,  the scope for any common code
between virtio and netvsc is very limited.

Should i submit a v5 version of this patch series with some minor 
fixes?  Or can
you pull in this series and i can submit patches on top of that?

Thanks
Sridhar

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v4 2/2] virtio_net: Extend virtio to use VF datapath when available
  2018-03-01 20:08   ` [virtio-dev] " Sridhar Samudrala
                     ` (2 preceding siblings ...)
  (?)
@ 2018-03-12 20:12   ` Jiri Pirko
  2018-03-12 20:58       ` [virtio-dev] " Samudrala, Sridhar
  -1 siblings, 1 reply; 102+ messages in thread
From: Jiri Pirko @ 2018-03-12 20:12 UTC (permalink / raw)
  To: Sridhar Samudrala
  Cc: mst, stephen, davem, netdev, virtio-dev, jesse.brandeburg,
	alexander.h.duyck, kubakici

Thu, Mar 01, 2018 at 09:08:43PM CET, sridhar.samudrala@intel.com wrote:
>This patch enables virtio_net to switch over to a VF datapath when a VF
>netdev is present with the same MAC address. It allows live migration
>of a VM with a direct attached VF without the need to setup a bond/team
>between a VF and virtio net device in the guest.
>
>The hypervisor needs to enable only one datapath at any time so that
>packets don't get looped back to the VM over the other datapath. When a VF
>is plugged, the virtio datapath link state can be marked as down. The
>hypervisor needs to unplug the VF device from the guest on the source host
>and reset the MAC filter of the VF to initiate failover of datapath to
>virtio before starting the migration. After the migration is completed,
>the destination hypervisor sets the MAC filter on the VF and plugs it back
>to the guest to switch over to VF datapath.
>
>When BACKUP feature is enabled, an additional netdev(bypass netdev) is
>created that acts as a master device and tracks the state of the 2 lower
>netdevs. The original virtio_net netdev is marked as 'backup' netdev and a
>passthru device with the same MAC is registered as 'active' netdev.
>
>This patch is based on the discussion initiated by Jesse on this thread.
>https://marc.info/?l=linux-virtualization&m=151189725224231&w=2
>
>Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
>Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
>Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>

[...]


>+static int virtnet_bypass_register_child(struct net_device *child_netdev)
>+{
>+	struct virtnet_bypass_info *vbi;
>+	struct net_device *dev;
>+	bool backup;
>+	int ret;
>+
>+	if (child_netdev->addr_len != ETH_ALEN)
>+		return NOTIFY_DONE;
>+
>+	/* We will use the MAC address to locate the virtnet_bypass netdev
>+	 * to associate with the child netdev. If we don't find a matching
>+	 * bypass netdev, move on.
>+	 */
>+	dev = get_virtnet_bypass_bymac(dev_net(child_netdev),
>+				       child_netdev->perm_addr);
>+	if (!dev)
>+		return NOTIFY_DONE;
>+
>+	vbi = netdev_priv(dev);
>+	backup = (child_netdev->dev.parent == dev->dev.parent);
>+	if (backup ? rtnl_dereference(vbi->backup_netdev) :
>+			rtnl_dereference(vbi->active_netdev)) {
>+		netdev_info(dev,
>+			    "%s attempting to join bypass dev when %s already present\n",
>+			    child_netdev->name, backup ? "backup" : "active");
>+		return NOTIFY_DONE;
>+	}
>+
>+	/* Avoid non pci devices as active netdev */
>+	if (!backup && (!child_netdev->dev.parent ||
>+			!dev_is_pci(child_netdev->dev.parent)))
>+		return NOTIFY_DONE;
>+
>+	ret = netdev_rx_handler_register(child_netdev,
>+					 virtnet_bypass_handle_frame, dev);
>+	if (ret != 0) {
>+		netdev_err(child_netdev,
>+			   "can not register bypass receive handler (err = %d)\n",
>+			   ret);
>+		goto rx_handler_failed;
>+	}
>+
>+	ret = netdev_upper_dev_link(child_netdev, dev, NULL);
>+	if (ret != 0) {
>+		netdev_err(child_netdev,
>+			   "can not set master device %s (err = %d)\n",
>+			   dev->name, ret);
>+		goto upper_link_failed;
>+	}
>+
>+	child_netdev->flags |= IFF_SLAVE;
>+
>+	if (netif_running(dev)) {
>+		ret = dev_open(child_netdev);
>+		if (ret && (ret != -EBUSY)) {
>+			netdev_err(dev, "Opening child %s failed ret:%d\n",
>+				   child_netdev->name, ret);
>+			goto err_interface_up;
>+		}
>+	}

Much of this function is copy of netvsc_vf_join, should be shared with
netvsc.


>+
>+	/* Align MTU of child with master */
>+	ret = dev_set_mtu(child_netdev, dev->mtu);
>+	if (ret) {
>+		netdev_err(dev,
>+			   "unable to change mtu of %s to %u register failed\n",
>+			   child_netdev->name, dev->mtu);
>+		goto err_set_mtu;
>+	}
>+
>+	call_netdevice_notifiers(NETDEV_JOIN, child_netdev);
>+
>+	netdev_info(dev, "registering %s\n", child_netdev->name);
>+
>+	dev_hold(child_netdev);
>+	if (backup) {
>+		rcu_assign_pointer(vbi->backup_netdev, child_netdev);
>+		dev_get_stats(vbi->backup_netdev, &vbi->backup_stats);
>+	} else {
>+		rcu_assign_pointer(vbi->active_netdev, child_netdev);
>+		dev_get_stats(vbi->active_netdev, &vbi->active_stats);
>+		dev->min_mtu = child_netdev->min_mtu;
>+		dev->max_mtu = child_netdev->max_mtu;
>+	}
>+
>+	return NOTIFY_OK;
>+
>+err_set_mtu:
>+	dev_close(child_netdev);
>+err_interface_up:
>+	netdev_upper_dev_unlink(child_netdev, dev);
>+	child_netdev->flags &= ~IFF_SLAVE;
>+upper_link_failed:
>+	netdev_rx_handler_unregister(child_netdev);
>+rx_handler_failed:
>+	return NOTIFY_DONE;
>+}
>+
>+static int virtnet_bypass_unregister_child(struct net_device *child_netdev)
>+{
>+	struct virtnet_bypass_info *vbi;
>+	struct net_device *dev, *backup;
>+
>+	dev = get_virtnet_bypass_byref(child_netdev);
>+	if (!dev)
>+		return NOTIFY_DONE;
>+
>+	vbi = netdev_priv(dev);
>+
>+	netdev_info(dev, "unregistering %s\n", child_netdev->name);
>+
>+	netdev_rx_handler_unregister(child_netdev);
>+	netdev_upper_dev_unlink(child_netdev, dev);
>+	child_netdev->flags &= ~IFF_SLAVE;
>+
>+	if (child_netdev->dev.parent == dev->dev.parent) {
>+		RCU_INIT_POINTER(vbi->backup_netdev, NULL);
>+	} else {
>+		RCU_INIT_POINTER(vbi->active_netdev, NULL);
>+		backup = rtnl_dereference(vbi->backup_netdev);
>+		if (backup) {
>+			dev->min_mtu = backup->min_mtu;
>+			dev->max_mtu = backup->max_mtu;
>+		}
>+	}
>+
>+	dev_put(child_netdev);
>+
>+	return NOTIFY_OK;
>+}
>+
>+static int virtnet_bypass_update_link(struct net_device *child_netdev)
>+{
>+	struct net_device *dev, *active, *backup;
>+	struct virtnet_bypass_info *vbi;
>+
>+	dev = get_virtnet_bypass_byref(child_netdev);
>+	if (!dev || !netif_running(dev))
>+		return NOTIFY_DONE;
>+
>+	vbi = netdev_priv(dev);
>+
>+	active = rtnl_dereference(vbi->active_netdev);
>+	backup = rtnl_dereference(vbi->backup_netdev);
>+
>+	if ((active && virtnet_bypass_xmit_ready(active)) ||
>+	    (backup && virtnet_bypass_xmit_ready(backup))) {
>+		netif_carrier_on(dev);
>+		netif_tx_wake_all_queues(dev);
>+	} else {
>+		netif_carrier_off(dev);
>+		netif_tx_stop_all_queues(dev);
>+	}
>+
>+	return NOTIFY_OK;
>+}
>+
>+static int virtnet_bypass_event(struct notifier_block *this,
>+				unsigned long event, void *ptr)
>+{
>+	struct net_device *event_dev = netdev_notifier_info_to_dev(ptr);
>+
>+	/* Skip our own events */
>+	if (event_dev->netdev_ops == &virtnet_bypass_netdev_ops)
>+		return NOTIFY_DONE;
>+
>+	/* Avoid non-Ethernet type devices */
>+	if (event_dev->type != ARPHRD_ETHER)
>+		return NOTIFY_DONE;
>+
>+	/* Avoid Vlan dev with same MAC registering as child dev */
>+	if (is_vlan_dev(event_dev))
>+		return NOTIFY_DONE;
>+
>+	/* Avoid Bonding master dev with same MAC registering as child dev */
>+	if ((event_dev->priv_flags & IFF_BONDING) &&
>+	    (event_dev->flags & IFF_MASTER))
>+		return NOTIFY_DONE;
>+
>+	switch (event) {
>+	case NETDEV_REGISTER:
>+		return virtnet_bypass_register_child(event_dev);
>+	case NETDEV_UNREGISTER:
>+		return virtnet_bypass_unregister_child(event_dev);
>+	case NETDEV_UP:
>+	case NETDEV_DOWN:
>+	case NETDEV_CHANGE:
>+		return virtnet_bypass_update_link(event_dev);
>+	default:
>+		return NOTIFY_DONE;
>+	}
>+}

For example this function is 1:1 copy of netvsc, even with comments
and bugs (like IFF_BODING check).

This is also something that should be shared with netvsc.


>+
>+static struct notifier_block virtnet_bypass_notifier = {
>+	.notifier_call = virtnet_bypass_event,
>+};
>+
>+static int virtnet_bypass_create(struct virtnet_info *vi)
>+{
>+	struct net_device *backup_netdev = vi->dev;
>+	struct device *dev = &vi->vdev->dev;
>+	struct net_device *bypass_netdev;
>+	int res;
>+
>+	/* Alloc at least 2 queues, for now we are going with 16 assuming
>+	 * that most devices being bonded won't have too many queues.
>+	 */
>+	bypass_netdev = alloc_etherdev_mq(sizeof(struct virtnet_bypass_info),
>+					  16);
>+	if (!bypass_netdev) {
>+		dev_err(dev, "Unable to allocate bypass_netdev!\n");
>+		return -ENOMEM;
>+	}
>+
>+	dev_net_set(bypass_netdev, dev_net(backup_netdev));
>+	SET_NETDEV_DEV(bypass_netdev, dev);
>+
>+	bypass_netdev->netdev_ops = &virtnet_bypass_netdev_ops;
>+	bypass_netdev->ethtool_ops = &virtnet_bypass_ethtool_ops;
>+
>+	/* Initialize the device options */
>+	bypass_netdev->flags |= IFF_MASTER;
>+	bypass_netdev->priv_flags |= IFF_BONDING | IFF_UNICAST_FLT |

No clue why you set IFF_BONDING here...

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v4 2/2] virtio_net: Extend virtio to use VF datapath when available
  2018-03-12 20:12   ` Jiri Pirko
@ 2018-03-12 20:58       ` Samudrala, Sridhar
  0 siblings, 0 replies; 102+ messages in thread
From: Samudrala, Sridhar @ 2018-03-12 20:58 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: mst, stephen, davem, netdev, virtio-dev, jesse.brandeburg,
	alexander.h.duyck, kubakici



On 3/12/2018 1:12 PM, Jiri Pirko wrote:
> Thu, Mar 01, 2018 at 09:08:43PM CET, sridhar.samudrala@intel.com wrote:
>> This patch enables virtio_net to switch over to a VF datapath when a VF
>> netdev is present with the same MAC address. It allows live migration
>> of a VM with a direct attached VF without the need to setup a bond/team
>> between a VF and virtio net device in the guest.
>>
>> The hypervisor needs to enable only one datapath at any time so that
>> packets don't get looped back to the VM over the other datapath. When a VF
>> is plugged, the virtio datapath link state can be marked as down. The
>> hypervisor needs to unplug the VF device from the guest on the source host
>> and reset the MAC filter of the VF to initiate failover of datapath to
>> virtio before starting the migration. After the migration is completed,
>> the destination hypervisor sets the MAC filter on the VF and plugs it back
>> to the guest to switch over to VF datapath.
>>
>> When BACKUP feature is enabled, an additional netdev(bypass netdev) is
>> created that acts as a master device and tracks the state of the 2 lower
>> netdevs. The original virtio_net netdev is marked as 'backup' netdev and a
>> passthru device with the same MAC is registered as 'active' netdev.
>>
>> This patch is based on the discussion initiated by Jesse on this thread.
>> https://marc.info/?l=linux-virtualization&m=151189725224231&w=2
>>
>> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
>> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
>> Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
> [...]
>
>
>> +static int virtnet_bypass_register_child(struct net_device *child_netdev)
>> +{
>> +	struct virtnet_bypass_info *vbi;
>> +	struct net_device *dev;
>> +	bool backup;
>> +	int ret;
>> +
>> +	if (child_netdev->addr_len != ETH_ALEN)
>> +		return NOTIFY_DONE;
>> +
>> +	/* We will use the MAC address to locate the virtnet_bypass netdev
>> +	 * to associate with the child netdev. If we don't find a matching
>> +	 * bypass netdev, move on.
>> +	 */
>> +	dev = get_virtnet_bypass_bymac(dev_net(child_netdev),
>> +				       child_netdev->perm_addr);
>> +	if (!dev)
>> +		return NOTIFY_DONE;
>> +
>> +	vbi = netdev_priv(dev);
>> +	backup = (child_netdev->dev.parent == dev->dev.parent);
>> +	if (backup ? rtnl_dereference(vbi->backup_netdev) :
>> +			rtnl_dereference(vbi->active_netdev)) {
>> +		netdev_info(dev,
>> +			    "%s attempting to join bypass dev when %s already present\n",
>> +			    child_netdev->name, backup ? "backup" : "active");
>> +		return NOTIFY_DONE;
>> +	}
>> +
>> +	/* Avoid non pci devices as active netdev */
>> +	if (!backup && (!child_netdev->dev.parent ||
>> +			!dev_is_pci(child_netdev->dev.parent)))
>> +		return NOTIFY_DONE;
>> +
>> +	ret = netdev_rx_handler_register(child_netdev,
>> +					 virtnet_bypass_handle_frame, dev);
>> +	if (ret != 0) {
>> +		netdev_err(child_netdev,
>> +			   "can not register bypass receive handler (err = %d)\n",
>> +			   ret);
>> +		goto rx_handler_failed;
>> +	}
>> +
>> +	ret = netdev_upper_dev_link(child_netdev, dev, NULL);
>> +	if (ret != 0) {
>> +		netdev_err(child_netdev,
>> +			   "can not set master device %s (err = %d)\n",
>> +			   dev->name, ret);
>> +		goto upper_link_failed;
>> +	}
>> +
>> +	child_netdev->flags |= IFF_SLAVE;
>> +
>> +	if (netif_running(dev)) {
>> +		ret = dev_open(child_netdev);
>> +		if (ret && (ret != -EBUSY)) {
>> +			netdev_err(dev, "Opening child %s failed ret:%d\n",
>> +				   child_netdev->name, ret);
>> +			goto err_interface_up;
>> +		}
>> +	}
> Much of this function is copy of netvsc_vf_join, should be shared with
> netvsc.

Yes. This is based on netvsc_register_vf() and netvsc_vf_join(). But modified
to handle registration of 2 child netdevs(backup virtio and active vf).  In case
of netvsc, they only register VF. There is no backup netdev.
I think trying to make this code shareable with netvsc will introduce additional
checks and make it convoluted.


>
>
>> +
>> +	/* Align MTU of child with master */
>> +	ret = dev_set_mtu(child_netdev, dev->mtu);
>> +	if (ret) {
>> +		netdev_err(dev,
>> +			   "unable to change mtu of %s to %u register failed\n",
>> +			   child_netdev->name, dev->mtu);
>> +		goto err_set_mtu;
>> +	}
>> +
>> +	call_netdevice_notifiers(NETDEV_JOIN, child_netdev);
>> +
>> +	netdev_info(dev, "registering %s\n", child_netdev->name);
>> +
>> +	dev_hold(child_netdev);
>> +	if (backup) {
>> +		rcu_assign_pointer(vbi->backup_netdev, child_netdev);
>> +		dev_get_stats(vbi->backup_netdev, &vbi->backup_stats);
>> +	} else {
>> +		rcu_assign_pointer(vbi->active_netdev, child_netdev);
>> +		dev_get_stats(vbi->active_netdev, &vbi->active_stats);
>> +		dev->min_mtu = child_netdev->min_mtu;
>> +		dev->max_mtu = child_netdev->max_mtu;
>> +	}
>> +
>> +	return NOTIFY_OK;
>> +
>> +err_set_mtu:
>> +	dev_close(child_netdev);
>> +err_interface_up:
>> +	netdev_upper_dev_unlink(child_netdev, dev);
>> +	child_netdev->flags &= ~IFF_SLAVE;
>> +upper_link_failed:
>> +	netdev_rx_handler_unregister(child_netdev);
>> +rx_handler_failed:
>> +	return NOTIFY_DONE;
>> +}
>> +
>> +static int virtnet_bypass_unregister_child(struct net_device *child_netdev)
>> +{
>> +	struct virtnet_bypass_info *vbi;
>> +	struct net_device *dev, *backup;
>> +
>> +	dev = get_virtnet_bypass_byref(child_netdev);
>> +	if (!dev)
>> +		return NOTIFY_DONE;
>> +
>> +	vbi = netdev_priv(dev);
>> +
>> +	netdev_info(dev, "unregistering %s\n", child_netdev->name);
>> +
>> +	netdev_rx_handler_unregister(child_netdev);
>> +	netdev_upper_dev_unlink(child_netdev, dev);
>> +	child_netdev->flags &= ~IFF_SLAVE;
>> +
>> +	if (child_netdev->dev.parent == dev->dev.parent) {
>> +		RCU_INIT_POINTER(vbi->backup_netdev, NULL);
>> +	} else {
>> +		RCU_INIT_POINTER(vbi->active_netdev, NULL);
>> +		backup = rtnl_dereference(vbi->backup_netdev);
>> +		if (backup) {
>> +			dev->min_mtu = backup->min_mtu;
>> +			dev->max_mtu = backup->max_mtu;
>> +		}
>> +	}
>> +
>> +	dev_put(child_netdev);
>> +
>> +	return NOTIFY_OK;
>> +}
>> +
>> +static int virtnet_bypass_update_link(struct net_device *child_netdev)
>> +{
>> +	struct net_device *dev, *active, *backup;
>> +	struct virtnet_bypass_info *vbi;
>> +
>> +	dev = get_virtnet_bypass_byref(child_netdev);
>> +	if (!dev || !netif_running(dev))
>> +		return NOTIFY_DONE;
>> +
>> +	vbi = netdev_priv(dev);
>> +
>> +	active = rtnl_dereference(vbi->active_netdev);
>> +	backup = rtnl_dereference(vbi->backup_netdev);
>> +
>> +	if ((active && virtnet_bypass_xmit_ready(active)) ||
>> +	    (backup && virtnet_bypass_xmit_ready(backup))) {
>> +		netif_carrier_on(dev);
>> +		netif_tx_wake_all_queues(dev);
>> +	} else {
>> +		netif_carrier_off(dev);
>> +		netif_tx_stop_all_queues(dev);
>> +	}
>> +
>> +	return NOTIFY_OK;
>> +}
>> +
>> +static int virtnet_bypass_event(struct notifier_block *this,
>> +				unsigned long event, void *ptr)
>> +{
>> +	struct net_device *event_dev = netdev_notifier_info_to_dev(ptr);
>> +
>> +	/* Skip our own events */
>> +	if (event_dev->netdev_ops == &virtnet_bypass_netdev_ops)
>> +		return NOTIFY_DONE;
>> +
>> +	/* Avoid non-Ethernet type devices */
>> +	if (event_dev->type != ARPHRD_ETHER)
>> +		return NOTIFY_DONE;
>> +
>> +	/* Avoid Vlan dev with same MAC registering as child dev */
>> +	if (is_vlan_dev(event_dev))
>> +		return NOTIFY_DONE;
>> +
>> +	/* Avoid Bonding master dev with same MAC registering as child dev */
>> +	if ((event_dev->priv_flags & IFF_BONDING) &&
>> +	    (event_dev->flags & IFF_MASTER))
>> +		return NOTIFY_DONE;
>> +
>> +	switch (event) {
>> +	case NETDEV_REGISTER:
>> +		return virtnet_bypass_register_child(event_dev);
>> +	case NETDEV_UNREGISTER:
>> +		return virtnet_bypass_unregister_child(event_dev);
>> +	case NETDEV_UP:
>> +	case NETDEV_DOWN:
>> +	case NETDEV_CHANGE:
>> +		return virtnet_bypass_update_link(event_dev);
>> +	default:
>> +		return NOTIFY_DONE;
>> +	}
>> +}
> For example this function is 1:1 copy of netvsc, even with comments
> and bugs (like IFF_BODING check).
>
> This is also something that should be shared with netvsc.

The problem is with the calls that are made from this function.
Sharing would have been possible if both virtio and netvsc went with
same netdev model.

>
>
>> +
>> +static struct notifier_block virtnet_bypass_notifier = {
>> +	.notifier_call = virtnet_bypass_event,
>> +};
>> +
>> +static int virtnet_bypass_create(struct virtnet_info *vi)
>> +{
>> +	struct net_device *backup_netdev = vi->dev;
>> +	struct device *dev = &vi->vdev->dev;
>> +	struct net_device *bypass_netdev;
>> +	int res;
>> +
>> +	/* Alloc at least 2 queues, for now we are going with 16 assuming
>> +	 * that most devices being bonded won't have too many queues.
>> +	 */
>> +	bypass_netdev = alloc_etherdev_mq(sizeof(struct virtnet_bypass_info),
>> +					  16);
>> +	if (!bypass_netdev) {
>> +		dev_err(dev, "Unable to allocate bypass_netdev!\n");
>> +		return -ENOMEM;
>> +	}
>> +
>> +	dev_net_set(bypass_netdev, dev_net(backup_netdev));
>> +	SET_NETDEV_DEV(bypass_netdev, dev);
>> +
>> +	bypass_netdev->netdev_ops = &virtnet_bypass_netdev_ops;
>> +	bypass_netdev->ethtool_ops = &virtnet_bypass_ethtool_ops;
>> +
>> +	/* Initialize the device options */
>> +	bypass_netdev->flags |= IFF_MASTER;
>> +	bypass_netdev->priv_flags |= IFF_BONDING | IFF_UNICAST_FLT |
> No clue why you set IFF_BONDING here...

This is to indicate to the userspace that this is a master bond device.
May be it is sufficient to just set IFF_MASTER.

>
>
>
>
>

^ permalink raw reply	[flat|nested] 102+ messages in thread

* [virtio-dev] Re: [PATCH v4 2/2] virtio_net: Extend virtio to use VF datapath when available
@ 2018-03-12 20:58       ` Samudrala, Sridhar
  0 siblings, 0 replies; 102+ messages in thread
From: Samudrala, Sridhar @ 2018-03-12 20:58 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: mst, stephen, davem, netdev, virtio-dev, jesse.brandeburg,
	alexander.h.duyck, kubakici



On 3/12/2018 1:12 PM, Jiri Pirko wrote:
> Thu, Mar 01, 2018 at 09:08:43PM CET, sridhar.samudrala@intel.com wrote:
>> This patch enables virtio_net to switch over to a VF datapath when a VF
>> netdev is present with the same MAC address. It allows live migration
>> of a VM with a direct attached VF without the need to setup a bond/team
>> between a VF and virtio net device in the guest.
>>
>> The hypervisor needs to enable only one datapath at any time so that
>> packets don't get looped back to the VM over the other datapath. When a VF
>> is plugged, the virtio datapath link state can be marked as down. The
>> hypervisor needs to unplug the VF device from the guest on the source host
>> and reset the MAC filter of the VF to initiate failover of datapath to
>> virtio before starting the migration. After the migration is completed,
>> the destination hypervisor sets the MAC filter on the VF and plugs it back
>> to the guest to switch over to VF datapath.
>>
>> When BACKUP feature is enabled, an additional netdev(bypass netdev) is
>> created that acts as a master device and tracks the state of the 2 lower
>> netdevs. The original virtio_net netdev is marked as 'backup' netdev and a
>> passthru device with the same MAC is registered as 'active' netdev.
>>
>> This patch is based on the discussion initiated by Jesse on this thread.
>> https://marc.info/?l=linux-virtualization&m=151189725224231&w=2
>>
>> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
>> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
>> Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
> [...]
>
>
>> +static int virtnet_bypass_register_child(struct net_device *child_netdev)
>> +{
>> +	struct virtnet_bypass_info *vbi;
>> +	struct net_device *dev;
>> +	bool backup;
>> +	int ret;
>> +
>> +	if (child_netdev->addr_len != ETH_ALEN)
>> +		return NOTIFY_DONE;
>> +
>> +	/* We will use the MAC address to locate the virtnet_bypass netdev
>> +	 * to associate with the child netdev. If we don't find a matching
>> +	 * bypass netdev, move on.
>> +	 */
>> +	dev = get_virtnet_bypass_bymac(dev_net(child_netdev),
>> +				       child_netdev->perm_addr);
>> +	if (!dev)
>> +		return NOTIFY_DONE;
>> +
>> +	vbi = netdev_priv(dev);
>> +	backup = (child_netdev->dev.parent == dev->dev.parent);
>> +	if (backup ? rtnl_dereference(vbi->backup_netdev) :
>> +			rtnl_dereference(vbi->active_netdev)) {
>> +		netdev_info(dev,
>> +			    "%s attempting to join bypass dev when %s already present\n",
>> +			    child_netdev->name, backup ? "backup" : "active");
>> +		return NOTIFY_DONE;
>> +	}
>> +
>> +	/* Avoid non pci devices as active netdev */
>> +	if (!backup && (!child_netdev->dev.parent ||
>> +			!dev_is_pci(child_netdev->dev.parent)))
>> +		return NOTIFY_DONE;
>> +
>> +	ret = netdev_rx_handler_register(child_netdev,
>> +					 virtnet_bypass_handle_frame, dev);
>> +	if (ret != 0) {
>> +		netdev_err(child_netdev,
>> +			   "can not register bypass receive handler (err = %d)\n",
>> +			   ret);
>> +		goto rx_handler_failed;
>> +	}
>> +
>> +	ret = netdev_upper_dev_link(child_netdev, dev, NULL);
>> +	if (ret != 0) {
>> +		netdev_err(child_netdev,
>> +			   "can not set master device %s (err = %d)\n",
>> +			   dev->name, ret);
>> +		goto upper_link_failed;
>> +	}
>> +
>> +	child_netdev->flags |= IFF_SLAVE;
>> +
>> +	if (netif_running(dev)) {
>> +		ret = dev_open(child_netdev);
>> +		if (ret && (ret != -EBUSY)) {
>> +			netdev_err(dev, "Opening child %s failed ret:%d\n",
>> +				   child_netdev->name, ret);
>> +			goto err_interface_up;
>> +		}
>> +	}
> Much of this function is copy of netvsc_vf_join, should be shared with
> netvsc.

Yes. This is based on netvsc_register_vf() and netvsc_vf_join(). But modified
to handle registration of 2 child netdevs(backup virtio and active vf).  In case
of netvsc, they only register VF. There is no backup netdev.
I think trying to make this code shareable with netvsc will introduce additional
checks and make it convoluted.


>
>
>> +
>> +	/* Align MTU of child with master */
>> +	ret = dev_set_mtu(child_netdev, dev->mtu);
>> +	if (ret) {
>> +		netdev_err(dev,
>> +			   "unable to change mtu of %s to %u register failed\n",
>> +			   child_netdev->name, dev->mtu);
>> +		goto err_set_mtu;
>> +	}
>> +
>> +	call_netdevice_notifiers(NETDEV_JOIN, child_netdev);
>> +
>> +	netdev_info(dev, "registering %s\n", child_netdev->name);
>> +
>> +	dev_hold(child_netdev);
>> +	if (backup) {
>> +		rcu_assign_pointer(vbi->backup_netdev, child_netdev);
>> +		dev_get_stats(vbi->backup_netdev, &vbi->backup_stats);
>> +	} else {
>> +		rcu_assign_pointer(vbi->active_netdev, child_netdev);
>> +		dev_get_stats(vbi->active_netdev, &vbi->active_stats);
>> +		dev->min_mtu = child_netdev->min_mtu;
>> +		dev->max_mtu = child_netdev->max_mtu;
>> +	}
>> +
>> +	return NOTIFY_OK;
>> +
>> +err_set_mtu:
>> +	dev_close(child_netdev);
>> +err_interface_up:
>> +	netdev_upper_dev_unlink(child_netdev, dev);
>> +	child_netdev->flags &= ~IFF_SLAVE;
>> +upper_link_failed:
>> +	netdev_rx_handler_unregister(child_netdev);
>> +rx_handler_failed:
>> +	return NOTIFY_DONE;
>> +}
>> +
>> +static int virtnet_bypass_unregister_child(struct net_device *child_netdev)
>> +{
>> +	struct virtnet_bypass_info *vbi;
>> +	struct net_device *dev, *backup;
>> +
>> +	dev = get_virtnet_bypass_byref(child_netdev);
>> +	if (!dev)
>> +		return NOTIFY_DONE;
>> +
>> +	vbi = netdev_priv(dev);
>> +
>> +	netdev_info(dev, "unregistering %s\n", child_netdev->name);
>> +
>> +	netdev_rx_handler_unregister(child_netdev);
>> +	netdev_upper_dev_unlink(child_netdev, dev);
>> +	child_netdev->flags &= ~IFF_SLAVE;
>> +
>> +	if (child_netdev->dev.parent == dev->dev.parent) {
>> +		RCU_INIT_POINTER(vbi->backup_netdev, NULL);
>> +	} else {
>> +		RCU_INIT_POINTER(vbi->active_netdev, NULL);
>> +		backup = rtnl_dereference(vbi->backup_netdev);
>> +		if (backup) {
>> +			dev->min_mtu = backup->min_mtu;
>> +			dev->max_mtu = backup->max_mtu;
>> +		}
>> +	}
>> +
>> +	dev_put(child_netdev);
>> +
>> +	return NOTIFY_OK;
>> +}
>> +
>> +static int virtnet_bypass_update_link(struct net_device *child_netdev)
>> +{
>> +	struct net_device *dev, *active, *backup;
>> +	struct virtnet_bypass_info *vbi;
>> +
>> +	dev = get_virtnet_bypass_byref(child_netdev);
>> +	if (!dev || !netif_running(dev))
>> +		return NOTIFY_DONE;
>> +
>> +	vbi = netdev_priv(dev);
>> +
>> +	active = rtnl_dereference(vbi->active_netdev);
>> +	backup = rtnl_dereference(vbi->backup_netdev);
>> +
>> +	if ((active && virtnet_bypass_xmit_ready(active)) ||
>> +	    (backup && virtnet_bypass_xmit_ready(backup))) {
>> +		netif_carrier_on(dev);
>> +		netif_tx_wake_all_queues(dev);
>> +	} else {
>> +		netif_carrier_off(dev);
>> +		netif_tx_stop_all_queues(dev);
>> +	}
>> +
>> +	return NOTIFY_OK;
>> +}
>> +
>> +static int virtnet_bypass_event(struct notifier_block *this,
>> +				unsigned long event, void *ptr)
>> +{
>> +	struct net_device *event_dev = netdev_notifier_info_to_dev(ptr);
>> +
>> +	/* Skip our own events */
>> +	if (event_dev->netdev_ops == &virtnet_bypass_netdev_ops)
>> +		return NOTIFY_DONE;
>> +
>> +	/* Avoid non-Ethernet type devices */
>> +	if (event_dev->type != ARPHRD_ETHER)
>> +		return NOTIFY_DONE;
>> +
>> +	/* Avoid Vlan dev with same MAC registering as child dev */
>> +	if (is_vlan_dev(event_dev))
>> +		return NOTIFY_DONE;
>> +
>> +	/* Avoid Bonding master dev with same MAC registering as child dev */
>> +	if ((event_dev->priv_flags & IFF_BONDING) &&
>> +	    (event_dev->flags & IFF_MASTER))
>> +		return NOTIFY_DONE;
>> +
>> +	switch (event) {
>> +	case NETDEV_REGISTER:
>> +		return virtnet_bypass_register_child(event_dev);
>> +	case NETDEV_UNREGISTER:
>> +		return virtnet_bypass_unregister_child(event_dev);
>> +	case NETDEV_UP:
>> +	case NETDEV_DOWN:
>> +	case NETDEV_CHANGE:
>> +		return virtnet_bypass_update_link(event_dev);
>> +	default:
>> +		return NOTIFY_DONE;
>> +	}
>> +}
> For example this function is 1:1 copy of netvsc, even with comments
> and bugs (like IFF_BODING check).
>
> This is also something that should be shared with netvsc.

The problem is with the calls that are made from this function.
Sharing would have been possible if both virtio and netvsc went with
same netdev model.

>
>
>> +
>> +static struct notifier_block virtnet_bypass_notifier = {
>> +	.notifier_call = virtnet_bypass_event,
>> +};
>> +
>> +static int virtnet_bypass_create(struct virtnet_info *vi)
>> +{
>> +	struct net_device *backup_netdev = vi->dev;
>> +	struct device *dev = &vi->vdev->dev;
>> +	struct net_device *bypass_netdev;
>> +	int res;
>> +
>> +	/* Alloc at least 2 queues, for now we are going with 16 assuming
>> +	 * that most devices being bonded won't have too many queues.
>> +	 */
>> +	bypass_netdev = alloc_etherdev_mq(sizeof(struct virtnet_bypass_info),
>> +					  16);
>> +	if (!bypass_netdev) {
>> +		dev_err(dev, "Unable to allocate bypass_netdev!\n");
>> +		return -ENOMEM;
>> +	}
>> +
>> +	dev_net_set(bypass_netdev, dev_net(backup_netdev));
>> +	SET_NETDEV_DEV(bypass_netdev, dev);
>> +
>> +	bypass_netdev->netdev_ops = &virtnet_bypass_netdev_ops;
>> +	bypass_netdev->ethtool_ops = &virtnet_bypass_ethtool_ops;
>> +
>> +	/* Initialize the device options */
>> +	bypass_netdev->flags |= IFF_MASTER;
>> +	bypass_netdev->priv_flags |= IFF_BONDING | IFF_UNICAST_FLT |
> No clue why you set IFF_BONDING here...

This is to indicate to the userspace that this is a master bond device.
May be it is sufficient to just set IFF_MASTER.

>
>
>
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v4 2/2] virtio_net: Extend virtio to use VF datapath when available
  2018-03-12 20:58       ` [virtio-dev] " Samudrala, Sridhar
  (?)
@ 2018-03-12 21:08       ` Jiri Pirko
  2018-03-14  0:36           ` [virtio-dev] " Samudrala, Sridhar
  -1 siblings, 1 reply; 102+ messages in thread
From: Jiri Pirko @ 2018-03-12 21:08 UTC (permalink / raw)
  To: Samudrala, Sridhar
  Cc: mst, stephen, davem, netdev, virtio-dev, jesse.brandeburg,
	alexander.h.duyck, kubakici

Mon, Mar 12, 2018 at 09:58:06PM CET, sridhar.samudrala@intel.com wrote:
>
>
>On 3/12/2018 1:12 PM, Jiri Pirko wrote:
>> Thu, Mar 01, 2018 at 09:08:43PM CET, sridhar.samudrala@intel.com wrote:
>> > This patch enables virtio_net to switch over to a VF datapath when a VF
>> > netdev is present with the same MAC address. It allows live migration
>> > of a VM with a direct attached VF without the need to setup a bond/team
>> > between a VF and virtio net device in the guest.
>> > 
>> > The hypervisor needs to enable only one datapath at any time so that
>> > packets don't get looped back to the VM over the other datapath. When a VF
>> > is plugged, the virtio datapath link state can be marked as down. The
>> > hypervisor needs to unplug the VF device from the guest on the source host
>> > and reset the MAC filter of the VF to initiate failover of datapath to
>> > virtio before starting the migration. After the migration is completed,
>> > the destination hypervisor sets the MAC filter on the VF and plugs it back
>> > to the guest to switch over to VF datapath.
>> > 
>> > When BACKUP feature is enabled, an additional netdev(bypass netdev) is
>> > created that acts as a master device and tracks the state of the 2 lower
>> > netdevs. The original virtio_net netdev is marked as 'backup' netdev and a
>> > passthru device with the same MAC is registered as 'active' netdev.
>> > 
>> > This patch is based on the discussion initiated by Jesse on this thread.
>> > https://marc.info/?l=linux-virtualization&m=151189725224231&w=2
>> > 
>> > Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
>> > Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
>> > Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
>> [...]
>> 
>> 
>> > +static int virtnet_bypass_register_child(struct net_device *child_netdev)
>> > +{
>> > +	struct virtnet_bypass_info *vbi;
>> > +	struct net_device *dev;
>> > +	bool backup;
>> > +	int ret;
>> > +
>> > +	if (child_netdev->addr_len != ETH_ALEN)
>> > +		return NOTIFY_DONE;
>> > +
>> > +	/* We will use the MAC address to locate the virtnet_bypass netdev
>> > +	 * to associate with the child netdev. If we don't find a matching
>> > +	 * bypass netdev, move on.
>> > +	 */
>> > +	dev = get_virtnet_bypass_bymac(dev_net(child_netdev),
>> > +				       child_netdev->perm_addr);
>> > +	if (!dev)
>> > +		return NOTIFY_DONE;
>> > +
>> > +	vbi = netdev_priv(dev);
>> > +	backup = (child_netdev->dev.parent == dev->dev.parent);
>> > +	if (backup ? rtnl_dereference(vbi->backup_netdev) :
>> > +			rtnl_dereference(vbi->active_netdev)) {
>> > +		netdev_info(dev,
>> > +			    "%s attempting to join bypass dev when %s already present\n",
>> > +			    child_netdev->name, backup ? "backup" : "active");
>> > +		return NOTIFY_DONE;
>> > +	}
>> > +
>> > +	/* Avoid non pci devices as active netdev */
>> > +	if (!backup && (!child_netdev->dev.parent ||
>> > +			!dev_is_pci(child_netdev->dev.parent)))
>> > +		return NOTIFY_DONE;
>> > +
>> > +	ret = netdev_rx_handler_register(child_netdev,
>> > +					 virtnet_bypass_handle_frame, dev);
>> > +	if (ret != 0) {
>> > +		netdev_err(child_netdev,
>> > +			   "can not register bypass receive handler (err = %d)\n",
>> > +			   ret);
>> > +		goto rx_handler_failed;
>> > +	}
>> > +
>> > +	ret = netdev_upper_dev_link(child_netdev, dev, NULL);
>> > +	if (ret != 0) {
>> > +		netdev_err(child_netdev,
>> > +			   "can not set master device %s (err = %d)\n",
>> > +			   dev->name, ret);
>> > +		goto upper_link_failed;
>> > +	}
>> > +
>> > +	child_netdev->flags |= IFF_SLAVE;
>> > +
>> > +	if (netif_running(dev)) {
>> > +		ret = dev_open(child_netdev);
>> > +		if (ret && (ret != -EBUSY)) {
>> > +			netdev_err(dev, "Opening child %s failed ret:%d\n",
>> > +				   child_netdev->name, ret);
>> > +			goto err_interface_up;
>> > +		}
>> > +	}
>> Much of this function is copy of netvsc_vf_join, should be shared with
>> netvsc.
>
>Yes. This is based on netvsc_register_vf() and netvsc_vf_join(). But modified
>to handle registration of 2 child netdevs(backup virtio and active vf).ï¿½ In case
>of netvsc, they only register VF. There is no backup netdev.
>I think trying to make this code shareable with netvsc will introduce additional
>checks and make it convoluted.

No problem.


>
>
>> 
>> 
>> > +
>> > +	/* Align MTU of child with master */
>> > +	ret = dev_set_mtu(child_netdev, dev->mtu);
>> > +	if (ret) {
>> > +		netdev_err(dev,
>> > +			   "unable to change mtu of %s to %u register failed\n",
>> > +			   child_netdev->name, dev->mtu);
>> > +		goto err_set_mtu;
>> > +	}
>> > +
>> > +	call_netdevice_notifiers(NETDEV_JOIN, child_netdev);
>> > +
>> > +	netdev_info(dev, "registering %s\n", child_netdev->name);
>> > +
>> > +	dev_hold(child_netdev);
>> > +	if (backup) {
>> > +		rcu_assign_pointer(vbi->backup_netdev, child_netdev);
>> > +		dev_get_stats(vbi->backup_netdev, &vbi->backup_stats);
>> > +	} else {
>> > +		rcu_assign_pointer(vbi->active_netdev, child_netdev);
>> > +		dev_get_stats(vbi->active_netdev, &vbi->active_stats);
>> > +		dev->min_mtu = child_netdev->min_mtu;
>> > +		dev->max_mtu = child_netdev->max_mtu;
>> > +	}
>> > +
>> > +	return NOTIFY_OK;
>> > +
>> > +err_set_mtu:
>> > +	dev_close(child_netdev);
>> > +err_interface_up:
>> > +	netdev_upper_dev_unlink(child_netdev, dev);
>> > +	child_netdev->flags &= ~IFF_SLAVE;
>> > +upper_link_failed:
>> > +	netdev_rx_handler_unregister(child_netdev);
>> > +rx_handler_failed:
>> > +	return NOTIFY_DONE;
>> > +}
>> > +
>> > +static int virtnet_bypass_unregister_child(struct net_device *child_netdev)
>> > +{
>> > +	struct virtnet_bypass_info *vbi;
>> > +	struct net_device *dev, *backup;
>> > +
>> > +	dev = get_virtnet_bypass_byref(child_netdev);
>> > +	if (!dev)
>> > +		return NOTIFY_DONE;
>> > +
>> > +	vbi = netdev_priv(dev);
>> > +
>> > +	netdev_info(dev, "unregistering %s\n", child_netdev->name);
>> > +
>> > +	netdev_rx_handler_unregister(child_netdev);
>> > +	netdev_upper_dev_unlink(child_netdev, dev);
>> > +	child_netdev->flags &= ~IFF_SLAVE;
>> > +
>> > +	if (child_netdev->dev.parent == dev->dev.parent) {
>> > +		RCU_INIT_POINTER(vbi->backup_netdev, NULL);
>> > +	} else {
>> > +		RCU_INIT_POINTER(vbi->active_netdev, NULL);
>> > +		backup = rtnl_dereference(vbi->backup_netdev);
>> > +		if (backup) {
>> > +			dev->min_mtu = backup->min_mtu;
>> > +			dev->max_mtu = backup->max_mtu;
>> > +		}
>> > +	}
>> > +
>> > +	dev_put(child_netdev);
>> > +
>> > +	return NOTIFY_OK;
>> > +}
>> > +
>> > +static int virtnet_bypass_update_link(struct net_device *child_netdev)
>> > +{
>> > +	struct net_device *dev, *active, *backup;
>> > +	struct virtnet_bypass_info *vbi;
>> > +
>> > +	dev = get_virtnet_bypass_byref(child_netdev);
>> > +	if (!dev || !netif_running(dev))
>> > +		return NOTIFY_DONE;
>> > +
>> > +	vbi = netdev_priv(dev);
>> > +
>> > +	active = rtnl_dereference(vbi->active_netdev);
>> > +	backup = rtnl_dereference(vbi->backup_netdev);
>> > +
>> > +	if ((active && virtnet_bypass_xmit_ready(active)) ||
>> > +	    (backup && virtnet_bypass_xmit_ready(backup))) {
>> > +		netif_carrier_on(dev);
>> > +		netif_tx_wake_all_queues(dev);
>> > +	} else {
>> > +		netif_carrier_off(dev);
>> > +		netif_tx_stop_all_queues(dev);
>> > +	}
>> > +
>> > +	return NOTIFY_OK;
>> > +}
>> > +
>> > +static int virtnet_bypass_event(struct notifier_block *this,
>> > +				unsigned long event, void *ptr)
>> > +{
>> > +	struct net_device *event_dev = netdev_notifier_info_to_dev(ptr);
>> > +
>> > +	/* Skip our own events */
>> > +	if (event_dev->netdev_ops == &virtnet_bypass_netdev_ops)
>> > +		return NOTIFY_DONE;
>> > +
>> > +	/* Avoid non-Ethernet type devices */
>> > +	if (event_dev->type != ARPHRD_ETHER)
>> > +		return NOTIFY_DONE;
>> > +
>> > +	/* Avoid Vlan dev with same MAC registering as child dev */
>> > +	if (is_vlan_dev(event_dev))
>> > +		return NOTIFY_DONE;
>> > +
>> > +	/* Avoid Bonding master dev with same MAC registering as child dev */
>> > +	if ((event_dev->priv_flags & IFF_BONDING) &&
>> > +	    (event_dev->flags & IFF_MASTER))
>> > +		return NOTIFY_DONE;
>> > +
>> > +	switch (event) {
>> > +	case NETDEV_REGISTER:
>> > +		return virtnet_bypass_register_child(event_dev);
>> > +	case NETDEV_UNREGISTER:
>> > +		return virtnet_bypass_unregister_child(event_dev);
>> > +	case NETDEV_UP:
>> > +	case NETDEV_DOWN:
>> > +	case NETDEV_CHANGE:
>> > +		return virtnet_bypass_update_link(event_dev);
>> > +	default:
>> > +		return NOTIFY_DONE;
>> > +	}
>> > +}
>> For example this function is 1:1 copy of netvsc, even with comments
>> and bugs (like IFF_BODING check).
>> 
>> This is also something that should be shared with netvsc.
>
>The problem is with the calls that are made from this function.
>Sharing would have been possible if both virtio and netvsc went with
>same netdev model.

ops? You can still share.


>
>> 
>> 
>> > +
>> > +static struct notifier_block virtnet_bypass_notifier = {
>> > +	.notifier_call = virtnet_bypass_event,
>> > +};
>> > +
>> > +static int virtnet_bypass_create(struct virtnet_info *vi)
>> > +{
>> > +	struct net_device *backup_netdev = vi->dev;
>> > +	struct device *dev = &vi->vdev->dev;
>> > +	struct net_device *bypass_netdev;
>> > +	int res;
>> > +
>> > +	/* Alloc at least 2 queues, for now we are going with 16 assuming
>> > +	 * that most devices being bonded won't have too many queues.
>> > +	 */
>> > +	bypass_netdev = alloc_etherdev_mq(sizeof(struct virtnet_bypass_info),
>> > +					  16);
>> > +	if (!bypass_netdev) {
>> > +		dev_err(dev, "Unable to allocate bypass_netdev!\n");
>> > +		return -ENOMEM;
>> > +	}
>> > +
>> > +	dev_net_set(bypass_netdev, dev_net(backup_netdev));
>> > +	SET_NETDEV_DEV(bypass_netdev, dev);
>> > +
>> > +	bypass_netdev->netdev_ops = &virtnet_bypass_netdev_ops;
>> > +	bypass_netdev->ethtool_ops = &virtnet_bypass_ethtool_ops;
>> > +
>> > +	/* Initialize the device options */
>> > +	bypass_netdev->flags |= IFF_MASTER;
>> > +	bypass_netdev->priv_flags |= IFF_BONDING | IFF_UNICAST_FLT |
>> No clue why you set IFF_BONDING here...
>
>This is to indicate to the userspace that this is a master bond device.

This is not a master bond device! If you say this, it makes me really
worried....



>May be it is sufficient to just set IFF_MASTER.
>
>> 
>> 
>> 
>> 
>> 
>

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v4 2/2] virtio_net: Extend virtio to use VF datapath when available
  2018-03-04  4:04           ` [virtio-dev] " Michael S. Tsirkin
@ 2018-03-12 21:53             ` Siwei Liu
  -1 siblings, 0 replies; 102+ messages in thread
From: Siwei Liu @ 2018-03-12 21:53 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Sridhar Samudrala, Stephen Hemminger, David Miller, Netdev,
	Jiri Pirko, virtio-dev, Brandeburg, Jesse, Alexander Duyck,
	Jakub Kicinski

On Sat, Mar 3, 2018 at 8:04 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Fri, Mar 02, 2018 at 03:56:31PM -0800, Siwei Liu wrote:
>> On Fri, Mar 2, 2018 at 1:36 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
>> > On Fri, Mar 02, 2018 at 01:11:56PM -0800, Siwei Liu wrote:
>> >> On Thu, Mar 1, 2018 at 12:08 PM, Sridhar Samudrala
>> >> <sridhar.samudrala@intel.com> wrote:
>> >> > This patch enables virtio_net to switch over to a VF datapath when a VF
>> >> > netdev is present with the same MAC address. It allows live migration
>> >> > of a VM with a direct attached VF without the need to setup a bond/team
>> >> > between a VF and virtio net device in the guest.
>> >> >
>> >> > The hypervisor needs to enable only one datapath at any time so that
>> >> > packets don't get looped back to the VM over the other datapath. When a VF
>> >> > is plugged, the virtio datapath link state can be marked as down. The
>> >> > hypervisor needs to unplug the VF device from the guest on the source host
>> >> > and reset the MAC filter of the VF to initiate failover of datapath to
>> >> > virtio before starting the migration. After the migration is completed,
>> >> > the destination hypervisor sets the MAC filter on the VF and plugs it back
>> >> > to the guest to switch over to VF datapath.
>> >> >
>> >> > When BACKUP feature is enabled, an additional netdev(bypass netdev) is
>> >> > created that acts as a master device and tracks the state of the 2 lower
>> >> > netdevs. The original virtio_net netdev is marked as 'backup' netdev and a
>> >> > passthru device with the same MAC is registered as 'active' netdev.
>> >> >
>> >> > This patch is based on the discussion initiated by Jesse on this thread.
>> >> > https://marc.info/?l=linux-virtualization&m=151189725224231&w=2
>> >> >
>> >> > Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
>> >> > Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
>> >> > Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
>> >> > ---
>> >> >  drivers/net/virtio_net.c | 683 ++++++++++++++++++++++++++++++++++++++++++++++-
>> >> >  1 file changed, 682 insertions(+), 1 deletion(-)
>> >> >
>> >> > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>> >> > index bcd13fe906ca..f2860d86c952 100644
>> >> > --- a/drivers/net/virtio_net.c
>> >> > +++ b/drivers/net/virtio_net.c
>> >> > @@ -30,6 +30,8 @@
>> >> >  #include <linux/cpu.h>
>> >> >  #include <linux/average.h>
>> >> >  #include <linux/filter.h>
>> >> > +#include <linux/netdevice.h>
>> >> > +#include <linux/pci.h>
>> >> >  #include <net/route.h>
>> >> >  #include <net/xdp.h>
>> >> >
>> >> > @@ -206,6 +208,9 @@ struct virtnet_info {
>> >> >         u32 speed;
>> >> >
>> >> >         unsigned long guest_offloads;
>> >> > +
>> >> > +       /* upper netdev created when BACKUP feature enabled */
>> >> > +       struct net_device *bypass_netdev;
>> >> >  };
>> >> >
>> >> >  struct padded_vnet_hdr {
>> >> > @@ -2236,6 +2241,22 @@ static int virtnet_xdp(struct net_device *dev, struct netdev_bpf *xdp)
>> >> >         }
>> >> >  }
>> >> >
>> >> > +static int virtnet_get_phys_port_name(struct net_device *dev, char *buf,
>> >> > +                                     size_t len)
>> >> > +{
>> >> > +       struct virtnet_info *vi = netdev_priv(dev);
>> >> > +       int ret;
>> >> > +
>> >> > +       if (!virtio_has_feature(vi->vdev, VIRTIO_NET_F_BACKUP))
>> >> > +               return -EOPNOTSUPP;
>> >> > +
>> >> > +       ret = snprintf(buf, len, "_bkup");
>> >> > +       if (ret >= len)
>> >> > +               return -EOPNOTSUPP;
>> >> > +
>> >> > +       return 0;
>> >> > +}
>> >> > +
>> >>
>> >> What if the systemd/udevd is not new enough to enforce the
>> >> n<phys_port_name> naming? Would virtio_bypass get a different name
>> >> than the original virtio_net?
>> >
>> > You mean people using ethX names? Any hardware config change breaks
>> > these, I don't think that can be helped.
>>
>> I don't like the way to rely on .ndo_get_phys_port_name - it's fragile
>> and it does not completely solve the problem it tries to address.
>> Imagine what can end up with if getting an old udevd, or users already
>> have exsiting explicit udev rules around phys_port_name. It does not
>> give you the an ack in saying "yes, I know you're the bypass and
>> you're the backup, please continue and I will give you both correct
>> names", or an unacknowlegment saying "no, I don't know what these
>> extra interfaces are, please go back and leave the VF device alone".
>> We need new udev API for both feature negotiation and naming, or may
>> even completely hide the lower interfaces.
>
> Go ahead and try to make this happen, but I won't hold my
> breath.

Heads-up: I have some working code to hide the lower devices, upon
which the new udev/sysfs interace for feature negotiation and naming
will be built. I will get it posted for review in the next few days
once ready. Hopefully this could end the fight between the 3-netdev
and 2-netdev driver model, as I see most of the problems being argued
in this thread can be addressed by just hiding the lower devices from
the 3-netdev model while keeping its merits still.

Regards,
-Siwei


>
>> >
>> >> Should we detect this earlier and fall
>> >> back to legacy mode without creating the bypass netdev and ensalving
>> >> the VF?
>> >
>> > I don't think we can do this with existing kernel/userspace APIs.
>>
>> That's why I ever said to make udev aware of this new type of combined
>> device instead of doing hacks here and there around.
>>
>> Regards,
>> -Siwei
>
> We can add new interfaces on top but the main purpose here is to
> make old userspace do new tricks.
>
>> >
>> > --
>> > MST

^ permalink raw reply	[flat|nested] 102+ messages in thread

* [virtio-dev] Re: [PATCH v4 2/2] virtio_net: Extend virtio to use VF datapath when available
@ 2018-03-12 21:53             ` Siwei Liu
  0 siblings, 0 replies; 102+ messages in thread
From: Siwei Liu @ 2018-03-12 21:53 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Sridhar Samudrala, Stephen Hemminger, David Miller, Netdev,
	Jiri Pirko, virtio-dev, Brandeburg, Jesse, Alexander Duyck,
	Jakub Kicinski

On Sat, Mar 3, 2018 at 8:04 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Fri, Mar 02, 2018 at 03:56:31PM -0800, Siwei Liu wrote:
>> On Fri, Mar 2, 2018 at 1:36 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
>> > On Fri, Mar 02, 2018 at 01:11:56PM -0800, Siwei Liu wrote:
>> >> On Thu, Mar 1, 2018 at 12:08 PM, Sridhar Samudrala
>> >> <sridhar.samudrala@intel.com> wrote:
>> >> > This patch enables virtio_net to switch over to a VF datapath when a VF
>> >> > netdev is present with the same MAC address. It allows live migration
>> >> > of a VM with a direct attached VF without the need to setup a bond/team
>> >> > between a VF and virtio net device in the guest.
>> >> >
>> >> > The hypervisor needs to enable only one datapath at any time so that
>> >> > packets don't get looped back to the VM over the other datapath. When a VF
>> >> > is plugged, the virtio datapath link state can be marked as down. The
>> >> > hypervisor needs to unplug the VF device from the guest on the source host
>> >> > and reset the MAC filter of the VF to initiate failover of datapath to
>> >> > virtio before starting the migration. After the migration is completed,
>> >> > the destination hypervisor sets the MAC filter on the VF and plugs it back
>> >> > to the guest to switch over to VF datapath.
>> >> >
>> >> > When BACKUP feature is enabled, an additional netdev(bypass netdev) is
>> >> > created that acts as a master device and tracks the state of the 2 lower
>> >> > netdevs. The original virtio_net netdev is marked as 'backup' netdev and a
>> >> > passthru device with the same MAC is registered as 'active' netdev.
>> >> >
>> >> > This patch is based on the discussion initiated by Jesse on this thread.
>> >> > https://marc.info/?l=linux-virtualization&m=151189725224231&w=2
>> >> >
>> >> > Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
>> >> > Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
>> >> > Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
>> >> > ---
>> >> >  drivers/net/virtio_net.c | 683 ++++++++++++++++++++++++++++++++++++++++++++++-
>> >> >  1 file changed, 682 insertions(+), 1 deletion(-)
>> >> >
>> >> > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>> >> > index bcd13fe906ca..f2860d86c952 100644
>> >> > --- a/drivers/net/virtio_net.c
>> >> > +++ b/drivers/net/virtio_net.c
>> >> > @@ -30,6 +30,8 @@
>> >> >  #include <linux/cpu.h>
>> >> >  #include <linux/average.h>
>> >> >  #include <linux/filter.h>
>> >> > +#include <linux/netdevice.h>
>> >> > +#include <linux/pci.h>
>> >> >  #include <net/route.h>
>> >> >  #include <net/xdp.h>
>> >> >
>> >> > @@ -206,6 +208,9 @@ struct virtnet_info {
>> >> >         u32 speed;
>> >> >
>> >> >         unsigned long guest_offloads;
>> >> > +
>> >> > +       /* upper netdev created when BACKUP feature enabled */
>> >> > +       struct net_device *bypass_netdev;
>> >> >  };
>> >> >
>> >> >  struct padded_vnet_hdr {
>> >> > @@ -2236,6 +2241,22 @@ static int virtnet_xdp(struct net_device *dev, struct netdev_bpf *xdp)
>> >> >         }
>> >> >  }
>> >> >
>> >> > +static int virtnet_get_phys_port_name(struct net_device *dev, char *buf,
>> >> > +                                     size_t len)
>> >> > +{
>> >> > +       struct virtnet_info *vi = netdev_priv(dev);
>> >> > +       int ret;
>> >> > +
>> >> > +       if (!virtio_has_feature(vi->vdev, VIRTIO_NET_F_BACKUP))
>> >> > +               return -EOPNOTSUPP;
>> >> > +
>> >> > +       ret = snprintf(buf, len, "_bkup");
>> >> > +       if (ret >= len)
>> >> > +               return -EOPNOTSUPP;
>> >> > +
>> >> > +       return 0;
>> >> > +}
>> >> > +
>> >>
>> >> What if the systemd/udevd is not new enough to enforce the
>> >> n<phys_port_name> naming? Would virtio_bypass get a different name
>> >> than the original virtio_net?
>> >
>> > You mean people using ethX names? Any hardware config change breaks
>> > these, I don't think that can be helped.
>>
>> I don't like the way to rely on .ndo_get_phys_port_name - it's fragile
>> and it does not completely solve the problem it tries to address.
>> Imagine what can end up with if getting an old udevd, or users already
>> have exsiting explicit udev rules around phys_port_name. It does not
>> give you the an ack in saying "yes, I know you're the bypass and
>> you're the backup, please continue and I will give you both correct
>> names", or an unacknowlegment saying "no, I don't know what these
>> extra interfaces are, please go back and leave the VF device alone".
>> We need new udev API for both feature negotiation and naming, or may
>> even completely hide the lower interfaces.
>
> Go ahead and try to make this happen, but I won't hold my
> breath.

Heads-up: I have some working code to hide the lower devices, upon
which the new udev/sysfs interace for feature negotiation and naming
will be built. I will get it posted for review in the next few days
once ready. Hopefully this could end the fight between the 3-netdev
and 2-netdev driver model, as I see most of the problems being argued
in this thread can be addressed by just hiding the lower devices from
the 3-netdev model while keeping its merits still.

Regards,
-Siwei


>
>> >
>> >> Should we detect this earlier and fall
>> >> back to legacy mode without creating the bypass netdev and ensalving
>> >> the VF?
>> >
>> > I don't think we can do this with existing kernel/userspace APIs.
>>
>> That's why I ever said to make udev aware of this new type of combined
>> device instead of doing hacks here and there around.
>>
>> Regards,
>> -Siwei
>
> We can add new interfaces on top but the main purpose here is to
> make old userspace do new tricks.
>
>> >
>> > --
>> > MST

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v4 2/2] virtio_net: Extend virtio to use VF datapath when available
  2018-03-01 20:08   ` [virtio-dev] " Sridhar Samudrala
@ 2018-03-12 22:44     ` Siwei Liu
  -1 siblings, 0 replies; 102+ messages in thread
From: Siwei Liu @ 2018-03-12 22:44 UTC (permalink / raw)
  To: Sridhar Samudrala
  Cc: Michael S. Tsirkin, Stephen Hemminger, David Miller, Netdev,
	Jiri Pirko, virtio-dev, Brandeburg, Jesse, Alexander Duyck,
	Jakub Kicinski

Apologies, still some comments going. Please see inline.

On Thu, Mar 1, 2018 at 12:08 PM, Sridhar Samudrala
<sridhar.samudrala@intel.com> wrote:
> This patch enables virtio_net to switch over to a VF datapath when a VF
> netdev is present with the same MAC address. It allows live migration
> of a VM with a direct attached VF without the need to setup a bond/team
> between a VF and virtio net device in the guest.
>
> The hypervisor needs to enable only one datapath at any time so that
> packets don't get looped back to the VM over the other datapath. When a VF
> is plugged, the virtio datapath link state can be marked as down. The
> hypervisor needs to unplug the VF device from the guest on the source host
> and reset the MAC filter of the VF to initiate failover of datapath to
> virtio before starting the migration. After the migration is completed,
> the destination hypervisor sets the MAC filter on the VF and plugs it back
> to the guest to switch over to VF datapath.
>
> When BACKUP feature is enabled, an additional netdev(bypass netdev) is
> created that acts as a master device and tracks the state of the 2 lower
> netdevs. The original virtio_net netdev is marked as 'backup' netdev and a
> passthru device with the same MAC is registered as 'active' netdev.
>
> This patch is based on the discussion initiated by Jesse on this thread.
> https://marc.info/?l=linux-virtualization&m=151189725224231&w=2
>
> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
> Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
> ---
>  drivers/net/virtio_net.c | 683 ++++++++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 682 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index bcd13fe906ca..f2860d86c952 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -30,6 +30,8 @@
>  #include <linux/cpu.h>
>  #include <linux/average.h>
>  #include <linux/filter.h>
> +#include <linux/netdevice.h>
> +#include <linux/pci.h>
>  #include <net/route.h>
>  #include <net/xdp.h>
>
> @@ -206,6 +208,9 @@ struct virtnet_info {
>         u32 speed;
>
>         unsigned long guest_offloads;
> +
> +       /* upper netdev created when BACKUP feature enabled */
> +       struct net_device *bypass_netdev;
>  };
>
>  struct padded_vnet_hdr {
> @@ -2236,6 +2241,22 @@ static int virtnet_xdp(struct net_device *dev, struct netdev_bpf *xdp)
>         }
>  }
>
> +static int virtnet_get_phys_port_name(struct net_device *dev, char *buf,
> +                                     size_t len)
> +{
> +       struct virtnet_info *vi = netdev_priv(dev);
> +       int ret;
> +
> +       if (!virtio_has_feature(vi->vdev, VIRTIO_NET_F_BACKUP))
> +               return -EOPNOTSUPP;
> +
> +       ret = snprintf(buf, len, "_bkup");
> +       if (ret >= len)
> +               return -EOPNOTSUPP;
> +
> +       return 0;
> +}
> +
>  static const struct net_device_ops virtnet_netdev = {
>         .ndo_open            = virtnet_open,
>         .ndo_stop            = virtnet_close,
> @@ -2253,6 +2274,7 @@ static const struct net_device_ops virtnet_netdev = {
>         .ndo_xdp_xmit           = virtnet_xdp_xmit,
>         .ndo_xdp_flush          = virtnet_xdp_flush,
>         .ndo_features_check     = passthru_features_check,
> +       .ndo_get_phys_port_name = virtnet_get_phys_port_name,
>  };
>
>  static void virtnet_config_changed_work(struct work_struct *work)
> @@ -2647,6 +2669,653 @@ static int virtnet_validate(struct virtio_device *vdev)
>         return 0;
>  }
>
> +/* START of functions supporting VIRTIO_NET_F_BACKUP feature.
> + * When BACKUP feature is enabled, an additional netdev(bypass netdev)
> + * is created that acts as a master device and tracks the state of the
> + * 2 lower netdevs. The original virtio_net netdev is registered as
> + * 'backup' netdev and a passthru device with the same MAC is registered
> + * as 'active' netdev.
> + */
> +
> +/* bypass state maintained when BACKUP feature is enabled */
> +struct virtnet_bypass_info {
> +       /* passthru netdev with same MAC */
> +       struct net_device __rcu *active_netdev;
> +
> +       /* virtio_net netdev */
> +       struct net_device __rcu *backup_netdev;
> +
> +       /* active netdev stats */
> +       struct rtnl_link_stats64 active_stats;
> +
> +       /* backup netdev stats */
> +       struct rtnl_link_stats64 backup_stats;
> +
> +       /* aggregated stats */
> +       struct rtnl_link_stats64 bypass_stats;
> +
> +       /* spinlock while updating stats */
> +       spinlock_t stats_lock;
> +};
> +
> +static void virtnet_bypass_child_open(struct net_device *dev,
> +                                     struct net_device *child_netdev)
> +{
> +       int err = dev_open(child_netdev);
> +
> +       if (err)
> +               netdev_warn(dev, "unable to open slave: %s: %d\n",
> +                           child_netdev->name, err);
> +}

Why ignoring the error?? i.e. virtnet_bypass_child_open should have
return value. I don't believe the caller is in a safe context to
guarantee the dev_open always succeeds.

> +
> +static int virtnet_bypass_open(struct net_device *dev)
> +{
> +       struct virtnet_bypass_info *vbi = netdev_priv(dev);
> +       struct net_device *child_netdev;
> +
> +       netif_carrier_off(dev);
> +       netif_tx_wake_all_queues(dev);
> +
> +       child_netdev = rtnl_dereference(vbi->active_netdev);
> +       if (child_netdev)
> +               virtnet_bypass_child_open(dev, child_netdev);

Handle the error?

> +
> +       child_netdev = rtnl_dereference(vbi->backup_netdev);
> +       if (child_netdev)
> +               virtnet_bypass_child_open(dev, child_netdev);
Handle the error and unwind?

> +
> +       return 0;
> +}
> +
> +static int virtnet_bypass_close(struct net_device *dev)
> +{
> +       struct virtnet_bypass_info *vi = netdev_priv(dev);
> +       struct net_device *child_netdev;
> +
> +       netif_tx_disable(dev);
> +
> +       child_netdev = rtnl_dereference(vi->active_netdev);
> +       if (child_netdev)
> +               dev_close(child_netdev);
> +
> +       child_netdev = rtnl_dereference(vi->backup_netdev);
> +       if (child_netdev)
> +               dev_close(child_netdev);
> +
> +       return 0;
> +}
> +
> +static netdev_tx_t virtnet_bypass_drop_xmit(struct sk_buff *skb,
> +                                           struct net_device *dev)
> +{
> +       atomic_long_inc(&dev->tx_dropped);
> +       dev_kfree_skb_any(skb);
> +       return NETDEV_TX_OK;
> +}
> +
> +static bool virtnet_bypass_xmit_ready(struct net_device *dev)
> +{
> +       return netif_running(dev) && netif_carrier_ok(dev);
> +}
> +
> +static netdev_tx_t virtnet_bypass_start_xmit(struct sk_buff *skb,
> +                                            struct net_device *dev)
> +{
> +       struct virtnet_bypass_info *vbi = netdev_priv(dev);
> +       struct net_device *xmit_dev;
> +
> +       /* Try xmit via active netdev followed by backup netdev */
> +       xmit_dev = rcu_dereference_bh(vbi->active_netdev);
> +       if (!xmit_dev || !virtnet_bypass_xmit_ready(xmit_dev)) {
> +               xmit_dev = rcu_dereference_bh(vbi->backup_netdev);
> +               if (!xmit_dev || !virtnet_bypass_xmit_ready(xmit_dev))
> +                       return virtnet_bypass_drop_xmit(skb, dev);
> +       }
> +
> +       skb->dev = xmit_dev;
> +       skb->queue_mapping = qdisc_skb_cb(skb)->slave_dev_queue_mapping;
> +
> +       return dev_queue_xmit(skb);
> +}
> +
> +static u16 virtnet_bypass_select_queue(struct net_device *dev,
> +                                      struct sk_buff *skb, void *accel_priv,
> +                                      select_queue_fallback_t fallback)
> +{
> +       /* This helper function exists to help dev_pick_tx get the correct
> +        * destination queue.  Using a helper function skips a call to
> +        * skb_tx_hash and will put the skbs in the queue we expect on their
> +        * way down to the bonding driver.
> +        */
> +       u16 txq = skb_rx_queue_recorded(skb) ? skb_get_rx_queue(skb) : 0;
> +
> +       /* Save the original txq to restore before passing to the driver */
> +       qdisc_skb_cb(skb)->slave_dev_queue_mapping = skb->queue_mapping;
> +
> +       if (unlikely(txq >= dev->real_num_tx_queues)) {
> +               do {
> +                       txq -= dev->real_num_tx_queues;
> +               } while (txq >= dev->real_num_tx_queues);
> +       }
> +
> +       return txq;
> +}
> +
> +/* fold stats, assuming all rtnl_link_stats64 fields are u64, but
> + * that some drivers can provide 32bit values only.
> + */
> +static void virtnet_bypass_fold_stats(struct rtnl_link_stats64 *_res,
> +                                     const struct rtnl_link_stats64 *_new,
> +                                     const struct rtnl_link_stats64 *_old)
> +{
> +       const u64 *new = (const u64 *)_new;
> +       const u64 *old = (const u64 *)_old;
> +       u64 *res = (u64 *)_res;
> +       int i;
> +
> +       for (i = 0; i < sizeof(*_res) / sizeof(u64); i++) {
> +               u64 nv = new[i];
> +               u64 ov = old[i];
> +               s64 delta = nv - ov;
> +
> +               /* detects if this particular field is 32bit only */
> +               if (((nv | ov) >> 32) == 0)
> +                       delta = (s64)(s32)((u32)nv - (u32)ov);
> +
> +               /* filter anomalies, some drivers reset their stats
> +                * at down/up events.
> +                */
> +               if (delta > 0)
> +                       res[i] += delta;
> +       }
> +}
> +
> +static void virtnet_bypass_get_stats(struct net_device *dev,
> +                                    struct rtnl_link_stats64 *stats)
> +{
> +       struct virtnet_bypass_info *vbi = netdev_priv(dev);
> +       const struct rtnl_link_stats64 *new;
> +       struct rtnl_link_stats64 temp;
> +       struct net_device *child_netdev;
> +
> +       spin_lock(&vbi->stats_lock);
> +       memcpy(stats, &vbi->bypass_stats, sizeof(*stats));
> +
> +       rcu_read_lock();
> +
> +       child_netdev = rcu_dereference(vbi->active_netdev);
> +       if (child_netdev) {
> +               new = dev_get_stats(child_netdev, &temp);
> +               virtnet_bypass_fold_stats(stats, new, &vbi->active_stats);
> +               memcpy(&vbi->active_stats, new, sizeof(*new));
> +       }
> +
> +       child_netdev = rcu_dereference(vbi->backup_netdev);
> +       if (child_netdev) {
> +               new = dev_get_stats(child_netdev, &temp);
> +               virtnet_bypass_fold_stats(stats, new, &vbi->backup_stats);
> +               memcpy(&vbi->backup_stats, new, sizeof(*new));
> +       }
> +
> +       rcu_read_unlock();
> +
> +       memcpy(&vbi->bypass_stats, stats, sizeof(*stats));
> +       spin_unlock(&vbi->stats_lock);
> +}
> +
> +static int virtnet_bypass_change_mtu(struct net_device *dev, int new_mtu)
> +{
> +       struct virtnet_bypass_info *vbi = netdev_priv(dev);
> +       struct net_device *child_netdev;
> +       int ret = 0;
> +
> +       child_netdev = rcu_dereference(vbi->active_netdev);
> +       if (child_netdev) {
> +               ret = dev_set_mtu(child_netdev, new_mtu);
> +               if (ret)
> +                       return ret;
> +       }
> +
> +       child_netdev = rcu_dereference(vbi->backup_netdev);
> +       if (child_netdev) {
> +               ret = dev_set_mtu(child_netdev, new_mtu);
> +               if (ret)
> +                       netdev_err(child_netdev,
> +                                  "Unexpected failure to set mtu to %d\n",
> +                                  new_mtu);
> +       }
> +
> +       dev->mtu = new_mtu;
> +       return 0;
> +}
> +
> +static void virtnet_bypass_set_rx_mode(struct net_device *dev)
> +{
> +       struct virtnet_bypass_info *vbi = netdev_priv(dev);
> +       struct net_device *child_netdev;
> +
> +       rcu_read_lock();
> +
> +       child_netdev = rcu_dereference(vbi->active_netdev);
> +       if (child_netdev) {
> +               dev_uc_sync_multiple(child_netdev, dev);
> +               dev_mc_sync_multiple(child_netdev, dev);
> +       }
> +
> +       child_netdev = rcu_dereference(vbi->backup_netdev);
> +       if (child_netdev) {
> +               dev_uc_sync_multiple(child_netdev, dev);
> +               dev_mc_sync_multiple(child_netdev, dev);
> +       }
> +
> +       rcu_read_unlock();
> +}
> +
> +static const struct net_device_ops virtnet_bypass_netdev_ops = {
> +       .ndo_open               = virtnet_bypass_open,
> +       .ndo_stop               = virtnet_bypass_close,
> +       .ndo_start_xmit         = virtnet_bypass_start_xmit,
> +       .ndo_select_queue       = virtnet_bypass_select_queue,
> +       .ndo_get_stats64        = virtnet_bypass_get_stats,
> +       .ndo_change_mtu         = virtnet_bypass_change_mtu,
> +       .ndo_set_rx_mode        = virtnet_bypass_set_rx_mode,
> +       .ndo_validate_addr      = eth_validate_addr,
> +       .ndo_features_check     = passthru_features_check,
> +};
> +
> +static int
> +virtnet_bypass_ethtool_get_link_ksettings(struct net_device *dev,
> +                                         struct ethtool_link_ksettings *cmd)
> +{
> +       struct virtnet_bypass_info *vbi = netdev_priv(dev);
> +       struct net_device *child_netdev;
> +
> +       child_netdev = rtnl_dereference(vbi->active_netdev);
> +       if (!child_netdev || !virtnet_bypass_xmit_ready(child_netdev)) {
> +               child_netdev = rtnl_dereference(vbi->backup_netdev);
> +               if (!child_netdev || !virtnet_bypass_xmit_ready(child_netdev)) {
> +                       cmd->base.duplex = DUPLEX_UNKNOWN;
> +                       cmd->base.port = PORT_OTHER;
> +                       cmd->base.speed = SPEED_UNKNOWN;
> +
> +                       return 0;
> +               }
> +       }
> +
> +       return __ethtool_get_link_ksettings(child_netdev, cmd);
> +}
> +
> +#define BYPASS_DRV_NAME "virtnet_bypass"
> +#define BYPASS_DRV_VERSION "0.1"
> +
> +static void virtnet_bypass_ethtool_get_drvinfo(struct net_device *dev,
> +                                              struct ethtool_drvinfo *drvinfo)
> +{
> +       strlcpy(drvinfo->driver, BYPASS_DRV_NAME, sizeof(drvinfo->driver));
> +       strlcpy(drvinfo->version, BYPASS_DRV_VERSION, sizeof(drvinfo->version));
> +}
> +
> +static const struct ethtool_ops virtnet_bypass_ethtool_ops = {
> +       .get_drvinfo            = virtnet_bypass_ethtool_get_drvinfo,
> +       .get_link               = ethtool_op_get_link,
> +       .get_link_ksettings     = virtnet_bypass_ethtool_get_link_ksettings,
> +};
> +
> +static struct net_device *get_virtnet_bypass_bymac(struct net *net,
> +                                                  const u8 *mac)
> +{
> +       struct net_device *dev;
> +
> +       ASSERT_RTNL();
> +
> +       for_each_netdev(net, dev) {
> +               if (dev->netdev_ops != &virtnet_bypass_netdev_ops)
> +                       continue;       /* not a virtnet_bypass device */
> +
> +               if (ether_addr_equal(mac, dev->perm_addr))
> +                       return dev;
> +       }
> +
> +       return NULL;
> +}
> +
> +static struct net_device *
> +get_virtnet_bypass_byref(struct net_device *child_netdev)
> +{
> +       struct net *net = dev_net(child_netdev);
> +       struct net_device *dev;
> +
> +       ASSERT_RTNL();
> +
> +       for_each_netdev(net, dev) {
> +               struct virtnet_bypass_info *vbi;
> +
> +               if (dev->netdev_ops != &virtnet_bypass_netdev_ops)
> +                       continue;       /* not a virtnet_bypass device */
> +
> +               vbi = netdev_priv(dev);
> +
> +               if ((rtnl_dereference(vbi->active_netdev) == child_netdev) ||
> +                   (rtnl_dereference(vbi->backup_netdev) == child_netdev))
> +                       return dev;     /* a match */
> +       }
> +
> +       return NULL;
> +}
> +
> +/* Called when child dev is injecting data into network stack.
> + * Change the associated network device from lower dev to virtio.
> + * note: already called with rcu_read_lock
> + */
> +static rx_handler_result_t virtnet_bypass_handle_frame(struct sk_buff **pskb)
> +{
> +       struct sk_buff *skb = *pskb;
> +       struct net_device *ndev = rcu_dereference(skb->dev->rx_handler_data);
> +
> +       skb->dev = ndev;
> +
> +       return RX_HANDLER_ANOTHER;
> +}
> +
> +static int virtnet_bypass_register_child(struct net_device *child_netdev)
> +{
> +       struct virtnet_bypass_info *vbi;
> +       struct net_device *dev;
> +       bool backup;
> +       int ret;
> +
> +       if (child_netdev->addr_len != ETH_ALEN)
> +               return NOTIFY_DONE;
> +
> +       /* We will use the MAC address to locate the virtnet_bypass netdev
> +        * to associate with the child netdev. If we don't find a matching
> +        * bypass netdev, move on.
> +        */
> +       dev = get_virtnet_bypass_bymac(dev_net(child_netdev),
> +                                      child_netdev->perm_addr);
> +       if (!dev)
> +               return NOTIFY_DONE;
> +
> +       vbi = netdev_priv(dev);
> +       backup = (child_netdev->dev.parent == dev->dev.parent);
> +       if (backup ? rtnl_dereference(vbi->backup_netdev) :
> +                       rtnl_dereference(vbi->active_netdev)) {
> +               netdev_info(dev,
> +                           "%s attempting to join bypass dev when %s already present\n",
> +                           child_netdev->name, backup ? "backup" : "active");
> +               return NOTIFY_DONE;
> +       }
> +
> +       /* Avoid non pci devices as active netdev */
> +       if (!backup && (!child_netdev->dev.parent ||
> +                       !dev_is_pci(child_netdev->dev.parent)))
> +               return NOTIFY_DONE;
> +

There's a problem here in terms of error (particularly on the active
slave, e.g. VF), see below:

> +       ret = netdev_rx_handler_register(child_netdev,
> +                                        virtnet_bypass_handle_frame, dev);
> +       if (ret != 0) {
> +               netdev_err(child_netdev,
> +                          "can not register bypass receive handler (err = %d)\n",
> +                          ret);
> +               goto rx_handler_failed;
> +       }
> +
> +       ret = netdev_upper_dev_link(child_netdev, dev, NULL);
> +       if (ret != 0) {
> +               netdev_err(child_netdev,
> +                          "can not set master device %s (err = %d)\n",
> +                          dev->name, ret);
> +               goto upper_link_failed;
> +       }
> +
> +       child_netdev->flags |= IFF_SLAVE;
> +
> +       if (netif_running(dev)) {
> +               ret = dev_open(child_netdev);
> +               if (ret && (ret != -EBUSY)) {
> +                       netdev_err(dev, "Opening child %s failed ret:%d\n",
> +                                  child_netdev->name, ret);
> +                       goto err_interface_up;
> +               }
> +       }
> +
> +       /* Align MTU of child with master */
> +       ret = dev_set_mtu(child_netdev, dev->mtu);
> +       if (ret) {
> +               netdev_err(dev,
> +                          "unable to change mtu of %s to %u register failed\n",
> +                          child_netdev->name, dev->mtu);
> +               goto err_set_mtu;
> +       }

If any of the above calls returns non-zero, the current code steps
back to undo what's being done on that spefic slave previously. For
instance, if the netdev_rx_handler_register returns non-zero because
of busy rx handler, this register_child function would give up
enslaving the VF and leave the upper virtio_bypass interface behind
once it returns. I am not sure if it's a good idea to leave the
virtio_bypass around if running into failure: the guest is not
migratable as the VF doesn't have a backup path, And perhaps the worse
part is that, it now has two interfaces with identical MAC address but
one of them is invalid (user cannot use the virtio interface as it has
a dampened datapath). IMHO the virtio_bypass and its lower netdev
should be destroyed at all when it fails to bind the VF, and
technically, there should be some way to propogate the failure status
to the hypervisor/backend, indicating that the VM is not migratable
because of guest software errors (maybe by clearing out the backup
feature from the guest virtio driver so host can see/learn it).

Thanks,
-Siwei

> +
> +       call_netdevice_notifiers(NETDEV_JOIN, child_netdev);
> +
> +       netdev_info(dev, "registering %s\n", child_netdev->name);
> +
> +       dev_hold(child_netdev);
> +       if (backup) {
> +               rcu_assign_pointer(vbi->backup_netdev, child_netdev);
> +               dev_get_stats(vbi->backup_netdev, &vbi->backup_stats);
> +       } else {
> +               rcu_assign_pointer(vbi->active_netdev, child_netdev);
> +               dev_get_stats(vbi->active_netdev, &vbi->active_stats);
> +               dev->min_mtu = child_netdev->min_mtu;
> +               dev->max_mtu = child_netdev->max_mtu;
> +       }
> +
> +       return NOTIFY_OK;
> +
> +err_set_mtu:
> +       dev_close(child_netdev);
> +err_interface_up:
> +       netdev_upper_dev_unlink(child_netdev, dev);
> +       child_netdev->flags &= ~IFF_SLAVE;
> +upper_link_failed:
> +       netdev_rx_handler_unregister(child_netdev);
> +rx_handler_failed:
> +       return NOTIFY_DONE;
> +}
> +
> +static int virtnet_bypass_unregister_child(struct net_device *child_netdev)
> +{
> +       struct virtnet_bypass_info *vbi;
> +       struct net_device *dev, *backup;
> +
> +       dev = get_virtnet_bypass_byref(child_netdev);
> +       if (!dev)
> +               return NOTIFY_DONE;
> +
> +       vbi = netdev_priv(dev);
> +
> +       netdev_info(dev, "unregistering %s\n", child_netdev->name);
> +
> +       netdev_rx_handler_unregister(child_netdev);
> +       netdev_upper_dev_unlink(child_netdev, dev);
> +       child_netdev->flags &= ~IFF_SLAVE;
> +
> +       if (child_netdev->dev.parent == dev->dev.parent) {
> +               RCU_INIT_POINTER(vbi->backup_netdev, NULL);
> +       } else {
> +               RCU_INIT_POINTER(vbi->active_netdev, NULL);
> +               backup = rtnl_dereference(vbi->backup_netdev);
> +               if (backup) {
> +                       dev->min_mtu = backup->min_mtu;
> +                       dev->max_mtu = backup->max_mtu;
> +               }
> +       }
> +
> +       dev_put(child_netdev);
> +
> +       return NOTIFY_OK;
> +}
> +
> +static int virtnet_bypass_update_link(struct net_device *child_netdev)
> +{
> +       struct net_device *dev, *active, *backup;
> +       struct virtnet_bypass_info *vbi;
> +
> +       dev = get_virtnet_bypass_byref(child_netdev);
> +       if (!dev || !netif_running(dev))
> +               return NOTIFY_DONE;
> +
> +       vbi = netdev_priv(dev);
> +
> +       active = rtnl_dereference(vbi->active_netdev);
> +       backup = rtnl_dereference(vbi->backup_netdev);
> +
> +       if ((active && virtnet_bypass_xmit_ready(active)) ||
> +           (backup && virtnet_bypass_xmit_ready(backup))) {
> +               netif_carrier_on(dev);
> +               netif_tx_wake_all_queues(dev);
> +       } else {
> +               netif_carrier_off(dev);
> +               netif_tx_stop_all_queues(dev);
> +       }
> +
> +       return NOTIFY_OK;
> +}
> +
> +static int virtnet_bypass_event(struct notifier_block *this,
> +                               unsigned long event, void *ptr)
> +{
> +       struct net_device *event_dev = netdev_notifier_info_to_dev(ptr);
> +
> +       /* Skip our own events */
> +       if (event_dev->netdev_ops == &virtnet_bypass_netdev_ops)
> +               return NOTIFY_DONE;
> +
> +       /* Avoid non-Ethernet type devices */
> +       if (event_dev->type != ARPHRD_ETHER)
> +               return NOTIFY_DONE;
> +
> +       /* Avoid Vlan dev with same MAC registering as child dev */
> +       if (is_vlan_dev(event_dev))
> +               return NOTIFY_DONE;
> +
> +       /* Avoid Bonding master dev with same MAC registering as child dev */
> +       if ((event_dev->priv_flags & IFF_BONDING) &&
> +           (event_dev->flags & IFF_MASTER))
> +               return NOTIFY_DONE;
> +
> +       switch (event) {
> +       case NETDEV_REGISTER:
> +               return virtnet_bypass_register_child(event_dev);
> +       case NETDEV_UNREGISTER:
> +               return virtnet_bypass_unregister_child(event_dev);
> +       case NETDEV_UP:
> +       case NETDEV_DOWN:
> +       case NETDEV_CHANGE:
> +               return virtnet_bypass_update_link(event_dev);
> +       default:
> +               return NOTIFY_DONE;
> +       }
> +}
> +
> +static struct notifier_block virtnet_bypass_notifier = {
> +       .notifier_call = virtnet_bypass_event,
> +};
> +
> +static int virtnet_bypass_create(struct virtnet_info *vi)
> +{
> +       struct net_device *backup_netdev = vi->dev;
> +       struct device *dev = &vi->vdev->dev;
> +       struct net_device *bypass_netdev;
> +       int res;
> +
> +       /* Alloc at least 2 queues, for now we are going with 16 assuming
> +        * that most devices being bonded won't have too many queues.
> +        */
> +       bypass_netdev = alloc_etherdev_mq(sizeof(struct virtnet_bypass_info),
> +                                         16);
> +       if (!bypass_netdev) {
> +               dev_err(dev, "Unable to allocate bypass_netdev!\n");
> +               return -ENOMEM;
> +       }
> +
> +       dev_net_set(bypass_netdev, dev_net(backup_netdev));
> +       SET_NETDEV_DEV(bypass_netdev, dev);
> +
> +       bypass_netdev->netdev_ops = &virtnet_bypass_netdev_ops;
> +       bypass_netdev->ethtool_ops = &virtnet_bypass_ethtool_ops;
> +
> +       /* Initialize the device options */
> +       bypass_netdev->flags |= IFF_MASTER;
> +       bypass_netdev->priv_flags |= IFF_BONDING | IFF_UNICAST_FLT |
> +                                    IFF_NO_QUEUE;
> +       bypass_netdev->priv_flags &= ~(IFF_XMIT_DST_RELEASE |
> +                                      IFF_TX_SKB_SHARING);
> +
> +       /* don't acquire bypass netdev's netif_tx_lock when transmitting */
> +       bypass_netdev->features |= NETIF_F_LLTX;
> +
> +       /* Don't allow bypass devices to change network namespaces. */
> +       bypass_netdev->features |= NETIF_F_NETNS_LOCAL;
> +
> +       bypass_netdev->hw_features = NETIF_F_HW_CSUM | NETIF_F_SG |
> +                                    NETIF_F_FRAGLIST | NETIF_F_ALL_TSO |
> +                                    NETIF_F_HIGHDMA | NETIF_F_LRO;
> +
> +       bypass_netdev->hw_features |= NETIF_F_GSO_ENCAP_ALL;
> +       bypass_netdev->features |= bypass_netdev->hw_features;
> +
> +       /* For now treat bypass netdev as VLAN challenged since we
> +        * cannot assume VLAN functionality with a VF
> +        */
> +       bypass_netdev->features |= NETIF_F_VLAN_CHALLENGED;
> +
> +       memcpy(bypass_netdev->dev_addr, backup_netdev->dev_addr,
> +              bypass_netdev->addr_len);
> +
> +       bypass_netdev->min_mtu = backup_netdev->min_mtu;
> +       bypass_netdev->max_mtu = backup_netdev->max_mtu;
> +
> +       res = register_netdev(bypass_netdev);
> +       if (res < 0) {
> +               dev_err(dev, "Unable to register bypass_netdev!\n");
> +               free_netdev(bypass_netdev);
> +               return res;
> +       }
> +
> +       netif_carrier_off(bypass_netdev);
> +
> +       vi->bypass_netdev = bypass_netdev;
> +
> +       return 0;
> +}
> +
> +static void virtnet_bypass_destroy(struct virtnet_info *vi)
> +{
> +       struct net_device *bypass_netdev = vi->bypass_netdev;
> +       struct virtnet_bypass_info *vbi;
> +       struct net_device *child_netdev;
> +
> +       /* no device found, nothing to free */
> +       if (!bypass_netdev)
> +               return;
> +
> +       vbi = netdev_priv(bypass_netdev);
> +
> +       netif_device_detach(bypass_netdev);
> +
> +       rtnl_lock();
> +
> +       child_netdev = rtnl_dereference(vbi->active_netdev);
> +       if (child_netdev)
> +               virtnet_bypass_unregister_child(child_netdev);
> +
> +       child_netdev = rtnl_dereference(vbi->backup_netdev);
> +       if (child_netdev)
> +               virtnet_bypass_unregister_child(child_netdev);
> +
> +       unregister_netdevice(bypass_netdev);
> +
> +       rtnl_unlock();
> +
> +       free_netdev(bypass_netdev);
> +}
> +
> +/* END of functions supporting VIRTIO_NET_F_BACKUP feature. */
> +
>  static int virtnet_probe(struct virtio_device *vdev)
>  {
>         int i, err = -ENOMEM;
> @@ -2797,10 +3466,15 @@ static int virtnet_probe(struct virtio_device *vdev)
>
>         virtnet_init_settings(dev);
>
> +       if (virtio_has_feature(vdev, VIRTIO_NET_F_BACKUP)) {
> +               if (virtnet_bypass_create(vi) != 0)
> +                       goto free_vqs;
> +       }
> +
>         err = register_netdev(dev);
>         if (err) {
>                 pr_debug("virtio_net: registering device failed\n");
> -               goto free_vqs;
> +               goto free_bypass;
>         }
>
>         virtio_device_ready(vdev);
> @@ -2837,6 +3511,8 @@ static int virtnet_probe(struct virtio_device *vdev)
>         vi->vdev->config->reset(vdev);
>
>         unregister_netdev(dev);
> +free_bypass:
> +       virtnet_bypass_destroy(vi);
>  free_vqs:
>         cancel_delayed_work_sync(&vi->refill);
>         free_receive_page_frags(vi);
> @@ -2871,6 +3547,8 @@ static void virtnet_remove(struct virtio_device *vdev)
>
>         unregister_netdev(vi->dev);
>
> +       virtnet_bypass_destroy(vi);
> +
>         remove_vq_common(vi);
>
>         free_netdev(vi->dev);
> @@ -2968,6 +3646,8 @@ static __init int virtio_net_driver_init(void)
>          ret = register_virtio_driver(&virtio_net_driver);
>         if (ret)
>                 goto err_virtio;
> +
> +       register_netdevice_notifier(&virtnet_bypass_notifier);
>         return 0;
>  err_virtio:
>         cpuhp_remove_multi_state(CPUHP_VIRT_NET_DEAD);
> @@ -2980,6 +3660,7 @@ module_init(virtio_net_driver_init);
>
>  static __exit void virtio_net_driver_exit(void)
>  {
> +       unregister_netdevice_notifier(&virtnet_bypass_notifier);
>         unregister_virtio_driver(&virtio_net_driver);
>         cpuhp_remove_multi_state(CPUHP_VIRT_NET_DEAD);
>         cpuhp_remove_multi_state(virtionet_online);
> --
> 2.14.3
>

^ permalink raw reply	[flat|nested] 102+ messages in thread

* [virtio-dev] Re: [PATCH v4 2/2] virtio_net: Extend virtio to use VF datapath when available
@ 2018-03-12 22:44     ` Siwei Liu
  0 siblings, 0 replies; 102+ messages in thread
From: Siwei Liu @ 2018-03-12 22:44 UTC (permalink / raw)
  To: Sridhar Samudrala
  Cc: Michael S. Tsirkin, Stephen Hemminger, David Miller, Netdev,
	Jiri Pirko, virtio-dev, Brandeburg, Jesse, Alexander Duyck,
	Jakub Kicinski

Apologies, still some comments going. Please see inline.

On Thu, Mar 1, 2018 at 12:08 PM, Sridhar Samudrala
<sridhar.samudrala@intel.com> wrote:
> This patch enables virtio_net to switch over to a VF datapath when a VF
> netdev is present with the same MAC address. It allows live migration
> of a VM with a direct attached VF without the need to setup a bond/team
> between a VF and virtio net device in the guest.
>
> The hypervisor needs to enable only one datapath at any time so that
> packets don't get looped back to the VM over the other datapath. When a VF
> is plugged, the virtio datapath link state can be marked as down. The
> hypervisor needs to unplug the VF device from the guest on the source host
> and reset the MAC filter of the VF to initiate failover of datapath to
> virtio before starting the migration. After the migration is completed,
> the destination hypervisor sets the MAC filter on the VF and plugs it back
> to the guest to switch over to VF datapath.
>
> When BACKUP feature is enabled, an additional netdev(bypass netdev) is
> created that acts as a master device and tracks the state of the 2 lower
> netdevs. The original virtio_net netdev is marked as 'backup' netdev and a
> passthru device with the same MAC is registered as 'active' netdev.
>
> This patch is based on the discussion initiated by Jesse on this thread.
> https://marc.info/?l=linux-virtualization&m=151189725224231&w=2
>
> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
> Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
> ---
>  drivers/net/virtio_net.c | 683 ++++++++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 682 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index bcd13fe906ca..f2860d86c952 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -30,6 +30,8 @@
>  #include <linux/cpu.h>
>  #include <linux/average.h>
>  #include <linux/filter.h>
> +#include <linux/netdevice.h>
> +#include <linux/pci.h>
>  #include <net/route.h>
>  #include <net/xdp.h>
>
> @@ -206,6 +208,9 @@ struct virtnet_info {
>         u32 speed;
>
>         unsigned long guest_offloads;
> +
> +       /* upper netdev created when BACKUP feature enabled */
> +       struct net_device *bypass_netdev;
>  };
>
>  struct padded_vnet_hdr {
> @@ -2236,6 +2241,22 @@ static int virtnet_xdp(struct net_device *dev, struct netdev_bpf *xdp)
>         }
>  }
>
> +static int virtnet_get_phys_port_name(struct net_device *dev, char *buf,
> +                                     size_t len)
> +{
> +       struct virtnet_info *vi = netdev_priv(dev);
> +       int ret;
> +
> +       if (!virtio_has_feature(vi->vdev, VIRTIO_NET_F_BACKUP))
> +               return -EOPNOTSUPP;
> +
> +       ret = snprintf(buf, len, "_bkup");
> +       if (ret >= len)
> +               return -EOPNOTSUPP;
> +
> +       return 0;
> +}
> +
>  static const struct net_device_ops virtnet_netdev = {
>         .ndo_open            = virtnet_open,
>         .ndo_stop            = virtnet_close,
> @@ -2253,6 +2274,7 @@ static const struct net_device_ops virtnet_netdev = {
>         .ndo_xdp_xmit           = virtnet_xdp_xmit,
>         .ndo_xdp_flush          = virtnet_xdp_flush,
>         .ndo_features_check     = passthru_features_check,
> +       .ndo_get_phys_port_name = virtnet_get_phys_port_name,
>  };
>
>  static void virtnet_config_changed_work(struct work_struct *work)
> @@ -2647,6 +2669,653 @@ static int virtnet_validate(struct virtio_device *vdev)
>         return 0;
>  }
>
> +/* START of functions supporting VIRTIO_NET_F_BACKUP feature.
> + * When BACKUP feature is enabled, an additional netdev(bypass netdev)
> + * is created that acts as a master device and tracks the state of the
> + * 2 lower netdevs. The original virtio_net netdev is registered as
> + * 'backup' netdev and a passthru device with the same MAC is registered
> + * as 'active' netdev.
> + */
> +
> +/* bypass state maintained when BACKUP feature is enabled */
> +struct virtnet_bypass_info {
> +       /* passthru netdev with same MAC */
> +       struct net_device __rcu *active_netdev;
> +
> +       /* virtio_net netdev */
> +       struct net_device __rcu *backup_netdev;
> +
> +       /* active netdev stats */
> +       struct rtnl_link_stats64 active_stats;
> +
> +       /* backup netdev stats */
> +       struct rtnl_link_stats64 backup_stats;
> +
> +       /* aggregated stats */
> +       struct rtnl_link_stats64 bypass_stats;
> +
> +       /* spinlock while updating stats */
> +       spinlock_t stats_lock;
> +};
> +
> +static void virtnet_bypass_child_open(struct net_device *dev,
> +                                     struct net_device *child_netdev)
> +{
> +       int err = dev_open(child_netdev);
> +
> +       if (err)
> +               netdev_warn(dev, "unable to open slave: %s: %d\n",
> +                           child_netdev->name, err);
> +}

Why ignoring the error?? i.e. virtnet_bypass_child_open should have
return value. I don't believe the caller is in a safe context to
guarantee the dev_open always succeeds.

> +
> +static int virtnet_bypass_open(struct net_device *dev)
> +{
> +       struct virtnet_bypass_info *vbi = netdev_priv(dev);
> +       struct net_device *child_netdev;
> +
> +       netif_carrier_off(dev);
> +       netif_tx_wake_all_queues(dev);
> +
> +       child_netdev = rtnl_dereference(vbi->active_netdev);
> +       if (child_netdev)
> +               virtnet_bypass_child_open(dev, child_netdev);

Handle the error?

> +
> +       child_netdev = rtnl_dereference(vbi->backup_netdev);
> +       if (child_netdev)
> +               virtnet_bypass_child_open(dev, child_netdev);
Handle the error and unwind?

> +
> +       return 0;
> +}
> +
> +static int virtnet_bypass_close(struct net_device *dev)
> +{
> +       struct virtnet_bypass_info *vi = netdev_priv(dev);
> +       struct net_device *child_netdev;
> +
> +       netif_tx_disable(dev);
> +
> +       child_netdev = rtnl_dereference(vi->active_netdev);
> +       if (child_netdev)
> +               dev_close(child_netdev);
> +
> +       child_netdev = rtnl_dereference(vi->backup_netdev);
> +       if (child_netdev)
> +               dev_close(child_netdev);
> +
> +       return 0;
> +}
> +
> +static netdev_tx_t virtnet_bypass_drop_xmit(struct sk_buff *skb,
> +                                           struct net_device *dev)
> +{
> +       atomic_long_inc(&dev->tx_dropped);
> +       dev_kfree_skb_any(skb);
> +       return NETDEV_TX_OK;
> +}
> +
> +static bool virtnet_bypass_xmit_ready(struct net_device *dev)
> +{
> +       return netif_running(dev) && netif_carrier_ok(dev);
> +}
> +
> +static netdev_tx_t virtnet_bypass_start_xmit(struct sk_buff *skb,
> +                                            struct net_device *dev)
> +{
> +       struct virtnet_bypass_info *vbi = netdev_priv(dev);
> +       struct net_device *xmit_dev;
> +
> +       /* Try xmit via active netdev followed by backup netdev */
> +       xmit_dev = rcu_dereference_bh(vbi->active_netdev);
> +       if (!xmit_dev || !virtnet_bypass_xmit_ready(xmit_dev)) {
> +               xmit_dev = rcu_dereference_bh(vbi->backup_netdev);
> +               if (!xmit_dev || !virtnet_bypass_xmit_ready(xmit_dev))
> +                       return virtnet_bypass_drop_xmit(skb, dev);
> +       }
> +
> +       skb->dev = xmit_dev;
> +       skb->queue_mapping = qdisc_skb_cb(skb)->slave_dev_queue_mapping;
> +
> +       return dev_queue_xmit(skb);
> +}
> +
> +static u16 virtnet_bypass_select_queue(struct net_device *dev,
> +                                      struct sk_buff *skb, void *accel_priv,
> +                                      select_queue_fallback_t fallback)
> +{
> +       /* This helper function exists to help dev_pick_tx get the correct
> +        * destination queue.  Using a helper function skips a call to
> +        * skb_tx_hash and will put the skbs in the queue we expect on their
> +        * way down to the bonding driver.
> +        */
> +       u16 txq = skb_rx_queue_recorded(skb) ? skb_get_rx_queue(skb) : 0;
> +
> +       /* Save the original txq to restore before passing to the driver */
> +       qdisc_skb_cb(skb)->slave_dev_queue_mapping = skb->queue_mapping;
> +
> +       if (unlikely(txq >= dev->real_num_tx_queues)) {
> +               do {
> +                       txq -= dev->real_num_tx_queues;
> +               } while (txq >= dev->real_num_tx_queues);
> +       }
> +
> +       return txq;
> +}
> +
> +/* fold stats, assuming all rtnl_link_stats64 fields are u64, but
> + * that some drivers can provide 32bit values only.
> + */
> +static void virtnet_bypass_fold_stats(struct rtnl_link_stats64 *_res,
> +                                     const struct rtnl_link_stats64 *_new,
> +                                     const struct rtnl_link_stats64 *_old)
> +{
> +       const u64 *new = (const u64 *)_new;
> +       const u64 *old = (const u64 *)_old;
> +       u64 *res = (u64 *)_res;
> +       int i;
> +
> +       for (i = 0; i < sizeof(*_res) / sizeof(u64); i++) {
> +               u64 nv = new[i];
> +               u64 ov = old[i];
> +               s64 delta = nv - ov;
> +
> +               /* detects if this particular field is 32bit only */
> +               if (((nv | ov) >> 32) == 0)
> +                       delta = (s64)(s32)((u32)nv - (u32)ov);
> +
> +               /* filter anomalies, some drivers reset their stats
> +                * at down/up events.
> +                */
> +               if (delta > 0)
> +                       res[i] += delta;
> +       }
> +}
> +
> +static void virtnet_bypass_get_stats(struct net_device *dev,
> +                                    struct rtnl_link_stats64 *stats)
> +{
> +       struct virtnet_bypass_info *vbi = netdev_priv(dev);
> +       const struct rtnl_link_stats64 *new;
> +       struct rtnl_link_stats64 temp;
> +       struct net_device *child_netdev;
> +
> +       spin_lock(&vbi->stats_lock);
> +       memcpy(stats, &vbi->bypass_stats, sizeof(*stats));
> +
> +       rcu_read_lock();
> +
> +       child_netdev = rcu_dereference(vbi->active_netdev);
> +       if (child_netdev) {
> +               new = dev_get_stats(child_netdev, &temp);
> +               virtnet_bypass_fold_stats(stats, new, &vbi->active_stats);
> +               memcpy(&vbi->active_stats, new, sizeof(*new));
> +       }
> +
> +       child_netdev = rcu_dereference(vbi->backup_netdev);
> +       if (child_netdev) {
> +               new = dev_get_stats(child_netdev, &temp);
> +               virtnet_bypass_fold_stats(stats, new, &vbi->backup_stats);
> +               memcpy(&vbi->backup_stats, new, sizeof(*new));
> +       }
> +
> +       rcu_read_unlock();
> +
> +       memcpy(&vbi->bypass_stats, stats, sizeof(*stats));
> +       spin_unlock(&vbi->stats_lock);
> +}
> +
> +static int virtnet_bypass_change_mtu(struct net_device *dev, int new_mtu)
> +{
> +       struct virtnet_bypass_info *vbi = netdev_priv(dev);
> +       struct net_device *child_netdev;
> +       int ret = 0;
> +
> +       child_netdev = rcu_dereference(vbi->active_netdev);
> +       if (child_netdev) {
> +               ret = dev_set_mtu(child_netdev, new_mtu);
> +               if (ret)
> +                       return ret;
> +       }
> +
> +       child_netdev = rcu_dereference(vbi->backup_netdev);
> +       if (child_netdev) {
> +               ret = dev_set_mtu(child_netdev, new_mtu);
> +               if (ret)
> +                       netdev_err(child_netdev,
> +                                  "Unexpected failure to set mtu to %d\n",
> +                                  new_mtu);
> +       }
> +
> +       dev->mtu = new_mtu;
> +       return 0;
> +}
> +
> +static void virtnet_bypass_set_rx_mode(struct net_device *dev)
> +{
> +       struct virtnet_bypass_info *vbi = netdev_priv(dev);
> +       struct net_device *child_netdev;
> +
> +       rcu_read_lock();
> +
> +       child_netdev = rcu_dereference(vbi->active_netdev);
> +       if (child_netdev) {
> +               dev_uc_sync_multiple(child_netdev, dev);
> +               dev_mc_sync_multiple(child_netdev, dev);
> +       }
> +
> +       child_netdev = rcu_dereference(vbi->backup_netdev);
> +       if (child_netdev) {
> +               dev_uc_sync_multiple(child_netdev, dev);
> +               dev_mc_sync_multiple(child_netdev, dev);
> +       }
> +
> +       rcu_read_unlock();
> +}
> +
> +static const struct net_device_ops virtnet_bypass_netdev_ops = {
> +       .ndo_open               = virtnet_bypass_open,
> +       .ndo_stop               = virtnet_bypass_close,
> +       .ndo_start_xmit         = virtnet_bypass_start_xmit,
> +       .ndo_select_queue       = virtnet_bypass_select_queue,
> +       .ndo_get_stats64        = virtnet_bypass_get_stats,
> +       .ndo_change_mtu         = virtnet_bypass_change_mtu,
> +       .ndo_set_rx_mode        = virtnet_bypass_set_rx_mode,
> +       .ndo_validate_addr      = eth_validate_addr,
> +       .ndo_features_check     = passthru_features_check,
> +};
> +
> +static int
> +virtnet_bypass_ethtool_get_link_ksettings(struct net_device *dev,
> +                                         struct ethtool_link_ksettings *cmd)
> +{
> +       struct virtnet_bypass_info *vbi = netdev_priv(dev);
> +       struct net_device *child_netdev;
> +
> +       child_netdev = rtnl_dereference(vbi->active_netdev);
> +       if (!child_netdev || !virtnet_bypass_xmit_ready(child_netdev)) {
> +               child_netdev = rtnl_dereference(vbi->backup_netdev);
> +               if (!child_netdev || !virtnet_bypass_xmit_ready(child_netdev)) {
> +                       cmd->base.duplex = DUPLEX_UNKNOWN;
> +                       cmd->base.port = PORT_OTHER;
> +                       cmd->base.speed = SPEED_UNKNOWN;
> +
> +                       return 0;
> +               }
> +       }
> +
> +       return __ethtool_get_link_ksettings(child_netdev, cmd);
> +}
> +
> +#define BYPASS_DRV_NAME "virtnet_bypass"
> +#define BYPASS_DRV_VERSION "0.1"
> +
> +static void virtnet_bypass_ethtool_get_drvinfo(struct net_device *dev,
> +                                              struct ethtool_drvinfo *drvinfo)
> +{
> +       strlcpy(drvinfo->driver, BYPASS_DRV_NAME, sizeof(drvinfo->driver));
> +       strlcpy(drvinfo->version, BYPASS_DRV_VERSION, sizeof(drvinfo->version));
> +}
> +
> +static const struct ethtool_ops virtnet_bypass_ethtool_ops = {
> +       .get_drvinfo            = virtnet_bypass_ethtool_get_drvinfo,
> +       .get_link               = ethtool_op_get_link,
> +       .get_link_ksettings     = virtnet_bypass_ethtool_get_link_ksettings,
> +};
> +
> +static struct net_device *get_virtnet_bypass_bymac(struct net *net,
> +                                                  const u8 *mac)
> +{
> +       struct net_device *dev;
> +
> +       ASSERT_RTNL();
> +
> +       for_each_netdev(net, dev) {
> +               if (dev->netdev_ops != &virtnet_bypass_netdev_ops)
> +                       continue;       /* not a virtnet_bypass device */
> +
> +               if (ether_addr_equal(mac, dev->perm_addr))
> +                       return dev;
> +       }
> +
> +       return NULL;
> +}
> +
> +static struct net_device *
> +get_virtnet_bypass_byref(struct net_device *child_netdev)
> +{
> +       struct net *net = dev_net(child_netdev);
> +       struct net_device *dev;
> +
> +       ASSERT_RTNL();
> +
> +       for_each_netdev(net, dev) {
> +               struct virtnet_bypass_info *vbi;
> +
> +               if (dev->netdev_ops != &virtnet_bypass_netdev_ops)
> +                       continue;       /* not a virtnet_bypass device */
> +
> +               vbi = netdev_priv(dev);
> +
> +               if ((rtnl_dereference(vbi->active_netdev) == child_netdev) ||
> +                   (rtnl_dereference(vbi->backup_netdev) == child_netdev))
> +                       return dev;     /* a match */
> +       }
> +
> +       return NULL;
> +}
> +
> +/* Called when child dev is injecting data into network stack.
> + * Change the associated network device from lower dev to virtio.
> + * note: already called with rcu_read_lock
> + */
> +static rx_handler_result_t virtnet_bypass_handle_frame(struct sk_buff **pskb)
> +{
> +       struct sk_buff *skb = *pskb;
> +       struct net_device *ndev = rcu_dereference(skb->dev->rx_handler_data);
> +
> +       skb->dev = ndev;
> +
> +       return RX_HANDLER_ANOTHER;
> +}
> +
> +static int virtnet_bypass_register_child(struct net_device *child_netdev)
> +{
> +       struct virtnet_bypass_info *vbi;
> +       struct net_device *dev;
> +       bool backup;
> +       int ret;
> +
> +       if (child_netdev->addr_len != ETH_ALEN)
> +               return NOTIFY_DONE;
> +
> +       /* We will use the MAC address to locate the virtnet_bypass netdev
> +        * to associate with the child netdev. If we don't find a matching
> +        * bypass netdev, move on.
> +        */
> +       dev = get_virtnet_bypass_bymac(dev_net(child_netdev),
> +                                      child_netdev->perm_addr);
> +       if (!dev)
> +               return NOTIFY_DONE;
> +
> +       vbi = netdev_priv(dev);
> +       backup = (child_netdev->dev.parent == dev->dev.parent);
> +       if (backup ? rtnl_dereference(vbi->backup_netdev) :
> +                       rtnl_dereference(vbi->active_netdev)) {
> +               netdev_info(dev,
> +                           "%s attempting to join bypass dev when %s already present\n",
> +                           child_netdev->name, backup ? "backup" : "active");
> +               return NOTIFY_DONE;
> +       }
> +
> +       /* Avoid non pci devices as active netdev */
> +       if (!backup && (!child_netdev->dev.parent ||
> +                       !dev_is_pci(child_netdev->dev.parent)))
> +               return NOTIFY_DONE;
> +

There's a problem here in terms of error (particularly on the active
slave, e.g. VF), see below:

> +       ret = netdev_rx_handler_register(child_netdev,
> +                                        virtnet_bypass_handle_frame, dev);
> +       if (ret != 0) {
> +               netdev_err(child_netdev,
> +                          "can not register bypass receive handler (err = %d)\n",
> +                          ret);
> +               goto rx_handler_failed;
> +       }
> +
> +       ret = netdev_upper_dev_link(child_netdev, dev, NULL);
> +       if (ret != 0) {
> +               netdev_err(child_netdev,
> +                          "can not set master device %s (err = %d)\n",
> +                          dev->name, ret);
> +               goto upper_link_failed;
> +       }
> +
> +       child_netdev->flags |= IFF_SLAVE;
> +
> +       if (netif_running(dev)) {
> +               ret = dev_open(child_netdev);
> +               if (ret && (ret != -EBUSY)) {
> +                       netdev_err(dev, "Opening child %s failed ret:%d\n",
> +                                  child_netdev->name, ret);
> +                       goto err_interface_up;
> +               }
> +       }
> +
> +       /* Align MTU of child with master */
> +       ret = dev_set_mtu(child_netdev, dev->mtu);
> +       if (ret) {
> +               netdev_err(dev,
> +                          "unable to change mtu of %s to %u register failed\n",
> +                          child_netdev->name, dev->mtu);
> +               goto err_set_mtu;
> +       }

If any of the above calls returns non-zero, the current code steps
back to undo what's being done on that spefic slave previously. For
instance, if the netdev_rx_handler_register returns non-zero because
of busy rx handler, this register_child function would give up
enslaving the VF and leave the upper virtio_bypass interface behind
once it returns. I am not sure if it's a good idea to leave the
virtio_bypass around if running into failure: the guest is not
migratable as the VF doesn't have a backup path, And perhaps the worse
part is that, it now has two interfaces with identical MAC address but
one of them is invalid (user cannot use the virtio interface as it has
a dampened datapath). IMHO the virtio_bypass and its lower netdev
should be destroyed at all when it fails to bind the VF, and
technically, there should be some way to propogate the failure status
to the hypervisor/backend, indicating that the VM is not migratable
because of guest software errors (maybe by clearing out the backup
feature from the guest virtio driver so host can see/learn it).

Thanks,
-Siwei

> +
> +       call_netdevice_notifiers(NETDEV_JOIN, child_netdev);
> +
> +       netdev_info(dev, "registering %s\n", child_netdev->name);
> +
> +       dev_hold(child_netdev);
> +       if (backup) {
> +               rcu_assign_pointer(vbi->backup_netdev, child_netdev);
> +               dev_get_stats(vbi->backup_netdev, &vbi->backup_stats);
> +       } else {
> +               rcu_assign_pointer(vbi->active_netdev, child_netdev);
> +               dev_get_stats(vbi->active_netdev, &vbi->active_stats);
> +               dev->min_mtu = child_netdev->min_mtu;
> +               dev->max_mtu = child_netdev->max_mtu;
> +       }
> +
> +       return NOTIFY_OK;
> +
> +err_set_mtu:
> +       dev_close(child_netdev);
> +err_interface_up:
> +       netdev_upper_dev_unlink(child_netdev, dev);
> +       child_netdev->flags &= ~IFF_SLAVE;
> +upper_link_failed:
> +       netdev_rx_handler_unregister(child_netdev);
> +rx_handler_failed:
> +       return NOTIFY_DONE;
> +}
> +
> +static int virtnet_bypass_unregister_child(struct net_device *child_netdev)
> +{
> +       struct virtnet_bypass_info *vbi;
> +       struct net_device *dev, *backup;
> +
> +       dev = get_virtnet_bypass_byref(child_netdev);
> +       if (!dev)
> +               return NOTIFY_DONE;
> +
> +       vbi = netdev_priv(dev);
> +
> +       netdev_info(dev, "unregistering %s\n", child_netdev->name);
> +
> +       netdev_rx_handler_unregister(child_netdev);
> +       netdev_upper_dev_unlink(child_netdev, dev);
> +       child_netdev->flags &= ~IFF_SLAVE;
> +
> +       if (child_netdev->dev.parent == dev->dev.parent) {
> +               RCU_INIT_POINTER(vbi->backup_netdev, NULL);
> +       } else {
> +               RCU_INIT_POINTER(vbi->active_netdev, NULL);
> +               backup = rtnl_dereference(vbi->backup_netdev);
> +               if (backup) {
> +                       dev->min_mtu = backup->min_mtu;
> +                       dev->max_mtu = backup->max_mtu;
> +               }
> +       }
> +
> +       dev_put(child_netdev);
> +
> +       return NOTIFY_OK;
> +}
> +
> +static int virtnet_bypass_update_link(struct net_device *child_netdev)
> +{
> +       struct net_device *dev, *active, *backup;
> +       struct virtnet_bypass_info *vbi;
> +
> +       dev = get_virtnet_bypass_byref(child_netdev);
> +       if (!dev || !netif_running(dev))
> +               return NOTIFY_DONE;
> +
> +       vbi = netdev_priv(dev);
> +
> +       active = rtnl_dereference(vbi->active_netdev);
> +       backup = rtnl_dereference(vbi->backup_netdev);
> +
> +       if ((active && virtnet_bypass_xmit_ready(active)) ||
> +           (backup && virtnet_bypass_xmit_ready(backup))) {
> +               netif_carrier_on(dev);
> +               netif_tx_wake_all_queues(dev);
> +       } else {
> +               netif_carrier_off(dev);
> +               netif_tx_stop_all_queues(dev);
> +       }
> +
> +       return NOTIFY_OK;
> +}
> +
> +static int virtnet_bypass_event(struct notifier_block *this,
> +                               unsigned long event, void *ptr)
> +{
> +       struct net_device *event_dev = netdev_notifier_info_to_dev(ptr);
> +
> +       /* Skip our own events */
> +       if (event_dev->netdev_ops == &virtnet_bypass_netdev_ops)
> +               return NOTIFY_DONE;
> +
> +       /* Avoid non-Ethernet type devices */
> +       if (event_dev->type != ARPHRD_ETHER)
> +               return NOTIFY_DONE;
> +
> +       /* Avoid Vlan dev with same MAC registering as child dev */
> +       if (is_vlan_dev(event_dev))
> +               return NOTIFY_DONE;
> +
> +       /* Avoid Bonding master dev with same MAC registering as child dev */
> +       if ((event_dev->priv_flags & IFF_BONDING) &&
> +           (event_dev->flags & IFF_MASTER))
> +               return NOTIFY_DONE;
> +
> +       switch (event) {
> +       case NETDEV_REGISTER:
> +               return virtnet_bypass_register_child(event_dev);
> +       case NETDEV_UNREGISTER:
> +               return virtnet_bypass_unregister_child(event_dev);
> +       case NETDEV_UP:
> +       case NETDEV_DOWN:
> +       case NETDEV_CHANGE:
> +               return virtnet_bypass_update_link(event_dev);
> +       default:
> +               return NOTIFY_DONE;
> +       }
> +}
> +
> +static struct notifier_block virtnet_bypass_notifier = {
> +       .notifier_call = virtnet_bypass_event,
> +};
> +
> +static int virtnet_bypass_create(struct virtnet_info *vi)
> +{
> +       struct net_device *backup_netdev = vi->dev;
> +       struct device *dev = &vi->vdev->dev;
> +       struct net_device *bypass_netdev;
> +       int res;
> +
> +       /* Alloc at least 2 queues, for now we are going with 16 assuming
> +        * that most devices being bonded won't have too many queues.
> +        */
> +       bypass_netdev = alloc_etherdev_mq(sizeof(struct virtnet_bypass_info),
> +                                         16);
> +       if (!bypass_netdev) {
> +               dev_err(dev, "Unable to allocate bypass_netdev!\n");
> +               return -ENOMEM;
> +       }
> +
> +       dev_net_set(bypass_netdev, dev_net(backup_netdev));
> +       SET_NETDEV_DEV(bypass_netdev, dev);
> +
> +       bypass_netdev->netdev_ops = &virtnet_bypass_netdev_ops;
> +       bypass_netdev->ethtool_ops = &virtnet_bypass_ethtool_ops;
> +
> +       /* Initialize the device options */
> +       bypass_netdev->flags |= IFF_MASTER;
> +       bypass_netdev->priv_flags |= IFF_BONDING | IFF_UNICAST_FLT |
> +                                    IFF_NO_QUEUE;
> +       bypass_netdev->priv_flags &= ~(IFF_XMIT_DST_RELEASE |
> +                                      IFF_TX_SKB_SHARING);
> +
> +       /* don't acquire bypass netdev's netif_tx_lock when transmitting */
> +       bypass_netdev->features |= NETIF_F_LLTX;
> +
> +       /* Don't allow bypass devices to change network namespaces. */
> +       bypass_netdev->features |= NETIF_F_NETNS_LOCAL;
> +
> +       bypass_netdev->hw_features = NETIF_F_HW_CSUM | NETIF_F_SG |
> +                                    NETIF_F_FRAGLIST | NETIF_F_ALL_TSO |
> +                                    NETIF_F_HIGHDMA | NETIF_F_LRO;
> +
> +       bypass_netdev->hw_features |= NETIF_F_GSO_ENCAP_ALL;
> +       bypass_netdev->features |= bypass_netdev->hw_features;
> +
> +       /* For now treat bypass netdev as VLAN challenged since we
> +        * cannot assume VLAN functionality with a VF
> +        */
> +       bypass_netdev->features |= NETIF_F_VLAN_CHALLENGED;
> +
> +       memcpy(bypass_netdev->dev_addr, backup_netdev->dev_addr,
> +              bypass_netdev->addr_len);
> +
> +       bypass_netdev->min_mtu = backup_netdev->min_mtu;
> +       bypass_netdev->max_mtu = backup_netdev->max_mtu;
> +
> +       res = register_netdev(bypass_netdev);
> +       if (res < 0) {
> +               dev_err(dev, "Unable to register bypass_netdev!\n");
> +               free_netdev(bypass_netdev);
> +               return res;
> +       }
> +
> +       netif_carrier_off(bypass_netdev);
> +
> +       vi->bypass_netdev = bypass_netdev;
> +
> +       return 0;
> +}
> +
> +static void virtnet_bypass_destroy(struct virtnet_info *vi)
> +{
> +       struct net_device *bypass_netdev = vi->bypass_netdev;
> +       struct virtnet_bypass_info *vbi;
> +       struct net_device *child_netdev;
> +
> +       /* no device found, nothing to free */
> +       if (!bypass_netdev)
> +               return;
> +
> +       vbi = netdev_priv(bypass_netdev);
> +
> +       netif_device_detach(bypass_netdev);
> +
> +       rtnl_lock();
> +
> +       child_netdev = rtnl_dereference(vbi->active_netdev);
> +       if (child_netdev)
> +               virtnet_bypass_unregister_child(child_netdev);
> +
> +       child_netdev = rtnl_dereference(vbi->backup_netdev);
> +       if (child_netdev)
> +               virtnet_bypass_unregister_child(child_netdev);
> +
> +       unregister_netdevice(bypass_netdev);
> +
> +       rtnl_unlock();
> +
> +       free_netdev(bypass_netdev);
> +}
> +
> +/* END of functions supporting VIRTIO_NET_F_BACKUP feature. */
> +
>  static int virtnet_probe(struct virtio_device *vdev)
>  {
>         int i, err = -ENOMEM;
> @@ -2797,10 +3466,15 @@ static int virtnet_probe(struct virtio_device *vdev)
>
>         virtnet_init_settings(dev);
>
> +       if (virtio_has_feature(vdev, VIRTIO_NET_F_BACKUP)) {
> +               if (virtnet_bypass_create(vi) != 0)
> +                       goto free_vqs;
> +       }
> +
>         err = register_netdev(dev);
>         if (err) {
>                 pr_debug("virtio_net: registering device failed\n");
> -               goto free_vqs;
> +               goto free_bypass;
>         }
>
>         virtio_device_ready(vdev);
> @@ -2837,6 +3511,8 @@ static int virtnet_probe(struct virtio_device *vdev)
>         vi->vdev->config->reset(vdev);
>
>         unregister_netdev(dev);
> +free_bypass:
> +       virtnet_bypass_destroy(vi);
>  free_vqs:
>         cancel_delayed_work_sync(&vi->refill);
>         free_receive_page_frags(vi);
> @@ -2871,6 +3547,8 @@ static void virtnet_remove(struct virtio_device *vdev)
>
>         unregister_netdev(vi->dev);
>
> +       virtnet_bypass_destroy(vi);
> +
>         remove_vq_common(vi);
>
>         free_netdev(vi->dev);
> @@ -2968,6 +3646,8 @@ static __init int virtio_net_driver_init(void)
>          ret = register_virtio_driver(&virtio_net_driver);
>         if (ret)
>                 goto err_virtio;
> +
> +       register_netdevice_notifier(&virtnet_bypass_notifier);
>         return 0;
>  err_virtio:
>         cpuhp_remove_multi_state(CPUHP_VIRT_NET_DEAD);
> @@ -2980,6 +3660,7 @@ module_init(virtio_net_driver_init);
>
>  static __exit void virtio_net_driver_exit(void)
>  {
> +       unregister_netdevice_notifier(&virtnet_bypass_notifier);
>         unregister_virtio_driver(&virtio_net_driver);
>         cpuhp_remove_multi_state(CPUHP_VIRT_NET_DEAD);
>         cpuhp_remove_multi_state(virtionet_online);
> --
> 2.14.3
>

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v4 2/2] virtio_net: Extend virtio to use VF datapath when available
  2018-03-12 22:44     ` [virtio-dev] " Siwei Liu
@ 2018-03-14  0:28       ` Samudrala, Sridhar
  -1 siblings, 0 replies; 102+ messages in thread
From: Samudrala, Sridhar @ 2018-03-14  0:28 UTC (permalink / raw)
  To: Siwei Liu
  Cc: Michael S. Tsirkin, Stephen Hemminger, David Miller, Netdev,
	Jiri Pirko, virtio-dev, Brandeburg, Jesse, Alexander Duyck,
	Jakub Kicinski

On 3/12/2018 3:44 PM, Siwei Liu wrote:
> Apologies, still some comments going. Please see inline.
>
> On Thu, Mar 1, 2018 at 12:08 PM, Sridhar Samudrala
> <sridhar.samudrala@intel.com> wrote:
>> This patch enables virtio_net to switch over to a VF datapath when a VF
>> netdev is present with the same MAC address. It allows live migration
>> of a VM with a direct attached VF without the need to setup a bond/team
>> between a VF and virtio net device in the guest.
>>
>> The hypervisor needs to enable only one datapath at any time so that
>> packets don't get looped back to the VM over the other datapath. When a VF
>> is plugged, the virtio datapath link state can be marked as down. The
>> hypervisor needs to unplug the VF device from the guest on the source host
>> and reset the MAC filter of the VF to initiate failover of datapath to
>> virtio before starting the migration. After the migration is completed,
>> the destination hypervisor sets the MAC filter on the VF and plugs it back
>> to the guest to switch over to VF datapath.
>>
>> When BACKUP feature is enabled, an additional netdev(bypass netdev) is
>> created that acts as a master device and tracks the state of the 2 lower
>> netdevs. The original virtio_net netdev is marked as 'backup' netdev and a
>> passthru device with the same MAC is registered as 'active' netdev.
>>
>> This patch is based on the discussion initiated by Jesse on this thread.
>> https://marc.info/?l=linux-virtualization&m=151189725224231&w=2
>>
>> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
>> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
>> Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
>> ---
>>   drivers/net/virtio_net.c | 683 ++++++++++++++++++++++++++++++++++++++++++++++-
>>   1 file changed, 682 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>> index bcd13fe906ca..f2860d86c952 100644
>> --- a/drivers/net/virtio_net.c
>> +++ b/drivers/net/virtio_net.c
>> @@ -30,6 +30,8 @@
>>   #include <linux/cpu.h>
>>   #include <linux/average.h>
>>   #include <linux/filter.h>
>> +#include <linux/netdevice.h>
>> +#include <linux/pci.h>
>>   #include <net/route.h>
>>   #include <net/xdp.h>
>>
>> @@ -206,6 +208,9 @@ struct virtnet_info {
>>          u32 speed;
>>
>>          unsigned long guest_offloads;
>> +
>> +       /* upper netdev created when BACKUP feature enabled */
>> +       struct net_device *bypass_netdev;
>>   };
>>
>>   struct padded_vnet_hdr {
>> @@ -2236,6 +2241,22 @@ static int virtnet_xdp(struct net_device *dev, struct netdev_bpf *xdp)
>>          }
>>   }
>>
>> +static int virtnet_get_phys_port_name(struct net_device *dev, char *buf,
>> +                                     size_t len)
>> +{
>> +       struct virtnet_info *vi = netdev_priv(dev);
>> +       int ret;
>> +
>> +       if (!virtio_has_feature(vi->vdev, VIRTIO_NET_F_BACKUP))
>> +               return -EOPNOTSUPP;
>> +
>> +       ret = snprintf(buf, len, "_bkup");
>> +       if (ret >= len)
>> +               return -EOPNOTSUPP;
>> +
>> +       return 0;
>> +}
>> +
>>   static const struct net_device_ops virtnet_netdev = {
>>          .ndo_open            = virtnet_open,
>>          .ndo_stop            = virtnet_close,
>> @@ -2253,6 +2274,7 @@ static const struct net_device_ops virtnet_netdev = {
>>          .ndo_xdp_xmit           = virtnet_xdp_xmit,
>>          .ndo_xdp_flush          = virtnet_xdp_flush,
>>          .ndo_features_check     = passthru_features_check,
>> +       .ndo_get_phys_port_name = virtnet_get_phys_port_name,
>>   };
>>
>>   static void virtnet_config_changed_work(struct work_struct *work)
>> @@ -2647,6 +2669,653 @@ static int virtnet_validate(struct virtio_device *vdev)
>>          return 0;
>>   }
>>
>> +/* START of functions supporting VIRTIO_NET_F_BACKUP feature.
>> + * When BACKUP feature is enabled, an additional netdev(bypass netdev)
>> + * is created that acts as a master device and tracks the state of the
>> + * 2 lower netdevs. The original virtio_net netdev is registered as
>> + * 'backup' netdev and a passthru device with the same MAC is registered
>> + * as 'active' netdev.
>> + */
>> +
>> +/* bypass state maintained when BACKUP feature is enabled */
>> +struct virtnet_bypass_info {
>> +       /* passthru netdev with same MAC */
>> +       struct net_device __rcu *active_netdev;
>> +
>> +       /* virtio_net netdev */
>> +       struct net_device __rcu *backup_netdev;
>> +
>> +       /* active netdev stats */
>> +       struct rtnl_link_stats64 active_stats;
>> +
>> +       /* backup netdev stats */
>> +       struct rtnl_link_stats64 backup_stats;
>> +
>> +       /* aggregated stats */
>> +       struct rtnl_link_stats64 bypass_stats;
>> +
>> +       /* spinlock while updating stats */
>> +       spinlock_t stats_lock;
>> +};
>> +
>> +static void virtnet_bypass_child_open(struct net_device *dev,
>> +                                     struct net_device *child_netdev)
>> +{
>> +       int err = dev_open(child_netdev);
>> +
>> +       if (err)
>> +               netdev_warn(dev, "unable to open slave: %s: %d\n",
>> +                           child_netdev->name, err);
>> +}
> Why ignoring the error?? i.e. virtnet_bypass_child_open should have
> return value. I don't believe the caller is in a safe context to
> guarantee the dev_open always succeeds.

OK.  Will handle this in the next revision.


>
>> +
>> +static int virtnet_bypass_open(struct net_device *dev)
>> +{
>> +       struct virtnet_bypass_info *vbi = netdev_priv(dev);
>> +       struct net_device *child_netdev;
>> +
>> +       netif_carrier_off(dev);
>> +       netif_tx_wake_all_queues(dev);
>> +
>> +       child_netdev = rtnl_dereference(vbi->active_netdev);
>> +       if (child_netdev)
>> +               virtnet_bypass_child_open(dev, child_netdev);
> Handle the error?
>
>> +
>> +       child_netdev = rtnl_dereference(vbi->backup_netdev);
>> +       if (child_netdev)
>> +               virtnet_bypass_child_open(dev, child_netdev);
> Handle the error and unwind?

Sure.

<snip>

> +
> +static int virtnet_bypass_register_child(struct net_device *child_netdev)
> +{
> +       struct virtnet_bypass_info *vbi;
> +       struct net_device *dev;
> +       bool backup;
> +       int ret;
> +
> +       if (child_netdev->addr_len != ETH_ALEN)
> +               return NOTIFY_DONE;
> +
> +       /* We will use the MAC address to locate the virtnet_bypass netdev
> +        * to associate with the child netdev. If we don't find a matching
> +        * bypass netdev, move on.
> +        */
> +       dev = get_virtnet_bypass_bymac(dev_net(child_netdev),
> +                                      child_netdev->perm_addr);
> +       if (!dev)
> +               return NOTIFY_DONE;
> +
> +       vbi = netdev_priv(dev);
> +       backup = (child_netdev->dev.parent == dev->dev.parent);
> +       if (backup ? rtnl_dereference(vbi->backup_netdev) :
> +                       rtnl_dereference(vbi->active_netdev)) {
> +               netdev_info(dev,
> +                           "%s attempting to join bypass dev when %s already present\n",
> +                           child_netdev->name, backup ? "backup" : "active");
> +               return NOTIFY_DONE;
> +       }
> +
> +       /* Avoid non pci devices as active netdev */
> +       if (!backup && (!child_netdev->dev.parent ||
> +                       !dev_is_pci(child_netdev->dev.parent)))
> +               return NOTIFY_DONE;
> +
> There's a problem here in terms of error (particularly on the active
> slave, e.g. VF), see below:
>
>> +       ret = netdev_rx_handler_register(child_netdev,
>> +                                        virtnet_bypass_handle_frame, dev);
>> +       if (ret != 0) {
>> +               netdev_err(child_netdev,
>> +                          "can not register bypass receive handler (err = %d)\n",
>> +                          ret);
>> +               goto rx_handler_failed;
>> +       }
>> +
>> +       ret = netdev_upper_dev_link(child_netdev, dev, NULL);
>> +       if (ret != 0) {
>> +               netdev_err(child_netdev,
>> +                          "can not set master device %s (err = %d)\n",
>> +                          dev->name, ret);
>> +               goto upper_link_failed;
>> +       }
>> +
>> +       child_netdev->flags |= IFF_SLAVE;
>> +
>> +       if (netif_running(dev)) {
>> +               ret = dev_open(child_netdev);
>> +               if (ret && (ret != -EBUSY)) {
>> +                       netdev_err(dev, "Opening child %s failed ret:%d\n",
>> +                                  child_netdev->name, ret);
>> +                       goto err_interface_up;
>> +               }
>> +       }
>> +
>> +       /* Align MTU of child with master */
>> +       ret = dev_set_mtu(child_netdev, dev->mtu);
>> +       if (ret) {
>> +               netdev_err(dev,
>> +                          "unable to change mtu of %s to %u register failed\n",
>> +                          child_netdev->name, dev->mtu);
>> +               goto err_set_mtu;
>> +       }
> If any of the above calls returns non-zero, the current code steps
> back to undo what's being done on that spefic slave previously. For
> instance, if the netdev_rx_handler_register returns non-zero because
> of busy rx handler, this register_child function would give up
> enslaving the VF and leave the upper virtio_bypass interface behind
> once it returns.

virtio_bypass interface is the upper master netdev and it is always present when
BACKUP feature is enabled.
If there is failure with enslaving VF, there will be 2 netdevs,
master virtio_bypass and slave virtio_net and the VM can be migrated.


> I am not sure if it's a good idea to leave the
> virtio_bypass around if running into failure: the guest is not
> migratable as the VF doesn't have a backup path,

Are you talking about a failure when registering backup netdev?  This should not
happen, but i guess we can improve error handling in such scenario.


> And perhaps the worse
> part is that, it now has two interfaces with identical MAC address but
> one of them is invalid (user cannot use the virtio interface as it has
> a dampened datapath). IMHO the virtio_bypass and its lower netdev
> should be destroyed at all when it fails to bind the VF, and
> technically, there should be some way to propogate the failure status
> to the hypervisor/backend, indicating that the VM is not migratable
> because of guest software errors (maybe by clearing out the backup
> feature from the guest virtio driver so host can see/learn it).

In BACKUP mode, user can only use the upper virtio_bypass netdev and that will
always be there. Any failure to enslave VF netdev is not fatal, but i will see
if we can improve the error handling of failure to enslave backup netdev.
Also, i don't think the BACKUP feature bit is negotiable with the host.

Thanks
Sridhar

^ permalink raw reply	[flat|nested] 102+ messages in thread

* [virtio-dev] Re: [PATCH v4 2/2] virtio_net: Extend virtio to use VF datapath when available
@ 2018-03-14  0:28       ` Samudrala, Sridhar
  0 siblings, 0 replies; 102+ messages in thread
From: Samudrala, Sridhar @ 2018-03-14  0:28 UTC (permalink / raw)
  To: Siwei Liu
  Cc: Michael S. Tsirkin, Stephen Hemminger, David Miller, Netdev,
	Jiri Pirko, virtio-dev, Brandeburg, Jesse, Alexander Duyck,
	Jakub Kicinski

On 3/12/2018 3:44 PM, Siwei Liu wrote:
> Apologies, still some comments going. Please see inline.
>
> On Thu, Mar 1, 2018 at 12:08 PM, Sridhar Samudrala
> <sridhar.samudrala@intel.com> wrote:
>> This patch enables virtio_net to switch over to a VF datapath when a VF
>> netdev is present with the same MAC address. It allows live migration
>> of a VM with a direct attached VF without the need to setup a bond/team
>> between a VF and virtio net device in the guest.
>>
>> The hypervisor needs to enable only one datapath at any time so that
>> packets don't get looped back to the VM over the other datapath. When a VF
>> is plugged, the virtio datapath link state can be marked as down. The
>> hypervisor needs to unplug the VF device from the guest on the source host
>> and reset the MAC filter of the VF to initiate failover of datapath to
>> virtio before starting the migration. After the migration is completed,
>> the destination hypervisor sets the MAC filter on the VF and plugs it back
>> to the guest to switch over to VF datapath.
>>
>> When BACKUP feature is enabled, an additional netdev(bypass netdev) is
>> created that acts as a master device and tracks the state of the 2 lower
>> netdevs. The original virtio_net netdev is marked as 'backup' netdev and a
>> passthru device with the same MAC is registered as 'active' netdev.
>>
>> This patch is based on the discussion initiated by Jesse on this thread.
>> https://marc.info/?l=linux-virtualization&m=151189725224231&w=2
>>
>> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
>> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
>> Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
>> ---
>>   drivers/net/virtio_net.c | 683 ++++++++++++++++++++++++++++++++++++++++++++++-
>>   1 file changed, 682 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>> index bcd13fe906ca..f2860d86c952 100644
>> --- a/drivers/net/virtio_net.c
>> +++ b/drivers/net/virtio_net.c
>> @@ -30,6 +30,8 @@
>>   #include <linux/cpu.h>
>>   #include <linux/average.h>
>>   #include <linux/filter.h>
>> +#include <linux/netdevice.h>
>> +#include <linux/pci.h>
>>   #include <net/route.h>
>>   #include <net/xdp.h>
>>
>> @@ -206,6 +208,9 @@ struct virtnet_info {
>>          u32 speed;
>>
>>          unsigned long guest_offloads;
>> +
>> +       /* upper netdev created when BACKUP feature enabled */
>> +       struct net_device *bypass_netdev;
>>   };
>>
>>   struct padded_vnet_hdr {
>> @@ -2236,6 +2241,22 @@ static int virtnet_xdp(struct net_device *dev, struct netdev_bpf *xdp)
>>          }
>>   }
>>
>> +static int virtnet_get_phys_port_name(struct net_device *dev, char *buf,
>> +                                     size_t len)
>> +{
>> +       struct virtnet_info *vi = netdev_priv(dev);
>> +       int ret;
>> +
>> +       if (!virtio_has_feature(vi->vdev, VIRTIO_NET_F_BACKUP))
>> +               return -EOPNOTSUPP;
>> +
>> +       ret = snprintf(buf, len, "_bkup");
>> +       if (ret >= len)
>> +               return -EOPNOTSUPP;
>> +
>> +       return 0;
>> +}
>> +
>>   static const struct net_device_ops virtnet_netdev = {
>>          .ndo_open            = virtnet_open,
>>          .ndo_stop            = virtnet_close,
>> @@ -2253,6 +2274,7 @@ static const struct net_device_ops virtnet_netdev = {
>>          .ndo_xdp_xmit           = virtnet_xdp_xmit,
>>          .ndo_xdp_flush          = virtnet_xdp_flush,
>>          .ndo_features_check     = passthru_features_check,
>> +       .ndo_get_phys_port_name = virtnet_get_phys_port_name,
>>   };
>>
>>   static void virtnet_config_changed_work(struct work_struct *work)
>> @@ -2647,6 +2669,653 @@ static int virtnet_validate(struct virtio_device *vdev)
>>          return 0;
>>   }
>>
>> +/* START of functions supporting VIRTIO_NET_F_BACKUP feature.
>> + * When BACKUP feature is enabled, an additional netdev(bypass netdev)
>> + * is created that acts as a master device and tracks the state of the
>> + * 2 lower netdevs. The original virtio_net netdev is registered as
>> + * 'backup' netdev and a passthru device with the same MAC is registered
>> + * as 'active' netdev.
>> + */
>> +
>> +/* bypass state maintained when BACKUP feature is enabled */
>> +struct virtnet_bypass_info {
>> +       /* passthru netdev with same MAC */
>> +       struct net_device __rcu *active_netdev;
>> +
>> +       /* virtio_net netdev */
>> +       struct net_device __rcu *backup_netdev;
>> +
>> +       /* active netdev stats */
>> +       struct rtnl_link_stats64 active_stats;
>> +
>> +       /* backup netdev stats */
>> +       struct rtnl_link_stats64 backup_stats;
>> +
>> +       /* aggregated stats */
>> +       struct rtnl_link_stats64 bypass_stats;
>> +
>> +       /* spinlock while updating stats */
>> +       spinlock_t stats_lock;
>> +};
>> +
>> +static void virtnet_bypass_child_open(struct net_device *dev,
>> +                                     struct net_device *child_netdev)
>> +{
>> +       int err = dev_open(child_netdev);
>> +
>> +       if (err)
>> +               netdev_warn(dev, "unable to open slave: %s: %d\n",
>> +                           child_netdev->name, err);
>> +}
> Why ignoring the error?? i.e. virtnet_bypass_child_open should have
> return value. I don't believe the caller is in a safe context to
> guarantee the dev_open always succeeds.

OK.  Will handle this in the next revision.


>
>> +
>> +static int virtnet_bypass_open(struct net_device *dev)
>> +{
>> +       struct virtnet_bypass_info *vbi = netdev_priv(dev);
>> +       struct net_device *child_netdev;
>> +
>> +       netif_carrier_off(dev);
>> +       netif_tx_wake_all_queues(dev);
>> +
>> +       child_netdev = rtnl_dereference(vbi->active_netdev);
>> +       if (child_netdev)
>> +               virtnet_bypass_child_open(dev, child_netdev);
> Handle the error?
>
>> +
>> +       child_netdev = rtnl_dereference(vbi->backup_netdev);
>> +       if (child_netdev)
>> +               virtnet_bypass_child_open(dev, child_netdev);
> Handle the error and unwind?

Sure.

<snip>

> +
> +static int virtnet_bypass_register_child(struct net_device *child_netdev)
> +{
> +       struct virtnet_bypass_info *vbi;
> +       struct net_device *dev;
> +       bool backup;
> +       int ret;
> +
> +       if (child_netdev->addr_len != ETH_ALEN)
> +               return NOTIFY_DONE;
> +
> +       /* We will use the MAC address to locate the virtnet_bypass netdev
> +        * to associate with the child netdev. If we don't find a matching
> +        * bypass netdev, move on.
> +        */
> +       dev = get_virtnet_bypass_bymac(dev_net(child_netdev),
> +                                      child_netdev->perm_addr);
> +       if (!dev)
> +               return NOTIFY_DONE;
> +
> +       vbi = netdev_priv(dev);
> +       backup = (child_netdev->dev.parent == dev->dev.parent);
> +       if (backup ? rtnl_dereference(vbi->backup_netdev) :
> +                       rtnl_dereference(vbi->active_netdev)) {
> +               netdev_info(dev,
> +                           "%s attempting to join bypass dev when %s already present\n",
> +                           child_netdev->name, backup ? "backup" : "active");
> +               return NOTIFY_DONE;
> +       }
> +
> +       /* Avoid non pci devices as active netdev */
> +       if (!backup && (!child_netdev->dev.parent ||
> +                       !dev_is_pci(child_netdev->dev.parent)))
> +               return NOTIFY_DONE;
> +
> There's a problem here in terms of error (particularly on the active
> slave, e.g. VF), see below:
>
>> +       ret = netdev_rx_handler_register(child_netdev,
>> +                                        virtnet_bypass_handle_frame, dev);
>> +       if (ret != 0) {
>> +               netdev_err(child_netdev,
>> +                          "can not register bypass receive handler (err = %d)\n",
>> +                          ret);
>> +               goto rx_handler_failed;
>> +       }
>> +
>> +       ret = netdev_upper_dev_link(child_netdev, dev, NULL);
>> +       if (ret != 0) {
>> +               netdev_err(child_netdev,
>> +                          "can not set master device %s (err = %d)\n",
>> +                          dev->name, ret);
>> +               goto upper_link_failed;
>> +       }
>> +
>> +       child_netdev->flags |= IFF_SLAVE;
>> +
>> +       if (netif_running(dev)) {
>> +               ret = dev_open(child_netdev);
>> +               if (ret && (ret != -EBUSY)) {
>> +                       netdev_err(dev, "Opening child %s failed ret:%d\n",
>> +                                  child_netdev->name, ret);
>> +                       goto err_interface_up;
>> +               }
>> +       }
>> +
>> +       /* Align MTU of child with master */
>> +       ret = dev_set_mtu(child_netdev, dev->mtu);
>> +       if (ret) {
>> +               netdev_err(dev,
>> +                          "unable to change mtu of %s to %u register failed\n",
>> +                          child_netdev->name, dev->mtu);
>> +               goto err_set_mtu;
>> +       }
> If any of the above calls returns non-zero, the current code steps
> back to undo what's being done on that spefic slave previously. For
> instance, if the netdev_rx_handler_register returns non-zero because
> of busy rx handler, this register_child function would give up
> enslaving the VF and leave the upper virtio_bypass interface behind
> once it returns.

virtio_bypass interface is the upper master netdev and it is always present when
BACKUP feature is enabled.
If there is failure with enslaving VF, there will be 2 netdevs,
master virtio_bypass and slave virtio_net and the VM can be migrated.


> I am not sure if it's a good idea to leave the
> virtio_bypass around if running into failure: the guest is not
> migratable as the VF doesn't have a backup path,

Are you talking about a failure when registering backup netdev?  This should not
happen, but i guess we can improve error handling in such scenario.


> And perhaps the worse
> part is that, it now has two interfaces with identical MAC address but
> one of them is invalid (user cannot use the virtio interface as it has
> a dampened datapath). IMHO the virtio_bypass and its lower netdev
> should be destroyed at all when it fails to bind the VF, and
> technically, there should be some way to propogate the failure status
> to the hypervisor/backend, indicating that the VM is not migratable
> because of guest software errors (maybe by clearing out the backup
> feature from the guest virtio driver so host can see/learn it).

In BACKUP mode, user can only use the upper virtio_bypass netdev and that will
always be there. Any failure to enslave VF netdev is not fatal, but i will see
if we can improve the error handling of failure to enslave backup netdev.
Also, i don't think the BACKUP feature bit is negotiable with the host.

Thanks
Sridhar



---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v4 2/2] virtio_net: Extend virtio to use VF datapath when available
  2018-03-12 21:08       ` Jiri Pirko
@ 2018-03-14  0:36           ` Samudrala, Sridhar
  0 siblings, 0 replies; 102+ messages in thread
From: Samudrala, Sridhar @ 2018-03-14  0:36 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: mst, stephen, davem, netdev, virtio-dev, jesse.brandeburg,
	alexander.h.duyck, kubakici

On 3/12/2018 2:08 PM, Jiri Pirko wrote:
> Mon, Mar 12, 2018 at 09:58:06PM CET, sridhar.samudrala@intel.com wrote:
>>
>> On 3/12/2018 1:12 PM, Jiri Pirko wrote:
>>> Thu, Mar 01, 2018 at 09:08:43PM CET, sridhar.samudrala@intel.com wrote:
>>>> This patch enables virtio_net to switch over to a VF datapath when a VF
>>>> netdev is present with the same MAC address. It allows live migration
>>>> of a VM with a direct attached VF without the need to setup a bond/team
>>>> between a VF and virtio net device in the guest.
>>>>
>>>> The hypervisor needs to enable only one datapath at any time so that
>>>> packets don't get looped back to the VM over the other datapath. When a VF
>>>> is plugged, the virtio datapath link state can be marked as down. The
>>>> hypervisor needs to unplug the VF device from the guest on the source host
>>>> and reset the MAC filter of the VF to initiate failover of datapath to
>>>> virtio before starting the migration. After the migration is completed,
>>>> the destination hypervisor sets the MAC filter on the VF and plugs it back
>>>> to the guest to switch over to VF datapath.
>>>>
>>>> When BACKUP feature is enabled, an additional netdev(bypass netdev) is
>>>> created that acts as a master device and tracks the state of the 2 lower
>>>> netdevs. The original virtio_net netdev is marked as 'backup' netdev and a
>>>> passthru device with the same MAC is registered as 'active' netdev.
>>>>
>>>> This patch is based on the discussion initiated by Jesse on this thread.
>>>> https://marc.info/?l=linux-virtualization&m=151189725224231&w=2
>>>>
>>>> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
>>>> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
>>>> Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
>>> [...]
>>>
>>>
>>>> +static int virtnet_bypass_register_child(struct net_device *child_netdev)
>>>> +{
>>>> +	struct virtnet_bypass_info *vbi;
>>>> +	struct net_device *dev;
>>>> +	bool backup;
>>>> +	int ret;
>>>> +
>>>> +	if (child_netdev->addr_len != ETH_ALEN)
>>>> +		return NOTIFY_DONE;
>>>> +
>>>> +	/* We will use the MAC address to locate the virtnet_bypass netdev
>>>> +	 * to associate with the child netdev. If we don't find a matching
>>>> +	 * bypass netdev, move on.
>>>> +	 */
>>>> +	dev = get_virtnet_bypass_bymac(dev_net(child_netdev),
>>>> +				       child_netdev->perm_addr);
>>>> +	if (!dev)
>>>> +		return NOTIFY_DONE;
>>>> +
>>>> +	vbi = netdev_priv(dev);
>>>> +	backup = (child_netdev->dev.parent == dev->dev.parent);
>>>> +	if (backup ? rtnl_dereference(vbi->backup_netdev) :
>>>> +			rtnl_dereference(vbi->active_netdev)) {
>>>> +		netdev_info(dev,
>>>> +			    "%s attempting to join bypass dev when %s already present\n",
>>>> +			    child_netdev->name, backup ? "backup" : "active");
>>>> +		return NOTIFY_DONE;
>>>> +	}
>>>> +
>>>> +	/* Avoid non pci devices as active netdev */
>>>> +	if (!backup && (!child_netdev->dev.parent ||
>>>> +			!dev_is_pci(child_netdev->dev.parent)))
>>>> +		return NOTIFY_DONE;
>>>> +
>>>> +	ret = netdev_rx_handler_register(child_netdev,
>>>> +					 virtnet_bypass_handle_frame, dev);
>>>> +	if (ret != 0) {
>>>> +		netdev_err(child_netdev,
>>>> +			   "can not register bypass receive handler (err = %d)\n",
>>>> +			   ret);
>>>> +		goto rx_handler_failed;
>>>> +	}
>>>> +
>>>> +	ret = netdev_upper_dev_link(child_netdev, dev, NULL);
>>>> +	if (ret != 0) {
>>>> +		netdev_err(child_netdev,
>>>> +			   "can not set master device %s (err = %d)\n",
>>>> +			   dev->name, ret);
>>>> +		goto upper_link_failed;
>>>> +	}
>>>> +
>>>> +	child_netdev->flags |= IFF_SLAVE;
>>>> +
>>>> +	if (netif_running(dev)) {
>>>> +		ret = dev_open(child_netdev);
>>>> +		if (ret && (ret != -EBUSY)) {
>>>> +			netdev_err(dev, "Opening child %s failed ret:%d\n",
>>>> +				   child_netdev->name, ret);
>>>> +			goto err_interface_up;
>>>> +		}
>>>> +	}
>>> Much of this function is copy of netvsc_vf_join, should be shared with
>>> netvsc.
>> Yes. This is based on netvsc_register_vf() and netvsc_vf_join(). But modified
>> to handle registration of 2 child netdevs(backup virtio and active vf).  In case
>> of netvsc, they only register VF. There is no backup netdev.
>> I think trying to make this code shareable with netvsc will introduce additional
>> checks and make it convoluted.
> No problem.

Not clear what you meant here?
Want to confirm that you are agreeing that it is OK to not share this function
with netvsc.


>
>
>>
>>>
>>>> +
>>>> +	/* Align MTU of child with master */
>>>> +	ret = dev_set_mtu(child_netdev, dev->mtu);
>>>> +	if (ret) {
>>>> +		netdev_err(dev,
>>>> +			   "unable to change mtu of %s to %u register failed\n",
>>>> +			   child_netdev->name, dev->mtu);
>>>> +		goto err_set_mtu;
>>>> +	}
>>>> +
>>>> +	call_netdevice_notifiers(NETDEV_JOIN, child_netdev);
>>>> +
>>>> +	netdev_info(dev, "registering %s\n", child_netdev->name);
>>>> +
>>>> +	dev_hold(child_netdev);
>>>> +	if (backup) {
>>>> +		rcu_assign_pointer(vbi->backup_netdev, child_netdev);
>>>> +		dev_get_stats(vbi->backup_netdev, &vbi->backup_stats);
>>>> +	} else {
>>>> +		rcu_assign_pointer(vbi->active_netdev, child_netdev);
>>>> +		dev_get_stats(vbi->active_netdev, &vbi->active_stats);
>>>> +		dev->min_mtu = child_netdev->min_mtu;
>>>> +		dev->max_mtu = child_netdev->max_mtu;
>>>> +	}
>>>> +
>>>> +	return NOTIFY_OK;
>>>> +
>>>> +err_set_mtu:
>>>> +	dev_close(child_netdev);
>>>> +err_interface_up:
>>>> +	netdev_upper_dev_unlink(child_netdev, dev);
>>>> +	child_netdev->flags &= ~IFF_SLAVE;
>>>> +upper_link_failed:
>>>> +	netdev_rx_handler_unregister(child_netdev);
>>>> +rx_handler_failed:
>>>> +	return NOTIFY_DONE;
>>>> +}
>>>> +
>>>> +static int virtnet_bypass_unregister_child(struct net_device *child_netdev)
>>>> +{
>>>> +	struct virtnet_bypass_info *vbi;
>>>> +	struct net_device *dev, *backup;
>>>> +
>>>> +	dev = get_virtnet_bypass_byref(child_netdev);
>>>> +	if (!dev)
>>>> +		return NOTIFY_DONE;
>>>> +
>>>> +	vbi = netdev_priv(dev);
>>>> +
>>>> +	netdev_info(dev, "unregistering %s\n", child_netdev->name);
>>>> +
>>>> +	netdev_rx_handler_unregister(child_netdev);
>>>> +	netdev_upper_dev_unlink(child_netdev, dev);
>>>> +	child_netdev->flags &= ~IFF_SLAVE;
>>>> +
>>>> +	if (child_netdev->dev.parent == dev->dev.parent) {
>>>> +		RCU_INIT_POINTER(vbi->backup_netdev, NULL);
>>>> +	} else {
>>>> +		RCU_INIT_POINTER(vbi->active_netdev, NULL);
>>>> +		backup = rtnl_dereference(vbi->backup_netdev);
>>>> +		if (backup) {
>>>> +			dev->min_mtu = backup->min_mtu;
>>>> +			dev->max_mtu = backup->max_mtu;
>>>> +		}
>>>> +	}
>>>> +
>>>> +	dev_put(child_netdev);
>>>> +
>>>> +	return NOTIFY_OK;
>>>> +}
>>>> +
>>>> +static int virtnet_bypass_update_link(struct net_device *child_netdev)
>>>> +{
>>>> +	struct net_device *dev, *active, *backup;
>>>> +	struct virtnet_bypass_info *vbi;
>>>> +
>>>> +	dev = get_virtnet_bypass_byref(child_netdev);
>>>> +	if (!dev || !netif_running(dev))
>>>> +		return NOTIFY_DONE;
>>>> +
>>>> +	vbi = netdev_priv(dev);
>>>> +
>>>> +	active = rtnl_dereference(vbi->active_netdev);
>>>> +	backup = rtnl_dereference(vbi->backup_netdev);
>>>> +
>>>> +	if ((active && virtnet_bypass_xmit_ready(active)) ||
>>>> +	    (backup && virtnet_bypass_xmit_ready(backup))) {
>>>> +		netif_carrier_on(dev);
>>>> +		netif_tx_wake_all_queues(dev);
>>>> +	} else {
>>>> +		netif_carrier_off(dev);
>>>> +		netif_tx_stop_all_queues(dev);
>>>> +	}
>>>> +
>>>> +	return NOTIFY_OK;
>>>> +}
>>>> +
>>>> +static int virtnet_bypass_event(struct notifier_block *this,
>>>> +				unsigned long event, void *ptr)
>>>> +{
>>>> +	struct net_device *event_dev = netdev_notifier_info_to_dev(ptr);
>>>> +
>>>> +	/* Skip our own events */
>>>> +	if (event_dev->netdev_ops == &virtnet_bypass_netdev_ops)
>>>> +		return NOTIFY_DONE;
>>>> +
>>>> +	/* Avoid non-Ethernet type devices */
>>>> +	if (event_dev->type != ARPHRD_ETHER)
>>>> +		return NOTIFY_DONE;
>>>> +
>>>> +	/* Avoid Vlan dev with same MAC registering as child dev */
>>>> +	if (is_vlan_dev(event_dev))
>>>> +		return NOTIFY_DONE;
>>>> +
>>>> +	/* Avoid Bonding master dev with same MAC registering as child dev */
>>>> +	if ((event_dev->priv_flags & IFF_BONDING) &&
>>>> +	    (event_dev->flags & IFF_MASTER))
>>>> +		return NOTIFY_DONE;
>>>> +
>>>> +	switch (event) {
>>>> +	case NETDEV_REGISTER:
>>>> +		return virtnet_bypass_register_child(event_dev);
>>>> +	case NETDEV_UNREGISTER:
>>>> +		return virtnet_bypass_unregister_child(event_dev);
>>>> +	case NETDEV_UP:
>>>> +	case NETDEV_DOWN:
>>>> +	case NETDEV_CHANGE:
>>>> +		return virtnet_bypass_update_link(event_dev);
>>>> +	default:
>>>> +		return NOTIFY_DONE;
>>>> +	}
>>>> +}
>>> For example this function is 1:1 copy of netvsc, even with comments
>>> and bugs (like IFF_BODING check).
>>>
>>> This is also something that should be shared with netvsc.
>> The problem is with the calls that are made from this function.
>> Sharing would have been possible if both virtio and netvsc went with
>> same netdev model.
> ops? You can still share.

OK. looks like you want to see atleast some code shared between netvsc and virtio_net
even when they are using 2 different netdev models.
Will try to add a new file under net/core and move some functions that can be shared
between netvsc and virtio_net in BACKUP mode.


>
>
>>>
>>>> +
>>>> +static struct notifier_block virtnet_bypass_notifier = {
>>>> +	.notifier_call = virtnet_bypass_event,
>>>> +};
>>>> +
>>>> +static int virtnet_bypass_create(struct virtnet_info *vi)
>>>> +{
>>>> +	struct net_device *backup_netdev = vi->dev;
>>>> +	struct device *dev = &vi->vdev->dev;
>>>> +	struct net_device *bypass_netdev;
>>>> +	int res;
>>>> +
>>>> +	/* Alloc at least 2 queues, for now we are going with 16 assuming
>>>> +	 * that most devices being bonded won't have too many queues.
>>>> +	 */
>>>> +	bypass_netdev = alloc_etherdev_mq(sizeof(struct virtnet_bypass_info),
>>>> +					  16);
>>>> +	if (!bypass_netdev) {
>>>> +		dev_err(dev, "Unable to allocate bypass_netdev!\n");
>>>> +		return -ENOMEM;
>>>> +	}
>>>> +
>>>> +	dev_net_set(bypass_netdev, dev_net(backup_netdev));
>>>> +	SET_NETDEV_DEV(bypass_netdev, dev);
>>>> +
>>>> +	bypass_netdev->netdev_ops = &virtnet_bypass_netdev_ops;
>>>> +	bypass_netdev->ethtool_ops = &virtnet_bypass_ethtool_ops;
>>>> +
>>>> +	/* Initialize the device options */
>>>> +	bypass_netdev->flags |= IFF_MASTER;
>>>> +	bypass_netdev->priv_flags |= IFF_BONDING | IFF_UNICAST_FLT |
>>> No clue why you set IFF_BONDING here...
>> This is to indicate to the userspace that this is a master bond device.
> This is not a master bond device! If you say this, it makes me really
> worried....
>
Will remove IFF_BONDING in the next revision.


>
>> May be it is sufficient to just set IFF_MASTER.
>>
>>>
>>>
>>>
>>>

^ permalink raw reply	[flat|nested] 102+ messages in thread

* [virtio-dev] Re: [PATCH v4 2/2] virtio_net: Extend virtio to use VF datapath when available
@ 2018-03-14  0:36           ` Samudrala, Sridhar
  0 siblings, 0 replies; 102+ messages in thread
From: Samudrala, Sridhar @ 2018-03-14  0:36 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: mst, stephen, davem, netdev, virtio-dev, jesse.brandeburg,
	alexander.h.duyck, kubakici

On 3/12/2018 2:08 PM, Jiri Pirko wrote:
> Mon, Mar 12, 2018 at 09:58:06PM CET, sridhar.samudrala@intel.com wrote:
>>
>> On 3/12/2018 1:12 PM, Jiri Pirko wrote:
>>> Thu, Mar 01, 2018 at 09:08:43PM CET, sridhar.samudrala@intel.com wrote:
>>>> This patch enables virtio_net to switch over to a VF datapath when a VF
>>>> netdev is present with the same MAC address. It allows live migration
>>>> of a VM with a direct attached VF without the need to setup a bond/team
>>>> between a VF and virtio net device in the guest.
>>>>
>>>> The hypervisor needs to enable only one datapath at any time so that
>>>> packets don't get looped back to the VM over the other datapath. When a VF
>>>> is plugged, the virtio datapath link state can be marked as down. The
>>>> hypervisor needs to unplug the VF device from the guest on the source host
>>>> and reset the MAC filter of the VF to initiate failover of datapath to
>>>> virtio before starting the migration. After the migration is completed,
>>>> the destination hypervisor sets the MAC filter on the VF and plugs it back
>>>> to the guest to switch over to VF datapath.
>>>>
>>>> When BACKUP feature is enabled, an additional netdev(bypass netdev) is
>>>> created that acts as a master device and tracks the state of the 2 lower
>>>> netdevs. The original virtio_net netdev is marked as 'backup' netdev and a
>>>> passthru device with the same MAC is registered as 'active' netdev.
>>>>
>>>> This patch is based on the discussion initiated by Jesse on this thread.
>>>> https://marc.info/?l=linux-virtualization&m=151189725224231&w=2
>>>>
>>>> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
>>>> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
>>>> Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
>>> [...]
>>>
>>>
>>>> +static int virtnet_bypass_register_child(struct net_device *child_netdev)
>>>> +{
>>>> +	struct virtnet_bypass_info *vbi;
>>>> +	struct net_device *dev;
>>>> +	bool backup;
>>>> +	int ret;
>>>> +
>>>> +	if (child_netdev->addr_len != ETH_ALEN)
>>>> +		return NOTIFY_DONE;
>>>> +
>>>> +	/* We will use the MAC address to locate the virtnet_bypass netdev
>>>> +	 * to associate with the child netdev. If we don't find a matching
>>>> +	 * bypass netdev, move on.
>>>> +	 */
>>>> +	dev = get_virtnet_bypass_bymac(dev_net(child_netdev),
>>>> +				       child_netdev->perm_addr);
>>>> +	if (!dev)
>>>> +		return NOTIFY_DONE;
>>>> +
>>>> +	vbi = netdev_priv(dev);
>>>> +	backup = (child_netdev->dev.parent == dev->dev.parent);
>>>> +	if (backup ? rtnl_dereference(vbi->backup_netdev) :
>>>> +			rtnl_dereference(vbi->active_netdev)) {
>>>> +		netdev_info(dev,
>>>> +			    "%s attempting to join bypass dev when %s already present\n",
>>>> +			    child_netdev->name, backup ? "backup" : "active");
>>>> +		return NOTIFY_DONE;
>>>> +	}
>>>> +
>>>> +	/* Avoid non pci devices as active netdev */
>>>> +	if (!backup && (!child_netdev->dev.parent ||
>>>> +			!dev_is_pci(child_netdev->dev.parent)))
>>>> +		return NOTIFY_DONE;
>>>> +
>>>> +	ret = netdev_rx_handler_register(child_netdev,
>>>> +					 virtnet_bypass_handle_frame, dev);
>>>> +	if (ret != 0) {
>>>> +		netdev_err(child_netdev,
>>>> +			   "can not register bypass receive handler (err = %d)\n",
>>>> +			   ret);
>>>> +		goto rx_handler_failed;
>>>> +	}
>>>> +
>>>> +	ret = netdev_upper_dev_link(child_netdev, dev, NULL);
>>>> +	if (ret != 0) {
>>>> +		netdev_err(child_netdev,
>>>> +			   "can not set master device %s (err = %d)\n",
>>>> +			   dev->name, ret);
>>>> +		goto upper_link_failed;
>>>> +	}
>>>> +
>>>> +	child_netdev->flags |= IFF_SLAVE;
>>>> +
>>>> +	if (netif_running(dev)) {
>>>> +		ret = dev_open(child_netdev);
>>>> +		if (ret && (ret != -EBUSY)) {
>>>> +			netdev_err(dev, "Opening child %s failed ret:%d\n",
>>>> +				   child_netdev->name, ret);
>>>> +			goto err_interface_up;
>>>> +		}
>>>> +	}
>>> Much of this function is copy of netvsc_vf_join, should be shared with
>>> netvsc.
>> Yes. This is based on netvsc_register_vf() and netvsc_vf_join(). But modified
>> to handle registration of 2 child netdevs(backup virtio and active vf).  In case
>> of netvsc, they only register VF. There is no backup netdev.
>> I think trying to make this code shareable with netvsc will introduce additional
>> checks and make it convoluted.
> No problem.

Not clear what you meant here?
Want to confirm that you are agreeing that it is OK to not share this function
with netvsc.


>
>
>>
>>>
>>>> +
>>>> +	/* Align MTU of child with master */
>>>> +	ret = dev_set_mtu(child_netdev, dev->mtu);
>>>> +	if (ret) {
>>>> +		netdev_err(dev,
>>>> +			   "unable to change mtu of %s to %u register failed\n",
>>>> +			   child_netdev->name, dev->mtu);
>>>> +		goto err_set_mtu;
>>>> +	}
>>>> +
>>>> +	call_netdevice_notifiers(NETDEV_JOIN, child_netdev);
>>>> +
>>>> +	netdev_info(dev, "registering %s\n", child_netdev->name);
>>>> +
>>>> +	dev_hold(child_netdev);
>>>> +	if (backup) {
>>>> +		rcu_assign_pointer(vbi->backup_netdev, child_netdev);
>>>> +		dev_get_stats(vbi->backup_netdev, &vbi->backup_stats);
>>>> +	} else {
>>>> +		rcu_assign_pointer(vbi->active_netdev, child_netdev);
>>>> +		dev_get_stats(vbi->active_netdev, &vbi->active_stats);
>>>> +		dev->min_mtu = child_netdev->min_mtu;
>>>> +		dev->max_mtu = child_netdev->max_mtu;
>>>> +	}
>>>> +
>>>> +	return NOTIFY_OK;
>>>> +
>>>> +err_set_mtu:
>>>> +	dev_close(child_netdev);
>>>> +err_interface_up:
>>>> +	netdev_upper_dev_unlink(child_netdev, dev);
>>>> +	child_netdev->flags &= ~IFF_SLAVE;
>>>> +upper_link_failed:
>>>> +	netdev_rx_handler_unregister(child_netdev);
>>>> +rx_handler_failed:
>>>> +	return NOTIFY_DONE;
>>>> +}
>>>> +
>>>> +static int virtnet_bypass_unregister_child(struct net_device *child_netdev)
>>>> +{
>>>> +	struct virtnet_bypass_info *vbi;
>>>> +	struct net_device *dev, *backup;
>>>> +
>>>> +	dev = get_virtnet_bypass_byref(child_netdev);
>>>> +	if (!dev)
>>>> +		return NOTIFY_DONE;
>>>> +
>>>> +	vbi = netdev_priv(dev);
>>>> +
>>>> +	netdev_info(dev, "unregistering %s\n", child_netdev->name);
>>>> +
>>>> +	netdev_rx_handler_unregister(child_netdev);
>>>> +	netdev_upper_dev_unlink(child_netdev, dev);
>>>> +	child_netdev->flags &= ~IFF_SLAVE;
>>>> +
>>>> +	if (child_netdev->dev.parent == dev->dev.parent) {
>>>> +		RCU_INIT_POINTER(vbi->backup_netdev, NULL);
>>>> +	} else {
>>>> +		RCU_INIT_POINTER(vbi->active_netdev, NULL);
>>>> +		backup = rtnl_dereference(vbi->backup_netdev);
>>>> +		if (backup) {
>>>> +			dev->min_mtu = backup->min_mtu;
>>>> +			dev->max_mtu = backup->max_mtu;
>>>> +		}
>>>> +	}
>>>> +
>>>> +	dev_put(child_netdev);
>>>> +
>>>> +	return NOTIFY_OK;
>>>> +}
>>>> +
>>>> +static int virtnet_bypass_update_link(struct net_device *child_netdev)
>>>> +{
>>>> +	struct net_device *dev, *active, *backup;
>>>> +	struct virtnet_bypass_info *vbi;
>>>> +
>>>> +	dev = get_virtnet_bypass_byref(child_netdev);
>>>> +	if (!dev || !netif_running(dev))
>>>> +		return NOTIFY_DONE;
>>>> +
>>>> +	vbi = netdev_priv(dev);
>>>> +
>>>> +	active = rtnl_dereference(vbi->active_netdev);
>>>> +	backup = rtnl_dereference(vbi->backup_netdev);
>>>> +
>>>> +	if ((active && virtnet_bypass_xmit_ready(active)) ||
>>>> +	    (backup && virtnet_bypass_xmit_ready(backup))) {
>>>> +		netif_carrier_on(dev);
>>>> +		netif_tx_wake_all_queues(dev);
>>>> +	} else {
>>>> +		netif_carrier_off(dev);
>>>> +		netif_tx_stop_all_queues(dev);
>>>> +	}
>>>> +
>>>> +	return NOTIFY_OK;
>>>> +}
>>>> +
>>>> +static int virtnet_bypass_event(struct notifier_block *this,
>>>> +				unsigned long event, void *ptr)
>>>> +{
>>>> +	struct net_device *event_dev = netdev_notifier_info_to_dev(ptr);
>>>> +
>>>> +	/* Skip our own events */
>>>> +	if (event_dev->netdev_ops == &virtnet_bypass_netdev_ops)
>>>> +		return NOTIFY_DONE;
>>>> +
>>>> +	/* Avoid non-Ethernet type devices */
>>>> +	if (event_dev->type != ARPHRD_ETHER)
>>>> +		return NOTIFY_DONE;
>>>> +
>>>> +	/* Avoid Vlan dev with same MAC registering as child dev */
>>>> +	if (is_vlan_dev(event_dev))
>>>> +		return NOTIFY_DONE;
>>>> +
>>>> +	/* Avoid Bonding master dev with same MAC registering as child dev */
>>>> +	if ((event_dev->priv_flags & IFF_BONDING) &&
>>>> +	    (event_dev->flags & IFF_MASTER))
>>>> +		return NOTIFY_DONE;
>>>> +
>>>> +	switch (event) {
>>>> +	case NETDEV_REGISTER:
>>>> +		return virtnet_bypass_register_child(event_dev);
>>>> +	case NETDEV_UNREGISTER:
>>>> +		return virtnet_bypass_unregister_child(event_dev);
>>>> +	case NETDEV_UP:
>>>> +	case NETDEV_DOWN:
>>>> +	case NETDEV_CHANGE:
>>>> +		return virtnet_bypass_update_link(event_dev);
>>>> +	default:
>>>> +		return NOTIFY_DONE;
>>>> +	}
>>>> +}
>>> For example this function is 1:1 copy of netvsc, even with comments
>>> and bugs (like IFF_BODING check).
>>>
>>> This is also something that should be shared with netvsc.
>> The problem is with the calls that are made from this function.
>> Sharing would have been possible if both virtio and netvsc went with
>> same netdev model.
> ops? You can still share.

OK. looks like you want to see atleast some code shared between netvsc and virtio_net
even when they are using 2 different netdev models.
Will try to add a new file under net/core and move some functions that can be shared
between netvsc and virtio_net in BACKUP mode.


>
>
>>>
>>>> +
>>>> +static struct notifier_block virtnet_bypass_notifier = {
>>>> +	.notifier_call = virtnet_bypass_event,
>>>> +};
>>>> +
>>>> +static int virtnet_bypass_create(struct virtnet_info *vi)
>>>> +{
>>>> +	struct net_device *backup_netdev = vi->dev;
>>>> +	struct device *dev = &vi->vdev->dev;
>>>> +	struct net_device *bypass_netdev;
>>>> +	int res;
>>>> +
>>>> +	/* Alloc at least 2 queues, for now we are going with 16 assuming
>>>> +	 * that most devices being bonded won't have too many queues.
>>>> +	 */
>>>> +	bypass_netdev = alloc_etherdev_mq(sizeof(struct virtnet_bypass_info),
>>>> +					  16);
>>>> +	if (!bypass_netdev) {
>>>> +		dev_err(dev, "Unable to allocate bypass_netdev!\n");
>>>> +		return -ENOMEM;
>>>> +	}
>>>> +
>>>> +	dev_net_set(bypass_netdev, dev_net(backup_netdev));
>>>> +	SET_NETDEV_DEV(bypass_netdev, dev);
>>>> +
>>>> +	bypass_netdev->netdev_ops = &virtnet_bypass_netdev_ops;
>>>> +	bypass_netdev->ethtool_ops = &virtnet_bypass_ethtool_ops;
>>>> +
>>>> +	/* Initialize the device options */
>>>> +	bypass_netdev->flags |= IFF_MASTER;
>>>> +	bypass_netdev->priv_flags |= IFF_BONDING | IFF_UNICAST_FLT |
>>> No clue why you set IFF_BONDING here...
>> This is to indicate to the userspace that this is a master bond device.
> This is not a master bond device! If you say this, it makes me really
> worried....
>
Will remove IFF_BONDING in the next revision.


>
>> May be it is sufficient to just set IFF_MASTER.
>>
>>>
>>>
>>>
>>>


---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v4 2/2] virtio_net: Extend virtio to use VF datapath when available
  2018-03-14  0:28       ` [virtio-dev] " Samudrala, Sridhar
@ 2018-03-14  0:44         ` Michael S. Tsirkin
  -1 siblings, 0 replies; 102+ messages in thread
From: Michael S. Tsirkin @ 2018-03-14  0:44 UTC (permalink / raw)
  To: Samudrala, Sridhar
  Cc: Siwei Liu, Stephen Hemminger, David Miller, Netdev, Jiri Pirko,
	virtio-dev, Brandeburg, Jesse, Alexander Duyck, Jakub Kicinski

On Tue, Mar 13, 2018 at 05:28:07PM -0700, Samudrala, Sridhar wrote:
> > I am not sure if it's a good idea to leave the
> > virtio_bypass around if running into failure: the guest is not
> > migratable as the VF doesn't have a backup path,
> 
> Are you talking about a failure when registering backup netdev?  This should not
> happen, but i guess we can improve error handling in such scenario.

A nice way to do this would be to clear the backup feature bit.

> 
> > And perhaps the worse
> > part is that, it now has two interfaces with identical MAC address but
> > one of them is invalid (user cannot use the virtio interface as it has
> > a dampened datapath). IMHO the virtio_bypass and its lower netdev
> > should be destroyed at all when it fails to bind the VF, and
> > technically, there should be some way to propogate the failure status
> > to the hypervisor/backend, indicating that the VM is not migratable
> > because of guest software errors (maybe by clearing out the backup
> > feature from the guest virtio driver so host can see/learn it).
> 
> In BACKUP mode, user can only use the upper virtio_bypass netdev and that will
> always be there. Any failure to enslave VF netdev is not fatal, but i will see
> if we can improve the error handling of failure to enslave backup netdev.
> Also, i don't think the BACKUP feature bit is negotiable with the host.
> 
> Thanks
> Sridhar

All bits are negotiable.  It's up to the host whether to support
a device with this bit clear, or to fail negotiation.

-- 
MST

^ permalink raw reply	[flat|nested] 102+ messages in thread

* [virtio-dev] Re: [PATCH v4 2/2] virtio_net: Extend virtio to use VF datapath when available
@ 2018-03-14  0:44         ` Michael S. Tsirkin
  0 siblings, 0 replies; 102+ messages in thread
From: Michael S. Tsirkin @ 2018-03-14  0:44 UTC (permalink / raw)
  To: Samudrala, Sridhar
  Cc: Siwei Liu, Stephen Hemminger, David Miller, Netdev, Jiri Pirko,
	virtio-dev, Brandeburg, Jesse, Alexander Duyck, Jakub Kicinski

On Tue, Mar 13, 2018 at 05:28:07PM -0700, Samudrala, Sridhar wrote:
> > I am not sure if it's a good idea to leave the
> > virtio_bypass around if running into failure: the guest is not
> > migratable as the VF doesn't have a backup path,
> 
> Are you talking about a failure when registering backup netdev?  This should not
> happen, but i guess we can improve error handling in such scenario.

A nice way to do this would be to clear the backup feature bit.

> 
> > And perhaps the worse
> > part is that, it now has two interfaces with identical MAC address but
> > one of them is invalid (user cannot use the virtio interface as it has
> > a dampened datapath). IMHO the virtio_bypass and its lower netdev
> > should be destroyed at all when it fails to bind the VF, and
> > technically, there should be some way to propogate the failure status
> > to the hypervisor/backend, indicating that the VM is not migratable
> > because of guest software errors (maybe by clearing out the backup
> > feature from the guest virtio driver so host can see/learn it).
> 
> In BACKUP mode, user can only use the upper virtio_bypass netdev and that will
> always be there. Any failure to enslave VF netdev is not fatal, but i will see
> if we can improve the error handling of failure to enslave backup netdev.
> Also, i don't think the BACKUP feature bit is negotiable with the host.
> 
> Thanks
> Sridhar

All bits are negotiable.  It's up to the host whether to support
a device with this bit clear, or to fail negotiation.

-- 
MST

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v4 2/2] virtio_net: Extend virtio to use VF datapath when available
  2018-03-14  0:36           ` [virtio-dev] " Samudrala, Sridhar
  (?)
@ 2018-03-14  0:54           ` Stephen Hemminger
  -1 siblings, 0 replies; 102+ messages in thread
From: Stephen Hemminger @ 2018-03-14  0:54 UTC (permalink / raw)
  To: Samudrala, Sridhar
  Cc: Jiri Pirko, mst, davem, netdev, virtio-dev, jesse.brandeburg,
	alexander.h.duyck, kubakici

On Tue, 13 Mar 2018 17:36:23 -0700
"Samudrala, Sridhar" <sridhar.samudrala@intel.com> wrote:

> OK. looks like you want to see atleast some code shared between netvsc and virtio_net
> even when they are using 2 different netdev models.
> Will try to add a new file under net/core and move some functions that can be shared
> between netvsc and virtio_net in BACKUP mode.

Make it work first (on both) then workout how to share.
That method worked quite well for tunnels and other netdev like objects.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v4 2/2] virtio_net: Extend virtio to use VF datapath when available
  2018-03-14  0:28       ` [virtio-dev] " Samudrala, Sridhar
@ 2018-03-14  4:50         ` Siwei Liu
  -1 siblings, 0 replies; 102+ messages in thread
From: Siwei Liu @ 2018-03-14  4:50 UTC (permalink / raw)
  To: Samudrala, Sridhar
  Cc: Michael S. Tsirkin, Stephen Hemminger, David Miller, Netdev,
	Jiri Pirko, virtio-dev, Brandeburg, Jesse, Alexander Duyck,
	Jakub Kicinski

On Tue, Mar 13, 2018 at 5:28 PM, Samudrala, Sridhar
<sridhar.samudrala@intel.com> wrote:
> On 3/12/2018 3:44 PM, Siwei Liu wrote:
>>
>> Apologies, still some comments going. Please see inline.
>>
>> On Thu, Mar 1, 2018 at 12:08 PM, Sridhar Samudrala
>> <sridhar.samudrala@intel.com> wrote:
>>>
>>> This patch enables virtio_net to switch over to a VF datapath when a VF
>>> netdev is present with the same MAC address. It allows live migration
>>> of a VM with a direct attached VF without the need to setup a bond/team
>>> between a VF and virtio net device in the guest.
>>>
>>> The hypervisor needs to enable only one datapath at any time so that
>>> packets don't get looped back to the VM over the other datapath. When a
>>> VF
>>> is plugged, the virtio datapath link state can be marked as down. The
>>> hypervisor needs to unplug the VF device from the guest on the source
>>> host
>>> and reset the MAC filter of the VF to initiate failover of datapath to
>>> virtio before starting the migration. After the migration is completed,
>>> the destination hypervisor sets the MAC filter on the VF and plugs it
>>> back
>>> to the guest to switch over to VF datapath.
>>>
>>> When BACKUP feature is enabled, an additional netdev(bypass netdev) is
>>> created that acts as a master device and tracks the state of the 2 lower
>>> netdevs. The original virtio_net netdev is marked as 'backup' netdev and
>>> a
>>> passthru device with the same MAC is registered as 'active' netdev.
>>>
>>> This patch is based on the discussion initiated by Jesse on this thread.
>>> https://marc.info/?l=linux-virtualization&m=151189725224231&w=2
>>>
>>> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
>>> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
>>> Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
>>> ---
>>>   drivers/net/virtio_net.c | 683
>>> ++++++++++++++++++++++++++++++++++++++++++++++-
>>>   1 file changed, 682 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>>> index bcd13fe906ca..f2860d86c952 100644
>>> --- a/drivers/net/virtio_net.c
>>> +++ b/drivers/net/virtio_net.c
>>> @@ -30,6 +30,8 @@
>>>   #include <linux/cpu.h>
>>>   #include <linux/average.h>
>>>   #include <linux/filter.h>
>>> +#include <linux/netdevice.h>
>>> +#include <linux/pci.h>
>>>   #include <net/route.h>
>>>   #include <net/xdp.h>
>>>
>>> @@ -206,6 +208,9 @@ struct virtnet_info {
>>>          u32 speed;
>>>
>>>          unsigned long guest_offloads;
>>> +
>>> +       /* upper netdev created when BACKUP feature enabled */
>>> +       struct net_device *bypass_netdev;
>>>   };
>>>
>>>   struct padded_vnet_hdr {
>>> @@ -2236,6 +2241,22 @@ static int virtnet_xdp(struct net_device *dev,
>>> struct netdev_bpf *xdp)
>>>          }
>>>   }
>>>
>>> +static int virtnet_get_phys_port_name(struct net_device *dev, char *buf,
>>> +                                     size_t len)
>>> +{
>>> +       struct virtnet_info *vi = netdev_priv(dev);
>>> +       int ret;
>>> +
>>> +       if (!virtio_has_feature(vi->vdev, VIRTIO_NET_F_BACKUP))
>>> +               return -EOPNOTSUPP;
>>> +
>>> +       ret = snprintf(buf, len, "_bkup");
>>> +       if (ret >= len)
>>> +               return -EOPNOTSUPP;
>>> +
>>> +       return 0;
>>> +}
>>> +
>>>   static const struct net_device_ops virtnet_netdev = {
>>>          .ndo_open            = virtnet_open,
>>>          .ndo_stop            = virtnet_close,
>>> @@ -2253,6 +2274,7 @@ static const struct net_device_ops virtnet_netdev =
>>> {
>>>          .ndo_xdp_xmit           = virtnet_xdp_xmit,
>>>          .ndo_xdp_flush          = virtnet_xdp_flush,
>>>          .ndo_features_check     = passthru_features_check,
>>> +       .ndo_get_phys_port_name = virtnet_get_phys_port_name,
>>>   };
>>>
>>>   static void virtnet_config_changed_work(struct work_struct *work)
>>> @@ -2647,6 +2669,653 @@ static int virtnet_validate(struct virtio_device
>>> *vdev)
>>>          return 0;
>>>   }
>>>
>>> +/* START of functions supporting VIRTIO_NET_F_BACKUP feature.
>>> + * When BACKUP feature is enabled, an additional netdev(bypass netdev)
>>> + * is created that acts as a master device and tracks the state of the
>>> + * 2 lower netdevs. The original virtio_net netdev is registered as
>>> + * 'backup' netdev and a passthru device with the same MAC is registered
>>> + * as 'active' netdev.
>>> + */
>>> +
>>> +/* bypass state maintained when BACKUP feature is enabled */
>>> +struct virtnet_bypass_info {
>>> +       /* passthru netdev with same MAC */
>>> +       struct net_device __rcu *active_netdev;
>>> +
>>> +       /* virtio_net netdev */
>>> +       struct net_device __rcu *backup_netdev;
>>> +
>>> +       /* active netdev stats */
>>> +       struct rtnl_link_stats64 active_stats;
>>> +
>>> +       /* backup netdev stats */
>>> +       struct rtnl_link_stats64 backup_stats;
>>> +
>>> +       /* aggregated stats */
>>> +       struct rtnl_link_stats64 bypass_stats;
>>> +
>>> +       /* spinlock while updating stats */
>>> +       spinlock_t stats_lock;
>>> +};
>>> +
>>> +static void virtnet_bypass_child_open(struct net_device *dev,
>>> +                                     struct net_device *child_netdev)
>>> +{
>>> +       int err = dev_open(child_netdev);
>>> +
>>> +       if (err)
>>> +               netdev_warn(dev, "unable to open slave: %s: %d\n",
>>> +                           child_netdev->name, err);
>>> +}
>>
>> Why ignoring the error?? i.e. virtnet_bypass_child_open should have
>> return value. I don't believe the caller is in a safe context to
>> guarantee the dev_open always succeeds.
>
>
> OK.  Will handle this in the next revision.
>
>
>>
>>> +
>>> +static int virtnet_bypass_open(struct net_device *dev)
>>> +{
>>> +       struct virtnet_bypass_info *vbi = netdev_priv(dev);
>>> +       struct net_device *child_netdev;
>>> +
>>> +       netif_carrier_off(dev);
>>> +       netif_tx_wake_all_queues(dev);
>>> +
>>> +       child_netdev = rtnl_dereference(vbi->active_netdev);
>>> +       if (child_netdev)
>>> +               virtnet_bypass_child_open(dev, child_netdev);
>>
>> Handle the error?
>>
>>> +
>>> +       child_netdev = rtnl_dereference(vbi->backup_netdev);
>>> +       if (child_netdev)
>>> +               virtnet_bypass_child_open(dev, child_netdev);
>>
>> Handle the error and unwind?
>
>
> Sure.
>
> <snip>
>
>
>> +
>> +static int virtnet_bypass_register_child(struct net_device *child_netdev)
>> +{
>> +       struct virtnet_bypass_info *vbi;
>> +       struct net_device *dev;
>> +       bool backup;
>> +       int ret;
>> +
>> +       if (child_netdev->addr_len != ETH_ALEN)
>> +               return NOTIFY_DONE;
>> +
>> +       /* We will use the MAC address to locate the virtnet_bypass netdev
>> +        * to associate with the child netdev. If we don't find a matching
>> +        * bypass netdev, move on.
>> +        */
>> +       dev = get_virtnet_bypass_bymac(dev_net(child_netdev),
>> +                                      child_netdev->perm_addr);
>> +       if (!dev)
>> +               return NOTIFY_DONE;
>> +
>> +       vbi = netdev_priv(dev);
>> +       backup = (child_netdev->dev.parent == dev->dev.parent);
>> +       if (backup ? rtnl_dereference(vbi->backup_netdev) :
>> +                       rtnl_dereference(vbi->active_netdev)) {
>> +               netdev_info(dev,
>> +                           "%s attempting to join bypass dev when %s
>> already present\n",
>> +                           child_netdev->name, backup ? "backup" :
>> "active");
>> +               return NOTIFY_DONE;
>> +       }
>> +
>> +       /* Avoid non pci devices as active netdev */
>> +       if (!backup && (!child_netdev->dev.parent ||
>> +                       !dev_is_pci(child_netdev->dev.parent)))
>> +               return NOTIFY_DONE;
>> +
>> There's a problem here in terms of error (particularly on the active
>> slave, e.g. VF), see below:
>>
>>> +       ret = netdev_rx_handler_register(child_netdev,
>>> +                                        virtnet_bypass_handle_frame,
>>> dev);
>>> +       if (ret != 0) {
>>> +               netdev_err(child_netdev,
>>> +                          "can not register bypass receive handler (err
>>> = %d)\n",
>>> +                          ret);
>>> +               goto rx_handler_failed;
>>> +       }
>>> +
>>> +       ret = netdev_upper_dev_link(child_netdev, dev, NULL);
>>> +       if (ret != 0) {
>>> +               netdev_err(child_netdev,
>>> +                          "can not set master device %s (err = %d)\n",
>>> +                          dev->name, ret);
>>> +               goto upper_link_failed;
>>> +       }
>>> +
>>> +       child_netdev->flags |= IFF_SLAVE;
>>> +
>>> +       if (netif_running(dev)) {
>>> +               ret = dev_open(child_netdev);
>>> +               if (ret && (ret != -EBUSY)) {
>>> +                       netdev_err(dev, "Opening child %s failed
>>> ret:%d\n",
>>> +                                  child_netdev->name, ret);
>>> +                       goto err_interface_up;
>>> +               }
>>> +       }
>>> +
>>> +       /* Align MTU of child with master */
>>> +       ret = dev_set_mtu(child_netdev, dev->mtu);
>>> +       if (ret) {
>>> +               netdev_err(dev,
>>> +                          "unable to change mtu of %s to %u register
>>> failed\n",
>>> +                          child_netdev->name, dev->mtu);
>>> +               goto err_set_mtu;
>>> +       }
>>
>> If any of the above calls returns non-zero, the current code steps
>> back to undo what's being done on that spefic slave previously. For
>> instance, if the netdev_rx_handler_register returns non-zero because
>> of busy rx handler, this register_child function would give up
>> enslaving the VF and leave the upper virtio_bypass interface behind
>> once it returns.
>
>
> virtio_bypass interface is the upper master netdev and it is always present
> when
> BACKUP feature is enabled.
> If there is failure with enslaving VF, there will be 2 netdevs,
> master virtio_bypass and slave virtio_net and the VM can be migrated.

Failure in enslaving the VF does not mean the netdev would disappear.
That means there will be 3 seperate netdevs left behind, a master
virtio_bypass with slave virtio_net, plus a standalone VF. In this
case the VF are still useable as the datapath is still there, while
the datapath for virtio_* is made standby. Which interface you expect
user to continue to use in case of failure?

If it's the VF interface you choose, there should be some detection
for that condition in virtio_net, hide the virtio interfaces from user
by unregistering those two virtio netdevs (master virtio_bypass with
slave virtio_net), presumably.

If not so user has to use virtio_bypass, how do you prevent users from
using the VF, or consequently, indicate/propogate the failure to
backend and switch the host datapath to virtio?

I don't see either of the above being handled on your patch. Unless I
miss something obvious, I don't think this patch is close to achieving
the goal in terms of completeness.

>
>
>> I am not sure if it's a good idea to leave the
>> virtio_bypass around if running into failure: the guest is not
>> migratable as the VF doesn't have a backup path,
>
>
> Are you talking about a failure when registering backup netdev?  This should

No, see above. I was mainly concerned with the failure in enslaving VF
(active netdev).

If users continue to use raw VF interface within the VM, the guest
cannot be migrated for sure.

> not
> happen, but i guess we can improve error handling in such scenario.
>
>
>> And perhaps the worse
>> part is that, it now has two interfaces with identical MAC address but
>> one of them is invalid (user cannot use the virtio interface as it has
>> a dampened datapath). IMHO the virtio_bypass and its lower netdev
>> should be destroyed at all when it fails to bind the VF, and
>> technically, there should be some way to propogate the failure status
>> to the hypervisor/backend, indicating that the VM is not migratable
>> because of guest software errors (maybe by clearing out the backup
>> feature from the guest virtio driver so host can see/learn it).
>
>
> In BACKUP mode, user can only use the upper virtio_bypass netdev and that
> will
> always be there. Any failure to enslave VF netdev is not fatal, but i will
> see
> if we can improve the error handling of failure to enslave backup netdev.
> Also, i don't think the BACKUP feature bit is negotiable with the host.

I don't believe that is the case. In any event, you should come up
with whatever means to indicate to the hypervisor that some VF are
still left unbound with a virtio, so QEMU knows the VM cannot live
migrate when host user attempts to do so. Or, alternatively, propoate
the failure to host, plug out the VF and switch the host datapath to
virtio.  Please work out a complete solution to address the error
case.

Regards,
-Siwei

>
> Thanks
> Sridhar
>
>

^ permalink raw reply	[flat|nested] 102+ messages in thread

* [virtio-dev] Re: [PATCH v4 2/2] virtio_net: Extend virtio to use VF datapath when available
@ 2018-03-14  4:50         ` Siwei Liu
  0 siblings, 0 replies; 102+ messages in thread
From: Siwei Liu @ 2018-03-14  4:50 UTC (permalink / raw)
  To: Samudrala, Sridhar
  Cc: Michael S. Tsirkin, Stephen Hemminger, David Miller, Netdev,
	Jiri Pirko, virtio-dev, Brandeburg, Jesse, Alexander Duyck,
	Jakub Kicinski

On Tue, Mar 13, 2018 at 5:28 PM, Samudrala, Sridhar
<sridhar.samudrala@intel.com> wrote:
> On 3/12/2018 3:44 PM, Siwei Liu wrote:
>>
>> Apologies, still some comments going. Please see inline.
>>
>> On Thu, Mar 1, 2018 at 12:08 PM, Sridhar Samudrala
>> <sridhar.samudrala@intel.com> wrote:
>>>
>>> This patch enables virtio_net to switch over to a VF datapath when a VF
>>> netdev is present with the same MAC address. It allows live migration
>>> of a VM with a direct attached VF without the need to setup a bond/team
>>> between a VF and virtio net device in the guest.
>>>
>>> The hypervisor needs to enable only one datapath at any time so that
>>> packets don't get looped back to the VM over the other datapath. When a
>>> VF
>>> is plugged, the virtio datapath link state can be marked as down. The
>>> hypervisor needs to unplug the VF device from the guest on the source
>>> host
>>> and reset the MAC filter of the VF to initiate failover of datapath to
>>> virtio before starting the migration. After the migration is completed,
>>> the destination hypervisor sets the MAC filter on the VF and plugs it
>>> back
>>> to the guest to switch over to VF datapath.
>>>
>>> When BACKUP feature is enabled, an additional netdev(bypass netdev) is
>>> created that acts as a master device and tracks the state of the 2 lower
>>> netdevs. The original virtio_net netdev is marked as 'backup' netdev and
>>> a
>>> passthru device with the same MAC is registered as 'active' netdev.
>>>
>>> This patch is based on the discussion initiated by Jesse on this thread.
>>> https://marc.info/?l=linux-virtualization&m=151189725224231&w=2
>>>
>>> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
>>> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
>>> Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
>>> ---
>>>   drivers/net/virtio_net.c | 683
>>> ++++++++++++++++++++++++++++++++++++++++++++++-
>>>   1 file changed, 682 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>>> index bcd13fe906ca..f2860d86c952 100644
>>> --- a/drivers/net/virtio_net.c
>>> +++ b/drivers/net/virtio_net.c
>>> @@ -30,6 +30,8 @@
>>>   #include <linux/cpu.h>
>>>   #include <linux/average.h>
>>>   #include <linux/filter.h>
>>> +#include <linux/netdevice.h>
>>> +#include <linux/pci.h>
>>>   #include <net/route.h>
>>>   #include <net/xdp.h>
>>>
>>> @@ -206,6 +208,9 @@ struct virtnet_info {
>>>          u32 speed;
>>>
>>>          unsigned long guest_offloads;
>>> +
>>> +       /* upper netdev created when BACKUP feature enabled */
>>> +       struct net_device *bypass_netdev;
>>>   };
>>>
>>>   struct padded_vnet_hdr {
>>> @@ -2236,6 +2241,22 @@ static int virtnet_xdp(struct net_device *dev,
>>> struct netdev_bpf *xdp)
>>>          }
>>>   }
>>>
>>> +static int virtnet_get_phys_port_name(struct net_device *dev, char *buf,
>>> +                                     size_t len)
>>> +{
>>> +       struct virtnet_info *vi = netdev_priv(dev);
>>> +       int ret;
>>> +
>>> +       if (!virtio_has_feature(vi->vdev, VIRTIO_NET_F_BACKUP))
>>> +               return -EOPNOTSUPP;
>>> +
>>> +       ret = snprintf(buf, len, "_bkup");
>>> +       if (ret >= len)
>>> +               return -EOPNOTSUPP;
>>> +
>>> +       return 0;
>>> +}
>>> +
>>>   static const struct net_device_ops virtnet_netdev = {
>>>          .ndo_open            = virtnet_open,
>>>          .ndo_stop            = virtnet_close,
>>> @@ -2253,6 +2274,7 @@ static const struct net_device_ops virtnet_netdev =
>>> {
>>>          .ndo_xdp_xmit           = virtnet_xdp_xmit,
>>>          .ndo_xdp_flush          = virtnet_xdp_flush,
>>>          .ndo_features_check     = passthru_features_check,
>>> +       .ndo_get_phys_port_name = virtnet_get_phys_port_name,
>>>   };
>>>
>>>   static void virtnet_config_changed_work(struct work_struct *work)
>>> @@ -2647,6 +2669,653 @@ static int virtnet_validate(struct virtio_device
>>> *vdev)
>>>          return 0;
>>>   }
>>>
>>> +/* START of functions supporting VIRTIO_NET_F_BACKUP feature.
>>> + * When BACKUP feature is enabled, an additional netdev(bypass netdev)
>>> + * is created that acts as a master device and tracks the state of the
>>> + * 2 lower netdevs. The original virtio_net netdev is registered as
>>> + * 'backup' netdev and a passthru device with the same MAC is registered
>>> + * as 'active' netdev.
>>> + */
>>> +
>>> +/* bypass state maintained when BACKUP feature is enabled */
>>> +struct virtnet_bypass_info {
>>> +       /* passthru netdev with same MAC */
>>> +       struct net_device __rcu *active_netdev;
>>> +
>>> +       /* virtio_net netdev */
>>> +       struct net_device __rcu *backup_netdev;
>>> +
>>> +       /* active netdev stats */
>>> +       struct rtnl_link_stats64 active_stats;
>>> +
>>> +       /* backup netdev stats */
>>> +       struct rtnl_link_stats64 backup_stats;
>>> +
>>> +       /* aggregated stats */
>>> +       struct rtnl_link_stats64 bypass_stats;
>>> +
>>> +       /* spinlock while updating stats */
>>> +       spinlock_t stats_lock;
>>> +};
>>> +
>>> +static void virtnet_bypass_child_open(struct net_device *dev,
>>> +                                     struct net_device *child_netdev)
>>> +{
>>> +       int err = dev_open(child_netdev);
>>> +
>>> +       if (err)
>>> +               netdev_warn(dev, "unable to open slave: %s: %d\n",
>>> +                           child_netdev->name, err);
>>> +}
>>
>> Why ignoring the error?? i.e. virtnet_bypass_child_open should have
>> return value. I don't believe the caller is in a safe context to
>> guarantee the dev_open always succeeds.
>
>
> OK.  Will handle this in the next revision.
>
>
>>
>>> +
>>> +static int virtnet_bypass_open(struct net_device *dev)
>>> +{
>>> +       struct virtnet_bypass_info *vbi = netdev_priv(dev);
>>> +       struct net_device *child_netdev;
>>> +
>>> +       netif_carrier_off(dev);
>>> +       netif_tx_wake_all_queues(dev);
>>> +
>>> +       child_netdev = rtnl_dereference(vbi->active_netdev);
>>> +       if (child_netdev)
>>> +               virtnet_bypass_child_open(dev, child_netdev);
>>
>> Handle the error?
>>
>>> +
>>> +       child_netdev = rtnl_dereference(vbi->backup_netdev);
>>> +       if (child_netdev)
>>> +               virtnet_bypass_child_open(dev, child_netdev);
>>
>> Handle the error and unwind?
>
>
> Sure.
>
> <snip>
>
>
>> +
>> +static int virtnet_bypass_register_child(struct net_device *child_netdev)
>> +{
>> +       struct virtnet_bypass_info *vbi;
>> +       struct net_device *dev;
>> +       bool backup;
>> +       int ret;
>> +
>> +       if (child_netdev->addr_len != ETH_ALEN)
>> +               return NOTIFY_DONE;
>> +
>> +       /* We will use the MAC address to locate the virtnet_bypass netdev
>> +        * to associate with the child netdev. If we don't find a matching
>> +        * bypass netdev, move on.
>> +        */
>> +       dev = get_virtnet_bypass_bymac(dev_net(child_netdev),
>> +                                      child_netdev->perm_addr);
>> +       if (!dev)
>> +               return NOTIFY_DONE;
>> +
>> +       vbi = netdev_priv(dev);
>> +       backup = (child_netdev->dev.parent == dev->dev.parent);
>> +       if (backup ? rtnl_dereference(vbi->backup_netdev) :
>> +                       rtnl_dereference(vbi->active_netdev)) {
>> +               netdev_info(dev,
>> +                           "%s attempting to join bypass dev when %s
>> already present\n",
>> +                           child_netdev->name, backup ? "backup" :
>> "active");
>> +               return NOTIFY_DONE;
>> +       }
>> +
>> +       /* Avoid non pci devices as active netdev */
>> +       if (!backup && (!child_netdev->dev.parent ||
>> +                       !dev_is_pci(child_netdev->dev.parent)))
>> +               return NOTIFY_DONE;
>> +
>> There's a problem here in terms of error (particularly on the active
>> slave, e.g. VF), see below:
>>
>>> +       ret = netdev_rx_handler_register(child_netdev,
>>> +                                        virtnet_bypass_handle_frame,
>>> dev);
>>> +       if (ret != 0) {
>>> +               netdev_err(child_netdev,
>>> +                          "can not register bypass receive handler (err
>>> = %d)\n",
>>> +                          ret);
>>> +               goto rx_handler_failed;
>>> +       }
>>> +
>>> +       ret = netdev_upper_dev_link(child_netdev, dev, NULL);
>>> +       if (ret != 0) {
>>> +               netdev_err(child_netdev,
>>> +                          "can not set master device %s (err = %d)\n",
>>> +                          dev->name, ret);
>>> +               goto upper_link_failed;
>>> +       }
>>> +
>>> +       child_netdev->flags |= IFF_SLAVE;
>>> +
>>> +       if (netif_running(dev)) {
>>> +               ret = dev_open(child_netdev);
>>> +               if (ret && (ret != -EBUSY)) {
>>> +                       netdev_err(dev, "Opening child %s failed
>>> ret:%d\n",
>>> +                                  child_netdev->name, ret);
>>> +                       goto err_interface_up;
>>> +               }
>>> +       }
>>> +
>>> +       /* Align MTU of child with master */
>>> +       ret = dev_set_mtu(child_netdev, dev->mtu);
>>> +       if (ret) {
>>> +               netdev_err(dev,
>>> +                          "unable to change mtu of %s to %u register
>>> failed\n",
>>> +                          child_netdev->name, dev->mtu);
>>> +               goto err_set_mtu;
>>> +       }
>>
>> If any of the above calls returns non-zero, the current code steps
>> back to undo what's being done on that spefic slave previously. For
>> instance, if the netdev_rx_handler_register returns non-zero because
>> of busy rx handler, this register_child function would give up
>> enslaving the VF and leave the upper virtio_bypass interface behind
>> once it returns.
>
>
> virtio_bypass interface is the upper master netdev and it is always present
> when
> BACKUP feature is enabled.
> If there is failure with enslaving VF, there will be 2 netdevs,
> master virtio_bypass and slave virtio_net and the VM can be migrated.

Failure in enslaving the VF does not mean the netdev would disappear.
That means there will be 3 seperate netdevs left behind, a master
virtio_bypass with slave virtio_net, plus a standalone VF. In this
case the VF are still useable as the datapath is still there, while
the datapath for virtio_* is made standby. Which interface you expect
user to continue to use in case of failure?

If it's the VF interface you choose, there should be some detection
for that condition in virtio_net, hide the virtio interfaces from user
by unregistering those two virtio netdevs (master virtio_bypass with
slave virtio_net), presumably.

If not so user has to use virtio_bypass, how do you prevent users from
using the VF, or consequently, indicate/propogate the failure to
backend and switch the host datapath to virtio?

I don't see either of the above being handled on your patch. Unless I
miss something obvious, I don't think this patch is close to achieving
the goal in terms of completeness.

>
>
>> I am not sure if it's a good idea to leave the
>> virtio_bypass around if running into failure: the guest is not
>> migratable as the VF doesn't have a backup path,
>
>
> Are you talking about a failure when registering backup netdev?  This should

No, see above. I was mainly concerned with the failure in enslaving VF
(active netdev).

If users continue to use raw VF interface within the VM, the guest
cannot be migrated for sure.

> not
> happen, but i guess we can improve error handling in such scenario.
>
>
>> And perhaps the worse
>> part is that, it now has two interfaces with identical MAC address but
>> one of them is invalid (user cannot use the virtio interface as it has
>> a dampened datapath). IMHO the virtio_bypass and its lower netdev
>> should be destroyed at all when it fails to bind the VF, and
>> technically, there should be some way to propogate the failure status
>> to the hypervisor/backend, indicating that the VM is not migratable
>> because of guest software errors (maybe by clearing out the backup
>> feature from the guest virtio driver so host can see/learn it).
>
>
> In BACKUP mode, user can only use the upper virtio_bypass netdev and that
> will
> always be there. Any failure to enslave VF netdev is not fatal, but i will
> see
> if we can improve the error handling of failure to enslave backup netdev.
> Also, i don't think the BACKUP feature bit is negotiable with the host.

I don't believe that is the case. In any event, you should come up
with whatever means to indicate to the hypervisor that some VF are
still left unbound with a virtio, so QEMU knows the VM cannot live
migrate when host user attempts to do so. Or, alternatively, propoate
the failure to host, plug out the VF and switch the host datapath to
virtio.  Please work out a complete solution to address the error
case.

Regards,
-Siwei

>
> Thanks
> Sridhar
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v4 2/2] virtio_net: Extend virtio to use VF datapath when available
  2018-03-14  0:36           ` [virtio-dev] " Samudrala, Sridhar
  (?)
  (?)
@ 2018-03-14 15:45           ` Jiri Pirko
  -1 siblings, 0 replies; 102+ messages in thread
From: Jiri Pirko @ 2018-03-14 15:45 UTC (permalink / raw)
  To: Samudrala, Sridhar
  Cc: mst, stephen, davem, netdev, virtio-dev, jesse.brandeburg,
	alexander.h.duyck, kubakici

Wed, Mar 14, 2018 at 01:36:23AM CET, sridhar.samudrala@intel.com wrote:
>On 3/12/2018 2:08 PM, Jiri Pirko wrote:
>> Mon, Mar 12, 2018 at 09:58:06PM CET, sridhar.samudrala@intel.com wrote:
>> > 
>> > On 3/12/2018 1:12 PM, Jiri Pirko wrote:
>> > > Thu, Mar 01, 2018 at 09:08:43PM CET, sridhar.samudrala@intel.com wrote:
>> > > > This patch enables virtio_net to switch over to a VF datapath when a VF
>> > > > netdev is present with the same MAC address. It allows live migration
>> > > > of a VM with a direct attached VF without the need to setup a bond/team
>> > > > between a VF and virtio net device in the guest.
>> > > > 
>> > > > The hypervisor needs to enable only one datapath at any time so that
>> > > > packets don't get looped back to the VM over the other datapath. When a VF
>> > > > is plugged, the virtio datapath link state can be marked as down. The
>> > > > hypervisor needs to unplug the VF device from the guest on the source host
>> > > > and reset the MAC filter of the VF to initiate failover of datapath to
>> > > > virtio before starting the migration. After the migration is completed,
>> > > > the destination hypervisor sets the MAC filter on the VF and plugs it back
>> > > > to the guest to switch over to VF datapath.
>> > > > 
>> > > > When BACKUP feature is enabled, an additional netdev(bypass netdev) is
>> > > > created that acts as a master device and tracks the state of the 2 lower
>> > > > netdevs. The original virtio_net netdev is marked as 'backup' netdev and a
>> > > > passthru device with the same MAC is registered as 'active' netdev.
>> > > > 
>> > > > This patch is based on the discussion initiated by Jesse on this thread.
>> > > > https://marc.info/?l=linux-virtualization&m=151189725224231&w=2
>> > > > 
>> > > > Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
>> > > > Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
>> > > > Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
>> > > [...]
>> > > 
>> > > 
>> > > > +static int virtnet_bypass_register_child(struct net_device *child_netdev)
>> > > > +{
>> > > > +	struct virtnet_bypass_info *vbi;
>> > > > +	struct net_device *dev;
>> > > > +	bool backup;
>> > > > +	int ret;
>> > > > +
>> > > > +	if (child_netdev->addr_len != ETH_ALEN)
>> > > > +		return NOTIFY_DONE;
>> > > > +
>> > > > +	/* We will use the MAC address to locate the virtnet_bypass netdev
>> > > > +	 * to associate with the child netdev. If we don't find a matching
>> > > > +	 * bypass netdev, move on.
>> > > > +	 */
>> > > > +	dev = get_virtnet_bypass_bymac(dev_net(child_netdev),
>> > > > +				       child_netdev->perm_addr);
>> > > > +	if (!dev)
>> > > > +		return NOTIFY_DONE;
>> > > > +
>> > > > +	vbi = netdev_priv(dev);
>> > > > +	backup = (child_netdev->dev.parent == dev->dev.parent);
>> > > > +	if (backup ? rtnl_dereference(vbi->backup_netdev) :
>> > > > +			rtnl_dereference(vbi->active_netdev)) {
>> > > > +		netdev_info(dev,
>> > > > +			    "%s attempting to join bypass dev when %s already present\n",
>> > > > +			    child_netdev->name, backup ? "backup" : "active");
>> > > > +		return NOTIFY_DONE;
>> > > > +	}
>> > > > +
>> > > > +	/* Avoid non pci devices as active netdev */
>> > > > +	if (!backup && (!child_netdev->dev.parent ||
>> > > > +			!dev_is_pci(child_netdev->dev.parent)))
>> > > > +		return NOTIFY_DONE;
>> > > > +
>> > > > +	ret = netdev_rx_handler_register(child_netdev,
>> > > > +					 virtnet_bypass_handle_frame, dev);
>> > > > +	if (ret != 0) {
>> > > > +		netdev_err(child_netdev,
>> > > > +			   "can not register bypass receive handler (err = %d)\n",
>> > > > +			   ret);
>> > > > +		goto rx_handler_failed;
>> > > > +	}
>> > > > +
>> > > > +	ret = netdev_upper_dev_link(child_netdev, dev, NULL);
>> > > > +	if (ret != 0) {
>> > > > +		netdev_err(child_netdev,
>> > > > +			   "can not set master device %s (err = %d)\n",
>> > > > +			   dev->name, ret);
>> > > > +		goto upper_link_failed;
>> > > > +	}
>> > > > +
>> > > > +	child_netdev->flags |= IFF_SLAVE;
>> > > > +
>> > > > +	if (netif_running(dev)) {
>> > > > +		ret = dev_open(child_netdev);
>> > > > +		if (ret && (ret != -EBUSY)) {
>> > > > +			netdev_err(dev, "Opening child %s failed ret:%d\n",
>> > > > +				   child_netdev->name, ret);
>> > > > +			goto err_interface_up;
>> > > > +		}
>> > > > +	}
>> > > Much of this function is copy of netvsc_vf_join, should be shared with
>> > > netvsc.
>> > Yes. This is based on netvsc_register_vf() and netvsc_vf_join(). But modified
>> > to handle registration of 2 child netdevs(backup virtio and active vf).ï¿½ In case
>> > of netvsc, they only register VF. There is no backup netdev.
>> > I think trying to make this code shareable with netvsc will introduce additional
>> > checks and make it convoluted.
>> No problem.
>
>Not clear what you meant here?
>Want to confirm that you are agreeing that it is OK to not share this function
>with netvsc.

Please share as much as possible. And this is possible. netdev_upper_dev_link
IFF_SLAVE flag set netdev_rx_handler_register, checks etc.


>
>
>> 
>> 
>> > 
>> > > 
>> > > > +
>> > > > +	/* Align MTU of child with master */
>> > > > +	ret = dev_set_mtu(child_netdev, dev->mtu);
>> > > > +	if (ret) {
>> > > > +		netdev_err(dev,
>> > > > +			   "unable to change mtu of %s to %u register failed\n",
>> > > > +			   child_netdev->name, dev->mtu);
>> > > > +		goto err_set_mtu;
>> > > > +	}
>> > > > +
>> > > > +	call_netdevice_notifiers(NETDEV_JOIN, child_netdev);
>> > > > +
>> > > > +	netdev_info(dev, "registering %s\n", child_netdev->name);
>> > > > +
>> > > > +	dev_hold(child_netdev);
>> > > > +	if (backup) {
>> > > > +		rcu_assign_pointer(vbi->backup_netdev, child_netdev);
>> > > > +		dev_get_stats(vbi->backup_netdev, &vbi->backup_stats);
>> > > > +	} else {
>> > > > +		rcu_assign_pointer(vbi->active_netdev, child_netdev);
>> > > > +		dev_get_stats(vbi->active_netdev, &vbi->active_stats);
>> > > > +		dev->min_mtu = child_netdev->min_mtu;
>> > > > +		dev->max_mtu = child_netdev->max_mtu;
>> > > > +	}
>> > > > +
>> > > > +	return NOTIFY_OK;
>> > > > +
>> > > > +err_set_mtu:
>> > > > +	dev_close(child_netdev);
>> > > > +err_interface_up:
>> > > > +	netdev_upper_dev_unlink(child_netdev, dev);
>> > > > +	child_netdev->flags &= ~IFF_SLAVE;
>> > > > +upper_link_failed:
>> > > > +	netdev_rx_handler_unregister(child_netdev);
>> > > > +rx_handler_failed:
>> > > > +	return NOTIFY_DONE;
>> > > > +}
>> > > > +
>> > > > +static int virtnet_bypass_unregister_child(struct net_device *child_netdev)
>> > > > +{
>> > > > +	struct virtnet_bypass_info *vbi;
>> > > > +	struct net_device *dev, *backup;
>> > > > +
>> > > > +	dev = get_virtnet_bypass_byref(child_netdev);
>> > > > +	if (!dev)
>> > > > +		return NOTIFY_DONE;
>> > > > +
>> > > > +	vbi = netdev_priv(dev);
>> > > > +
>> > > > +	netdev_info(dev, "unregistering %s\n", child_netdev->name);
>> > > > +
>> > > > +	netdev_rx_handler_unregister(child_netdev);
>> > > > +	netdev_upper_dev_unlink(child_netdev, dev);
>> > > > +	child_netdev->flags &= ~IFF_SLAVE;
>> > > > +
>> > > > +	if (child_netdev->dev.parent == dev->dev.parent) {
>> > > > +		RCU_INIT_POINTER(vbi->backup_netdev, NULL);
>> > > > +	} else {
>> > > > +		RCU_INIT_POINTER(vbi->active_netdev, NULL);
>> > > > +		backup = rtnl_dereference(vbi->backup_netdev);
>> > > > +		if (backup) {
>> > > > +			dev->min_mtu = backup->min_mtu;
>> > > > +			dev->max_mtu = backup->max_mtu;
>> > > > +		}
>> > > > +	}
>> > > > +
>> > > > +	dev_put(child_netdev);
>> > > > +
>> > > > +	return NOTIFY_OK;
>> > > > +}
>> > > > +
>> > > > +static int virtnet_bypass_update_link(struct net_device *child_netdev)
>> > > > +{
>> > > > +	struct net_device *dev, *active, *backup;
>> > > > +	struct virtnet_bypass_info *vbi;
>> > > > +
>> > > > +	dev = get_virtnet_bypass_byref(child_netdev);
>> > > > +	if (!dev || !netif_running(dev))
>> > > > +		return NOTIFY_DONE;
>> > > > +
>> > > > +	vbi = netdev_priv(dev);
>> > > > +
>> > > > +	active = rtnl_dereference(vbi->active_netdev);
>> > > > +	backup = rtnl_dereference(vbi->backup_netdev);
>> > > > +
>> > > > +	if ((active && virtnet_bypass_xmit_ready(active)) ||
>> > > > +	    (backup && virtnet_bypass_xmit_ready(backup))) {
>> > > > +		netif_carrier_on(dev);
>> > > > +		netif_tx_wake_all_queues(dev);
>> > > > +	} else {
>> > > > +		netif_carrier_off(dev);
>> > > > +		netif_tx_stop_all_queues(dev);
>> > > > +	}
>> > > > +
>> > > > +	return NOTIFY_OK;
>> > > > +}
>> > > > +
>> > > > +static int virtnet_bypass_event(struct notifier_block *this,
>> > > > +				unsigned long event, void *ptr)
>> > > > +{
>> > > > +	struct net_device *event_dev = netdev_notifier_info_to_dev(ptr);
>> > > > +
>> > > > +	/* Skip our own events */
>> > > > +	if (event_dev->netdev_ops == &virtnet_bypass_netdev_ops)
>> > > > +		return NOTIFY_DONE;
>> > > > +
>> > > > +	/* Avoid non-Ethernet type devices */
>> > > > +	if (event_dev->type != ARPHRD_ETHER)
>> > > > +		return NOTIFY_DONE;
>> > > > +
>> > > > +	/* Avoid Vlan dev with same MAC registering as child dev */
>> > > > +	if (is_vlan_dev(event_dev))
>> > > > +		return NOTIFY_DONE;
>> > > > +
>> > > > +	/* Avoid Bonding master dev with same MAC registering as child dev */
>> > > > +	if ((event_dev->priv_flags & IFF_BONDING) &&
>> > > > +	    (event_dev->flags & IFF_MASTER))
>> > > > +		return NOTIFY_DONE;
>> > > > +
>> > > > +	switch (event) {
>> > > > +	case NETDEV_REGISTER:
>> > > > +		return virtnet_bypass_register_child(event_dev);
>> > > > +	case NETDEV_UNREGISTER:
>> > > > +		return virtnet_bypass_unregister_child(event_dev);
>> > > > +	case NETDEV_UP:
>> > > > +	case NETDEV_DOWN:
>> > > > +	case NETDEV_CHANGE:
>> > > > +		return virtnet_bypass_update_link(event_dev);
>> > > > +	default:
>> > > > +		return NOTIFY_DONE;
>> > > > +	}
>> > > > +}
>> > > For example this function is 1:1 copy of netvsc, even with comments
>> > > and bugs (like IFF_BODING check).
>> > > 
>> > > This is also something that should be shared with netvsc.
>> > The problem is with the calls that are made from this function.
>> > Sharing would have been possible if both virtio and netvsc went with
>> > same netdev model.
>> ops? You can still share.
>
>OK. looks like you want to see atleast some code shared between netvsc and virtio_net
>even when they are using 2 different netdev models.

Yes please.

>Will try to add a new file under net/core and move some functions that can be shared
>between netvsc and virtio_net in BACKUP mode.
>
>
>> 
>> 
>> > > 
>> > > > +
>> > > > +static struct notifier_block virtnet_bypass_notifier = {
>> > > > +	.notifier_call = virtnet_bypass_event,
>> > > > +};
>> > > > +
>> > > > +static int virtnet_bypass_create(struct virtnet_info *vi)
>> > > > +{
>> > > > +	struct net_device *backup_netdev = vi->dev;
>> > > > +	struct device *dev = &vi->vdev->dev;
>> > > > +	struct net_device *bypass_netdev;
>> > > > +	int res;
>> > > > +
>> > > > +	/* Alloc at least 2 queues, for now we are going with 16 assuming
>> > > > +	 * that most devices being bonded won't have too many queues.
>> > > > +	 */
>> > > > +	bypass_netdev = alloc_etherdev_mq(sizeof(struct virtnet_bypass_info),
>> > > > +					  16);
>> > > > +	if (!bypass_netdev) {
>> > > > +		dev_err(dev, "Unable to allocate bypass_netdev!\n");
>> > > > +		return -ENOMEM;
>> > > > +	}
>> > > > +
>> > > > +	dev_net_set(bypass_netdev, dev_net(backup_netdev));
>> > > > +	SET_NETDEV_DEV(bypass_netdev, dev);
>> > > > +
>> > > > +	bypass_netdev->netdev_ops = &virtnet_bypass_netdev_ops;
>> > > > +	bypass_netdev->ethtool_ops = &virtnet_bypass_ethtool_ops;
>> > > > +
>> > > > +	/* Initialize the device options */
>> > > > +	bypass_netdev->flags |= IFF_MASTER;
>> > > > +	bypass_netdev->priv_flags |= IFF_BONDING | IFF_UNICAST_FLT |
>> > > No clue why you set IFF_BONDING here...
>> > This is to indicate to the userspace that this is a master bond device.
>> This is not a master bond device! If you say this, it makes me really
>> worried....
>> 
>Will remove IFF_BONDING in the next revision.

Also please remove IFF_MASTER setting. Thanks


>
>
>> 
>> > May be it is sufficient to just set IFF_MASTER.
>> > 
>> > > 
>> > > 
>> > > 
>> > > 
>

^ permalink raw reply	[flat|nested] 102+ messages in thread

end of thread, other threads:[~2018-03-14 15:45 UTC | newest]

Thread overview: 102+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-03-01 20:08 [PATCH v4 0/2] Enable virtio_net to act as a backup for a passthru device Sridhar Samudrala
2018-03-01 20:08 ` [virtio-dev] " Sridhar Samudrala
2018-03-01 20:08 ` [PATCH v4 1/2] virtio_net: Introduce VIRTIO_NET_F_BACKUP feature bit Sridhar Samudrala
2018-03-01 20:08   ` [virtio-dev] " Sridhar Samudrala
2018-03-01 20:08 ` [PATCH v4 2/2] virtio_net: Extend virtio to use VF datapath when available Sridhar Samudrala
2018-03-01 20:08   ` [virtio-dev] " Sridhar Samudrala
2018-03-02  8:36   ` Jiri Pirko
2018-03-02 15:26     ` Alexander Duyck
2018-03-02 15:26       ` [virtio-dev] " Alexander Duyck
2018-03-02 16:20       ` Jiri Pirko
2018-03-02 16:37         ` Samudrala, Sridhar
2018-03-02 16:37           ` [virtio-dev] " Samudrala, Sridhar
2018-03-02 17:06           ` Alexander Duyck
2018-03-02 17:06             ` [virtio-dev] " Alexander Duyck
2018-03-02 19:42         ` Michael S. Tsirkin
2018-03-02 19:42           ` [virtio-dev] " Michael S. Tsirkin
2018-03-02 20:49           ` Siwei Liu
2018-03-02 20:49             ` [virtio-dev] " Siwei Liu
2018-03-03 11:31           ` Jiri Pirko
2018-03-03 18:04             ` Alexander Duyck
2018-03-03 18:04               ` [virtio-dev] " Alexander Duyck
2018-03-03 21:25               ` Jiri Pirko
2018-03-04  0:26                 ` Alexander Duyck
2018-03-04  0:26                   ` [virtio-dev] " Alexander Duyck
2018-03-04  7:13                   ` Jiri Pirko
2018-03-04 18:24                     ` Alexander Duyck
2018-03-04 18:24                       ` [virtio-dev] " Alexander Duyck
2018-03-04 18:50                       ` Jiri Pirko
2018-03-04 21:54                         ` Samudrala, Sridhar
2018-03-04 21:54                           ` [virtio-dev] " Samudrala, Sridhar
2018-03-04 21:58                         ` Alexander Duyck
2018-03-04 21:58                           ` [virtio-dev] " Alexander Duyck
2018-03-05  9:21                           ` Jiri Pirko
2018-03-05 16:11                             ` Stephen Hemminger
2018-03-05 22:30                               ` Jiri Pirko
2018-03-05 22:47                                 ` Alexander Duyck
2018-03-05 22:47                                   ` [virtio-dev] " Alexander Duyck
2018-03-06  3:15                                   ` Stephen Hemminger
2018-03-06 19:08                                     ` Alexander Duyck
2018-03-06 19:08                                       ` [virtio-dev] " Alexander Duyck
2018-03-06 22:59                                       ` Jiri Pirko
2018-03-06 23:27                                         ` Alexander Duyck
2018-03-06 23:27                                           ` [virtio-dev] " Alexander Duyck
2018-03-07  2:38                                           ` Michael S. Tsirkin
2018-03-07  2:38                                             ` [virtio-dev] " Michael S. Tsirkin
2018-03-07 17:50                                             ` Alexander Duyck
2018-03-07 17:50                                               ` [virtio-dev] " Alexander Duyck
2018-03-07 18:06                                               ` Stephen Hemminger
2018-03-07 18:55                                                 ` Alexander Duyck
2018-03-07 18:55                                                   ` [virtio-dev] " Alexander Duyck
2018-03-07 20:11                                                 ` Michael S. Tsirkin
2018-03-07 20:11                                                   ` [virtio-dev] " Michael S. Tsirkin
2018-03-12 18:47                                                   ` Samudrala, Sridhar
2018-03-12 18:47                                                     ` [virtio-dev] " Samudrala, Sridhar
2018-03-02 19:41       ` Michael S. Tsirkin
2018-03-02 19:41         ` [virtio-dev] " Michael S. Tsirkin
2018-03-02 19:52         ` Samudrala, Sridhar
2018-03-02 19:52           ` [virtio-dev] " Samudrala, Sridhar
2018-03-02 20:10           ` Michael S. Tsirkin
2018-03-02 20:10             ` [virtio-dev] " Michael S. Tsirkin
2018-03-02 20:44             ` Siwei Liu
2018-03-02 20:44               ` [virtio-dev] " Siwei Liu
2018-03-02 20:56               ` Samudrala, Sridhar
2018-03-02 20:56                 ` [virtio-dev] " Samudrala, Sridhar
2018-03-02 21:33                 ` Michael S. Tsirkin
2018-03-02 21:33                   ` [virtio-dev] " Michael S. Tsirkin
2018-03-02 21:31               ` Michael S. Tsirkin
2018-03-02 21:31                 ` [virtio-dev] " Michael S. Tsirkin
2018-03-02 22:26                 ` Siwei Liu
2018-03-02 22:26                   ` [virtio-dev] " Siwei Liu
2018-03-04  4:00                   ` Michael S. Tsirkin
2018-03-04  4:00                     ` [virtio-dev] " Michael S. Tsirkin
2018-03-02 21:11   ` Siwei Liu
2018-03-02 21:11     ` [virtio-dev] " Siwei Liu
2018-03-02 21:36     ` Michael S. Tsirkin
2018-03-02 21:36       ` [virtio-dev] " Michael S. Tsirkin
2018-03-02 23:56       ` Siwei Liu
2018-03-02 23:56         ` [virtio-dev] " Siwei Liu
2018-03-04  4:04         ` Michael S. Tsirkin
2018-03-04  4:04           ` [virtio-dev] " Michael S. Tsirkin
2018-03-12 21:53           ` Siwei Liu
2018-03-12 21:53             ` [virtio-dev] " Siwei Liu
2018-03-02 23:12     ` Samudrala, Sridhar
2018-03-02 23:12       ` [virtio-dev] " Samudrala, Sridhar
2018-03-03  0:09       ` Siwei Liu
2018-03-03  0:09         ` [virtio-dev] " Siwei Liu
2018-03-12 20:12   ` Jiri Pirko
2018-03-12 20:58     ` Samudrala, Sridhar
2018-03-12 20:58       ` [virtio-dev] " Samudrala, Sridhar
2018-03-12 21:08       ` Jiri Pirko
2018-03-14  0:36         ` Samudrala, Sridhar
2018-03-14  0:36           ` [virtio-dev] " Samudrala, Sridhar
2018-03-14  0:54           ` Stephen Hemminger
2018-03-14 15:45           ` Jiri Pirko
2018-03-12 22:44   ` Siwei Liu
2018-03-12 22:44     ` [virtio-dev] " Siwei Liu
2018-03-14  0:28     ` Samudrala, Sridhar
2018-03-14  0:28       ` [virtio-dev] " Samudrala, Sridhar
2018-03-14  0:44       ` Michael S. Tsirkin
2018-03-14  0:44         ` [virtio-dev] " Michael S. Tsirkin
2018-03-14  4:50       ` Siwei Liu
2018-03-14  4:50         ` [virtio-dev] " Siwei Liu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.