All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
@ 2018-02-16 18:11 ` Sridhar Samudrala
  0 siblings, 0 replies; 121+ messages in thread
From: Sridhar Samudrala @ 2018-02-16 18:11 UTC (permalink / raw)
  To: mst, stephen, davem, netdev, virtualization, virtio-dev,
	jesse.brandeburg, alexander.h.duyck, kubakici, sridhar.samudrala,
	jasowang, loseweigh

Patch 1 introduces a new feature bit VIRTIO_NET_F_BACKUP that can be
used by hypervisor to indicate that virtio_net interface should act as
a backup for another device with the same MAC address.

Ppatch 2 is in response to the community request for a 3 netdev
solution.  However, it creates some issues we'll get into in a moment.
It extends virtio_net to use alternate datapath when available and
registered. When BACKUP feature is enabled, virtio_net driver creates
an additional 'bypass' netdev that acts as a master device and controls
2 slave devices.  The original virtio_net netdev is registered as
'backup' netdev and a passthru/vf device with the same MAC gets
registered as 'active' netdev. Both 'bypass' and 'backup' netdevs are
associated with the same 'pci' device.  The user accesses the network
interface via 'bypass' netdev. The 'bypass' netdev chooses 'active' netdev
as default for transmits when it is available with link up and running.

We noticed a couple of issues with this approach during testing.
- As both 'bypass' and 'backup' netdevs are associated with the same
  virtio pci device, udev tries to rename both of them with the same name
  and the 2nd rename will fail. This would be OK as long as the first netdev
  to be renamed is the 'bypass' netdev, but the order in which udev gets
  to rename the 2 netdevs is not reliable. 
- When the 'active' netdev is unplugged OR not present on a destination
  system after live migration, the user will see 2 virtio_net netdevs.

Patch 3 refactors much of the changes made in patch 2, which was done on 
purpose just to show the solution we recommend as part of one patch set.  
If we submit a final version of this, we would combine patch 2/3 together.
This patch removes the creation of an additional netdev, Instead, it
uses a new virtnet_bypass_info struct added to the original 'backup' netdev
to track the 'bypass' information and introduces an additional set of ndo and 
ethtool ops that are used when BACKUP feature is enabled.

One difference with the 3 netdev model compared to the 2 netdev model is that
the 'bypass' netdev is created with 'noqueue' qdisc marked as 'NETIF_F_LLTX'. 
This avoids going through an additional qdisc and acquiring an additional
qdisc and tx lock during transmits.
If we can replace the qdisc of virtio netdev dynamically, it should be
possible to get these optimizations enabled even with 2 netdev model when
BACKUP feature is enabled.

As this patch series is initially focusing on usecases where hypervisor 
fully controls the VM networking and the guest is not expected to directly 
configure any hardware settings, it doesn't expose all the ndo/ethtool ops
that are supported by virtio_net at this time. To support additional usecases,
it should be possible to enable additional ops later by caching the state
in virtio netdev and replaying when the 'active' netdev gets registered. 
 
The hypervisor needs to enable only one datapath at any time so that packets
don't get looped back to the VM over the other datapath. When a VF is
plugged, the virtio datapath link state can be marked as down.
At the time of live migration, the hypervisor needs to unplug the VF device
from the guest on the source host and reset the MAC filter of the VF to
initiate failover of datapath to virtio before starting the migration. After
the migration is completed, the destination hypervisor sets the MAC filter
on the VF and plugs it back to the guest to switch over to VF datapath.

This patch is based on the discussion initiated by Jesse on this thread.
https://marc.info/?l=linux-virtualization&m=151189725224231&w=2

Sridhar Samudrala (3):
  virtio_net: Introduce VIRTIO_NET_F_BACKUP feature bit
  virtio_net: Extend virtio to use VF datapath when available
  virtio_net: Enable alternate datapath without creating an additional
    netdev

 drivers/net/virtio_net.c        | 564 +++++++++++++++++++++++++++++++++++++++-
 include/uapi/linux/virtio_net.h |   3 +
 2 files changed, 563 insertions(+), 4 deletions(-)

-- 
2.14.3

^ permalink raw reply	[flat|nested] 121+ messages in thread

* [virtio-dev] [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
@ 2018-02-16 18:11 ` Sridhar Samudrala
  0 siblings, 0 replies; 121+ messages in thread
From: Sridhar Samudrala @ 2018-02-16 18:11 UTC (permalink / raw)
  To: mst, stephen, davem, netdev, virtualization, virtio-dev,
	jesse.brandeburg, alexander.h.duyck, kubakici, sridhar.samudrala,
	jasowang, loseweigh

Patch 1 introduces a new feature bit VIRTIO_NET_F_BACKUP that can be
used by hypervisor to indicate that virtio_net interface should act as
a backup for another device with the same MAC address.

Ppatch 2 is in response to the community request for a 3 netdev
solution.  However, it creates some issues we'll get into in a moment.
It extends virtio_net to use alternate datapath when available and
registered. When BACKUP feature is enabled, virtio_net driver creates
an additional 'bypass' netdev that acts as a master device and controls
2 slave devices.  The original virtio_net netdev is registered as
'backup' netdev and a passthru/vf device with the same MAC gets
registered as 'active' netdev. Both 'bypass' and 'backup' netdevs are
associated with the same 'pci' device.  The user accesses the network
interface via 'bypass' netdev. The 'bypass' netdev chooses 'active' netdev
as default for transmits when it is available with link up and running.

We noticed a couple of issues with this approach during testing.
- As both 'bypass' and 'backup' netdevs are associated with the same
  virtio pci device, udev tries to rename both of them with the same name
  and the 2nd rename will fail. This would be OK as long as the first netdev
  to be renamed is the 'bypass' netdev, but the order in which udev gets
  to rename the 2 netdevs is not reliable. 
- When the 'active' netdev is unplugged OR not present on a destination
  system after live migration, the user will see 2 virtio_net netdevs.

Patch 3 refactors much of the changes made in patch 2, which was done on 
purpose just to show the solution we recommend as part of one patch set.  
If we submit a final version of this, we would combine patch 2/3 together.
This patch removes the creation of an additional netdev, Instead, it
uses a new virtnet_bypass_info struct added to the original 'backup' netdev
to track the 'bypass' information and introduces an additional set of ndo and 
ethtool ops that are used when BACKUP feature is enabled.

One difference with the 3 netdev model compared to the 2 netdev model is that
the 'bypass' netdev is created with 'noqueue' qdisc marked as 'NETIF_F_LLTX'. 
This avoids going through an additional qdisc and acquiring an additional
qdisc and tx lock during transmits.
If we can replace the qdisc of virtio netdev dynamically, it should be
possible to get these optimizations enabled even with 2 netdev model when
BACKUP feature is enabled.

As this patch series is initially focusing on usecases where hypervisor 
fully controls the VM networking and the guest is not expected to directly 
configure any hardware settings, it doesn't expose all the ndo/ethtool ops
that are supported by virtio_net at this time. To support additional usecases,
it should be possible to enable additional ops later by caching the state
in virtio netdev and replaying when the 'active' netdev gets registered. 
 
The hypervisor needs to enable only one datapath at any time so that packets
don't get looped back to the VM over the other datapath. When a VF is
plugged, the virtio datapath link state can be marked as down.
At the time of live migration, the hypervisor needs to unplug the VF device
from the guest on the source host and reset the MAC filter of the VF to
initiate failover of datapath to virtio before starting the migration. After
the migration is completed, the destination hypervisor sets the MAC filter
on the VF and plugs it back to the guest to switch over to VF datapath.

This patch is based on the discussion initiated by Jesse on this thread.
https://marc.info/?l=linux-virtualization&m=151189725224231&w=2

Sridhar Samudrala (3):
  virtio_net: Introduce VIRTIO_NET_F_BACKUP feature bit
  virtio_net: Extend virtio to use VF datapath when available
  virtio_net: Enable alternate datapath without creating an additional
    netdev

 drivers/net/virtio_net.c        | 564 +++++++++++++++++++++++++++++++++++++++-
 include/uapi/linux/virtio_net.h |   3 +
 2 files changed, 563 insertions(+), 4 deletions(-)

-- 
2.14.3

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 121+ messages in thread

* [RFC PATCH v3 1/3] virtio_net: Introduce VIRTIO_NET_F_BACKUP feature bit
  2018-02-16 18:11 ` [virtio-dev] " Sridhar Samudrala
@ 2018-02-16 18:11   ` Sridhar Samudrala
  -1 siblings, 0 replies; 121+ messages in thread
From: Sridhar Samudrala @ 2018-02-16 18:11 UTC (permalink / raw)
  To: mst, stephen, davem, netdev, virtualization, virtio-dev,
	jesse.brandeburg, alexander.h.duyck, kubakici, sridhar.samudrala,
	jasowang, loseweigh

This feature bit can be used by hypervisor to indicate virtio_net device to
act as a backup for another device with the same MAC address.

VIRTIO_NET_F_BACKUP is defined as bit 62 as it is a device feature bit.

Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
---
 drivers/net/virtio_net.c        | 2 +-
 include/uapi/linux/virtio_net.h | 3 +++
 2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 626c27352ae2..bcd13fe906ca 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -2920,7 +2920,7 @@ static struct virtio_device_id id_table[] = {
 	VIRTIO_NET_F_GUEST_ANNOUNCE, VIRTIO_NET_F_MQ, \
 	VIRTIO_NET_F_CTRL_MAC_ADDR, \
 	VIRTIO_NET_F_MTU, VIRTIO_NET_F_CTRL_GUEST_OFFLOADS, \
-	VIRTIO_NET_F_SPEED_DUPLEX
+	VIRTIO_NET_F_SPEED_DUPLEX, VIRTIO_NET_F_BACKUP
 
 static unsigned int features[] = {
 	VIRTNET_FEATURES,
diff --git a/include/uapi/linux/virtio_net.h b/include/uapi/linux/virtio_net.h
index 5de6ed37695b..c7c35fd1a5ed 100644
--- a/include/uapi/linux/virtio_net.h
+++ b/include/uapi/linux/virtio_net.h
@@ -57,6 +57,9 @@
 					 * Steering */
 #define VIRTIO_NET_F_CTRL_MAC_ADDR 23	/* Set MAC address */
 
+#define VIRTIO_NET_F_BACKUP	  62	/* Act as backup for another device
+					 * with the same MAC.
+					 */
 #define VIRTIO_NET_F_SPEED_DUPLEX 63	/* Device set linkspeed and duplex */
 
 #ifndef VIRTIO_NET_NO_LEGACY
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [RFC PATCH v3 1/3] virtio_net: Introduce VIRTIO_NET_F_BACKUP feature bit
  2018-02-16 18:11 ` [virtio-dev] " Sridhar Samudrala
  (?)
@ 2018-02-16 18:11 ` Sridhar Samudrala
  -1 siblings, 0 replies; 121+ messages in thread
From: Sridhar Samudrala @ 2018-02-16 18:11 UTC (permalink / raw)
  To: mst, stephen, davem, netdev, virtualization, virtio-dev,
	jesse.brandeburg, alexander.h.duyck, kubakici, sridhar.samudrala,
	jasowang, loseweigh

This feature bit can be used by hypervisor to indicate virtio_net device to
act as a backup for another device with the same MAC address.

VIRTIO_NET_F_BACKUP is defined as bit 62 as it is a device feature bit.

Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
---
 drivers/net/virtio_net.c        | 2 +-
 include/uapi/linux/virtio_net.h | 3 +++
 2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 626c27352ae2..bcd13fe906ca 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -2920,7 +2920,7 @@ static struct virtio_device_id id_table[] = {
 	VIRTIO_NET_F_GUEST_ANNOUNCE, VIRTIO_NET_F_MQ, \
 	VIRTIO_NET_F_CTRL_MAC_ADDR, \
 	VIRTIO_NET_F_MTU, VIRTIO_NET_F_CTRL_GUEST_OFFLOADS, \
-	VIRTIO_NET_F_SPEED_DUPLEX
+	VIRTIO_NET_F_SPEED_DUPLEX, VIRTIO_NET_F_BACKUP
 
 static unsigned int features[] = {
 	VIRTNET_FEATURES,
diff --git a/include/uapi/linux/virtio_net.h b/include/uapi/linux/virtio_net.h
index 5de6ed37695b..c7c35fd1a5ed 100644
--- a/include/uapi/linux/virtio_net.h
+++ b/include/uapi/linux/virtio_net.h
@@ -57,6 +57,9 @@
 					 * Steering */
 #define VIRTIO_NET_F_CTRL_MAC_ADDR 23	/* Set MAC address */
 
+#define VIRTIO_NET_F_BACKUP	  62	/* Act as backup for another device
+					 * with the same MAC.
+					 */
 #define VIRTIO_NET_F_SPEED_DUPLEX 63	/* Device set linkspeed and duplex */
 
 #ifndef VIRTIO_NET_NO_LEGACY
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [virtio-dev] [RFC PATCH v3 1/3] virtio_net: Introduce VIRTIO_NET_F_BACKUP feature bit
@ 2018-02-16 18:11   ` Sridhar Samudrala
  0 siblings, 0 replies; 121+ messages in thread
From: Sridhar Samudrala @ 2018-02-16 18:11 UTC (permalink / raw)
  To: mst, stephen, davem, netdev, virtualization, virtio-dev,
	jesse.brandeburg, alexander.h.duyck, kubakici, sridhar.samudrala,
	jasowang, loseweigh

This feature bit can be used by hypervisor to indicate virtio_net device to
act as a backup for another device with the same MAC address.

VIRTIO_NET_F_BACKUP is defined as bit 62 as it is a device feature bit.

Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
---
 drivers/net/virtio_net.c        | 2 +-
 include/uapi/linux/virtio_net.h | 3 +++
 2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 626c27352ae2..bcd13fe906ca 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -2920,7 +2920,7 @@ static struct virtio_device_id id_table[] = {
 	VIRTIO_NET_F_GUEST_ANNOUNCE, VIRTIO_NET_F_MQ, \
 	VIRTIO_NET_F_CTRL_MAC_ADDR, \
 	VIRTIO_NET_F_MTU, VIRTIO_NET_F_CTRL_GUEST_OFFLOADS, \
-	VIRTIO_NET_F_SPEED_DUPLEX
+	VIRTIO_NET_F_SPEED_DUPLEX, VIRTIO_NET_F_BACKUP
 
 static unsigned int features[] = {
 	VIRTNET_FEATURES,
diff --git a/include/uapi/linux/virtio_net.h b/include/uapi/linux/virtio_net.h
index 5de6ed37695b..c7c35fd1a5ed 100644
--- a/include/uapi/linux/virtio_net.h
+++ b/include/uapi/linux/virtio_net.h
@@ -57,6 +57,9 @@
 					 * Steering */
 #define VIRTIO_NET_F_CTRL_MAC_ADDR 23	/* Set MAC address */
 
+#define VIRTIO_NET_F_BACKUP	  62	/* Act as backup for another device
+					 * with the same MAC.
+					 */
 #define VIRTIO_NET_F_SPEED_DUPLEX 63	/* Device set linkspeed and duplex */
 
 #ifndef VIRTIO_NET_NO_LEGACY
-- 
2.14.3


---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [RFC PATCH v3 2/3] virtio_net: Extend virtio to use VF datapath when available
  2018-02-16 18:11 ` [virtio-dev] " Sridhar Samudrala
@ 2018-02-16 18:11   ` Sridhar Samudrala
  -1 siblings, 0 replies; 121+ messages in thread
From: Sridhar Samudrala @ 2018-02-16 18:11 UTC (permalink / raw)
  To: mst, stephen, davem, netdev, virtualization, virtio-dev,
	jesse.brandeburg, alexander.h.duyck, kubakici, sridhar.samudrala,
	jasowang, loseweigh

This patch enables virtio_net to switch over to a VF datapath when a VF
netdev is present with the same MAC address. It allows live migration
of a VM with a direct attached VF without the need to setup a bond/team
between a VF and virtio net device in the guest.

The hypervisor needs to enable only one datapath at any time so that
packets don't get looped back to the VM over the other datapath. When a VF
is plugged, the virtio datapath link state can be marked as down. The
hypervisor needs to unplug the VF device from the guest on the source host
and reset the MAC filter of the VF to initiate failover of datapath to
virtio before starting the migration. After the migration is completed,
the destination hypervisor sets the MAC filter on the VF and plugs it back
to the guest to switch over to VF datapath.

When BACKUP feature is enabled, an additional netdev(bypass netdev) is
created that acts as a master device and tracks the state of the 2 lower
netdevs. The original virtio_net netdev is marked as 'backup' netdev and a
passthru device with the same MAC is registered as 'active' netdev.

This patch is based on the discussion initiated by Jesse on this thread.
https://marc.info/?l=linux-virtualization&m=151189725224231&w=2

Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> 
---
 drivers/net/virtio_net.c | 639 ++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 638 insertions(+), 1 deletion(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index bcd13fe906ca..14679806c1b1 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -30,6 +30,7 @@
 #include <linux/cpu.h>
 #include <linux/average.h>
 #include <linux/filter.h>
+#include <linux/netdevice.h>
 #include <net/route.h>
 #include <net/xdp.h>
 
@@ -147,6 +148,27 @@ struct receive_queue {
 	struct xdp_rxq_info xdp_rxq;
 };
 
+/* bypass state maintained when BACKUP feature is enabled */
+struct virtnet_bypass_info {
+	/* passthru netdev with same MAC */
+	struct net_device __rcu *active_netdev;
+
+	/* virtio_net netdev */
+	struct net_device __rcu *backup_netdev;
+
+	/* active netdev stats */
+	struct rtnl_link_stats64 active_stats;
+
+	/* backup netdev stats */
+	struct rtnl_link_stats64 backup_stats;
+
+	/* aggregated stats */
+	struct rtnl_link_stats64 bypass_stats;
+
+	/* spinlock while updating stats */
+	spinlock_t stats_lock;
+};
+
 struct virtnet_info {
 	struct virtio_device *vdev;
 	struct virtqueue *cvq;
@@ -206,6 +228,9 @@ struct virtnet_info {
 	u32 speed;
 
 	unsigned long guest_offloads;
+
+	/* upper netdev created when BACKUP feature enabled */
+	struct net_device *bypass_netdev;
 };
 
 struct padded_vnet_hdr {
@@ -2255,6 +2280,11 @@ static const struct net_device_ops virtnet_netdev = {
 	.ndo_features_check	= passthru_features_check,
 };
 
+static bool virtnet_bypass_xmit_ready(struct net_device *dev)
+{
+	return netif_running(dev) && netif_carrier_ok(dev);
+}
+
 static void virtnet_config_changed_work(struct work_struct *work)
 {
 	struct virtnet_info *vi =
@@ -2647,6 +2677,601 @@ static int virtnet_validate(struct virtio_device *vdev)
 	return 0;
 }
 
+static void
+virtnet_bypass_child_open(struct net_device *dev,
+			  struct net_device *child_netdev)
+{
+	int err = dev_open(child_netdev);
+
+	if (err)
+		netdev_warn(dev, "unable to open slave: %s: %d\n",
+			    child_netdev->name, err);
+}
+
+static int virtnet_bypass_open(struct net_device *dev)
+{
+	struct virtnet_bypass_info *vbi = netdev_priv(dev);
+	struct net_device *child_netdev;
+
+	netif_carrier_off(dev);
+	netif_tx_wake_all_queues(dev);
+
+	child_netdev = rtnl_dereference(vbi->active_netdev);
+	if (child_netdev)
+		virtnet_bypass_child_open(dev, child_netdev);
+
+	child_netdev = rtnl_dereference(vbi->backup_netdev);
+	if (child_netdev)
+		virtnet_bypass_child_open(dev, child_netdev);
+
+	return 0;
+}
+
+static int virtnet_bypass_close(struct net_device *dev)
+{
+	struct virtnet_bypass_info *vi = netdev_priv(dev);
+	struct net_device *child_netdev;
+
+	netif_tx_disable(dev);
+
+	child_netdev = rtnl_dereference(vi->active_netdev);
+	if (child_netdev)
+		dev_close(child_netdev);
+
+	child_netdev = rtnl_dereference(vi->backup_netdev);
+	if (child_netdev)
+		dev_close(child_netdev);
+
+	return 0;
+}
+
+static netdev_tx_t
+virtnet_bypass_drop_xmit(struct sk_buff *skb, struct net_device *dev)
+{
+	atomic_long_inc(&dev->tx_dropped);
+	dev_kfree_skb_any(skb);
+	return NETDEV_TX_OK;
+}
+
+static netdev_tx_t
+virtnet_bypass_start_xmit(struct sk_buff *skb, struct net_device *dev)
+{
+	struct virtnet_bypass_info *vbi = netdev_priv(dev);
+	struct net_device *xmit_dev;
+
+	/* Try xmit via active netdev followed by backup netdev */
+	xmit_dev = rcu_dereference_bh(vbi->active_netdev);
+	if (!xmit_dev || !virtnet_bypass_xmit_ready(xmit_dev)) {
+		xmit_dev = rcu_dereference_bh(vbi->backup_netdev);
+		if (!xmit_dev || !virtnet_bypass_xmit_ready(xmit_dev))
+			return virtnet_bypass_drop_xmit(skb, dev);
+	}
+
+	skb->dev = xmit_dev;
+	skb->queue_mapping = qdisc_skb_cb(skb)->slave_dev_queue_mapping;
+
+	return dev_queue_xmit(skb);
+}
+
+static u16
+virtnet_bypass_select_queue(struct net_device *dev, struct sk_buff *skb,
+			    void *accel_priv, select_queue_fallback_t fallback)
+{
+	/* This helper function exists to help dev_pick_tx get the correct
+	 * destination queue.  Using a helper function skips a call to
+	 * skb_tx_hash and will put the skbs in the queue we expect on their
+	 * way down to the bonding driver.
+	 */
+	u16 txq = skb_rx_queue_recorded(skb) ? skb_get_rx_queue(skb) : 0;
+
+	/* Save the original txq to restore before passing to the driver */
+	qdisc_skb_cb(skb)->slave_dev_queue_mapping = skb->queue_mapping;
+
+	if (unlikely(txq >= dev->real_num_tx_queues)) {
+		do {
+			txq -= dev->real_num_tx_queues;
+		} while (txq >= dev->real_num_tx_queues);
+	}
+
+	return txq;
+}
+
+/* fold stats, assuming all rtnl_link_stats64 fields are u64, but
+ * that some drivers can provide 32bit values only.
+ */
+static void
+virtnet_bypass_fold_stats(struct rtnl_link_stats64 *_res,
+			  const struct rtnl_link_stats64 *_new,
+			  const struct rtnl_link_stats64 *_old)
+{
+	const u64 *new = (const u64 *)_new;
+	const u64 *old = (const u64 *)_old;
+	u64 *res = (u64 *)_res;
+	int i;
+
+	for (i = 0; i < sizeof(*_res) / sizeof(u64); i++) {
+		u64 nv = new[i];
+		u64 ov = old[i];
+		s64 delta = nv - ov;
+
+		/* detects if this particular field is 32bit only */
+		if (((nv | ov) >> 32) == 0)
+			delta = (s64)(s32)((u32)nv - (u32)ov);
+
+		/* filter anomalies, some drivers reset their stats
+		 * at down/up events.
+		 */
+		if (delta > 0)
+			res[i] += delta;
+	}
+}
+
+static void
+virtnet_bypass_get_stats(struct net_device *dev,
+			 struct rtnl_link_stats64 *stats)
+{
+	struct virtnet_bypass_info *vbi = netdev_priv(dev);
+	const struct rtnl_link_stats64 *new;
+	struct rtnl_link_stats64 temp;
+	struct net_device *child_netdev;
+
+	spin_lock(&vbi->stats_lock);
+	memcpy(stats, &vbi->bypass_stats, sizeof(*stats));
+
+	rcu_read_lock();
+
+	child_netdev = rcu_dereference(vbi->active_netdev);
+	if (child_netdev) {
+		new = dev_get_stats(child_netdev, &temp);
+		virtnet_bypass_fold_stats(stats, new, &vbi->active_stats);
+		memcpy(&vbi->active_stats, new, sizeof(*new));
+	}
+
+	child_netdev = rcu_dereference(vbi->backup_netdev);
+	if (child_netdev) {
+		new = dev_get_stats(child_netdev, &temp);
+		virtnet_bypass_fold_stats(stats, new, &vbi->backup_stats);
+		memcpy(&vbi->backup_stats, new, sizeof(*new));
+	}
+
+	rcu_read_unlock();
+
+	memcpy(&vbi->bypass_stats, stats, sizeof(*stats));
+	spin_unlock(&vbi->stats_lock);
+}
+
+static int virtnet_bypass_change_mtu(struct net_device *dev, int new_mtu)
+{
+	struct virtnet_bypass_info *vbi = netdev_priv(dev);
+	struct net_device *child_netdev;
+	int ret = 0;
+
+	child_netdev = rcu_dereference(vbi->active_netdev);
+	if (child_netdev) {
+		ret = dev_set_mtu(child_netdev, new_mtu);
+		if (ret)
+			return ret;
+	}
+
+	child_netdev = rcu_dereference(vbi->backup_netdev);
+	if (child_netdev) {
+		ret = dev_set_mtu(child_netdev, new_mtu);
+		if (ret)
+			netdev_err(child_netdev,
+				   "Unexpected failure to set mtu to %d\n",
+				   new_mtu);
+	}
+
+	dev->mtu = new_mtu;
+	return 0;
+}
+
+static const struct net_device_ops virtnet_bypass_netdev_ops = {
+	.ndo_open		= virtnet_bypass_open,
+	.ndo_stop		= virtnet_bypass_close,
+	.ndo_start_xmit		= virtnet_bypass_start_xmit,
+	.ndo_select_queue	= virtnet_bypass_select_queue,
+	.ndo_get_stats64	= virtnet_bypass_get_stats,
+	.ndo_change_mtu		= virtnet_bypass_change_mtu,
+	.ndo_validate_addr	= eth_validate_addr,
+	.ndo_features_check	= passthru_features_check,
+};
+
+static int
+virtnet_bypass_ethtool_get_link_ksettings(struct net_device *dev,
+					  struct ethtool_link_ksettings *cmd)
+{
+	struct virtnet_bypass_info *vbi = netdev_priv(dev);
+	struct net_device *child_netdev;
+
+	child_netdev = rtnl_dereference(vbi->active_netdev);
+	if (!child_netdev || !virtnet_bypass_xmit_ready(child_netdev)) {
+		child_netdev = rtnl_dereference(vbi->backup_netdev);
+		if (!child_netdev || !virtnet_bypass_xmit_ready(child_netdev)) {
+			cmd->base.duplex = DUPLEX_UNKNOWN;
+			cmd->base.port = PORT_OTHER;
+			cmd->base.speed = SPEED_UNKNOWN;
+
+			return 0;
+		}
+	}
+
+	return __ethtool_get_link_ksettings(child_netdev, cmd);
+}
+
+#define BYPASS_DRV_NAME "virtnet_bypass"
+#define BYPASS_DRV_VERSION "0.1"
+
+static void
+virtnet_bypass_ethtool_get_drvinfo(struct net_device *dev,
+				   struct ethtool_drvinfo *drvinfo)
+{
+	strlcpy(drvinfo->driver, BYPASS_DRV_NAME, sizeof(drvinfo->driver));
+	strlcpy(drvinfo->version, BYPASS_DRV_VERSION, sizeof(drvinfo->version));
+}
+
+static const struct ethtool_ops virtnet_bypass_ethtool_ops = {
+	.get_drvinfo            = virtnet_bypass_ethtool_get_drvinfo,
+	.get_link               = ethtool_op_get_link,
+	.get_link_ksettings     = virtnet_bypass_ethtool_get_link_ksettings,
+};
+
+static struct net_device *
+get_virtnet_bypass_bymac(struct net *net, const u8 *mac)
+{
+	struct net_device *dev;
+
+	ASSERT_RTNL();
+
+	for_each_netdev(net, dev) {
+		if (dev->netdev_ops != &virtnet_bypass_netdev_ops)
+			continue;       /* not a virtnet_bypass device */
+
+		if (ether_addr_equal(mac, dev->perm_addr))
+			return dev;
+	}
+
+	return NULL;
+}
+
+static struct net_device *
+get_virtnet_bypass_byref(struct net_device *child_netdev)
+{
+	struct net *net = dev_net(child_netdev);
+	struct net_device *dev;
+
+	ASSERT_RTNL();
+
+	for_each_netdev(net, dev) {
+		struct virtnet_bypass_info *vbi;
+
+		if (dev->netdev_ops != &virtnet_bypass_netdev_ops)
+			continue;       /* not a virtnet_bypass device */
+
+		vbi = netdev_priv(dev);
+
+		if ((rtnl_dereference(vbi->active_netdev) == child_netdev) ||
+		    (rtnl_dereference(vbi->backup_netdev) == child_netdev))
+			return dev;	/* a match */
+	}
+
+	return NULL;
+}
+
+/* Called when child dev is injecting data into network stack.
+ * Change the associated network device from lower dev to virtio.
+ * note: already called with rcu_read_lock
+ */
+static rx_handler_result_t virtnet_bypass_handle_frame(struct sk_buff **pskb)
+{
+	struct sk_buff *skb = *pskb;
+	struct net_device *ndev = rcu_dereference(skb->dev->rx_handler_data);
+
+	skb->dev = ndev;
+
+	return RX_HANDLER_ANOTHER;
+}
+
+static int virtnet_bypass_register_child(struct net_device *child_netdev)
+{
+	struct virtnet_bypass_info *vbi;
+	struct net_device *dev;
+	bool backup;
+	int ret;
+
+	if (child_netdev->addr_len != ETH_ALEN)
+		return NOTIFY_DONE;
+
+	/* We will use the MAC address to locate the virtnet_bypass netdev
+	 * to associate with the child netdev. If we don't find a matching
+	 * bypass netdev, move on.
+	 */
+	dev = get_virtnet_bypass_bymac(dev_net(child_netdev),
+				       child_netdev->perm_addr);
+	if (!dev)
+		return NOTIFY_DONE;
+
+	vbi = netdev_priv(dev);
+	backup = (child_netdev->dev.parent == dev->dev.parent);
+	if (backup ? rtnl_dereference(vbi->backup_netdev) :
+			rtnl_dereference(vbi->active_netdev)) {
+		netdev_info(dev,
+		  "%s attempting to join bypass dev when %s already present\n",
+			child_netdev->name,
+			backup ? "backup" : "active");
+		return NOTIFY_DONE;
+	}
+
+	ret = netdev_rx_handler_register(child_netdev,
+					 virtnet_bypass_handle_frame, dev);
+	if (ret != 0) {
+		netdev_err(child_netdev,
+		     "can not register bypass receive handler (err = %d)\n",
+			   ret);
+		goto rx_handler_failed;
+	}
+
+	ret = netdev_upper_dev_link(child_netdev, dev, NULL);
+	if (ret != 0) {
+		netdev_err(child_netdev,
+			   "can not set master device %s (err = %d)\n",
+			   dev->name, ret);
+		goto upper_link_failed;
+	}
+
+	child_netdev->flags |= IFF_SLAVE;
+
+	if (netif_running(dev)) {
+		ret = dev_open(child_netdev);
+		if (ret && (ret != -EBUSY)) {
+			netdev_err(dev, "Opening child %s failed ret:%d\n",
+				   child_netdev->name, ret);
+			goto err_interface_up;
+		}
+	}
+
+	/* Align MTU of child with master */
+	ret = dev_set_mtu(child_netdev, dev->mtu);
+	if (ret) {
+		netdev_err(dev,
+			   "unable to change mtu of %s to %u register failed\n",
+			   child_netdev->name, dev->mtu);
+		goto err_set_mtu;
+	}
+
+	call_netdevice_notifiers(NETDEV_JOIN, child_netdev);
+
+	netdev_info(dev, "registering %s\n", child_netdev->name);
+
+	dev_hold(child_netdev);
+	if (backup) {
+		rcu_assign_pointer(vbi->backup_netdev, child_netdev);
+		dev_get_stats(vbi->backup_netdev, &vbi->backup_stats);
+	} else {
+		rcu_assign_pointer(vbi->active_netdev, child_netdev);
+		dev_get_stats(vbi->active_netdev, &vbi->active_stats);
+		dev->min_mtu = child_netdev->min_mtu;
+		dev->max_mtu = child_netdev->max_mtu;
+	}
+
+	return NOTIFY_OK;
+
+err_set_mtu:
+	dev_close(child_netdev);
+err_interface_up:
+	netdev_upper_dev_unlink(child_netdev, dev);
+	child_netdev->flags &= ~IFF_SLAVE;
+upper_link_failed:
+	netdev_rx_handler_unregister(child_netdev);
+rx_handler_failed:
+	return NOTIFY_DONE;
+}
+
+static int virtnet_bypass_unregister_child(struct net_device *child_netdev)
+{
+	struct virtnet_bypass_info *vbi;
+	struct net_device *dev, *backup;
+
+	dev = get_virtnet_bypass_byref(child_netdev);
+	if (!dev)
+		return NOTIFY_DONE;
+
+	vbi = netdev_priv(dev);
+
+	netdev_info(dev, "unregistering %s\n", child_netdev->name);
+
+	netdev_rx_handler_unregister(child_netdev);
+	netdev_upper_dev_unlink(child_netdev, dev);
+	child_netdev->flags &= ~IFF_SLAVE;
+
+	if (child_netdev->dev.parent == dev->dev.parent) {
+		RCU_INIT_POINTER(vbi->backup_netdev, NULL);
+	} else {
+		RCU_INIT_POINTER(vbi->active_netdev, NULL);
+		backup = rtnl_dereference(vbi->backup_netdev);
+		if (backup) {
+			dev->min_mtu = backup->min_mtu;
+			dev->max_mtu = backup->max_mtu;
+		}
+	}
+
+	dev_put(child_netdev);
+
+	return NOTIFY_OK;
+}
+
+static int virtnet_bypass_update_link(struct net_device *child_netdev)
+{
+	struct net_device *dev, *active, *backup;
+	struct virtnet_bypass_info *vbi;
+
+	dev = get_virtnet_bypass_byref(child_netdev);
+	if (!dev || !netif_running(dev))
+		return NOTIFY_DONE;
+
+	vbi = netdev_priv(dev);
+
+	active = rtnl_dereference(vbi->active_netdev);
+	backup = rtnl_dereference(vbi->backup_netdev);
+
+	if ((active && virtnet_bypass_xmit_ready(active)) ||
+	    (backup && virtnet_bypass_xmit_ready(backup))) {
+		netif_carrier_on(dev);
+		netif_tx_wake_all_queues(dev);
+	} else {
+		netif_carrier_off(dev);
+		netif_tx_stop_all_queues(dev);
+	}
+
+	return NOTIFY_OK;
+}
+
+static int
+virtnet_bypass_event(struct notifier_block *this, unsigned long event,
+		     void *ptr)
+{
+	struct net_device *event_dev = netdev_notifier_info_to_dev(ptr);
+
+	/* Skip our own events */
+	if (event_dev->netdev_ops == &virtnet_bypass_netdev_ops)
+		return NOTIFY_DONE;
+
+	/* Avoid non-Ethernet type devices */
+	if (event_dev->type != ARPHRD_ETHER)
+		return NOTIFY_DONE;
+
+	/* Avoid Vlan dev with same MAC registering as child dev */
+	if (is_vlan_dev(event_dev))
+		return NOTIFY_DONE;
+
+	/* Avoid Bonding master dev with same MAC registering as child dev */
+	if ((event_dev->priv_flags & IFF_BONDING) &&
+	    (event_dev->flags & IFF_MASTER))
+		return NOTIFY_DONE;
+
+	switch (event) {
+	case NETDEV_REGISTER:
+		return virtnet_bypass_register_child(event_dev);
+	case NETDEV_UNREGISTER:
+		return virtnet_bypass_unregister_child(event_dev);
+	case NETDEV_UP:
+	case NETDEV_DOWN:
+	case NETDEV_CHANGE:
+		return virtnet_bypass_update_link(event_dev);
+	default:
+		return NOTIFY_DONE;
+	}
+}
+
+static struct notifier_block virtnet_bypass_notifier = {
+	.notifier_call = virtnet_bypass_event,
+};
+
+static int virtnet_bypass_create(struct virtnet_info *vi)
+{
+	struct net_device *backup_netdev = vi->dev;
+	struct device *dev = &vi->vdev->dev;
+	struct net_device *bypass_netdev;
+	int res;
+
+	/* Alloc at least 2 queues, for now we are going with 16 assuming
+	 * that most devices being bonded won't have too many queues.
+	 */
+	bypass_netdev = alloc_etherdev_mq(sizeof(struct virtnet_bypass_info),
+					  16);
+	if (!bypass_netdev) {
+		dev_err(dev, "Unable to allocate bypass_netdev!\n");
+		return -ENOMEM;
+	}
+
+	dev_net_set(bypass_netdev, dev_net(backup_netdev));
+	SET_NETDEV_DEV(bypass_netdev, dev);
+
+	bypass_netdev->netdev_ops = &virtnet_bypass_netdev_ops;
+	bypass_netdev->ethtool_ops = &virtnet_bypass_ethtool_ops;
+
+	/* Initialize the device options */
+	bypass_netdev->flags |= IFF_MASTER;
+	bypass_netdev->priv_flags |= IFF_BONDING | IFF_UNICAST_FLT |
+				     IFF_NO_QUEUE;
+	bypass_netdev->priv_flags &= ~(IFF_XMIT_DST_RELEASE |
+				       IFF_TX_SKB_SHARING);
+
+	/* don't acquire bypass netdev's netif_tx_lock when transmitting */
+	bypass_netdev->features |= NETIF_F_LLTX;
+
+	/* Don't allow bypass devices to change network namespaces. */
+	bypass_netdev->features |= NETIF_F_NETNS_LOCAL;
+
+	bypass_netdev->hw_features = NETIF_F_HW_CSUM | NETIF_F_SG |
+				     NETIF_F_FRAGLIST | NETIF_F_ALL_TSO |
+				     NETIF_F_HIGHDMA | NETIF_F_LRO;
+
+	bypass_netdev->hw_features |= NETIF_F_GSO_ENCAP_ALL;
+	bypass_netdev->features |= bypass_netdev->hw_features;
+
+	/* For now treat bypass netdev as VLAN challenged since we
+	 * cannot assume VLAN functionality with a VF
+	 */
+	bypass_netdev->features |= NETIF_F_VLAN_CHALLENGED;
+
+	memcpy(bypass_netdev->dev_addr, backup_netdev->dev_addr,
+	       bypass_netdev->addr_len);
+
+	bypass_netdev->min_mtu = backup_netdev->min_mtu;
+	bypass_netdev->max_mtu = backup_netdev->max_mtu;
+
+	res = register_netdev(bypass_netdev);
+	if (res < 0) {
+		dev_err(dev, "Unable to register bypass_netdev!\n");
+		free_netdev(bypass_netdev);
+		return res;
+	}
+
+	netif_carrier_off(bypass_netdev);
+
+	vi->bypass_netdev = bypass_netdev;
+
+	/* Change the name of the backup interface to vbkup0
+	 * we may need to revisit naming later but this gets it out
+	 * of the way for now.
+	 */
+	strcpy(backup_netdev->name, "vbkup%d");
+
+	return 0;
+}
+
+static void virtnet_bypass_destroy(struct virtnet_info *vi)
+{
+	struct net_device *bypass_netdev = vi->bypass_netdev;
+	struct virtnet_bypass_info *vbi;
+	struct net_device *child_netdev;
+
+	/* no device found, nothing to free */
+	if (!bypass_netdev)
+		return;
+
+	vbi = netdev_priv(bypass_netdev);
+
+	netif_device_detach(bypass_netdev);
+
+	rtnl_lock();
+
+	child_netdev = rtnl_dereference(vbi->active_netdev);
+	if (child_netdev)
+		virtnet_bypass_unregister_child(child_netdev);
+
+	child_netdev = rtnl_dereference(vbi->backup_netdev);
+	if (child_netdev)
+		virtnet_bypass_unregister_child(child_netdev);
+
+	unregister_netdevice(bypass_netdev);
+
+	rtnl_unlock();
+
+	free_netdev(bypass_netdev);
+}
+
 static int virtnet_probe(struct virtio_device *vdev)
 {
 	int i, err = -ENOMEM;
@@ -2797,10 +3422,15 @@ static int virtnet_probe(struct virtio_device *vdev)
 
 	virtnet_init_settings(dev);
 
+	if (virtio_has_feature(vdev, VIRTIO_NET_F_BACKUP)) {
+		if (virtnet_bypass_create(vi) != 0)
+			goto free_vqs;
+	}
+
 	err = register_netdev(dev);
 	if (err) {
 		pr_debug("virtio_net: registering device failed\n");
-		goto free_vqs;
+		goto free_bypass;
 	}
 
 	virtio_device_ready(vdev);
@@ -2837,6 +3467,8 @@ static int virtnet_probe(struct virtio_device *vdev)
 	vi->vdev->config->reset(vdev);
 
 	unregister_netdev(dev);
+free_bypass:
+	virtnet_bypass_destroy(vi);
 free_vqs:
 	cancel_delayed_work_sync(&vi->refill);
 	free_receive_page_frags(vi);
@@ -2871,6 +3503,8 @@ static void virtnet_remove(struct virtio_device *vdev)
 
 	unregister_netdev(vi->dev);
 
+	virtnet_bypass_destroy(vi);
+
 	remove_vq_common(vi);
 
 	free_netdev(vi->dev);
@@ -2968,6 +3602,8 @@ static __init int virtio_net_driver_init(void)
         ret = register_virtio_driver(&virtio_net_driver);
 	if (ret)
 		goto err_virtio;
+
+	register_netdevice_notifier(&virtnet_bypass_notifier);
 	return 0;
 err_virtio:
 	cpuhp_remove_multi_state(CPUHP_VIRT_NET_DEAD);
@@ -2980,6 +3616,7 @@ module_init(virtio_net_driver_init);
 
 static __exit void virtio_net_driver_exit(void)
 {
+	unregister_netdevice_notifier(&virtnet_bypass_notifier);
 	unregister_virtio_driver(&virtio_net_driver);
 	cpuhp_remove_multi_state(CPUHP_VIRT_NET_DEAD);
 	cpuhp_remove_multi_state(virtionet_online);
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [RFC PATCH v3 2/3] virtio_net: Extend virtio to use VF datapath when available
  2018-02-16 18:11 ` [virtio-dev] " Sridhar Samudrala
                   ` (2 preceding siblings ...)
  (?)
@ 2018-02-16 18:11 ` Sridhar Samudrala
  -1 siblings, 0 replies; 121+ messages in thread
From: Sridhar Samudrala @ 2018-02-16 18:11 UTC (permalink / raw)
  To: mst, stephen, davem, netdev, virtualization, virtio-dev,
	jesse.brandeburg, alexander.h.duyck, kubakici, sridhar.samudrala,
	jasowang, loseweigh

This patch enables virtio_net to switch over to a VF datapath when a VF
netdev is present with the same MAC address. It allows live migration
of a VM with a direct attached VF without the need to setup a bond/team
between a VF and virtio net device in the guest.

The hypervisor needs to enable only one datapath at any time so that
packets don't get looped back to the VM over the other datapath. When a VF
is plugged, the virtio datapath link state can be marked as down. The
hypervisor needs to unplug the VF device from the guest on the source host
and reset the MAC filter of the VF to initiate failover of datapath to
virtio before starting the migration. After the migration is completed,
the destination hypervisor sets the MAC filter on the VF and plugs it back
to the guest to switch over to VF datapath.

When BACKUP feature is enabled, an additional netdev(bypass netdev) is
created that acts as a master device and tracks the state of the 2 lower
netdevs. The original virtio_net netdev is marked as 'backup' netdev and a
passthru device with the same MAC is registered as 'active' netdev.

This patch is based on the discussion initiated by Jesse on this thread.
https://marc.info/?l=linux-virtualization&m=151189725224231&w=2

Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> 
---
 drivers/net/virtio_net.c | 639 ++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 638 insertions(+), 1 deletion(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index bcd13fe906ca..14679806c1b1 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -30,6 +30,7 @@
 #include <linux/cpu.h>
 #include <linux/average.h>
 #include <linux/filter.h>
+#include <linux/netdevice.h>
 #include <net/route.h>
 #include <net/xdp.h>
 
@@ -147,6 +148,27 @@ struct receive_queue {
 	struct xdp_rxq_info xdp_rxq;
 };
 
+/* bypass state maintained when BACKUP feature is enabled */
+struct virtnet_bypass_info {
+	/* passthru netdev with same MAC */
+	struct net_device __rcu *active_netdev;
+
+	/* virtio_net netdev */
+	struct net_device __rcu *backup_netdev;
+
+	/* active netdev stats */
+	struct rtnl_link_stats64 active_stats;
+
+	/* backup netdev stats */
+	struct rtnl_link_stats64 backup_stats;
+
+	/* aggregated stats */
+	struct rtnl_link_stats64 bypass_stats;
+
+	/* spinlock while updating stats */
+	spinlock_t stats_lock;
+};
+
 struct virtnet_info {
 	struct virtio_device *vdev;
 	struct virtqueue *cvq;
@@ -206,6 +228,9 @@ struct virtnet_info {
 	u32 speed;
 
 	unsigned long guest_offloads;
+
+	/* upper netdev created when BACKUP feature enabled */
+	struct net_device *bypass_netdev;
 };
 
 struct padded_vnet_hdr {
@@ -2255,6 +2280,11 @@ static const struct net_device_ops virtnet_netdev = {
 	.ndo_features_check	= passthru_features_check,
 };
 
+static bool virtnet_bypass_xmit_ready(struct net_device *dev)
+{
+	return netif_running(dev) && netif_carrier_ok(dev);
+}
+
 static void virtnet_config_changed_work(struct work_struct *work)
 {
 	struct virtnet_info *vi =
@@ -2647,6 +2677,601 @@ static int virtnet_validate(struct virtio_device *vdev)
 	return 0;
 }
 
+static void
+virtnet_bypass_child_open(struct net_device *dev,
+			  struct net_device *child_netdev)
+{
+	int err = dev_open(child_netdev);
+
+	if (err)
+		netdev_warn(dev, "unable to open slave: %s: %d\n",
+			    child_netdev->name, err);
+}
+
+static int virtnet_bypass_open(struct net_device *dev)
+{
+	struct virtnet_bypass_info *vbi = netdev_priv(dev);
+	struct net_device *child_netdev;
+
+	netif_carrier_off(dev);
+	netif_tx_wake_all_queues(dev);
+
+	child_netdev = rtnl_dereference(vbi->active_netdev);
+	if (child_netdev)
+		virtnet_bypass_child_open(dev, child_netdev);
+
+	child_netdev = rtnl_dereference(vbi->backup_netdev);
+	if (child_netdev)
+		virtnet_bypass_child_open(dev, child_netdev);
+
+	return 0;
+}
+
+static int virtnet_bypass_close(struct net_device *dev)
+{
+	struct virtnet_bypass_info *vi = netdev_priv(dev);
+	struct net_device *child_netdev;
+
+	netif_tx_disable(dev);
+
+	child_netdev = rtnl_dereference(vi->active_netdev);
+	if (child_netdev)
+		dev_close(child_netdev);
+
+	child_netdev = rtnl_dereference(vi->backup_netdev);
+	if (child_netdev)
+		dev_close(child_netdev);
+
+	return 0;
+}
+
+static netdev_tx_t
+virtnet_bypass_drop_xmit(struct sk_buff *skb, struct net_device *dev)
+{
+	atomic_long_inc(&dev->tx_dropped);
+	dev_kfree_skb_any(skb);
+	return NETDEV_TX_OK;
+}
+
+static netdev_tx_t
+virtnet_bypass_start_xmit(struct sk_buff *skb, struct net_device *dev)
+{
+	struct virtnet_bypass_info *vbi = netdev_priv(dev);
+	struct net_device *xmit_dev;
+
+	/* Try xmit via active netdev followed by backup netdev */
+	xmit_dev = rcu_dereference_bh(vbi->active_netdev);
+	if (!xmit_dev || !virtnet_bypass_xmit_ready(xmit_dev)) {
+		xmit_dev = rcu_dereference_bh(vbi->backup_netdev);
+		if (!xmit_dev || !virtnet_bypass_xmit_ready(xmit_dev))
+			return virtnet_bypass_drop_xmit(skb, dev);
+	}
+
+	skb->dev = xmit_dev;
+	skb->queue_mapping = qdisc_skb_cb(skb)->slave_dev_queue_mapping;
+
+	return dev_queue_xmit(skb);
+}
+
+static u16
+virtnet_bypass_select_queue(struct net_device *dev, struct sk_buff *skb,
+			    void *accel_priv, select_queue_fallback_t fallback)
+{
+	/* This helper function exists to help dev_pick_tx get the correct
+	 * destination queue.  Using a helper function skips a call to
+	 * skb_tx_hash and will put the skbs in the queue we expect on their
+	 * way down to the bonding driver.
+	 */
+	u16 txq = skb_rx_queue_recorded(skb) ? skb_get_rx_queue(skb) : 0;
+
+	/* Save the original txq to restore before passing to the driver */
+	qdisc_skb_cb(skb)->slave_dev_queue_mapping = skb->queue_mapping;
+
+	if (unlikely(txq >= dev->real_num_tx_queues)) {
+		do {
+			txq -= dev->real_num_tx_queues;
+		} while (txq >= dev->real_num_tx_queues);
+	}
+
+	return txq;
+}
+
+/* fold stats, assuming all rtnl_link_stats64 fields are u64, but
+ * that some drivers can provide 32bit values only.
+ */
+static void
+virtnet_bypass_fold_stats(struct rtnl_link_stats64 *_res,
+			  const struct rtnl_link_stats64 *_new,
+			  const struct rtnl_link_stats64 *_old)
+{
+	const u64 *new = (const u64 *)_new;
+	const u64 *old = (const u64 *)_old;
+	u64 *res = (u64 *)_res;
+	int i;
+
+	for (i = 0; i < sizeof(*_res) / sizeof(u64); i++) {
+		u64 nv = new[i];
+		u64 ov = old[i];
+		s64 delta = nv - ov;
+
+		/* detects if this particular field is 32bit only */
+		if (((nv | ov) >> 32) == 0)
+			delta = (s64)(s32)((u32)nv - (u32)ov);
+
+		/* filter anomalies, some drivers reset their stats
+		 * at down/up events.
+		 */
+		if (delta > 0)
+			res[i] += delta;
+	}
+}
+
+static void
+virtnet_bypass_get_stats(struct net_device *dev,
+			 struct rtnl_link_stats64 *stats)
+{
+	struct virtnet_bypass_info *vbi = netdev_priv(dev);
+	const struct rtnl_link_stats64 *new;
+	struct rtnl_link_stats64 temp;
+	struct net_device *child_netdev;
+
+	spin_lock(&vbi->stats_lock);
+	memcpy(stats, &vbi->bypass_stats, sizeof(*stats));
+
+	rcu_read_lock();
+
+	child_netdev = rcu_dereference(vbi->active_netdev);
+	if (child_netdev) {
+		new = dev_get_stats(child_netdev, &temp);
+		virtnet_bypass_fold_stats(stats, new, &vbi->active_stats);
+		memcpy(&vbi->active_stats, new, sizeof(*new));
+	}
+
+	child_netdev = rcu_dereference(vbi->backup_netdev);
+	if (child_netdev) {
+		new = dev_get_stats(child_netdev, &temp);
+		virtnet_bypass_fold_stats(stats, new, &vbi->backup_stats);
+		memcpy(&vbi->backup_stats, new, sizeof(*new));
+	}
+
+	rcu_read_unlock();
+
+	memcpy(&vbi->bypass_stats, stats, sizeof(*stats));
+	spin_unlock(&vbi->stats_lock);
+}
+
+static int virtnet_bypass_change_mtu(struct net_device *dev, int new_mtu)
+{
+	struct virtnet_bypass_info *vbi = netdev_priv(dev);
+	struct net_device *child_netdev;
+	int ret = 0;
+
+	child_netdev = rcu_dereference(vbi->active_netdev);
+	if (child_netdev) {
+		ret = dev_set_mtu(child_netdev, new_mtu);
+		if (ret)
+			return ret;
+	}
+
+	child_netdev = rcu_dereference(vbi->backup_netdev);
+	if (child_netdev) {
+		ret = dev_set_mtu(child_netdev, new_mtu);
+		if (ret)
+			netdev_err(child_netdev,
+				   "Unexpected failure to set mtu to %d\n",
+				   new_mtu);
+	}
+
+	dev->mtu = new_mtu;
+	return 0;
+}
+
+static const struct net_device_ops virtnet_bypass_netdev_ops = {
+	.ndo_open		= virtnet_bypass_open,
+	.ndo_stop		= virtnet_bypass_close,
+	.ndo_start_xmit		= virtnet_bypass_start_xmit,
+	.ndo_select_queue	= virtnet_bypass_select_queue,
+	.ndo_get_stats64	= virtnet_bypass_get_stats,
+	.ndo_change_mtu		= virtnet_bypass_change_mtu,
+	.ndo_validate_addr	= eth_validate_addr,
+	.ndo_features_check	= passthru_features_check,
+};
+
+static int
+virtnet_bypass_ethtool_get_link_ksettings(struct net_device *dev,
+					  struct ethtool_link_ksettings *cmd)
+{
+	struct virtnet_bypass_info *vbi = netdev_priv(dev);
+	struct net_device *child_netdev;
+
+	child_netdev = rtnl_dereference(vbi->active_netdev);
+	if (!child_netdev || !virtnet_bypass_xmit_ready(child_netdev)) {
+		child_netdev = rtnl_dereference(vbi->backup_netdev);
+		if (!child_netdev || !virtnet_bypass_xmit_ready(child_netdev)) {
+			cmd->base.duplex = DUPLEX_UNKNOWN;
+			cmd->base.port = PORT_OTHER;
+			cmd->base.speed = SPEED_UNKNOWN;
+
+			return 0;
+		}
+	}
+
+	return __ethtool_get_link_ksettings(child_netdev, cmd);
+}
+
+#define BYPASS_DRV_NAME "virtnet_bypass"
+#define BYPASS_DRV_VERSION "0.1"
+
+static void
+virtnet_bypass_ethtool_get_drvinfo(struct net_device *dev,
+				   struct ethtool_drvinfo *drvinfo)
+{
+	strlcpy(drvinfo->driver, BYPASS_DRV_NAME, sizeof(drvinfo->driver));
+	strlcpy(drvinfo->version, BYPASS_DRV_VERSION, sizeof(drvinfo->version));
+}
+
+static const struct ethtool_ops virtnet_bypass_ethtool_ops = {
+	.get_drvinfo            = virtnet_bypass_ethtool_get_drvinfo,
+	.get_link               = ethtool_op_get_link,
+	.get_link_ksettings     = virtnet_bypass_ethtool_get_link_ksettings,
+};
+
+static struct net_device *
+get_virtnet_bypass_bymac(struct net *net, const u8 *mac)
+{
+	struct net_device *dev;
+
+	ASSERT_RTNL();
+
+	for_each_netdev(net, dev) {
+		if (dev->netdev_ops != &virtnet_bypass_netdev_ops)
+			continue;       /* not a virtnet_bypass device */
+
+		if (ether_addr_equal(mac, dev->perm_addr))
+			return dev;
+	}
+
+	return NULL;
+}
+
+static struct net_device *
+get_virtnet_bypass_byref(struct net_device *child_netdev)
+{
+	struct net *net = dev_net(child_netdev);
+	struct net_device *dev;
+
+	ASSERT_RTNL();
+
+	for_each_netdev(net, dev) {
+		struct virtnet_bypass_info *vbi;
+
+		if (dev->netdev_ops != &virtnet_bypass_netdev_ops)
+			continue;       /* not a virtnet_bypass device */
+
+		vbi = netdev_priv(dev);
+
+		if ((rtnl_dereference(vbi->active_netdev) == child_netdev) ||
+		    (rtnl_dereference(vbi->backup_netdev) == child_netdev))
+			return dev;	/* a match */
+	}
+
+	return NULL;
+}
+
+/* Called when child dev is injecting data into network stack.
+ * Change the associated network device from lower dev to virtio.
+ * note: already called with rcu_read_lock
+ */
+static rx_handler_result_t virtnet_bypass_handle_frame(struct sk_buff **pskb)
+{
+	struct sk_buff *skb = *pskb;
+	struct net_device *ndev = rcu_dereference(skb->dev->rx_handler_data);
+
+	skb->dev = ndev;
+
+	return RX_HANDLER_ANOTHER;
+}
+
+static int virtnet_bypass_register_child(struct net_device *child_netdev)
+{
+	struct virtnet_bypass_info *vbi;
+	struct net_device *dev;
+	bool backup;
+	int ret;
+
+	if (child_netdev->addr_len != ETH_ALEN)
+		return NOTIFY_DONE;
+
+	/* We will use the MAC address to locate the virtnet_bypass netdev
+	 * to associate with the child netdev. If we don't find a matching
+	 * bypass netdev, move on.
+	 */
+	dev = get_virtnet_bypass_bymac(dev_net(child_netdev),
+				       child_netdev->perm_addr);
+	if (!dev)
+		return NOTIFY_DONE;
+
+	vbi = netdev_priv(dev);
+	backup = (child_netdev->dev.parent == dev->dev.parent);
+	if (backup ? rtnl_dereference(vbi->backup_netdev) :
+			rtnl_dereference(vbi->active_netdev)) {
+		netdev_info(dev,
+		  "%s attempting to join bypass dev when %s already present\n",
+			child_netdev->name,
+			backup ? "backup" : "active");
+		return NOTIFY_DONE;
+	}
+
+	ret = netdev_rx_handler_register(child_netdev,
+					 virtnet_bypass_handle_frame, dev);
+	if (ret != 0) {
+		netdev_err(child_netdev,
+		     "can not register bypass receive handler (err = %d)\n",
+			   ret);
+		goto rx_handler_failed;
+	}
+
+	ret = netdev_upper_dev_link(child_netdev, dev, NULL);
+	if (ret != 0) {
+		netdev_err(child_netdev,
+			   "can not set master device %s (err = %d)\n",
+			   dev->name, ret);
+		goto upper_link_failed;
+	}
+
+	child_netdev->flags |= IFF_SLAVE;
+
+	if (netif_running(dev)) {
+		ret = dev_open(child_netdev);
+		if (ret && (ret != -EBUSY)) {
+			netdev_err(dev, "Opening child %s failed ret:%d\n",
+				   child_netdev->name, ret);
+			goto err_interface_up;
+		}
+	}
+
+	/* Align MTU of child with master */
+	ret = dev_set_mtu(child_netdev, dev->mtu);
+	if (ret) {
+		netdev_err(dev,
+			   "unable to change mtu of %s to %u register failed\n",
+			   child_netdev->name, dev->mtu);
+		goto err_set_mtu;
+	}
+
+	call_netdevice_notifiers(NETDEV_JOIN, child_netdev);
+
+	netdev_info(dev, "registering %s\n", child_netdev->name);
+
+	dev_hold(child_netdev);
+	if (backup) {
+		rcu_assign_pointer(vbi->backup_netdev, child_netdev);
+		dev_get_stats(vbi->backup_netdev, &vbi->backup_stats);
+	} else {
+		rcu_assign_pointer(vbi->active_netdev, child_netdev);
+		dev_get_stats(vbi->active_netdev, &vbi->active_stats);
+		dev->min_mtu = child_netdev->min_mtu;
+		dev->max_mtu = child_netdev->max_mtu;
+	}
+
+	return NOTIFY_OK;
+
+err_set_mtu:
+	dev_close(child_netdev);
+err_interface_up:
+	netdev_upper_dev_unlink(child_netdev, dev);
+	child_netdev->flags &= ~IFF_SLAVE;
+upper_link_failed:
+	netdev_rx_handler_unregister(child_netdev);
+rx_handler_failed:
+	return NOTIFY_DONE;
+}
+
+static int virtnet_bypass_unregister_child(struct net_device *child_netdev)
+{
+	struct virtnet_bypass_info *vbi;
+	struct net_device *dev, *backup;
+
+	dev = get_virtnet_bypass_byref(child_netdev);
+	if (!dev)
+		return NOTIFY_DONE;
+
+	vbi = netdev_priv(dev);
+
+	netdev_info(dev, "unregistering %s\n", child_netdev->name);
+
+	netdev_rx_handler_unregister(child_netdev);
+	netdev_upper_dev_unlink(child_netdev, dev);
+	child_netdev->flags &= ~IFF_SLAVE;
+
+	if (child_netdev->dev.parent == dev->dev.parent) {
+		RCU_INIT_POINTER(vbi->backup_netdev, NULL);
+	} else {
+		RCU_INIT_POINTER(vbi->active_netdev, NULL);
+		backup = rtnl_dereference(vbi->backup_netdev);
+		if (backup) {
+			dev->min_mtu = backup->min_mtu;
+			dev->max_mtu = backup->max_mtu;
+		}
+	}
+
+	dev_put(child_netdev);
+
+	return NOTIFY_OK;
+}
+
+static int virtnet_bypass_update_link(struct net_device *child_netdev)
+{
+	struct net_device *dev, *active, *backup;
+	struct virtnet_bypass_info *vbi;
+
+	dev = get_virtnet_bypass_byref(child_netdev);
+	if (!dev || !netif_running(dev))
+		return NOTIFY_DONE;
+
+	vbi = netdev_priv(dev);
+
+	active = rtnl_dereference(vbi->active_netdev);
+	backup = rtnl_dereference(vbi->backup_netdev);
+
+	if ((active && virtnet_bypass_xmit_ready(active)) ||
+	    (backup && virtnet_bypass_xmit_ready(backup))) {
+		netif_carrier_on(dev);
+		netif_tx_wake_all_queues(dev);
+	} else {
+		netif_carrier_off(dev);
+		netif_tx_stop_all_queues(dev);
+	}
+
+	return NOTIFY_OK;
+}
+
+static int
+virtnet_bypass_event(struct notifier_block *this, unsigned long event,
+		     void *ptr)
+{
+	struct net_device *event_dev = netdev_notifier_info_to_dev(ptr);
+
+	/* Skip our own events */
+	if (event_dev->netdev_ops == &virtnet_bypass_netdev_ops)
+		return NOTIFY_DONE;
+
+	/* Avoid non-Ethernet type devices */
+	if (event_dev->type != ARPHRD_ETHER)
+		return NOTIFY_DONE;
+
+	/* Avoid Vlan dev with same MAC registering as child dev */
+	if (is_vlan_dev(event_dev))
+		return NOTIFY_DONE;
+
+	/* Avoid Bonding master dev with same MAC registering as child dev */
+	if ((event_dev->priv_flags & IFF_BONDING) &&
+	    (event_dev->flags & IFF_MASTER))
+		return NOTIFY_DONE;
+
+	switch (event) {
+	case NETDEV_REGISTER:
+		return virtnet_bypass_register_child(event_dev);
+	case NETDEV_UNREGISTER:
+		return virtnet_bypass_unregister_child(event_dev);
+	case NETDEV_UP:
+	case NETDEV_DOWN:
+	case NETDEV_CHANGE:
+		return virtnet_bypass_update_link(event_dev);
+	default:
+		return NOTIFY_DONE;
+	}
+}
+
+static struct notifier_block virtnet_bypass_notifier = {
+	.notifier_call = virtnet_bypass_event,
+};
+
+static int virtnet_bypass_create(struct virtnet_info *vi)
+{
+	struct net_device *backup_netdev = vi->dev;
+	struct device *dev = &vi->vdev->dev;
+	struct net_device *bypass_netdev;
+	int res;
+
+	/* Alloc at least 2 queues, for now we are going with 16 assuming
+	 * that most devices being bonded won't have too many queues.
+	 */
+	bypass_netdev = alloc_etherdev_mq(sizeof(struct virtnet_bypass_info),
+					  16);
+	if (!bypass_netdev) {
+		dev_err(dev, "Unable to allocate bypass_netdev!\n");
+		return -ENOMEM;
+	}
+
+	dev_net_set(bypass_netdev, dev_net(backup_netdev));
+	SET_NETDEV_DEV(bypass_netdev, dev);
+
+	bypass_netdev->netdev_ops = &virtnet_bypass_netdev_ops;
+	bypass_netdev->ethtool_ops = &virtnet_bypass_ethtool_ops;
+
+	/* Initialize the device options */
+	bypass_netdev->flags |= IFF_MASTER;
+	bypass_netdev->priv_flags |= IFF_BONDING | IFF_UNICAST_FLT |
+				     IFF_NO_QUEUE;
+	bypass_netdev->priv_flags &= ~(IFF_XMIT_DST_RELEASE |
+				       IFF_TX_SKB_SHARING);
+
+	/* don't acquire bypass netdev's netif_tx_lock when transmitting */
+	bypass_netdev->features |= NETIF_F_LLTX;
+
+	/* Don't allow bypass devices to change network namespaces. */
+	bypass_netdev->features |= NETIF_F_NETNS_LOCAL;
+
+	bypass_netdev->hw_features = NETIF_F_HW_CSUM | NETIF_F_SG |
+				     NETIF_F_FRAGLIST | NETIF_F_ALL_TSO |
+				     NETIF_F_HIGHDMA | NETIF_F_LRO;
+
+	bypass_netdev->hw_features |= NETIF_F_GSO_ENCAP_ALL;
+	bypass_netdev->features |= bypass_netdev->hw_features;
+
+	/* For now treat bypass netdev as VLAN challenged since we
+	 * cannot assume VLAN functionality with a VF
+	 */
+	bypass_netdev->features |= NETIF_F_VLAN_CHALLENGED;
+
+	memcpy(bypass_netdev->dev_addr, backup_netdev->dev_addr,
+	       bypass_netdev->addr_len);
+
+	bypass_netdev->min_mtu = backup_netdev->min_mtu;
+	bypass_netdev->max_mtu = backup_netdev->max_mtu;
+
+	res = register_netdev(bypass_netdev);
+	if (res < 0) {
+		dev_err(dev, "Unable to register bypass_netdev!\n");
+		free_netdev(bypass_netdev);
+		return res;
+	}
+
+	netif_carrier_off(bypass_netdev);
+
+	vi->bypass_netdev = bypass_netdev;
+
+	/* Change the name of the backup interface to vbkup0
+	 * we may need to revisit naming later but this gets it out
+	 * of the way for now.
+	 */
+	strcpy(backup_netdev->name, "vbkup%d");
+
+	return 0;
+}
+
+static void virtnet_bypass_destroy(struct virtnet_info *vi)
+{
+	struct net_device *bypass_netdev = vi->bypass_netdev;
+	struct virtnet_bypass_info *vbi;
+	struct net_device *child_netdev;
+
+	/* no device found, nothing to free */
+	if (!bypass_netdev)
+		return;
+
+	vbi = netdev_priv(bypass_netdev);
+
+	netif_device_detach(bypass_netdev);
+
+	rtnl_lock();
+
+	child_netdev = rtnl_dereference(vbi->active_netdev);
+	if (child_netdev)
+		virtnet_bypass_unregister_child(child_netdev);
+
+	child_netdev = rtnl_dereference(vbi->backup_netdev);
+	if (child_netdev)
+		virtnet_bypass_unregister_child(child_netdev);
+
+	unregister_netdevice(bypass_netdev);
+
+	rtnl_unlock();
+
+	free_netdev(bypass_netdev);
+}
+
 static int virtnet_probe(struct virtio_device *vdev)
 {
 	int i, err = -ENOMEM;
@@ -2797,10 +3422,15 @@ static int virtnet_probe(struct virtio_device *vdev)
 
 	virtnet_init_settings(dev);
 
+	if (virtio_has_feature(vdev, VIRTIO_NET_F_BACKUP)) {
+		if (virtnet_bypass_create(vi) != 0)
+			goto free_vqs;
+	}
+
 	err = register_netdev(dev);
 	if (err) {
 		pr_debug("virtio_net: registering device failed\n");
-		goto free_vqs;
+		goto free_bypass;
 	}
 
 	virtio_device_ready(vdev);
@@ -2837,6 +3467,8 @@ static int virtnet_probe(struct virtio_device *vdev)
 	vi->vdev->config->reset(vdev);
 
 	unregister_netdev(dev);
+free_bypass:
+	virtnet_bypass_destroy(vi);
 free_vqs:
 	cancel_delayed_work_sync(&vi->refill);
 	free_receive_page_frags(vi);
@@ -2871,6 +3503,8 @@ static void virtnet_remove(struct virtio_device *vdev)
 
 	unregister_netdev(vi->dev);
 
+	virtnet_bypass_destroy(vi);
+
 	remove_vq_common(vi);
 
 	free_netdev(vi->dev);
@@ -2968,6 +3602,8 @@ static __init int virtio_net_driver_init(void)
         ret = register_virtio_driver(&virtio_net_driver);
 	if (ret)
 		goto err_virtio;
+
+	register_netdevice_notifier(&virtnet_bypass_notifier);
 	return 0;
 err_virtio:
 	cpuhp_remove_multi_state(CPUHP_VIRT_NET_DEAD);
@@ -2980,6 +3616,7 @@ module_init(virtio_net_driver_init);
 
 static __exit void virtio_net_driver_exit(void)
 {
+	unregister_netdevice_notifier(&virtnet_bypass_notifier);
 	unregister_virtio_driver(&virtio_net_driver);
 	cpuhp_remove_multi_state(CPUHP_VIRT_NET_DEAD);
 	cpuhp_remove_multi_state(virtionet_online);
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [virtio-dev] [RFC PATCH v3 2/3] virtio_net: Extend virtio to use VF datapath when available
@ 2018-02-16 18:11   ` Sridhar Samudrala
  0 siblings, 0 replies; 121+ messages in thread
From: Sridhar Samudrala @ 2018-02-16 18:11 UTC (permalink / raw)
  To: mst, stephen, davem, netdev, virtualization, virtio-dev,
	jesse.brandeburg, alexander.h.duyck, kubakici, sridhar.samudrala,
	jasowang, loseweigh

This patch enables virtio_net to switch over to a VF datapath when a VF
netdev is present with the same MAC address. It allows live migration
of a VM with a direct attached VF without the need to setup a bond/team
between a VF and virtio net device in the guest.

The hypervisor needs to enable only one datapath at any time so that
packets don't get looped back to the VM over the other datapath. When a VF
is plugged, the virtio datapath link state can be marked as down. The
hypervisor needs to unplug the VF device from the guest on the source host
and reset the MAC filter of the VF to initiate failover of datapath to
virtio before starting the migration. After the migration is completed,
the destination hypervisor sets the MAC filter on the VF and plugs it back
to the guest to switch over to VF datapath.

When BACKUP feature is enabled, an additional netdev(bypass netdev) is
created that acts as a master device and tracks the state of the 2 lower
netdevs. The original virtio_net netdev is marked as 'backup' netdev and a
passthru device with the same MAC is registered as 'active' netdev.

This patch is based on the discussion initiated by Jesse on this thread.
https://marc.info/?l=linux-virtualization&m=151189725224231&w=2

Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> 
---
 drivers/net/virtio_net.c | 639 ++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 638 insertions(+), 1 deletion(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index bcd13fe906ca..14679806c1b1 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -30,6 +30,7 @@
 #include <linux/cpu.h>
 #include <linux/average.h>
 #include <linux/filter.h>
+#include <linux/netdevice.h>
 #include <net/route.h>
 #include <net/xdp.h>
 
@@ -147,6 +148,27 @@ struct receive_queue {
 	struct xdp_rxq_info xdp_rxq;
 };
 
+/* bypass state maintained when BACKUP feature is enabled */
+struct virtnet_bypass_info {
+	/* passthru netdev with same MAC */
+	struct net_device __rcu *active_netdev;
+
+	/* virtio_net netdev */
+	struct net_device __rcu *backup_netdev;
+
+	/* active netdev stats */
+	struct rtnl_link_stats64 active_stats;
+
+	/* backup netdev stats */
+	struct rtnl_link_stats64 backup_stats;
+
+	/* aggregated stats */
+	struct rtnl_link_stats64 bypass_stats;
+
+	/* spinlock while updating stats */
+	spinlock_t stats_lock;
+};
+
 struct virtnet_info {
 	struct virtio_device *vdev;
 	struct virtqueue *cvq;
@@ -206,6 +228,9 @@ struct virtnet_info {
 	u32 speed;
 
 	unsigned long guest_offloads;
+
+	/* upper netdev created when BACKUP feature enabled */
+	struct net_device *bypass_netdev;
 };
 
 struct padded_vnet_hdr {
@@ -2255,6 +2280,11 @@ static const struct net_device_ops virtnet_netdev = {
 	.ndo_features_check	= passthru_features_check,
 };
 
+static bool virtnet_bypass_xmit_ready(struct net_device *dev)
+{
+	return netif_running(dev) && netif_carrier_ok(dev);
+}
+
 static void virtnet_config_changed_work(struct work_struct *work)
 {
 	struct virtnet_info *vi =
@@ -2647,6 +2677,601 @@ static int virtnet_validate(struct virtio_device *vdev)
 	return 0;
 }
 
+static void
+virtnet_bypass_child_open(struct net_device *dev,
+			  struct net_device *child_netdev)
+{
+	int err = dev_open(child_netdev);
+
+	if (err)
+		netdev_warn(dev, "unable to open slave: %s: %d\n",
+			    child_netdev->name, err);
+}
+
+static int virtnet_bypass_open(struct net_device *dev)
+{
+	struct virtnet_bypass_info *vbi = netdev_priv(dev);
+	struct net_device *child_netdev;
+
+	netif_carrier_off(dev);
+	netif_tx_wake_all_queues(dev);
+
+	child_netdev = rtnl_dereference(vbi->active_netdev);
+	if (child_netdev)
+		virtnet_bypass_child_open(dev, child_netdev);
+
+	child_netdev = rtnl_dereference(vbi->backup_netdev);
+	if (child_netdev)
+		virtnet_bypass_child_open(dev, child_netdev);
+
+	return 0;
+}
+
+static int virtnet_bypass_close(struct net_device *dev)
+{
+	struct virtnet_bypass_info *vi = netdev_priv(dev);
+	struct net_device *child_netdev;
+
+	netif_tx_disable(dev);
+
+	child_netdev = rtnl_dereference(vi->active_netdev);
+	if (child_netdev)
+		dev_close(child_netdev);
+
+	child_netdev = rtnl_dereference(vi->backup_netdev);
+	if (child_netdev)
+		dev_close(child_netdev);
+
+	return 0;
+}
+
+static netdev_tx_t
+virtnet_bypass_drop_xmit(struct sk_buff *skb, struct net_device *dev)
+{
+	atomic_long_inc(&dev->tx_dropped);
+	dev_kfree_skb_any(skb);
+	return NETDEV_TX_OK;
+}
+
+static netdev_tx_t
+virtnet_bypass_start_xmit(struct sk_buff *skb, struct net_device *dev)
+{
+	struct virtnet_bypass_info *vbi = netdev_priv(dev);
+	struct net_device *xmit_dev;
+
+	/* Try xmit via active netdev followed by backup netdev */
+	xmit_dev = rcu_dereference_bh(vbi->active_netdev);
+	if (!xmit_dev || !virtnet_bypass_xmit_ready(xmit_dev)) {
+		xmit_dev = rcu_dereference_bh(vbi->backup_netdev);
+		if (!xmit_dev || !virtnet_bypass_xmit_ready(xmit_dev))
+			return virtnet_bypass_drop_xmit(skb, dev);
+	}
+
+	skb->dev = xmit_dev;
+	skb->queue_mapping = qdisc_skb_cb(skb)->slave_dev_queue_mapping;
+
+	return dev_queue_xmit(skb);
+}
+
+static u16
+virtnet_bypass_select_queue(struct net_device *dev, struct sk_buff *skb,
+			    void *accel_priv, select_queue_fallback_t fallback)
+{
+	/* This helper function exists to help dev_pick_tx get the correct
+	 * destination queue.  Using a helper function skips a call to
+	 * skb_tx_hash and will put the skbs in the queue we expect on their
+	 * way down to the bonding driver.
+	 */
+	u16 txq = skb_rx_queue_recorded(skb) ? skb_get_rx_queue(skb) : 0;
+
+	/* Save the original txq to restore before passing to the driver */
+	qdisc_skb_cb(skb)->slave_dev_queue_mapping = skb->queue_mapping;
+
+	if (unlikely(txq >= dev->real_num_tx_queues)) {
+		do {
+			txq -= dev->real_num_tx_queues;
+		} while (txq >= dev->real_num_tx_queues);
+	}
+
+	return txq;
+}
+
+/* fold stats, assuming all rtnl_link_stats64 fields are u64, but
+ * that some drivers can provide 32bit values only.
+ */
+static void
+virtnet_bypass_fold_stats(struct rtnl_link_stats64 *_res,
+			  const struct rtnl_link_stats64 *_new,
+			  const struct rtnl_link_stats64 *_old)
+{
+	const u64 *new = (const u64 *)_new;
+	const u64 *old = (const u64 *)_old;
+	u64 *res = (u64 *)_res;
+	int i;
+
+	for (i = 0; i < sizeof(*_res) / sizeof(u64); i++) {
+		u64 nv = new[i];
+		u64 ov = old[i];
+		s64 delta = nv - ov;
+
+		/* detects if this particular field is 32bit only */
+		if (((nv | ov) >> 32) == 0)
+			delta = (s64)(s32)((u32)nv - (u32)ov);
+
+		/* filter anomalies, some drivers reset their stats
+		 * at down/up events.
+		 */
+		if (delta > 0)
+			res[i] += delta;
+	}
+}
+
+static void
+virtnet_bypass_get_stats(struct net_device *dev,
+			 struct rtnl_link_stats64 *stats)
+{
+	struct virtnet_bypass_info *vbi = netdev_priv(dev);
+	const struct rtnl_link_stats64 *new;
+	struct rtnl_link_stats64 temp;
+	struct net_device *child_netdev;
+
+	spin_lock(&vbi->stats_lock);
+	memcpy(stats, &vbi->bypass_stats, sizeof(*stats));
+
+	rcu_read_lock();
+
+	child_netdev = rcu_dereference(vbi->active_netdev);
+	if (child_netdev) {
+		new = dev_get_stats(child_netdev, &temp);
+		virtnet_bypass_fold_stats(stats, new, &vbi->active_stats);
+		memcpy(&vbi->active_stats, new, sizeof(*new));
+	}
+
+	child_netdev = rcu_dereference(vbi->backup_netdev);
+	if (child_netdev) {
+		new = dev_get_stats(child_netdev, &temp);
+		virtnet_bypass_fold_stats(stats, new, &vbi->backup_stats);
+		memcpy(&vbi->backup_stats, new, sizeof(*new));
+	}
+
+	rcu_read_unlock();
+
+	memcpy(&vbi->bypass_stats, stats, sizeof(*stats));
+	spin_unlock(&vbi->stats_lock);
+}
+
+static int virtnet_bypass_change_mtu(struct net_device *dev, int new_mtu)
+{
+	struct virtnet_bypass_info *vbi = netdev_priv(dev);
+	struct net_device *child_netdev;
+	int ret = 0;
+
+	child_netdev = rcu_dereference(vbi->active_netdev);
+	if (child_netdev) {
+		ret = dev_set_mtu(child_netdev, new_mtu);
+		if (ret)
+			return ret;
+	}
+
+	child_netdev = rcu_dereference(vbi->backup_netdev);
+	if (child_netdev) {
+		ret = dev_set_mtu(child_netdev, new_mtu);
+		if (ret)
+			netdev_err(child_netdev,
+				   "Unexpected failure to set mtu to %d\n",
+				   new_mtu);
+	}
+
+	dev->mtu = new_mtu;
+	return 0;
+}
+
+static const struct net_device_ops virtnet_bypass_netdev_ops = {
+	.ndo_open		= virtnet_bypass_open,
+	.ndo_stop		= virtnet_bypass_close,
+	.ndo_start_xmit		= virtnet_bypass_start_xmit,
+	.ndo_select_queue	= virtnet_bypass_select_queue,
+	.ndo_get_stats64	= virtnet_bypass_get_stats,
+	.ndo_change_mtu		= virtnet_bypass_change_mtu,
+	.ndo_validate_addr	= eth_validate_addr,
+	.ndo_features_check	= passthru_features_check,
+};
+
+static int
+virtnet_bypass_ethtool_get_link_ksettings(struct net_device *dev,
+					  struct ethtool_link_ksettings *cmd)
+{
+	struct virtnet_bypass_info *vbi = netdev_priv(dev);
+	struct net_device *child_netdev;
+
+	child_netdev = rtnl_dereference(vbi->active_netdev);
+	if (!child_netdev || !virtnet_bypass_xmit_ready(child_netdev)) {
+		child_netdev = rtnl_dereference(vbi->backup_netdev);
+		if (!child_netdev || !virtnet_bypass_xmit_ready(child_netdev)) {
+			cmd->base.duplex = DUPLEX_UNKNOWN;
+			cmd->base.port = PORT_OTHER;
+			cmd->base.speed = SPEED_UNKNOWN;
+
+			return 0;
+		}
+	}
+
+	return __ethtool_get_link_ksettings(child_netdev, cmd);
+}
+
+#define BYPASS_DRV_NAME "virtnet_bypass"
+#define BYPASS_DRV_VERSION "0.1"
+
+static void
+virtnet_bypass_ethtool_get_drvinfo(struct net_device *dev,
+				   struct ethtool_drvinfo *drvinfo)
+{
+	strlcpy(drvinfo->driver, BYPASS_DRV_NAME, sizeof(drvinfo->driver));
+	strlcpy(drvinfo->version, BYPASS_DRV_VERSION, sizeof(drvinfo->version));
+}
+
+static const struct ethtool_ops virtnet_bypass_ethtool_ops = {
+	.get_drvinfo            = virtnet_bypass_ethtool_get_drvinfo,
+	.get_link               = ethtool_op_get_link,
+	.get_link_ksettings     = virtnet_bypass_ethtool_get_link_ksettings,
+};
+
+static struct net_device *
+get_virtnet_bypass_bymac(struct net *net, const u8 *mac)
+{
+	struct net_device *dev;
+
+	ASSERT_RTNL();
+
+	for_each_netdev(net, dev) {
+		if (dev->netdev_ops != &virtnet_bypass_netdev_ops)
+			continue;       /* not a virtnet_bypass device */
+
+		if (ether_addr_equal(mac, dev->perm_addr))
+			return dev;
+	}
+
+	return NULL;
+}
+
+static struct net_device *
+get_virtnet_bypass_byref(struct net_device *child_netdev)
+{
+	struct net *net = dev_net(child_netdev);
+	struct net_device *dev;
+
+	ASSERT_RTNL();
+
+	for_each_netdev(net, dev) {
+		struct virtnet_bypass_info *vbi;
+
+		if (dev->netdev_ops != &virtnet_bypass_netdev_ops)
+			continue;       /* not a virtnet_bypass device */
+
+		vbi = netdev_priv(dev);
+
+		if ((rtnl_dereference(vbi->active_netdev) == child_netdev) ||
+		    (rtnl_dereference(vbi->backup_netdev) == child_netdev))
+			return dev;	/* a match */
+	}
+
+	return NULL;
+}
+
+/* Called when child dev is injecting data into network stack.
+ * Change the associated network device from lower dev to virtio.
+ * note: already called with rcu_read_lock
+ */
+static rx_handler_result_t virtnet_bypass_handle_frame(struct sk_buff **pskb)
+{
+	struct sk_buff *skb = *pskb;
+	struct net_device *ndev = rcu_dereference(skb->dev->rx_handler_data);
+
+	skb->dev = ndev;
+
+	return RX_HANDLER_ANOTHER;
+}
+
+static int virtnet_bypass_register_child(struct net_device *child_netdev)
+{
+	struct virtnet_bypass_info *vbi;
+	struct net_device *dev;
+	bool backup;
+	int ret;
+
+	if (child_netdev->addr_len != ETH_ALEN)
+		return NOTIFY_DONE;
+
+	/* We will use the MAC address to locate the virtnet_bypass netdev
+	 * to associate with the child netdev. If we don't find a matching
+	 * bypass netdev, move on.
+	 */
+	dev = get_virtnet_bypass_bymac(dev_net(child_netdev),
+				       child_netdev->perm_addr);
+	if (!dev)
+		return NOTIFY_DONE;
+
+	vbi = netdev_priv(dev);
+	backup = (child_netdev->dev.parent == dev->dev.parent);
+	if (backup ? rtnl_dereference(vbi->backup_netdev) :
+			rtnl_dereference(vbi->active_netdev)) {
+		netdev_info(dev,
+		  "%s attempting to join bypass dev when %s already present\n",
+			child_netdev->name,
+			backup ? "backup" : "active");
+		return NOTIFY_DONE;
+	}
+
+	ret = netdev_rx_handler_register(child_netdev,
+					 virtnet_bypass_handle_frame, dev);
+	if (ret != 0) {
+		netdev_err(child_netdev,
+		     "can not register bypass receive handler (err = %d)\n",
+			   ret);
+		goto rx_handler_failed;
+	}
+
+	ret = netdev_upper_dev_link(child_netdev, dev, NULL);
+	if (ret != 0) {
+		netdev_err(child_netdev,
+			   "can not set master device %s (err = %d)\n",
+			   dev->name, ret);
+		goto upper_link_failed;
+	}
+
+	child_netdev->flags |= IFF_SLAVE;
+
+	if (netif_running(dev)) {
+		ret = dev_open(child_netdev);
+		if (ret && (ret != -EBUSY)) {
+			netdev_err(dev, "Opening child %s failed ret:%d\n",
+				   child_netdev->name, ret);
+			goto err_interface_up;
+		}
+	}
+
+	/* Align MTU of child with master */
+	ret = dev_set_mtu(child_netdev, dev->mtu);
+	if (ret) {
+		netdev_err(dev,
+			   "unable to change mtu of %s to %u register failed\n",
+			   child_netdev->name, dev->mtu);
+		goto err_set_mtu;
+	}
+
+	call_netdevice_notifiers(NETDEV_JOIN, child_netdev);
+
+	netdev_info(dev, "registering %s\n", child_netdev->name);
+
+	dev_hold(child_netdev);
+	if (backup) {
+		rcu_assign_pointer(vbi->backup_netdev, child_netdev);
+		dev_get_stats(vbi->backup_netdev, &vbi->backup_stats);
+	} else {
+		rcu_assign_pointer(vbi->active_netdev, child_netdev);
+		dev_get_stats(vbi->active_netdev, &vbi->active_stats);
+		dev->min_mtu = child_netdev->min_mtu;
+		dev->max_mtu = child_netdev->max_mtu;
+	}
+
+	return NOTIFY_OK;
+
+err_set_mtu:
+	dev_close(child_netdev);
+err_interface_up:
+	netdev_upper_dev_unlink(child_netdev, dev);
+	child_netdev->flags &= ~IFF_SLAVE;
+upper_link_failed:
+	netdev_rx_handler_unregister(child_netdev);
+rx_handler_failed:
+	return NOTIFY_DONE;
+}
+
+static int virtnet_bypass_unregister_child(struct net_device *child_netdev)
+{
+	struct virtnet_bypass_info *vbi;
+	struct net_device *dev, *backup;
+
+	dev = get_virtnet_bypass_byref(child_netdev);
+	if (!dev)
+		return NOTIFY_DONE;
+
+	vbi = netdev_priv(dev);
+
+	netdev_info(dev, "unregistering %s\n", child_netdev->name);
+
+	netdev_rx_handler_unregister(child_netdev);
+	netdev_upper_dev_unlink(child_netdev, dev);
+	child_netdev->flags &= ~IFF_SLAVE;
+
+	if (child_netdev->dev.parent == dev->dev.parent) {
+		RCU_INIT_POINTER(vbi->backup_netdev, NULL);
+	} else {
+		RCU_INIT_POINTER(vbi->active_netdev, NULL);
+		backup = rtnl_dereference(vbi->backup_netdev);
+		if (backup) {
+			dev->min_mtu = backup->min_mtu;
+			dev->max_mtu = backup->max_mtu;
+		}
+	}
+
+	dev_put(child_netdev);
+
+	return NOTIFY_OK;
+}
+
+static int virtnet_bypass_update_link(struct net_device *child_netdev)
+{
+	struct net_device *dev, *active, *backup;
+	struct virtnet_bypass_info *vbi;
+
+	dev = get_virtnet_bypass_byref(child_netdev);
+	if (!dev || !netif_running(dev))
+		return NOTIFY_DONE;
+
+	vbi = netdev_priv(dev);
+
+	active = rtnl_dereference(vbi->active_netdev);
+	backup = rtnl_dereference(vbi->backup_netdev);
+
+	if ((active && virtnet_bypass_xmit_ready(active)) ||
+	    (backup && virtnet_bypass_xmit_ready(backup))) {
+		netif_carrier_on(dev);
+		netif_tx_wake_all_queues(dev);
+	} else {
+		netif_carrier_off(dev);
+		netif_tx_stop_all_queues(dev);
+	}
+
+	return NOTIFY_OK;
+}
+
+static int
+virtnet_bypass_event(struct notifier_block *this, unsigned long event,
+		     void *ptr)
+{
+	struct net_device *event_dev = netdev_notifier_info_to_dev(ptr);
+
+	/* Skip our own events */
+	if (event_dev->netdev_ops == &virtnet_bypass_netdev_ops)
+		return NOTIFY_DONE;
+
+	/* Avoid non-Ethernet type devices */
+	if (event_dev->type != ARPHRD_ETHER)
+		return NOTIFY_DONE;
+
+	/* Avoid Vlan dev with same MAC registering as child dev */
+	if (is_vlan_dev(event_dev))
+		return NOTIFY_DONE;
+
+	/* Avoid Bonding master dev with same MAC registering as child dev */
+	if ((event_dev->priv_flags & IFF_BONDING) &&
+	    (event_dev->flags & IFF_MASTER))
+		return NOTIFY_DONE;
+
+	switch (event) {
+	case NETDEV_REGISTER:
+		return virtnet_bypass_register_child(event_dev);
+	case NETDEV_UNREGISTER:
+		return virtnet_bypass_unregister_child(event_dev);
+	case NETDEV_UP:
+	case NETDEV_DOWN:
+	case NETDEV_CHANGE:
+		return virtnet_bypass_update_link(event_dev);
+	default:
+		return NOTIFY_DONE;
+	}
+}
+
+static struct notifier_block virtnet_bypass_notifier = {
+	.notifier_call = virtnet_bypass_event,
+};
+
+static int virtnet_bypass_create(struct virtnet_info *vi)
+{
+	struct net_device *backup_netdev = vi->dev;
+	struct device *dev = &vi->vdev->dev;
+	struct net_device *bypass_netdev;
+	int res;
+
+	/* Alloc at least 2 queues, for now we are going with 16 assuming
+	 * that most devices being bonded won't have too many queues.
+	 */
+	bypass_netdev = alloc_etherdev_mq(sizeof(struct virtnet_bypass_info),
+					  16);
+	if (!bypass_netdev) {
+		dev_err(dev, "Unable to allocate bypass_netdev!\n");
+		return -ENOMEM;
+	}
+
+	dev_net_set(bypass_netdev, dev_net(backup_netdev));
+	SET_NETDEV_DEV(bypass_netdev, dev);
+
+	bypass_netdev->netdev_ops = &virtnet_bypass_netdev_ops;
+	bypass_netdev->ethtool_ops = &virtnet_bypass_ethtool_ops;
+
+	/* Initialize the device options */
+	bypass_netdev->flags |= IFF_MASTER;
+	bypass_netdev->priv_flags |= IFF_BONDING | IFF_UNICAST_FLT |
+				     IFF_NO_QUEUE;
+	bypass_netdev->priv_flags &= ~(IFF_XMIT_DST_RELEASE |
+				       IFF_TX_SKB_SHARING);
+
+	/* don't acquire bypass netdev's netif_tx_lock when transmitting */
+	bypass_netdev->features |= NETIF_F_LLTX;
+
+	/* Don't allow bypass devices to change network namespaces. */
+	bypass_netdev->features |= NETIF_F_NETNS_LOCAL;
+
+	bypass_netdev->hw_features = NETIF_F_HW_CSUM | NETIF_F_SG |
+				     NETIF_F_FRAGLIST | NETIF_F_ALL_TSO |
+				     NETIF_F_HIGHDMA | NETIF_F_LRO;
+
+	bypass_netdev->hw_features |= NETIF_F_GSO_ENCAP_ALL;
+	bypass_netdev->features |= bypass_netdev->hw_features;
+
+	/* For now treat bypass netdev as VLAN challenged since we
+	 * cannot assume VLAN functionality with a VF
+	 */
+	bypass_netdev->features |= NETIF_F_VLAN_CHALLENGED;
+
+	memcpy(bypass_netdev->dev_addr, backup_netdev->dev_addr,
+	       bypass_netdev->addr_len);
+
+	bypass_netdev->min_mtu = backup_netdev->min_mtu;
+	bypass_netdev->max_mtu = backup_netdev->max_mtu;
+
+	res = register_netdev(bypass_netdev);
+	if (res < 0) {
+		dev_err(dev, "Unable to register bypass_netdev!\n");
+		free_netdev(bypass_netdev);
+		return res;
+	}
+
+	netif_carrier_off(bypass_netdev);
+
+	vi->bypass_netdev = bypass_netdev;
+
+	/* Change the name of the backup interface to vbkup0
+	 * we may need to revisit naming later but this gets it out
+	 * of the way for now.
+	 */
+	strcpy(backup_netdev->name, "vbkup%d");
+
+	return 0;
+}
+
+static void virtnet_bypass_destroy(struct virtnet_info *vi)
+{
+	struct net_device *bypass_netdev = vi->bypass_netdev;
+	struct virtnet_bypass_info *vbi;
+	struct net_device *child_netdev;
+
+	/* no device found, nothing to free */
+	if (!bypass_netdev)
+		return;
+
+	vbi = netdev_priv(bypass_netdev);
+
+	netif_device_detach(bypass_netdev);
+
+	rtnl_lock();
+
+	child_netdev = rtnl_dereference(vbi->active_netdev);
+	if (child_netdev)
+		virtnet_bypass_unregister_child(child_netdev);
+
+	child_netdev = rtnl_dereference(vbi->backup_netdev);
+	if (child_netdev)
+		virtnet_bypass_unregister_child(child_netdev);
+
+	unregister_netdevice(bypass_netdev);
+
+	rtnl_unlock();
+
+	free_netdev(bypass_netdev);
+}
+
 static int virtnet_probe(struct virtio_device *vdev)
 {
 	int i, err = -ENOMEM;
@@ -2797,10 +3422,15 @@ static int virtnet_probe(struct virtio_device *vdev)
 
 	virtnet_init_settings(dev);
 
+	if (virtio_has_feature(vdev, VIRTIO_NET_F_BACKUP)) {
+		if (virtnet_bypass_create(vi) != 0)
+			goto free_vqs;
+	}
+
 	err = register_netdev(dev);
 	if (err) {
 		pr_debug("virtio_net: registering device failed\n");
-		goto free_vqs;
+		goto free_bypass;
 	}
 
 	virtio_device_ready(vdev);
@@ -2837,6 +3467,8 @@ static int virtnet_probe(struct virtio_device *vdev)
 	vi->vdev->config->reset(vdev);
 
 	unregister_netdev(dev);
+free_bypass:
+	virtnet_bypass_destroy(vi);
 free_vqs:
 	cancel_delayed_work_sync(&vi->refill);
 	free_receive_page_frags(vi);
@@ -2871,6 +3503,8 @@ static void virtnet_remove(struct virtio_device *vdev)
 
 	unregister_netdev(vi->dev);
 
+	virtnet_bypass_destroy(vi);
+
 	remove_vq_common(vi);
 
 	free_netdev(vi->dev);
@@ -2968,6 +3602,8 @@ static __init int virtio_net_driver_init(void)
         ret = register_virtio_driver(&virtio_net_driver);
 	if (ret)
 		goto err_virtio;
+
+	register_netdevice_notifier(&virtnet_bypass_notifier);
 	return 0;
 err_virtio:
 	cpuhp_remove_multi_state(CPUHP_VIRT_NET_DEAD);
@@ -2980,6 +3616,7 @@ module_init(virtio_net_driver_init);
 
 static __exit void virtio_net_driver_exit(void)
 {
+	unregister_netdevice_notifier(&virtnet_bypass_notifier);
 	unregister_virtio_driver(&virtio_net_driver);
 	cpuhp_remove_multi_state(CPUHP_VIRT_NET_DEAD);
 	cpuhp_remove_multi_state(virtionet_online);
-- 
2.14.3


---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [RFC PATCH v3 3/3] virtio_net: Enable alternate datapath without creating an additional netdev
  2018-02-16 18:11 ` [virtio-dev] " Sridhar Samudrala
@ 2018-02-16 18:11   ` Sridhar Samudrala
  -1 siblings, 0 replies; 121+ messages in thread
From: Sridhar Samudrala @ 2018-02-16 18:11 UTC (permalink / raw)
  To: mst, stephen, davem, netdev, virtualization, virtio-dev,
	jesse.brandeburg, alexander.h.duyck, kubakici, sridhar.samudrala,
	jasowang, loseweigh

This patch addresses the issues that were seen with the 3 netdev model by
avoiding the creation of an additional netdev. Instead the bypass state
information is tracked in the original netdev and a different set of
ndo_ops and ethtool_ops are used when BACKUP feature is enabled.

Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
Reviewed-by: Alexander Duyck <alexander.h.duyck@intel.com> 
---
 drivers/net/virtio_net.c | 283 +++++++++++++++++------------------------------
 1 file changed, 101 insertions(+), 182 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 14679806c1b1..c85b2949f151 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -154,7 +154,7 @@ struct virtnet_bypass_info {
 	struct net_device __rcu *active_netdev;
 
 	/* virtio_net netdev */
-	struct net_device __rcu *backup_netdev;
+	struct net_device *backup_netdev;
 
 	/* active netdev stats */
 	struct rtnl_link_stats64 active_stats;
@@ -229,8 +229,8 @@ struct virtnet_info {
 
 	unsigned long guest_offloads;
 
-	/* upper netdev created when BACKUP feature enabled */
-	struct net_device *bypass_netdev;
+	/* bypass state maintained when BACKUP feature is enabled */
+	struct virtnet_bypass_info *vbi;
 };
 
 struct padded_vnet_hdr {
@@ -2285,6 +2285,22 @@ static bool virtnet_bypass_xmit_ready(struct net_device *dev)
 	return netif_running(dev) && netif_carrier_ok(dev);
 }
 
+static bool virtnet_bypass_active_ready(struct net_device *dev)
+{
+	struct virtnet_info *vi = netdev_priv(dev);
+	struct virtnet_bypass_info *vbi = vi->vbi;
+	struct net_device *active;
+
+	if (!vbi)
+		return false;
+
+	active = rcu_dereference(vbi->active_netdev);
+	if (!active || !virtnet_bypass_xmit_ready(active))
+		return false;
+
+	return true;
+}
+
 static void virtnet_config_changed_work(struct work_struct *work)
 {
 	struct virtnet_info *vi =
@@ -2312,7 +2328,7 @@ static void virtnet_config_changed_work(struct work_struct *work)
 		virtnet_update_settings(vi);
 		netif_carrier_on(vi->dev);
 		netif_tx_wake_all_queues(vi->dev);
-	} else {
+	} else if (!virtnet_bypass_active_ready(vi->dev)) {
 		netif_carrier_off(vi->dev);
 		netif_tx_stop_all_queues(vi->dev);
 	}
@@ -2501,7 +2517,8 @@ static int virtnet_find_vqs(struct virtnet_info *vi)
 
 	if (vi->has_cvq) {
 		vi->cvq = vqs[total_vqs - 1];
-		if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VLAN))
+		if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VLAN) &&
+		    !virtio_has_feature(vi->vdev, VIRTIO_NET_F_BACKUP))
 			vi->dev->features |= NETIF_F_HW_VLAN_CTAG_FILTER;
 	}
 
@@ -2690,62 +2707,54 @@ virtnet_bypass_child_open(struct net_device *dev,
 
 static int virtnet_bypass_open(struct net_device *dev)
 {
-	struct virtnet_bypass_info *vbi = netdev_priv(dev);
+	struct virtnet_info *vi = netdev_priv(dev);
+	struct virtnet_bypass_info *vbi = vi->vbi;
 	struct net_device *child_netdev;
-
-	netif_carrier_off(dev);
-	netif_tx_wake_all_queues(dev);
+	int err;
 
 	child_netdev = rtnl_dereference(vbi->active_netdev);
 	if (child_netdev)
 		virtnet_bypass_child_open(dev, child_netdev);
 
-	child_netdev = rtnl_dereference(vbi->backup_netdev);
-	if (child_netdev)
-		virtnet_bypass_child_open(dev, child_netdev);
+	err = virtnet_open(dev);
+	if (err < 0) {
+		dev_close(child_netdev);
+		return err;
+	}
 
 	return 0;
 }
 
 static int virtnet_bypass_close(struct net_device *dev)
 {
-	struct virtnet_bypass_info *vi = netdev_priv(dev);
+	struct virtnet_info *vi = netdev_priv(dev);
+	struct virtnet_bypass_info *vbi = vi->vbi;
 	struct net_device *child_netdev;
 
-	netif_tx_disable(dev);
+	virtnet_close(dev);
 
-	child_netdev = rtnl_dereference(vi->active_netdev);
-	if (child_netdev)
-		dev_close(child_netdev);
+	if (!vbi)
+		goto done;
 
-	child_netdev = rtnl_dereference(vi->backup_netdev);
+	child_netdev = rtnl_dereference(vbi->active_netdev);
 	if (child_netdev)
 		dev_close(child_netdev);
 
+done:
 	return 0;
 }
 
-static netdev_tx_t
-virtnet_bypass_drop_xmit(struct sk_buff *skb, struct net_device *dev)
-{
-	atomic_long_inc(&dev->tx_dropped);
-	dev_kfree_skb_any(skb);
-	return NETDEV_TX_OK;
-}
-
 static netdev_tx_t
 virtnet_bypass_start_xmit(struct sk_buff *skb, struct net_device *dev)
 {
-	struct virtnet_bypass_info *vbi = netdev_priv(dev);
+	struct virtnet_info *vi = netdev_priv(dev);
+	struct virtnet_bypass_info *vbi = vi->vbi;
 	struct net_device *xmit_dev;
 
 	/* Try xmit via active netdev followed by backup netdev */
 	xmit_dev = rcu_dereference_bh(vbi->active_netdev);
-	if (!xmit_dev || !virtnet_bypass_xmit_ready(xmit_dev)) {
-		xmit_dev = rcu_dereference_bh(vbi->backup_netdev);
-		if (!xmit_dev || !virtnet_bypass_xmit_ready(xmit_dev))
-			return virtnet_bypass_drop_xmit(skb, dev);
-	}
+	if (!xmit_dev || !virtnet_bypass_xmit_ready(xmit_dev))
+		return start_xmit(skb, dev);
 
 	skb->dev = xmit_dev;
 	skb->queue_mapping = qdisc_skb_cb(skb)->slave_dev_queue_mapping;
@@ -2810,7 +2819,8 @@ static void
 virtnet_bypass_get_stats(struct net_device *dev,
 			 struct rtnl_link_stats64 *stats)
 {
-	struct virtnet_bypass_info *vbi = netdev_priv(dev);
+	struct virtnet_info *vi = netdev_priv(dev);
+	struct virtnet_bypass_info *vbi = vi->vbi;
 	const struct rtnl_link_stats64 *new;
 	struct rtnl_link_stats64 temp;
 	struct net_device *child_netdev;
@@ -2827,12 +2837,10 @@ virtnet_bypass_get_stats(struct net_device *dev,
 		memcpy(&vbi->active_stats, new, sizeof(*new));
 	}
 
-	child_netdev = rcu_dereference(vbi->backup_netdev);
-	if (child_netdev) {
-		new = dev_get_stats(child_netdev, &temp);
-		virtnet_bypass_fold_stats(stats, new, &vbi->backup_stats);
-		memcpy(&vbi->backup_stats, new, sizeof(*new));
-	}
+	memset(&temp, 0, sizeof(temp));
+	virtnet_stats(vbi->backup_netdev, &temp);
+	virtnet_bypass_fold_stats(stats, &temp, &vbi->backup_stats);
+	memcpy(&vbi->backup_stats, &temp, sizeof(temp));
 
 	rcu_read_unlock();
 
@@ -2842,7 +2850,8 @@ virtnet_bypass_get_stats(struct net_device *dev,
 
 static int virtnet_bypass_change_mtu(struct net_device *dev, int new_mtu)
 {
-	struct virtnet_bypass_info *vbi = netdev_priv(dev);
+	struct virtnet_info *vi = netdev_priv(dev);
+	struct virtnet_bypass_info *vbi = vi->vbi;
 	struct net_device *child_netdev;
 	int ret = 0;
 
@@ -2853,15 +2862,6 @@ static int virtnet_bypass_change_mtu(struct net_device *dev, int new_mtu)
 			return ret;
 	}
 
-	child_netdev = rcu_dereference(vbi->backup_netdev);
-	if (child_netdev) {
-		ret = dev_set_mtu(child_netdev, new_mtu);
-		if (ret)
-			netdev_err(child_netdev,
-				   "Unexpected failure to set mtu to %d\n",
-				   new_mtu);
-	}
-
 	dev->mtu = new_mtu;
 	return 0;
 }
@@ -2881,20 +2881,13 @@ static int
 virtnet_bypass_ethtool_get_link_ksettings(struct net_device *dev,
 					  struct ethtool_link_ksettings *cmd)
 {
-	struct virtnet_bypass_info *vbi = netdev_priv(dev);
+	struct virtnet_info *vi = netdev_priv(dev);
+	struct virtnet_bypass_info *vbi = vi->vbi;
 	struct net_device *child_netdev;
 
 	child_netdev = rtnl_dereference(vbi->active_netdev);
-	if (!child_netdev || !virtnet_bypass_xmit_ready(child_netdev)) {
-		child_netdev = rtnl_dereference(vbi->backup_netdev);
-		if (!child_netdev || !virtnet_bypass_xmit_ready(child_netdev)) {
-			cmd->base.duplex = DUPLEX_UNKNOWN;
-			cmd->base.port = PORT_OTHER;
-			cmd->base.speed = SPEED_UNKNOWN;
-
-			return 0;
-		}
-	}
+	if (!child_netdev || !virtnet_bypass_xmit_ready(child_netdev))
+		return virtnet_get_link_ksettings(dev, cmd);
 
 	return __ethtool_get_link_ksettings(child_netdev, cmd);
 }
@@ -2944,14 +2937,15 @@ get_virtnet_bypass_byref(struct net_device *child_netdev)
 
 	for_each_netdev(net, dev) {
 		struct virtnet_bypass_info *vbi;
+		struct virtnet_info *vi;
 
 		if (dev->netdev_ops != &virtnet_bypass_netdev_ops)
 			continue;       /* not a virtnet_bypass device */
 
-		vbi = netdev_priv(dev);
+		vi = netdev_priv(dev);
+		vbi = vi->vbi;
 
-		if ((rtnl_dereference(vbi->active_netdev) == child_netdev) ||
-		    (rtnl_dereference(vbi->backup_netdev) == child_netdev))
+		if (rtnl_dereference(vbi->active_netdev) == child_netdev)
 			return dev;	/* a match */
 	}
 
@@ -2974,9 +2968,9 @@ static rx_handler_result_t virtnet_bypass_handle_frame(struct sk_buff **pskb)
 
 static int virtnet_bypass_register_child(struct net_device *child_netdev)
 {
+	struct net_device *dev, *active;
 	struct virtnet_bypass_info *vbi;
-	struct net_device *dev;
-	bool backup;
+	struct virtnet_info *vi;
 	int ret;
 
 	if (child_netdev->addr_len != ETH_ALEN)
@@ -2991,14 +2985,14 @@ static int virtnet_bypass_register_child(struct net_device *child_netdev)
 	if (!dev)
 		return NOTIFY_DONE;
 
-	vbi = netdev_priv(dev);
-	backup = (child_netdev->dev.parent == dev->dev.parent);
-	if (backup ? rtnl_dereference(vbi->backup_netdev) :
-			rtnl_dereference(vbi->active_netdev)) {
+	vi = netdev_priv(dev);
+	vbi = vi->vbi;
+
+	active = rtnl_dereference(vbi->active_netdev);
+	if (active) {
 		netdev_info(dev,
 		  "%s attempting to join bypass dev when %s already present\n",
-			child_netdev->name,
-			backup ? "backup" : "active");
+		  child_netdev->name, active->name);
 		return NOTIFY_DONE;
 	}
 
@@ -3030,7 +3024,7 @@ static int virtnet_bypass_register_child(struct net_device *child_netdev)
 		}
 	}
 
-	/* Align MTU of child with master */
+	/* Align MTU of child with virtio */
 	ret = dev_set_mtu(child_netdev, dev->mtu);
 	if (ret) {
 		netdev_err(dev,
@@ -3044,15 +3038,10 @@ static int virtnet_bypass_register_child(struct net_device *child_netdev)
 	netdev_info(dev, "registering %s\n", child_netdev->name);
 
 	dev_hold(child_netdev);
-	if (backup) {
-		rcu_assign_pointer(vbi->backup_netdev, child_netdev);
-		dev_get_stats(vbi->backup_netdev, &vbi->backup_stats);
-	} else {
-		rcu_assign_pointer(vbi->active_netdev, child_netdev);
-		dev_get_stats(vbi->active_netdev, &vbi->active_stats);
-		dev->min_mtu = child_netdev->min_mtu;
-		dev->max_mtu = child_netdev->max_mtu;
-	}
+	rcu_assign_pointer(vbi->active_netdev, child_netdev);
+	dev_get_stats(vbi->active_netdev, &vbi->active_stats);
+	dev->min_mtu = child_netdev->min_mtu;
+	dev->max_mtu = child_netdev->max_mtu;
 
 	return NOTIFY_OK;
 
@@ -3070,13 +3059,15 @@ static int virtnet_bypass_register_child(struct net_device *child_netdev)
 static int virtnet_bypass_unregister_child(struct net_device *child_netdev)
 {
 	struct virtnet_bypass_info *vbi;
-	struct net_device *dev, *backup;
+	struct virtnet_info *vi;
+	struct net_device *dev;
 
 	dev = get_virtnet_bypass_byref(child_netdev);
 	if (!dev)
 		return NOTIFY_DONE;
 
-	vbi = netdev_priv(dev);
+	vi = netdev_priv(dev);
+	vbi = vi->vbi;
 
 	netdev_info(dev, "unregistering %s\n", child_netdev->name);
 
@@ -3084,41 +3075,35 @@ static int virtnet_bypass_unregister_child(struct net_device *child_netdev)
 	netdev_upper_dev_unlink(child_netdev, dev);
 	child_netdev->flags &= ~IFF_SLAVE;
 
-	if (child_netdev->dev.parent == dev->dev.parent) {
-		RCU_INIT_POINTER(vbi->backup_netdev, NULL);
-	} else {
-		RCU_INIT_POINTER(vbi->active_netdev, NULL);
-		backup = rtnl_dereference(vbi->backup_netdev);
-		if (backup) {
-			dev->min_mtu = backup->min_mtu;
-			dev->max_mtu = backup->max_mtu;
-		}
-	}
+	RCU_INIT_POINTER(vbi->active_netdev, NULL);
+	dev->min_mtu = MIN_MTU;
+	dev->max_mtu = MAX_MTU;
 
 	dev_put(child_netdev);
 
+	if (!(vi->status & VIRTIO_NET_S_LINK_UP)) {
+		netif_carrier_off(dev);
+		netif_tx_stop_all_queues(dev);
+	}
+
 	return NOTIFY_OK;
 }
 
 static int virtnet_bypass_update_link(struct net_device *child_netdev)
 {
-	struct net_device *dev, *active, *backup;
-	struct virtnet_bypass_info *vbi;
+	struct virtnet_info *vi;
+	struct net_device *dev;
 
 	dev = get_virtnet_bypass_byref(child_netdev);
-	if (!dev || !netif_running(dev))
+	if (!dev)
 		return NOTIFY_DONE;
 
-	vbi = netdev_priv(dev);
-
-	active = rtnl_dereference(vbi->active_netdev);
-	backup = rtnl_dereference(vbi->backup_netdev);
+	vi = netdev_priv(dev);
 
-	if ((active && virtnet_bypass_xmit_ready(active)) ||
-	    (backup && virtnet_bypass_xmit_ready(backup))) {
+	if (virtnet_bypass_xmit_ready(child_netdev)) {
 		netif_carrier_on(dev);
 		netif_tx_wake_all_queues(dev);
-	} else {
+	} else if (!(vi->status & VIRTIO_NET_S_LINK_UP)) {
 		netif_carrier_off(dev);
 		netif_tx_stop_all_queues(dev);
 	}
@@ -3169,107 +3154,41 @@ static struct notifier_block virtnet_bypass_notifier = {
 
 static int virtnet_bypass_create(struct virtnet_info *vi)
 {
-	struct net_device *backup_netdev = vi->dev;
-	struct device *dev = &vi->vdev->dev;
-	struct net_device *bypass_netdev;
-	int res;
+	struct net_device *dev = vi->dev;
+	struct virtnet_bypass_info *vbi;
 
-	/* Alloc at least 2 queues, for now we are going with 16 assuming
-	 * that most devices being bonded won't have too many queues.
-	 */
-	bypass_netdev = alloc_etherdev_mq(sizeof(struct virtnet_bypass_info),
-					  16);
-	if (!bypass_netdev) {
-		dev_err(dev, "Unable to allocate bypass_netdev!\n");
+	vbi = kzalloc(sizeof(*vbi), GFP_KERNEL);
+	if (!vbi)
 		return -ENOMEM;
-	}
-
-	dev_net_set(bypass_netdev, dev_net(backup_netdev));
-	SET_NETDEV_DEV(bypass_netdev, dev);
-
-	bypass_netdev->netdev_ops = &virtnet_bypass_netdev_ops;
-	bypass_netdev->ethtool_ops = &virtnet_bypass_ethtool_ops;
-
-	/* Initialize the device options */
-	bypass_netdev->flags |= IFF_MASTER;
-	bypass_netdev->priv_flags |= IFF_BONDING | IFF_UNICAST_FLT |
-				     IFF_NO_QUEUE;
-	bypass_netdev->priv_flags &= ~(IFF_XMIT_DST_RELEASE |
-				       IFF_TX_SKB_SHARING);
-
-	/* don't acquire bypass netdev's netif_tx_lock when transmitting */
-	bypass_netdev->features |= NETIF_F_LLTX;
-
-	/* Don't allow bypass devices to change network namespaces. */
-	bypass_netdev->features |= NETIF_F_NETNS_LOCAL;
-
-	bypass_netdev->hw_features = NETIF_F_HW_CSUM | NETIF_F_SG |
-				     NETIF_F_FRAGLIST | NETIF_F_ALL_TSO |
-				     NETIF_F_HIGHDMA | NETIF_F_LRO;
-
-	bypass_netdev->hw_features |= NETIF_F_GSO_ENCAP_ALL;
-	bypass_netdev->features |= bypass_netdev->hw_features;
-
-	/* For now treat bypass netdev as VLAN challenged since we
-	 * cannot assume VLAN functionality with a VF
-	 */
-	bypass_netdev->features |= NETIF_F_VLAN_CHALLENGED;
-
-	memcpy(bypass_netdev->dev_addr, backup_netdev->dev_addr,
-	       bypass_netdev->addr_len);
 
-	bypass_netdev->min_mtu = backup_netdev->min_mtu;
-	bypass_netdev->max_mtu = backup_netdev->max_mtu;
+	dev->netdev_ops = &virtnet_bypass_netdev_ops;
+	dev->ethtool_ops = &virtnet_bypass_ethtool_ops;
 
-	res = register_netdev(bypass_netdev);
-	if (res < 0) {
-		dev_err(dev, "Unable to register bypass_netdev!\n");
-		free_netdev(bypass_netdev);
-		return res;
-	}
-
-	netif_carrier_off(bypass_netdev);
-
-	vi->bypass_netdev = bypass_netdev;
-
-	/* Change the name of the backup interface to vbkup0
-	 * we may need to revisit naming later but this gets it out
-	 * of the way for now.
-	 */
-	strcpy(backup_netdev->name, "vbkup%d");
+	vbi->backup_netdev = dev;
+	virtnet_stats(vbi->backup_netdev, &vbi->backup_stats);
+	vi->vbi = vbi;
 
 	return 0;
 }
 
 static void virtnet_bypass_destroy(struct virtnet_info *vi)
 {
-	struct net_device *bypass_netdev = vi->bypass_netdev;
-	struct virtnet_bypass_info *vbi;
+	struct virtnet_bypass_info *vbi = vi->vbi;
 	struct net_device *child_netdev;
 
-	/* no device found, nothing to free */
-	if (!bypass_netdev)
+	if (!vbi)
 		return;
 
-	vbi = netdev_priv(bypass_netdev);
-
-	netif_device_detach(bypass_netdev);
-
 	rtnl_lock();
 
 	child_netdev = rtnl_dereference(vbi->active_netdev);
 	if (child_netdev)
 		virtnet_bypass_unregister_child(child_netdev);
 
-	child_netdev = rtnl_dereference(vbi->backup_netdev);
-	if (child_netdev)
-		virtnet_bypass_unregister_child(child_netdev);
-
-	unregister_netdevice(bypass_netdev);
-
 	rtnl_unlock();
 
-	free_netdev(bypass_netdev);
+	kfree(vbi);
+	vi->vbi = NULL;
 }
 
 static int virtnet_probe(struct virtio_device *vdev)
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [RFC PATCH v3 3/3] virtio_net: Enable alternate datapath without creating an additional netdev
  2018-02-16 18:11 ` [virtio-dev] " Sridhar Samudrala
                   ` (5 preceding siblings ...)
  (?)
@ 2018-02-16 18:11 ` Sridhar Samudrala
  -1 siblings, 0 replies; 121+ messages in thread
From: Sridhar Samudrala @ 2018-02-16 18:11 UTC (permalink / raw)
  To: mst, stephen, davem, netdev, virtualization, virtio-dev,
	jesse.brandeburg, alexander.h.duyck, kubakici, sridhar.samudrala,
	jasowang, loseweigh

This patch addresses the issues that were seen with the 3 netdev model by
avoiding the creation of an additional netdev. Instead the bypass state
information is tracked in the original netdev and a different set of
ndo_ops and ethtool_ops are used when BACKUP feature is enabled.

Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
Reviewed-by: Alexander Duyck <alexander.h.duyck@intel.com> 
---
 drivers/net/virtio_net.c | 283 +++++++++++++++++------------------------------
 1 file changed, 101 insertions(+), 182 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 14679806c1b1..c85b2949f151 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -154,7 +154,7 @@ struct virtnet_bypass_info {
 	struct net_device __rcu *active_netdev;
 
 	/* virtio_net netdev */
-	struct net_device __rcu *backup_netdev;
+	struct net_device *backup_netdev;
 
 	/* active netdev stats */
 	struct rtnl_link_stats64 active_stats;
@@ -229,8 +229,8 @@ struct virtnet_info {
 
 	unsigned long guest_offloads;
 
-	/* upper netdev created when BACKUP feature enabled */
-	struct net_device *bypass_netdev;
+	/* bypass state maintained when BACKUP feature is enabled */
+	struct virtnet_bypass_info *vbi;
 };
 
 struct padded_vnet_hdr {
@@ -2285,6 +2285,22 @@ static bool virtnet_bypass_xmit_ready(struct net_device *dev)
 	return netif_running(dev) && netif_carrier_ok(dev);
 }
 
+static bool virtnet_bypass_active_ready(struct net_device *dev)
+{
+	struct virtnet_info *vi = netdev_priv(dev);
+	struct virtnet_bypass_info *vbi = vi->vbi;
+	struct net_device *active;
+
+	if (!vbi)
+		return false;
+
+	active = rcu_dereference(vbi->active_netdev);
+	if (!active || !virtnet_bypass_xmit_ready(active))
+		return false;
+
+	return true;
+}
+
 static void virtnet_config_changed_work(struct work_struct *work)
 {
 	struct virtnet_info *vi =
@@ -2312,7 +2328,7 @@ static void virtnet_config_changed_work(struct work_struct *work)
 		virtnet_update_settings(vi);
 		netif_carrier_on(vi->dev);
 		netif_tx_wake_all_queues(vi->dev);
-	} else {
+	} else if (!virtnet_bypass_active_ready(vi->dev)) {
 		netif_carrier_off(vi->dev);
 		netif_tx_stop_all_queues(vi->dev);
 	}
@@ -2501,7 +2517,8 @@ static int virtnet_find_vqs(struct virtnet_info *vi)
 
 	if (vi->has_cvq) {
 		vi->cvq = vqs[total_vqs - 1];
-		if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VLAN))
+		if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VLAN) &&
+		    !virtio_has_feature(vi->vdev, VIRTIO_NET_F_BACKUP))
 			vi->dev->features |= NETIF_F_HW_VLAN_CTAG_FILTER;
 	}
 
@@ -2690,62 +2707,54 @@ virtnet_bypass_child_open(struct net_device *dev,
 
 static int virtnet_bypass_open(struct net_device *dev)
 {
-	struct virtnet_bypass_info *vbi = netdev_priv(dev);
+	struct virtnet_info *vi = netdev_priv(dev);
+	struct virtnet_bypass_info *vbi = vi->vbi;
 	struct net_device *child_netdev;
-
-	netif_carrier_off(dev);
-	netif_tx_wake_all_queues(dev);
+	int err;
 
 	child_netdev = rtnl_dereference(vbi->active_netdev);
 	if (child_netdev)
 		virtnet_bypass_child_open(dev, child_netdev);
 
-	child_netdev = rtnl_dereference(vbi->backup_netdev);
-	if (child_netdev)
-		virtnet_bypass_child_open(dev, child_netdev);
+	err = virtnet_open(dev);
+	if (err < 0) {
+		dev_close(child_netdev);
+		return err;
+	}
 
 	return 0;
 }
 
 static int virtnet_bypass_close(struct net_device *dev)
 {
-	struct virtnet_bypass_info *vi = netdev_priv(dev);
+	struct virtnet_info *vi = netdev_priv(dev);
+	struct virtnet_bypass_info *vbi = vi->vbi;
 	struct net_device *child_netdev;
 
-	netif_tx_disable(dev);
+	virtnet_close(dev);
 
-	child_netdev = rtnl_dereference(vi->active_netdev);
-	if (child_netdev)
-		dev_close(child_netdev);
+	if (!vbi)
+		goto done;
 
-	child_netdev = rtnl_dereference(vi->backup_netdev);
+	child_netdev = rtnl_dereference(vbi->active_netdev);
 	if (child_netdev)
 		dev_close(child_netdev);
 
+done:
 	return 0;
 }
 
-static netdev_tx_t
-virtnet_bypass_drop_xmit(struct sk_buff *skb, struct net_device *dev)
-{
-	atomic_long_inc(&dev->tx_dropped);
-	dev_kfree_skb_any(skb);
-	return NETDEV_TX_OK;
-}
-
 static netdev_tx_t
 virtnet_bypass_start_xmit(struct sk_buff *skb, struct net_device *dev)
 {
-	struct virtnet_bypass_info *vbi = netdev_priv(dev);
+	struct virtnet_info *vi = netdev_priv(dev);
+	struct virtnet_bypass_info *vbi = vi->vbi;
 	struct net_device *xmit_dev;
 
 	/* Try xmit via active netdev followed by backup netdev */
 	xmit_dev = rcu_dereference_bh(vbi->active_netdev);
-	if (!xmit_dev || !virtnet_bypass_xmit_ready(xmit_dev)) {
-		xmit_dev = rcu_dereference_bh(vbi->backup_netdev);
-		if (!xmit_dev || !virtnet_bypass_xmit_ready(xmit_dev))
-			return virtnet_bypass_drop_xmit(skb, dev);
-	}
+	if (!xmit_dev || !virtnet_bypass_xmit_ready(xmit_dev))
+		return start_xmit(skb, dev);
 
 	skb->dev = xmit_dev;
 	skb->queue_mapping = qdisc_skb_cb(skb)->slave_dev_queue_mapping;
@@ -2810,7 +2819,8 @@ static void
 virtnet_bypass_get_stats(struct net_device *dev,
 			 struct rtnl_link_stats64 *stats)
 {
-	struct virtnet_bypass_info *vbi = netdev_priv(dev);
+	struct virtnet_info *vi = netdev_priv(dev);
+	struct virtnet_bypass_info *vbi = vi->vbi;
 	const struct rtnl_link_stats64 *new;
 	struct rtnl_link_stats64 temp;
 	struct net_device *child_netdev;
@@ -2827,12 +2837,10 @@ virtnet_bypass_get_stats(struct net_device *dev,
 		memcpy(&vbi->active_stats, new, sizeof(*new));
 	}
 
-	child_netdev = rcu_dereference(vbi->backup_netdev);
-	if (child_netdev) {
-		new = dev_get_stats(child_netdev, &temp);
-		virtnet_bypass_fold_stats(stats, new, &vbi->backup_stats);
-		memcpy(&vbi->backup_stats, new, sizeof(*new));
-	}
+	memset(&temp, 0, sizeof(temp));
+	virtnet_stats(vbi->backup_netdev, &temp);
+	virtnet_bypass_fold_stats(stats, &temp, &vbi->backup_stats);
+	memcpy(&vbi->backup_stats, &temp, sizeof(temp));
 
 	rcu_read_unlock();
 
@@ -2842,7 +2850,8 @@ virtnet_bypass_get_stats(struct net_device *dev,
 
 static int virtnet_bypass_change_mtu(struct net_device *dev, int new_mtu)
 {
-	struct virtnet_bypass_info *vbi = netdev_priv(dev);
+	struct virtnet_info *vi = netdev_priv(dev);
+	struct virtnet_bypass_info *vbi = vi->vbi;
 	struct net_device *child_netdev;
 	int ret = 0;
 
@@ -2853,15 +2862,6 @@ static int virtnet_bypass_change_mtu(struct net_device *dev, int new_mtu)
 			return ret;
 	}
 
-	child_netdev = rcu_dereference(vbi->backup_netdev);
-	if (child_netdev) {
-		ret = dev_set_mtu(child_netdev, new_mtu);
-		if (ret)
-			netdev_err(child_netdev,
-				   "Unexpected failure to set mtu to %d\n",
-				   new_mtu);
-	}
-
 	dev->mtu = new_mtu;
 	return 0;
 }
@@ -2881,20 +2881,13 @@ static int
 virtnet_bypass_ethtool_get_link_ksettings(struct net_device *dev,
 					  struct ethtool_link_ksettings *cmd)
 {
-	struct virtnet_bypass_info *vbi = netdev_priv(dev);
+	struct virtnet_info *vi = netdev_priv(dev);
+	struct virtnet_bypass_info *vbi = vi->vbi;
 	struct net_device *child_netdev;
 
 	child_netdev = rtnl_dereference(vbi->active_netdev);
-	if (!child_netdev || !virtnet_bypass_xmit_ready(child_netdev)) {
-		child_netdev = rtnl_dereference(vbi->backup_netdev);
-		if (!child_netdev || !virtnet_bypass_xmit_ready(child_netdev)) {
-			cmd->base.duplex = DUPLEX_UNKNOWN;
-			cmd->base.port = PORT_OTHER;
-			cmd->base.speed = SPEED_UNKNOWN;
-
-			return 0;
-		}
-	}
+	if (!child_netdev || !virtnet_bypass_xmit_ready(child_netdev))
+		return virtnet_get_link_ksettings(dev, cmd);
 
 	return __ethtool_get_link_ksettings(child_netdev, cmd);
 }
@@ -2944,14 +2937,15 @@ get_virtnet_bypass_byref(struct net_device *child_netdev)
 
 	for_each_netdev(net, dev) {
 		struct virtnet_bypass_info *vbi;
+		struct virtnet_info *vi;
 
 		if (dev->netdev_ops != &virtnet_bypass_netdev_ops)
 			continue;       /* not a virtnet_bypass device */
 
-		vbi = netdev_priv(dev);
+		vi = netdev_priv(dev);
+		vbi = vi->vbi;
 
-		if ((rtnl_dereference(vbi->active_netdev) == child_netdev) ||
-		    (rtnl_dereference(vbi->backup_netdev) == child_netdev))
+		if (rtnl_dereference(vbi->active_netdev) == child_netdev)
 			return dev;	/* a match */
 	}
 
@@ -2974,9 +2968,9 @@ static rx_handler_result_t virtnet_bypass_handle_frame(struct sk_buff **pskb)
 
 static int virtnet_bypass_register_child(struct net_device *child_netdev)
 {
+	struct net_device *dev, *active;
 	struct virtnet_bypass_info *vbi;
-	struct net_device *dev;
-	bool backup;
+	struct virtnet_info *vi;
 	int ret;
 
 	if (child_netdev->addr_len != ETH_ALEN)
@@ -2991,14 +2985,14 @@ static int virtnet_bypass_register_child(struct net_device *child_netdev)
 	if (!dev)
 		return NOTIFY_DONE;
 
-	vbi = netdev_priv(dev);
-	backup = (child_netdev->dev.parent == dev->dev.parent);
-	if (backup ? rtnl_dereference(vbi->backup_netdev) :
-			rtnl_dereference(vbi->active_netdev)) {
+	vi = netdev_priv(dev);
+	vbi = vi->vbi;
+
+	active = rtnl_dereference(vbi->active_netdev);
+	if (active) {
 		netdev_info(dev,
 		  "%s attempting to join bypass dev when %s already present\n",
-			child_netdev->name,
-			backup ? "backup" : "active");
+		  child_netdev->name, active->name);
 		return NOTIFY_DONE;
 	}
 
@@ -3030,7 +3024,7 @@ static int virtnet_bypass_register_child(struct net_device *child_netdev)
 		}
 	}
 
-	/* Align MTU of child with master */
+	/* Align MTU of child with virtio */
 	ret = dev_set_mtu(child_netdev, dev->mtu);
 	if (ret) {
 		netdev_err(dev,
@@ -3044,15 +3038,10 @@ static int virtnet_bypass_register_child(struct net_device *child_netdev)
 	netdev_info(dev, "registering %s\n", child_netdev->name);
 
 	dev_hold(child_netdev);
-	if (backup) {
-		rcu_assign_pointer(vbi->backup_netdev, child_netdev);
-		dev_get_stats(vbi->backup_netdev, &vbi->backup_stats);
-	} else {
-		rcu_assign_pointer(vbi->active_netdev, child_netdev);
-		dev_get_stats(vbi->active_netdev, &vbi->active_stats);
-		dev->min_mtu = child_netdev->min_mtu;
-		dev->max_mtu = child_netdev->max_mtu;
-	}
+	rcu_assign_pointer(vbi->active_netdev, child_netdev);
+	dev_get_stats(vbi->active_netdev, &vbi->active_stats);
+	dev->min_mtu = child_netdev->min_mtu;
+	dev->max_mtu = child_netdev->max_mtu;
 
 	return NOTIFY_OK;
 
@@ -3070,13 +3059,15 @@ static int virtnet_bypass_register_child(struct net_device *child_netdev)
 static int virtnet_bypass_unregister_child(struct net_device *child_netdev)
 {
 	struct virtnet_bypass_info *vbi;
-	struct net_device *dev, *backup;
+	struct virtnet_info *vi;
+	struct net_device *dev;
 
 	dev = get_virtnet_bypass_byref(child_netdev);
 	if (!dev)
 		return NOTIFY_DONE;
 
-	vbi = netdev_priv(dev);
+	vi = netdev_priv(dev);
+	vbi = vi->vbi;
 
 	netdev_info(dev, "unregistering %s\n", child_netdev->name);
 
@@ -3084,41 +3075,35 @@ static int virtnet_bypass_unregister_child(struct net_device *child_netdev)
 	netdev_upper_dev_unlink(child_netdev, dev);
 	child_netdev->flags &= ~IFF_SLAVE;
 
-	if (child_netdev->dev.parent == dev->dev.parent) {
-		RCU_INIT_POINTER(vbi->backup_netdev, NULL);
-	} else {
-		RCU_INIT_POINTER(vbi->active_netdev, NULL);
-		backup = rtnl_dereference(vbi->backup_netdev);
-		if (backup) {
-			dev->min_mtu = backup->min_mtu;
-			dev->max_mtu = backup->max_mtu;
-		}
-	}
+	RCU_INIT_POINTER(vbi->active_netdev, NULL);
+	dev->min_mtu = MIN_MTU;
+	dev->max_mtu = MAX_MTU;
 
 	dev_put(child_netdev);
 
+	if (!(vi->status & VIRTIO_NET_S_LINK_UP)) {
+		netif_carrier_off(dev);
+		netif_tx_stop_all_queues(dev);
+	}
+
 	return NOTIFY_OK;
 }
 
 static int virtnet_bypass_update_link(struct net_device *child_netdev)
 {
-	struct net_device *dev, *active, *backup;
-	struct virtnet_bypass_info *vbi;
+	struct virtnet_info *vi;
+	struct net_device *dev;
 
 	dev = get_virtnet_bypass_byref(child_netdev);
-	if (!dev || !netif_running(dev))
+	if (!dev)
 		return NOTIFY_DONE;
 
-	vbi = netdev_priv(dev);
-
-	active = rtnl_dereference(vbi->active_netdev);
-	backup = rtnl_dereference(vbi->backup_netdev);
+	vi = netdev_priv(dev);
 
-	if ((active && virtnet_bypass_xmit_ready(active)) ||
-	    (backup && virtnet_bypass_xmit_ready(backup))) {
+	if (virtnet_bypass_xmit_ready(child_netdev)) {
 		netif_carrier_on(dev);
 		netif_tx_wake_all_queues(dev);
-	} else {
+	} else if (!(vi->status & VIRTIO_NET_S_LINK_UP)) {
 		netif_carrier_off(dev);
 		netif_tx_stop_all_queues(dev);
 	}
@@ -3169,107 +3154,41 @@ static struct notifier_block virtnet_bypass_notifier = {
 
 static int virtnet_bypass_create(struct virtnet_info *vi)
 {
-	struct net_device *backup_netdev = vi->dev;
-	struct device *dev = &vi->vdev->dev;
-	struct net_device *bypass_netdev;
-	int res;
+	struct net_device *dev = vi->dev;
+	struct virtnet_bypass_info *vbi;
 
-	/* Alloc at least 2 queues, for now we are going with 16 assuming
-	 * that most devices being bonded won't have too many queues.
-	 */
-	bypass_netdev = alloc_etherdev_mq(sizeof(struct virtnet_bypass_info),
-					  16);
-	if (!bypass_netdev) {
-		dev_err(dev, "Unable to allocate bypass_netdev!\n");
+	vbi = kzalloc(sizeof(*vbi), GFP_KERNEL);
+	if (!vbi)
 		return -ENOMEM;
-	}
-
-	dev_net_set(bypass_netdev, dev_net(backup_netdev));
-	SET_NETDEV_DEV(bypass_netdev, dev);
-
-	bypass_netdev->netdev_ops = &virtnet_bypass_netdev_ops;
-	bypass_netdev->ethtool_ops = &virtnet_bypass_ethtool_ops;
-
-	/* Initialize the device options */
-	bypass_netdev->flags |= IFF_MASTER;
-	bypass_netdev->priv_flags |= IFF_BONDING | IFF_UNICAST_FLT |
-				     IFF_NO_QUEUE;
-	bypass_netdev->priv_flags &= ~(IFF_XMIT_DST_RELEASE |
-				       IFF_TX_SKB_SHARING);
-
-	/* don't acquire bypass netdev's netif_tx_lock when transmitting */
-	bypass_netdev->features |= NETIF_F_LLTX;
-
-	/* Don't allow bypass devices to change network namespaces. */
-	bypass_netdev->features |= NETIF_F_NETNS_LOCAL;
-
-	bypass_netdev->hw_features = NETIF_F_HW_CSUM | NETIF_F_SG |
-				     NETIF_F_FRAGLIST | NETIF_F_ALL_TSO |
-				     NETIF_F_HIGHDMA | NETIF_F_LRO;
-
-	bypass_netdev->hw_features |= NETIF_F_GSO_ENCAP_ALL;
-	bypass_netdev->features |= bypass_netdev->hw_features;
-
-	/* For now treat bypass netdev as VLAN challenged since we
-	 * cannot assume VLAN functionality with a VF
-	 */
-	bypass_netdev->features |= NETIF_F_VLAN_CHALLENGED;
-
-	memcpy(bypass_netdev->dev_addr, backup_netdev->dev_addr,
-	       bypass_netdev->addr_len);
 
-	bypass_netdev->min_mtu = backup_netdev->min_mtu;
-	bypass_netdev->max_mtu = backup_netdev->max_mtu;
+	dev->netdev_ops = &virtnet_bypass_netdev_ops;
+	dev->ethtool_ops = &virtnet_bypass_ethtool_ops;
 
-	res = register_netdev(bypass_netdev);
-	if (res < 0) {
-		dev_err(dev, "Unable to register bypass_netdev!\n");
-		free_netdev(bypass_netdev);
-		return res;
-	}
-
-	netif_carrier_off(bypass_netdev);
-
-	vi->bypass_netdev = bypass_netdev;
-
-	/* Change the name of the backup interface to vbkup0
-	 * we may need to revisit naming later but this gets it out
-	 * of the way for now.
-	 */
-	strcpy(backup_netdev->name, "vbkup%d");
+	vbi->backup_netdev = dev;
+	virtnet_stats(vbi->backup_netdev, &vbi->backup_stats);
+	vi->vbi = vbi;
 
 	return 0;
 }
 
 static void virtnet_bypass_destroy(struct virtnet_info *vi)
 {
-	struct net_device *bypass_netdev = vi->bypass_netdev;
-	struct virtnet_bypass_info *vbi;
+	struct virtnet_bypass_info *vbi = vi->vbi;
 	struct net_device *child_netdev;
 
-	/* no device found, nothing to free */
-	if (!bypass_netdev)
+	if (!vbi)
 		return;
 
-	vbi = netdev_priv(bypass_netdev);
-
-	netif_device_detach(bypass_netdev);
-
 	rtnl_lock();
 
 	child_netdev = rtnl_dereference(vbi->active_netdev);
 	if (child_netdev)
 		virtnet_bypass_unregister_child(child_netdev);
 
-	child_netdev = rtnl_dereference(vbi->backup_netdev);
-	if (child_netdev)
-		virtnet_bypass_unregister_child(child_netdev);
-
-	unregister_netdevice(bypass_netdev);
-
 	rtnl_unlock();
 
-	free_netdev(bypass_netdev);
+	kfree(vbi);
+	vi->vbi = NULL;
 }
 
 static int virtnet_probe(struct virtio_device *vdev)
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [virtio-dev] [RFC PATCH v3 3/3] virtio_net: Enable alternate datapath without creating an additional netdev
@ 2018-02-16 18:11   ` Sridhar Samudrala
  0 siblings, 0 replies; 121+ messages in thread
From: Sridhar Samudrala @ 2018-02-16 18:11 UTC (permalink / raw)
  To: mst, stephen, davem, netdev, virtualization, virtio-dev,
	jesse.brandeburg, alexander.h.duyck, kubakici, sridhar.samudrala,
	jasowang, loseweigh

This patch addresses the issues that were seen with the 3 netdev model by
avoiding the creation of an additional netdev. Instead the bypass state
information is tracked in the original netdev and a different set of
ndo_ops and ethtool_ops are used when BACKUP feature is enabled.

Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
Reviewed-by: Alexander Duyck <alexander.h.duyck@intel.com> 
---
 drivers/net/virtio_net.c | 283 +++++++++++++++++------------------------------
 1 file changed, 101 insertions(+), 182 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 14679806c1b1..c85b2949f151 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -154,7 +154,7 @@ struct virtnet_bypass_info {
 	struct net_device __rcu *active_netdev;
 
 	/* virtio_net netdev */
-	struct net_device __rcu *backup_netdev;
+	struct net_device *backup_netdev;
 
 	/* active netdev stats */
 	struct rtnl_link_stats64 active_stats;
@@ -229,8 +229,8 @@ struct virtnet_info {
 
 	unsigned long guest_offloads;
 
-	/* upper netdev created when BACKUP feature enabled */
-	struct net_device *bypass_netdev;
+	/* bypass state maintained when BACKUP feature is enabled */
+	struct virtnet_bypass_info *vbi;
 };
 
 struct padded_vnet_hdr {
@@ -2285,6 +2285,22 @@ static bool virtnet_bypass_xmit_ready(struct net_device *dev)
 	return netif_running(dev) && netif_carrier_ok(dev);
 }
 
+static bool virtnet_bypass_active_ready(struct net_device *dev)
+{
+	struct virtnet_info *vi = netdev_priv(dev);
+	struct virtnet_bypass_info *vbi = vi->vbi;
+	struct net_device *active;
+
+	if (!vbi)
+		return false;
+
+	active = rcu_dereference(vbi->active_netdev);
+	if (!active || !virtnet_bypass_xmit_ready(active))
+		return false;
+
+	return true;
+}
+
 static void virtnet_config_changed_work(struct work_struct *work)
 {
 	struct virtnet_info *vi =
@@ -2312,7 +2328,7 @@ static void virtnet_config_changed_work(struct work_struct *work)
 		virtnet_update_settings(vi);
 		netif_carrier_on(vi->dev);
 		netif_tx_wake_all_queues(vi->dev);
-	} else {
+	} else if (!virtnet_bypass_active_ready(vi->dev)) {
 		netif_carrier_off(vi->dev);
 		netif_tx_stop_all_queues(vi->dev);
 	}
@@ -2501,7 +2517,8 @@ static int virtnet_find_vqs(struct virtnet_info *vi)
 
 	if (vi->has_cvq) {
 		vi->cvq = vqs[total_vqs - 1];
-		if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VLAN))
+		if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VLAN) &&
+		    !virtio_has_feature(vi->vdev, VIRTIO_NET_F_BACKUP))
 			vi->dev->features |= NETIF_F_HW_VLAN_CTAG_FILTER;
 	}
 
@@ -2690,62 +2707,54 @@ virtnet_bypass_child_open(struct net_device *dev,
 
 static int virtnet_bypass_open(struct net_device *dev)
 {
-	struct virtnet_bypass_info *vbi = netdev_priv(dev);
+	struct virtnet_info *vi = netdev_priv(dev);
+	struct virtnet_bypass_info *vbi = vi->vbi;
 	struct net_device *child_netdev;
-
-	netif_carrier_off(dev);
-	netif_tx_wake_all_queues(dev);
+	int err;
 
 	child_netdev = rtnl_dereference(vbi->active_netdev);
 	if (child_netdev)
 		virtnet_bypass_child_open(dev, child_netdev);
 
-	child_netdev = rtnl_dereference(vbi->backup_netdev);
-	if (child_netdev)
-		virtnet_bypass_child_open(dev, child_netdev);
+	err = virtnet_open(dev);
+	if (err < 0) {
+		dev_close(child_netdev);
+		return err;
+	}
 
 	return 0;
 }
 
 static int virtnet_bypass_close(struct net_device *dev)
 {
-	struct virtnet_bypass_info *vi = netdev_priv(dev);
+	struct virtnet_info *vi = netdev_priv(dev);
+	struct virtnet_bypass_info *vbi = vi->vbi;
 	struct net_device *child_netdev;
 
-	netif_tx_disable(dev);
+	virtnet_close(dev);
 
-	child_netdev = rtnl_dereference(vi->active_netdev);
-	if (child_netdev)
-		dev_close(child_netdev);
+	if (!vbi)
+		goto done;
 
-	child_netdev = rtnl_dereference(vi->backup_netdev);
+	child_netdev = rtnl_dereference(vbi->active_netdev);
 	if (child_netdev)
 		dev_close(child_netdev);
 
+done:
 	return 0;
 }
 
-static netdev_tx_t
-virtnet_bypass_drop_xmit(struct sk_buff *skb, struct net_device *dev)
-{
-	atomic_long_inc(&dev->tx_dropped);
-	dev_kfree_skb_any(skb);
-	return NETDEV_TX_OK;
-}
-
 static netdev_tx_t
 virtnet_bypass_start_xmit(struct sk_buff *skb, struct net_device *dev)
 {
-	struct virtnet_bypass_info *vbi = netdev_priv(dev);
+	struct virtnet_info *vi = netdev_priv(dev);
+	struct virtnet_bypass_info *vbi = vi->vbi;
 	struct net_device *xmit_dev;
 
 	/* Try xmit via active netdev followed by backup netdev */
 	xmit_dev = rcu_dereference_bh(vbi->active_netdev);
-	if (!xmit_dev || !virtnet_bypass_xmit_ready(xmit_dev)) {
-		xmit_dev = rcu_dereference_bh(vbi->backup_netdev);
-		if (!xmit_dev || !virtnet_bypass_xmit_ready(xmit_dev))
-			return virtnet_bypass_drop_xmit(skb, dev);
-	}
+	if (!xmit_dev || !virtnet_bypass_xmit_ready(xmit_dev))
+		return start_xmit(skb, dev);
 
 	skb->dev = xmit_dev;
 	skb->queue_mapping = qdisc_skb_cb(skb)->slave_dev_queue_mapping;
@@ -2810,7 +2819,8 @@ static void
 virtnet_bypass_get_stats(struct net_device *dev,
 			 struct rtnl_link_stats64 *stats)
 {
-	struct virtnet_bypass_info *vbi = netdev_priv(dev);
+	struct virtnet_info *vi = netdev_priv(dev);
+	struct virtnet_bypass_info *vbi = vi->vbi;
 	const struct rtnl_link_stats64 *new;
 	struct rtnl_link_stats64 temp;
 	struct net_device *child_netdev;
@@ -2827,12 +2837,10 @@ virtnet_bypass_get_stats(struct net_device *dev,
 		memcpy(&vbi->active_stats, new, sizeof(*new));
 	}
 
-	child_netdev = rcu_dereference(vbi->backup_netdev);
-	if (child_netdev) {
-		new = dev_get_stats(child_netdev, &temp);
-		virtnet_bypass_fold_stats(stats, new, &vbi->backup_stats);
-		memcpy(&vbi->backup_stats, new, sizeof(*new));
-	}
+	memset(&temp, 0, sizeof(temp));
+	virtnet_stats(vbi->backup_netdev, &temp);
+	virtnet_bypass_fold_stats(stats, &temp, &vbi->backup_stats);
+	memcpy(&vbi->backup_stats, &temp, sizeof(temp));
 
 	rcu_read_unlock();
 
@@ -2842,7 +2850,8 @@ virtnet_bypass_get_stats(struct net_device *dev,
 
 static int virtnet_bypass_change_mtu(struct net_device *dev, int new_mtu)
 {
-	struct virtnet_bypass_info *vbi = netdev_priv(dev);
+	struct virtnet_info *vi = netdev_priv(dev);
+	struct virtnet_bypass_info *vbi = vi->vbi;
 	struct net_device *child_netdev;
 	int ret = 0;
 
@@ -2853,15 +2862,6 @@ static int virtnet_bypass_change_mtu(struct net_device *dev, int new_mtu)
 			return ret;
 	}
 
-	child_netdev = rcu_dereference(vbi->backup_netdev);
-	if (child_netdev) {
-		ret = dev_set_mtu(child_netdev, new_mtu);
-		if (ret)
-			netdev_err(child_netdev,
-				   "Unexpected failure to set mtu to %d\n",
-				   new_mtu);
-	}
-
 	dev->mtu = new_mtu;
 	return 0;
 }
@@ -2881,20 +2881,13 @@ static int
 virtnet_bypass_ethtool_get_link_ksettings(struct net_device *dev,
 					  struct ethtool_link_ksettings *cmd)
 {
-	struct virtnet_bypass_info *vbi = netdev_priv(dev);
+	struct virtnet_info *vi = netdev_priv(dev);
+	struct virtnet_bypass_info *vbi = vi->vbi;
 	struct net_device *child_netdev;
 
 	child_netdev = rtnl_dereference(vbi->active_netdev);
-	if (!child_netdev || !virtnet_bypass_xmit_ready(child_netdev)) {
-		child_netdev = rtnl_dereference(vbi->backup_netdev);
-		if (!child_netdev || !virtnet_bypass_xmit_ready(child_netdev)) {
-			cmd->base.duplex = DUPLEX_UNKNOWN;
-			cmd->base.port = PORT_OTHER;
-			cmd->base.speed = SPEED_UNKNOWN;
-
-			return 0;
-		}
-	}
+	if (!child_netdev || !virtnet_bypass_xmit_ready(child_netdev))
+		return virtnet_get_link_ksettings(dev, cmd);
 
 	return __ethtool_get_link_ksettings(child_netdev, cmd);
 }
@@ -2944,14 +2937,15 @@ get_virtnet_bypass_byref(struct net_device *child_netdev)
 
 	for_each_netdev(net, dev) {
 		struct virtnet_bypass_info *vbi;
+		struct virtnet_info *vi;
 
 		if (dev->netdev_ops != &virtnet_bypass_netdev_ops)
 			continue;       /* not a virtnet_bypass device */
 
-		vbi = netdev_priv(dev);
+		vi = netdev_priv(dev);
+		vbi = vi->vbi;
 
-		if ((rtnl_dereference(vbi->active_netdev) == child_netdev) ||
-		    (rtnl_dereference(vbi->backup_netdev) == child_netdev))
+		if (rtnl_dereference(vbi->active_netdev) == child_netdev)
 			return dev;	/* a match */
 	}
 
@@ -2974,9 +2968,9 @@ static rx_handler_result_t virtnet_bypass_handle_frame(struct sk_buff **pskb)
 
 static int virtnet_bypass_register_child(struct net_device *child_netdev)
 {
+	struct net_device *dev, *active;
 	struct virtnet_bypass_info *vbi;
-	struct net_device *dev;
-	bool backup;
+	struct virtnet_info *vi;
 	int ret;
 
 	if (child_netdev->addr_len != ETH_ALEN)
@@ -2991,14 +2985,14 @@ static int virtnet_bypass_register_child(struct net_device *child_netdev)
 	if (!dev)
 		return NOTIFY_DONE;
 
-	vbi = netdev_priv(dev);
-	backup = (child_netdev->dev.parent == dev->dev.parent);
-	if (backup ? rtnl_dereference(vbi->backup_netdev) :
-			rtnl_dereference(vbi->active_netdev)) {
+	vi = netdev_priv(dev);
+	vbi = vi->vbi;
+
+	active = rtnl_dereference(vbi->active_netdev);
+	if (active) {
 		netdev_info(dev,
 		  "%s attempting to join bypass dev when %s already present\n",
-			child_netdev->name,
-			backup ? "backup" : "active");
+		  child_netdev->name, active->name);
 		return NOTIFY_DONE;
 	}
 
@@ -3030,7 +3024,7 @@ static int virtnet_bypass_register_child(struct net_device *child_netdev)
 		}
 	}
 
-	/* Align MTU of child with master */
+	/* Align MTU of child with virtio */
 	ret = dev_set_mtu(child_netdev, dev->mtu);
 	if (ret) {
 		netdev_err(dev,
@@ -3044,15 +3038,10 @@ static int virtnet_bypass_register_child(struct net_device *child_netdev)
 	netdev_info(dev, "registering %s\n", child_netdev->name);
 
 	dev_hold(child_netdev);
-	if (backup) {
-		rcu_assign_pointer(vbi->backup_netdev, child_netdev);
-		dev_get_stats(vbi->backup_netdev, &vbi->backup_stats);
-	} else {
-		rcu_assign_pointer(vbi->active_netdev, child_netdev);
-		dev_get_stats(vbi->active_netdev, &vbi->active_stats);
-		dev->min_mtu = child_netdev->min_mtu;
-		dev->max_mtu = child_netdev->max_mtu;
-	}
+	rcu_assign_pointer(vbi->active_netdev, child_netdev);
+	dev_get_stats(vbi->active_netdev, &vbi->active_stats);
+	dev->min_mtu = child_netdev->min_mtu;
+	dev->max_mtu = child_netdev->max_mtu;
 
 	return NOTIFY_OK;
 
@@ -3070,13 +3059,15 @@ static int virtnet_bypass_register_child(struct net_device *child_netdev)
 static int virtnet_bypass_unregister_child(struct net_device *child_netdev)
 {
 	struct virtnet_bypass_info *vbi;
-	struct net_device *dev, *backup;
+	struct virtnet_info *vi;
+	struct net_device *dev;
 
 	dev = get_virtnet_bypass_byref(child_netdev);
 	if (!dev)
 		return NOTIFY_DONE;
 
-	vbi = netdev_priv(dev);
+	vi = netdev_priv(dev);
+	vbi = vi->vbi;
 
 	netdev_info(dev, "unregistering %s\n", child_netdev->name);
 
@@ -3084,41 +3075,35 @@ static int virtnet_bypass_unregister_child(struct net_device *child_netdev)
 	netdev_upper_dev_unlink(child_netdev, dev);
 	child_netdev->flags &= ~IFF_SLAVE;
 
-	if (child_netdev->dev.parent == dev->dev.parent) {
-		RCU_INIT_POINTER(vbi->backup_netdev, NULL);
-	} else {
-		RCU_INIT_POINTER(vbi->active_netdev, NULL);
-		backup = rtnl_dereference(vbi->backup_netdev);
-		if (backup) {
-			dev->min_mtu = backup->min_mtu;
-			dev->max_mtu = backup->max_mtu;
-		}
-	}
+	RCU_INIT_POINTER(vbi->active_netdev, NULL);
+	dev->min_mtu = MIN_MTU;
+	dev->max_mtu = MAX_MTU;
 
 	dev_put(child_netdev);
 
+	if (!(vi->status & VIRTIO_NET_S_LINK_UP)) {
+		netif_carrier_off(dev);
+		netif_tx_stop_all_queues(dev);
+	}
+
 	return NOTIFY_OK;
 }
 
 static int virtnet_bypass_update_link(struct net_device *child_netdev)
 {
-	struct net_device *dev, *active, *backup;
-	struct virtnet_bypass_info *vbi;
+	struct virtnet_info *vi;
+	struct net_device *dev;
 
 	dev = get_virtnet_bypass_byref(child_netdev);
-	if (!dev || !netif_running(dev))
+	if (!dev)
 		return NOTIFY_DONE;
 
-	vbi = netdev_priv(dev);
-
-	active = rtnl_dereference(vbi->active_netdev);
-	backup = rtnl_dereference(vbi->backup_netdev);
+	vi = netdev_priv(dev);
 
-	if ((active && virtnet_bypass_xmit_ready(active)) ||
-	    (backup && virtnet_bypass_xmit_ready(backup))) {
+	if (virtnet_bypass_xmit_ready(child_netdev)) {
 		netif_carrier_on(dev);
 		netif_tx_wake_all_queues(dev);
-	} else {
+	} else if (!(vi->status & VIRTIO_NET_S_LINK_UP)) {
 		netif_carrier_off(dev);
 		netif_tx_stop_all_queues(dev);
 	}
@@ -3169,107 +3154,41 @@ static struct notifier_block virtnet_bypass_notifier = {
 
 static int virtnet_bypass_create(struct virtnet_info *vi)
 {
-	struct net_device *backup_netdev = vi->dev;
-	struct device *dev = &vi->vdev->dev;
-	struct net_device *bypass_netdev;
-	int res;
+	struct net_device *dev = vi->dev;
+	struct virtnet_bypass_info *vbi;
 
-	/* Alloc at least 2 queues, for now we are going with 16 assuming
-	 * that most devices being bonded won't have too many queues.
-	 */
-	bypass_netdev = alloc_etherdev_mq(sizeof(struct virtnet_bypass_info),
-					  16);
-	if (!bypass_netdev) {
-		dev_err(dev, "Unable to allocate bypass_netdev!\n");
+	vbi = kzalloc(sizeof(*vbi), GFP_KERNEL);
+	if (!vbi)
 		return -ENOMEM;
-	}
-
-	dev_net_set(bypass_netdev, dev_net(backup_netdev));
-	SET_NETDEV_DEV(bypass_netdev, dev);
-
-	bypass_netdev->netdev_ops = &virtnet_bypass_netdev_ops;
-	bypass_netdev->ethtool_ops = &virtnet_bypass_ethtool_ops;
-
-	/* Initialize the device options */
-	bypass_netdev->flags |= IFF_MASTER;
-	bypass_netdev->priv_flags |= IFF_BONDING | IFF_UNICAST_FLT |
-				     IFF_NO_QUEUE;
-	bypass_netdev->priv_flags &= ~(IFF_XMIT_DST_RELEASE |
-				       IFF_TX_SKB_SHARING);
-
-	/* don't acquire bypass netdev's netif_tx_lock when transmitting */
-	bypass_netdev->features |= NETIF_F_LLTX;
-
-	/* Don't allow bypass devices to change network namespaces. */
-	bypass_netdev->features |= NETIF_F_NETNS_LOCAL;
-
-	bypass_netdev->hw_features = NETIF_F_HW_CSUM | NETIF_F_SG |
-				     NETIF_F_FRAGLIST | NETIF_F_ALL_TSO |
-				     NETIF_F_HIGHDMA | NETIF_F_LRO;
-
-	bypass_netdev->hw_features |= NETIF_F_GSO_ENCAP_ALL;
-	bypass_netdev->features |= bypass_netdev->hw_features;
-
-	/* For now treat bypass netdev as VLAN challenged since we
-	 * cannot assume VLAN functionality with a VF
-	 */
-	bypass_netdev->features |= NETIF_F_VLAN_CHALLENGED;
-
-	memcpy(bypass_netdev->dev_addr, backup_netdev->dev_addr,
-	       bypass_netdev->addr_len);
 
-	bypass_netdev->min_mtu = backup_netdev->min_mtu;
-	bypass_netdev->max_mtu = backup_netdev->max_mtu;
+	dev->netdev_ops = &virtnet_bypass_netdev_ops;
+	dev->ethtool_ops = &virtnet_bypass_ethtool_ops;
 
-	res = register_netdev(bypass_netdev);
-	if (res < 0) {
-		dev_err(dev, "Unable to register bypass_netdev!\n");
-		free_netdev(bypass_netdev);
-		return res;
-	}
-
-	netif_carrier_off(bypass_netdev);
-
-	vi->bypass_netdev = bypass_netdev;
-
-	/* Change the name of the backup interface to vbkup0
-	 * we may need to revisit naming later but this gets it out
-	 * of the way for now.
-	 */
-	strcpy(backup_netdev->name, "vbkup%d");
+	vbi->backup_netdev = dev;
+	virtnet_stats(vbi->backup_netdev, &vbi->backup_stats);
+	vi->vbi = vbi;
 
 	return 0;
 }
 
 static void virtnet_bypass_destroy(struct virtnet_info *vi)
 {
-	struct net_device *bypass_netdev = vi->bypass_netdev;
-	struct virtnet_bypass_info *vbi;
+	struct virtnet_bypass_info *vbi = vi->vbi;
 	struct net_device *child_netdev;
 
-	/* no device found, nothing to free */
-	if (!bypass_netdev)
+	if (!vbi)
 		return;
 
-	vbi = netdev_priv(bypass_netdev);
-
-	netif_device_detach(bypass_netdev);
-
 	rtnl_lock();
 
 	child_netdev = rtnl_dereference(vbi->active_netdev);
 	if (child_netdev)
 		virtnet_bypass_unregister_child(child_netdev);
 
-	child_netdev = rtnl_dereference(vbi->backup_netdev);
-	if (child_netdev)
-		virtnet_bypass_unregister_child(child_netdev);
-
-	unregister_netdevice(bypass_netdev);
-
 	rtnl_unlock();
 
-	free_netdev(bypass_netdev);
+	kfree(vbi);
+	vi->vbi = NULL;
 }
 
 static int virtnet_probe(struct virtio_device *vdev)
-- 
2.14.3


---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply related	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
  2018-02-16 18:11 ` [virtio-dev] " Sridhar Samudrala
                   ` (7 preceding siblings ...)
  (?)
@ 2018-02-17  2:38 ` Jakub Kicinski
  2018-02-17 17:12     ` [virtio-dev] " Alexander Duyck
  2018-02-17 17:12   ` Alexander Duyck
  -1 siblings, 2 replies; 121+ messages in thread
From: Jakub Kicinski @ 2018-02-17  2:38 UTC (permalink / raw)
  To: Sridhar Samudrala
  Cc: mst, stephen, davem, netdev, virtualization, virtio-dev,
	jesse.brandeburg, alexander.h.duyck, jasowang, loseweigh

On Fri, 16 Feb 2018 10:11:19 -0800, Sridhar Samudrala wrote:
> Ppatch 2 is in response to the community request for a 3 netdev
> solution.  However, it creates some issues we'll get into in a moment.
> It extends virtio_net to use alternate datapath when available and
> registered. When BACKUP feature is enabled, virtio_net driver creates
> an additional 'bypass' netdev that acts as a master device and controls
> 2 slave devices.  The original virtio_net netdev is registered as
> 'backup' netdev and a passthru/vf device with the same MAC gets
> registered as 'active' netdev. Both 'bypass' and 'backup' netdevs are
> associated with the same 'pci' device.  The user accesses the network
> interface via 'bypass' netdev. The 'bypass' netdev chooses 'active' netdev
> as default for transmits when it is available with link up and running.

Thank you do doing this.

> We noticed a couple of issues with this approach during testing.
> - As both 'bypass' and 'backup' netdevs are associated with the same
>   virtio pci device, udev tries to rename both of them with the same name
>   and the 2nd rename will fail. This would be OK as long as the first netdev
>   to be renamed is the 'bypass' netdev, but the order in which udev gets
>   to rename the 2 netdevs is not reliable. 

Out of curiosity - why do you link the master netdev to the virtio
struct device?

FWIW two solutions that immediately come to mind is to export "backup"
as phys_port_name of the backup virtio link and/or assign a name to the
master like you are doing already.  I think team uses team%d and bond
uses bond%d, soft naming of master devices seems quite natural in this
case.

IMHO phys_port_name == "backup" if BACKUP bit is set on slave virtio
link is quite neat.

> - When the 'active' netdev is unplugged OR not present on a destination
>   system after live migration, the user will see 2 virtio_net netdevs.

That's necessary and expected, all configuration applies to the master
so master must exist.

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
  2018-02-16 18:11 ` [virtio-dev] " Sridhar Samudrala
                   ` (6 preceding siblings ...)
  (?)
@ 2018-02-17  2:38 ` Jakub Kicinski
  -1 siblings, 0 replies; 121+ messages in thread
From: Jakub Kicinski @ 2018-02-17  2:38 UTC (permalink / raw)
  To: Sridhar Samudrala
  Cc: alexander.h.duyck, virtio-dev, mst, netdev, virtualization,
	loseweigh, davem

On Fri, 16 Feb 2018 10:11:19 -0800, Sridhar Samudrala wrote:
> Ppatch 2 is in response to the community request for a 3 netdev
> solution.  However, it creates some issues we'll get into in a moment.
> It extends virtio_net to use alternate datapath when available and
> registered. When BACKUP feature is enabled, virtio_net driver creates
> an additional 'bypass' netdev that acts as a master device and controls
> 2 slave devices.  The original virtio_net netdev is registered as
> 'backup' netdev and a passthru/vf device with the same MAC gets
> registered as 'active' netdev. Both 'bypass' and 'backup' netdevs are
> associated with the same 'pci' device.  The user accesses the network
> interface via 'bypass' netdev. The 'bypass' netdev chooses 'active' netdev
> as default for transmits when it is available with link up and running.

Thank you do doing this.

> We noticed a couple of issues with this approach during testing.
> - As both 'bypass' and 'backup' netdevs are associated with the same
>   virtio pci device, udev tries to rename both of them with the same name
>   and the 2nd rename will fail. This would be OK as long as the first netdev
>   to be renamed is the 'bypass' netdev, but the order in which udev gets
>   to rename the 2 netdevs is not reliable. 

Out of curiosity - why do you link the master netdev to the virtio
struct device?

FWIW two solutions that immediately come to mind is to export "backup"
as phys_port_name of the backup virtio link and/or assign a name to the
master like you are doing already.  I think team uses team%d and bond
uses bond%d, soft naming of master devices seems quite natural in this
case.

IMHO phys_port_name == "backup" if BACKUP bit is set on slave virtio
link is quite neat.

> - When the 'active' netdev is unplugged OR not present on a destination
>   system after live migration, the user will see 2 virtio_net netdevs.

That's necessary and expected, all configuration applies to the master
so master must exist.

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v3 2/3] virtio_net: Extend virtio to use VF datapath when available
  2018-02-16 18:11   ` [virtio-dev] " Sridhar Samudrala
  (?)
@ 2018-02-17  3:04   ` Jakub Kicinski
  2018-02-17 17:41     ` Alexander Duyck
  -1 siblings, 1 reply; 121+ messages in thread
From: Jakub Kicinski @ 2018-02-17  3:04 UTC (permalink / raw)
  To: Sridhar Samudrala
  Cc: mst, stephen, davem, netdev, virtualization, jesse.brandeburg,
	alexander.h.duyck, jasowang, loseweigh, Or Gerlitz

On Fri, 16 Feb 2018 10:11:21 -0800, Sridhar Samudrala wrote:
> This patch enables virtio_net to switch over to a VF datapath when a VF
> netdev is present with the same MAC address. It allows live migration
> of a VM with a direct attached VF without the need to setup a bond/team
> between a VF and virtio net device in the guest.
> 
> The hypervisor needs to enable only one datapath at any time so that
> packets don't get looped back to the VM over the other datapath. When a VF
> is plugged, the virtio datapath link state can be marked as down. The
> hypervisor needs to unplug the VF device from the guest on the source host
> and reset the MAC filter of the VF to initiate failover of datapath to
> virtio before starting the migration. After the migration is completed,
> the destination hypervisor sets the MAC filter on the VF and plugs it back
> to the guest to switch over to VF datapath.
> 
> When BACKUP feature is enabled, an additional netdev(bypass netdev) is
> created that acts as a master device and tracks the state of the 2 lower
> netdevs. The original virtio_net netdev is marked as 'backup' netdev and a
> passthru device with the same MAC is registered as 'active' netdev.
> 
> This patch is based on the discussion initiated by Jesse on this thread.
> https://marc.info/?l=linux-virtualization&m=151189725224231&w=2
> 
> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> 

> +static void
> +virtnet_bypass_get_stats(struct net_device *dev,
> +			 struct rtnl_link_stats64 *stats)
> +{
> +	struct virtnet_bypass_info *vbi = netdev_priv(dev);
> +	const struct rtnl_link_stats64 *new;
> +	struct rtnl_link_stats64 temp;
> +	struct net_device *child_netdev;
> +
> +	spin_lock(&vbi->stats_lock);
> +	memcpy(stats, &vbi->bypass_stats, sizeof(*stats));
> +
> +	rcu_read_lock();
> +
> +	child_netdev = rcu_dereference(vbi->active_netdev);
> +	if (child_netdev) {
> +		new = dev_get_stats(child_netdev, &temp);
> +		virtnet_bypass_fold_stats(stats, new, &vbi->active_stats);
> +		memcpy(&vbi->active_stats, new, sizeof(*new));
> +	}
> +
> +	child_netdev = rcu_dereference(vbi->backup_netdev);
> +	if (child_netdev) {
> +		new = dev_get_stats(child_netdev, &temp);
> +		virtnet_bypass_fold_stats(stats, new, &vbi->backup_stats);
> +		memcpy(&vbi->backup_stats, new, sizeof(*new));
> +	}
> +
> +	rcu_read_unlock();
> +
> +	memcpy(&vbi->bypass_stats, stats, sizeof(*stats));
> +	spin_unlock(&vbi->stats_lock);
> +}
> +
> +static int virtnet_bypass_change_mtu(struct net_device *dev, int new_mtu)
> +{
> +	struct virtnet_bypass_info *vbi = netdev_priv(dev);
> +	struct net_device *child_netdev;
> +	int ret = 0;
> +
> +	child_netdev = rcu_dereference(vbi->active_netdev);
> +	if (child_netdev) {
> +		ret = dev_set_mtu(child_netdev, new_mtu);
> +		if (ret)
> +			return ret;
> +	}
> +
> +	child_netdev = rcu_dereference(vbi->backup_netdev);
> +	if (child_netdev) {
> +		ret = dev_set_mtu(child_netdev, new_mtu);
> +		if (ret)
> +			netdev_err(child_netdev,
> +				   "Unexpected failure to set mtu to %d\n",
> +				   new_mtu);

You should probably unwind if set fails on one of the legs.

> +	}
> +
> +	dev->mtu = new_mtu;
> +	return 0;
> +}

nit: stats, mtu, all those mundane things are implemented in team
     already.  If we had this as kernel-internal team mode we wouldn't
     have to reimplement them...  You probably did investigate that
     option, for my edification, would you mind saying what the
     challenges/downsides were?

> +static struct net_device *
> +get_virtnet_bypass_bymac(struct net *net, const u8 *mac)
> +{
> +	struct net_device *dev;
> +
> +	ASSERT_RTNL();
> +
> +	for_each_netdev(net, dev) {
> +		if (dev->netdev_ops != &virtnet_bypass_netdev_ops)
> +			continue;       /* not a virtnet_bypass device */

Is there anything inherently wrong with enslaving another virtio dev
now?  I was expecting something like a hash map to map MAC addr ->
master and then one can check if dev is already enslaved to that master.
Just a random thought, I'm probably missing something...

> +		if (ether_addr_equal(mac, dev->perm_addr))
> +			return dev;
> +	}
> +
> +	return NULL;
> +}
> +
> +static struct net_device *
> +get_virtnet_bypass_byref(struct net_device *child_netdev)
> +{
> +	struct net *net = dev_net(child_netdev);
> +	struct net_device *dev;
> +
> +	ASSERT_RTNL();
> +
> +	for_each_netdev(net, dev) {
> +		struct virtnet_bypass_info *vbi;
> +
> +		if (dev->netdev_ops != &virtnet_bypass_netdev_ops)
> +			continue;       /* not a virtnet_bypass device */
> +
> +		vbi = netdev_priv(dev);
> +
> +		if ((rtnl_dereference(vbi->active_netdev) == child_netdev) ||
> +		    (rtnl_dereference(vbi->backup_netdev) == child_netdev))

nit: parens not needed

> +			return dev;	/* a match */
> +	}
> +
> +	return NULL;
> +}

> +static int virtnet_bypass_create(struct virtnet_info *vi)
> +{
> +	struct net_device *backup_netdev = vi->dev;
> +	struct device *dev = &vi->vdev->dev;
> +	struct net_device *bypass_netdev;
> +	int res;
> +
> +	/* Alloc at least 2 queues, for now we are going with 16 assuming
> +	 * that most devices being bonded won't have too many queues.
> +	 */
> +	bypass_netdev = alloc_etherdev_mq(sizeof(struct virtnet_bypass_info),
> +					  16);
> +	if (!bypass_netdev) {
> +		dev_err(dev, "Unable to allocate bypass_netdev!\n");
> +		return -ENOMEM;
> +	}

Maybe it's just me but referring to master as bypass seems slightly
confusing.  I know you don't like team and bond, but perhaps we can
come up with a better name?  For me bypass device is the other leg,
i.e. the VF, not the master.  Perhaps others disagree.

> +	dev_net_set(bypass_netdev, dev_net(backup_netdev));
> +	SET_NETDEV_DEV(bypass_netdev, dev);
> +
> +	bypass_netdev->netdev_ops = &virtnet_bypass_netdev_ops;
> +	bypass_netdev->ethtool_ops = &virtnet_bypass_ethtool_ops;

Thanks!

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v3 2/3] virtio_net: Extend virtio to use VF datapath when available
  2018-02-16 18:11   ` [virtio-dev] " Sridhar Samudrala
  (?)
  (?)
@ 2018-02-17  3:04   ` Jakub Kicinski
  -1 siblings, 0 replies; 121+ messages in thread
From: Jakub Kicinski @ 2018-02-17  3:04 UTC (permalink / raw)
  To: Sridhar Samudrala
  Cc: alexander.h.duyck, Or Gerlitz, mst, netdev, virtualization,
	loseweigh, davem

On Fri, 16 Feb 2018 10:11:21 -0800, Sridhar Samudrala wrote:
> This patch enables virtio_net to switch over to a VF datapath when a VF
> netdev is present with the same MAC address. It allows live migration
> of a VM with a direct attached VF without the need to setup a bond/team
> between a VF and virtio net device in the guest.
> 
> The hypervisor needs to enable only one datapath at any time so that
> packets don't get looped back to the VM over the other datapath. When a VF
> is plugged, the virtio datapath link state can be marked as down. The
> hypervisor needs to unplug the VF device from the guest on the source host
> and reset the MAC filter of the VF to initiate failover of datapath to
> virtio before starting the migration. After the migration is completed,
> the destination hypervisor sets the MAC filter on the VF and plugs it back
> to the guest to switch over to VF datapath.
> 
> When BACKUP feature is enabled, an additional netdev(bypass netdev) is
> created that acts as a master device and tracks the state of the 2 lower
> netdevs. The original virtio_net netdev is marked as 'backup' netdev and a
> passthru device with the same MAC is registered as 'active' netdev.
> 
> This patch is based on the discussion initiated by Jesse on this thread.
> https://marc.info/?l=linux-virtualization&m=151189725224231&w=2
> 
> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> 

> +static void
> +virtnet_bypass_get_stats(struct net_device *dev,
> +			 struct rtnl_link_stats64 *stats)
> +{
> +	struct virtnet_bypass_info *vbi = netdev_priv(dev);
> +	const struct rtnl_link_stats64 *new;
> +	struct rtnl_link_stats64 temp;
> +	struct net_device *child_netdev;
> +
> +	spin_lock(&vbi->stats_lock);
> +	memcpy(stats, &vbi->bypass_stats, sizeof(*stats));
> +
> +	rcu_read_lock();
> +
> +	child_netdev = rcu_dereference(vbi->active_netdev);
> +	if (child_netdev) {
> +		new = dev_get_stats(child_netdev, &temp);
> +		virtnet_bypass_fold_stats(stats, new, &vbi->active_stats);
> +		memcpy(&vbi->active_stats, new, sizeof(*new));
> +	}
> +
> +	child_netdev = rcu_dereference(vbi->backup_netdev);
> +	if (child_netdev) {
> +		new = dev_get_stats(child_netdev, &temp);
> +		virtnet_bypass_fold_stats(stats, new, &vbi->backup_stats);
> +		memcpy(&vbi->backup_stats, new, sizeof(*new));
> +	}
> +
> +	rcu_read_unlock();
> +
> +	memcpy(&vbi->bypass_stats, stats, sizeof(*stats));
> +	spin_unlock(&vbi->stats_lock);
> +}
> +
> +static int virtnet_bypass_change_mtu(struct net_device *dev, int new_mtu)
> +{
> +	struct virtnet_bypass_info *vbi = netdev_priv(dev);
> +	struct net_device *child_netdev;
> +	int ret = 0;
> +
> +	child_netdev = rcu_dereference(vbi->active_netdev);
> +	if (child_netdev) {
> +		ret = dev_set_mtu(child_netdev, new_mtu);
> +		if (ret)
> +			return ret;
> +	}
> +
> +	child_netdev = rcu_dereference(vbi->backup_netdev);
> +	if (child_netdev) {
> +		ret = dev_set_mtu(child_netdev, new_mtu);
> +		if (ret)
> +			netdev_err(child_netdev,
> +				   "Unexpected failure to set mtu to %d\n",
> +				   new_mtu);

You should probably unwind if set fails on one of the legs.

> +	}
> +
> +	dev->mtu = new_mtu;
> +	return 0;
> +}

nit: stats, mtu, all those mundane things are implemented in team
     already.  If we had this as kernel-internal team mode we wouldn't
     have to reimplement them...  You probably did investigate that
     option, for my edification, would you mind saying what the
     challenges/downsides were?

> +static struct net_device *
> +get_virtnet_bypass_bymac(struct net *net, const u8 *mac)
> +{
> +	struct net_device *dev;
> +
> +	ASSERT_RTNL();
> +
> +	for_each_netdev(net, dev) {
> +		if (dev->netdev_ops != &virtnet_bypass_netdev_ops)
> +			continue;       /* not a virtnet_bypass device */

Is there anything inherently wrong with enslaving another virtio dev
now?  I was expecting something like a hash map to map MAC addr ->
master and then one can check if dev is already enslaved to that master.
Just a random thought, I'm probably missing something...

> +		if (ether_addr_equal(mac, dev->perm_addr))
> +			return dev;
> +	}
> +
> +	return NULL;
> +}
> +
> +static struct net_device *
> +get_virtnet_bypass_byref(struct net_device *child_netdev)
> +{
> +	struct net *net = dev_net(child_netdev);
> +	struct net_device *dev;
> +
> +	ASSERT_RTNL();
> +
> +	for_each_netdev(net, dev) {
> +		struct virtnet_bypass_info *vbi;
> +
> +		if (dev->netdev_ops != &virtnet_bypass_netdev_ops)
> +			continue;       /* not a virtnet_bypass device */
> +
> +		vbi = netdev_priv(dev);
> +
> +		if ((rtnl_dereference(vbi->active_netdev) == child_netdev) ||
> +		    (rtnl_dereference(vbi->backup_netdev) == child_netdev))

nit: parens not needed

> +			return dev;	/* a match */
> +	}
> +
> +	return NULL;
> +}

> +static int virtnet_bypass_create(struct virtnet_info *vi)
> +{
> +	struct net_device *backup_netdev = vi->dev;
> +	struct device *dev = &vi->vdev->dev;
> +	struct net_device *bypass_netdev;
> +	int res;
> +
> +	/* Alloc at least 2 queues, for now we are going with 16 assuming
> +	 * that most devices being bonded won't have too many queues.
> +	 */
> +	bypass_netdev = alloc_etherdev_mq(sizeof(struct virtnet_bypass_info),
> +					  16);
> +	if (!bypass_netdev) {
> +		dev_err(dev, "Unable to allocate bypass_netdev!\n");
> +		return -ENOMEM;
> +	}

Maybe it's just me but referring to master as bypass seems slightly
confusing.  I know you don't like team and bond, but perhaps we can
come up with a better name?  For me bypass device is the other leg,
i.e. the VF, not the master.  Perhaps others disagree.

> +	dev_net_set(bypass_netdev, dev_net(backup_netdev));
> +	SET_NETDEV_DEV(bypass_netdev, dev);
> +
> +	bypass_netdev->netdev_ops = &virtnet_bypass_netdev_ops;
> +	bypass_netdev->ethtool_ops = &virtnet_bypass_ethtool_ops;

Thanks!

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
  2018-02-17  2:38 ` Jakub Kicinski
@ 2018-02-17 17:12     ` Alexander Duyck
  2018-02-17 17:12   ` Alexander Duyck
  1 sibling, 0 replies; 121+ messages in thread
From: Alexander Duyck @ 2018-02-17 17:12 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Sridhar Samudrala, Michael S. Tsirkin, Stephen Hemminger,
	David Miller, Netdev, virtualization, virtio-dev, Brandeburg,
	Jesse, Duyck, Alexander H, Jason Wang, Siwei Liu

On Fri, Feb 16, 2018 at 6:38 PM, Jakub Kicinski <kubakici@wp.pl> wrote:
> On Fri, 16 Feb 2018 10:11:19 -0800, Sridhar Samudrala wrote:
>> Ppatch 2 is in response to the community request for a 3 netdev
>> solution.  However, it creates some issues we'll get into in a moment.
>> It extends virtio_net to use alternate datapath when available and
>> registered. When BACKUP feature is enabled, virtio_net driver creates
>> an additional 'bypass' netdev that acts as a master device and controls
>> 2 slave devices.  The original virtio_net netdev is registered as
>> 'backup' netdev and a passthru/vf device with the same MAC gets
>> registered as 'active' netdev. Both 'bypass' and 'backup' netdevs are
>> associated with the same 'pci' device.  The user accesses the network
>> interface via 'bypass' netdev. The 'bypass' netdev chooses 'active' netdev
>> as default for transmits when it is available with link up and running.
>
> Thank you do doing this.
>
>> We noticed a couple of issues with this approach during testing.
>> - As both 'bypass' and 'backup' netdevs are associated with the same
>>   virtio pci device, udev tries to rename both of them with the same name
>>   and the 2nd rename will fail. This would be OK as long as the first netdev
>>   to be renamed is the 'bypass' netdev, but the order in which udev gets
>>   to rename the 2 netdevs is not reliable.
>
> Out of curiosity - why do you link the master netdev to the virtio
> struct device?

The basic idea of all this is that we wanted this to work with an
existing VM image that was using virtio. As such we were trying to
make it so that the bypass interface takes the place of the original
virtio and get udev to rename the bypass to what the original
virtio_net was.

> FWIW two solutions that immediately come to mind is to export "backup"
> as phys_port_name of the backup virtio link and/or assign a name to the
> master like you are doing already.  I think team uses team%d and bond
> uses bond%d, soft naming of master devices seems quite natural in this
> case.

I figured I had overlooked something like that.. Thanks for pointing
this out. Okay so I think the phys_port_name approach might resolve
the original issue. If I am reading things correctly what we end up
with is the master showing up as "ens1" for example and the backup
showing up as "ens1nbackup". Am I understanding that right?

The problem with the team/bond%d approach is that it creates a new
netdevice and so it would require guest configuration changes.

> IMHO phys_port_name == "backup" if BACKUP bit is set on slave virtio
> link is quite neat.

I agree. For non-"backup" virio_net devices would it be okay for us to
just return -EOPNOTSUPP? I assume it would be and that way the legacy
behavior could be maintained although the function still exists.

>> - When the 'active' netdev is unplugged OR not present on a destination
>>   system after live migration, the user will see 2 virtio_net netdevs.
>
> That's necessary and expected, all configuration applies to the master
> so master must exist.

With the naming issue resolved this is the only item left outstanding.
This becomes a matter of form vs function.

The main complaint about the "3 netdev" solution is a bit confusing to
have the 2 netdevs present if the VF isn't there. The idea is that
having the extra "master" netdev there if there isn't really a bond is
a bit ugly.

The downside of the "2 netdev" solution is that you have to deal with
an extra layer of locking/queueing to get to the VF and you lose some
functionality since things like in-driver XDP have to be disabled in
order to maintain the same functionality when the VF is present or
not. However it looks more like classic virtio_net when the VF is not
present.

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
  2018-02-17  2:38 ` Jakub Kicinski
  2018-02-17 17:12     ` [virtio-dev] " Alexander Duyck
@ 2018-02-17 17:12   ` Alexander Duyck
  1 sibling, 0 replies; 121+ messages in thread
From: Alexander Duyck @ 2018-02-17 17:12 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Duyck, Alexander H, virtio-dev, Michael S. Tsirkin,
	Sridhar Samudrala, virtualization, Siwei Liu, Netdev,
	David Miller

On Fri, Feb 16, 2018 at 6:38 PM, Jakub Kicinski <kubakici@wp.pl> wrote:
> On Fri, 16 Feb 2018 10:11:19 -0800, Sridhar Samudrala wrote:
>> Ppatch 2 is in response to the community request for a 3 netdev
>> solution.  However, it creates some issues we'll get into in a moment.
>> It extends virtio_net to use alternate datapath when available and
>> registered. When BACKUP feature is enabled, virtio_net driver creates
>> an additional 'bypass' netdev that acts as a master device and controls
>> 2 slave devices.  The original virtio_net netdev is registered as
>> 'backup' netdev and a passthru/vf device with the same MAC gets
>> registered as 'active' netdev. Both 'bypass' and 'backup' netdevs are
>> associated with the same 'pci' device.  The user accesses the network
>> interface via 'bypass' netdev. The 'bypass' netdev chooses 'active' netdev
>> as default for transmits when it is available with link up and running.
>
> Thank you do doing this.
>
>> We noticed a couple of issues with this approach during testing.
>> - As both 'bypass' and 'backup' netdevs are associated with the same
>>   virtio pci device, udev tries to rename both of them with the same name
>>   and the 2nd rename will fail. This would be OK as long as the first netdev
>>   to be renamed is the 'bypass' netdev, but the order in which udev gets
>>   to rename the 2 netdevs is not reliable.
>
> Out of curiosity - why do you link the master netdev to the virtio
> struct device?

The basic idea of all this is that we wanted this to work with an
existing VM image that was using virtio. As such we were trying to
make it so that the bypass interface takes the place of the original
virtio and get udev to rename the bypass to what the original
virtio_net was.

> FWIW two solutions that immediately come to mind is to export "backup"
> as phys_port_name of the backup virtio link and/or assign a name to the
> master like you are doing already.  I think team uses team%d and bond
> uses bond%d, soft naming of master devices seems quite natural in this
> case.

I figured I had overlooked something like that.. Thanks for pointing
this out. Okay so I think the phys_port_name approach might resolve
the original issue. If I am reading things correctly what we end up
with is the master showing up as "ens1" for example and the backup
showing up as "ens1nbackup". Am I understanding that right?

The problem with the team/bond%d approach is that it creates a new
netdevice and so it would require guest configuration changes.

> IMHO phys_port_name == "backup" if BACKUP bit is set on slave virtio
> link is quite neat.

I agree. For non-"backup" virio_net devices would it be okay for us to
just return -EOPNOTSUPP? I assume it would be and that way the legacy
behavior could be maintained although the function still exists.

>> - When the 'active' netdev is unplugged OR not present on a destination
>>   system after live migration, the user will see 2 virtio_net netdevs.
>
> That's necessary and expected, all configuration applies to the master
> so master must exist.

With the naming issue resolved this is the only item left outstanding.
This becomes a matter of form vs function.

The main complaint about the "3 netdev" solution is a bit confusing to
have the 2 netdevs present if the VF isn't there. The idea is that
having the extra "master" netdev there if there isn't really a bond is
a bit ugly.

The downside of the "2 netdev" solution is that you have to deal with
an extra layer of locking/queueing to get to the VF and you lose some
functionality since things like in-driver XDP have to be disabled in
order to maintain the same functionality when the VF is present or
not. However it looks more like classic virtio_net when the VF is not
present.

^ permalink raw reply	[flat|nested] 121+ messages in thread

* [virtio-dev] Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
@ 2018-02-17 17:12     ` Alexander Duyck
  0 siblings, 0 replies; 121+ messages in thread
From: Alexander Duyck @ 2018-02-17 17:12 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Sridhar Samudrala, Michael S. Tsirkin, Stephen Hemminger,
	David Miller, Netdev, virtualization, virtio-dev, Brandeburg,
	Jesse, Duyck, Alexander H, Jason Wang, Siwei Liu

On Fri, Feb 16, 2018 at 6:38 PM, Jakub Kicinski <kubakici@wp.pl> wrote:
> On Fri, 16 Feb 2018 10:11:19 -0800, Sridhar Samudrala wrote:
>> Ppatch 2 is in response to the community request for a 3 netdev
>> solution.  However, it creates some issues we'll get into in a moment.
>> It extends virtio_net to use alternate datapath when available and
>> registered. When BACKUP feature is enabled, virtio_net driver creates
>> an additional 'bypass' netdev that acts as a master device and controls
>> 2 slave devices.  The original virtio_net netdev is registered as
>> 'backup' netdev and a passthru/vf device with the same MAC gets
>> registered as 'active' netdev. Both 'bypass' and 'backup' netdevs are
>> associated with the same 'pci' device.  The user accesses the network
>> interface via 'bypass' netdev. The 'bypass' netdev chooses 'active' netdev
>> as default for transmits when it is available with link up and running.
>
> Thank you do doing this.
>
>> We noticed a couple of issues with this approach during testing.
>> - As both 'bypass' and 'backup' netdevs are associated with the same
>>   virtio pci device, udev tries to rename both of them with the same name
>>   and the 2nd rename will fail. This would be OK as long as the first netdev
>>   to be renamed is the 'bypass' netdev, but the order in which udev gets
>>   to rename the 2 netdevs is not reliable.
>
> Out of curiosity - why do you link the master netdev to the virtio
> struct device?

The basic idea of all this is that we wanted this to work with an
existing VM image that was using virtio. As such we were trying to
make it so that the bypass interface takes the place of the original
virtio and get udev to rename the bypass to what the original
virtio_net was.

> FWIW two solutions that immediately come to mind is to export "backup"
> as phys_port_name of the backup virtio link and/or assign a name to the
> master like you are doing already.  I think team uses team%d and bond
> uses bond%d, soft naming of master devices seems quite natural in this
> case.

I figured I had overlooked something like that.. Thanks for pointing
this out. Okay so I think the phys_port_name approach might resolve
the original issue. If I am reading things correctly what we end up
with is the master showing up as "ens1" for example and the backup
showing up as "ens1nbackup". Am I understanding that right?

The problem with the team/bond%d approach is that it creates a new
netdevice and so it would require guest configuration changes.

> IMHO phys_port_name == "backup" if BACKUP bit is set on slave virtio
> link is quite neat.

I agree. For non-"backup" virio_net devices would it be okay for us to
just return -EOPNOTSUPP? I assume it would be and that way the legacy
behavior could be maintained although the function still exists.

>> - When the 'active' netdev is unplugged OR not present on a destination
>>   system after live migration, the user will see 2 virtio_net netdevs.
>
> That's necessary and expected, all configuration applies to the master
> so master must exist.

With the naming issue resolved this is the only item left outstanding.
This becomes a matter of form vs function.

The main complaint about the "3 netdev" solution is a bit confusing to
have the 2 netdevs present if the VF isn't there. The idea is that
having the extra "master" netdev there if there isn't really a bond is
a bit ugly.

The downside of the "2 netdev" solution is that you have to deal with
an extra layer of locking/queueing to get to the VF and you lose some
functionality since things like in-driver XDP have to be disabled in
order to maintain the same functionality when the VF is present or
not. However it looks more like classic virtio_net when the VF is not
present.

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v3 2/3] virtio_net: Extend virtio to use VF datapath when available
  2018-02-17  3:04   ` Jakub Kicinski
@ 2018-02-17 17:41     ` Alexander Duyck
  0 siblings, 0 replies; 121+ messages in thread
From: Alexander Duyck @ 2018-02-17 17:41 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Duyck, Alexander H, Or Gerlitz, Michael S. Tsirkin,
	Sridhar Samudrala, virtualization, Siwei Liu, Netdev,
	David Miller

On Fri, Feb 16, 2018 at 7:04 PM, Jakub Kicinski <kubakici@wp.pl> wrote:
> On Fri, 16 Feb 2018 10:11:21 -0800, Sridhar Samudrala wrote:
>> This patch enables virtio_net to switch over to a VF datapath when a VF
>> netdev is present with the same MAC address. It allows live migration
>> of a VM with a direct attached VF without the need to setup a bond/team
>> between a VF and virtio net device in the guest.
>>
>> The hypervisor needs to enable only one datapath at any time so that
>> packets don't get looped back to the VM over the other datapath. When a VF
>> is plugged, the virtio datapath link state can be marked as down. The
>> hypervisor needs to unplug the VF device from the guest on the source host
>> and reset the MAC filter of the VF to initiate failover of datapath to
>> virtio before starting the migration. After the migration is completed,
>> the destination hypervisor sets the MAC filter on the VF and plugs it back
>> to the guest to switch over to VF datapath.
>>
>> When BACKUP feature is enabled, an additional netdev(bypass netdev) is
>> created that acts as a master device and tracks the state of the 2 lower
>> netdevs. The original virtio_net netdev is marked as 'backup' netdev and a
>> passthru device with the same MAC is registered as 'active' netdev.
>>
>> This patch is based on the discussion initiated by Jesse on this thread.
>> https://marc.info/?l=linux-virtualization&m=151189725224231&w=2
>>
>> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
>> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
>
>> +static void
>> +virtnet_bypass_get_stats(struct net_device *dev,
>> +                      struct rtnl_link_stats64 *stats)
>> +{
>> +     struct virtnet_bypass_info *vbi = netdev_priv(dev);
>> +     const struct rtnl_link_stats64 *new;
>> +     struct rtnl_link_stats64 temp;
>> +     struct net_device *child_netdev;
>> +
>> +     spin_lock(&vbi->stats_lock);
>> +     memcpy(stats, &vbi->bypass_stats, sizeof(*stats));
>> +
>> +     rcu_read_lock();
>> +
>> +     child_netdev = rcu_dereference(vbi->active_netdev);
>> +     if (child_netdev) {
>> +             new = dev_get_stats(child_netdev, &temp);
>> +             virtnet_bypass_fold_stats(stats, new, &vbi->active_stats);
>> +             memcpy(&vbi->active_stats, new, sizeof(*new));
>> +     }
>> +
>> +     child_netdev = rcu_dereference(vbi->backup_netdev);
>> +     if (child_netdev) {
>> +             new = dev_get_stats(child_netdev, &temp);
>> +             virtnet_bypass_fold_stats(stats, new, &vbi->backup_stats);
>> +             memcpy(&vbi->backup_stats, new, sizeof(*new));
>> +     }
>> +
>> +     rcu_read_unlock();
>> +
>> +     memcpy(&vbi->bypass_stats, stats, sizeof(*stats));
>> +     spin_unlock(&vbi->stats_lock);
>> +}
>> +
>> +static int virtnet_bypass_change_mtu(struct net_device *dev, int new_mtu)
>> +{
>> +     struct virtnet_bypass_info *vbi = netdev_priv(dev);
>> +     struct net_device *child_netdev;
>> +     int ret = 0;
>> +
>> +     child_netdev = rcu_dereference(vbi->active_netdev);
>> +     if (child_netdev) {
>> +             ret = dev_set_mtu(child_netdev, new_mtu);
>> +             if (ret)
>> +                     return ret;
>> +     }
>> +
>> +     child_netdev = rcu_dereference(vbi->backup_netdev);
>> +     if (child_netdev) {
>> +             ret = dev_set_mtu(child_netdev, new_mtu);
>> +             if (ret)
>> +                     netdev_err(child_netdev,
>> +                                "Unexpected failure to set mtu to %d\n",
>> +                                new_mtu);
>
> You should probably unwind if set fails on one of the legs.

Actually if we know that the backup is always going to be a virtio I
don't know if we even need to worry about the call failing. Last I
knew virtio_net doesn't implement ndo_change_mtu so I don't think it
is an issue. Unless a notifier blows up about it somewhere I don't
think there is anything that should prevent us from updating the MTU.

One interesting thing we may want to take a look at would be to tweak
the ordering of things based on if we are increasing or decreasing the
MTU. In the case of a increase we need to work from the bottom up, but
in the case of a decrease I wonder if we shouldn't be working from the
top down in order to guarantee we don't create an MTU mismatch
somewhere in the path.

>> +     }
>> +
>> +     dev->mtu = new_mtu;
>> +     return 0;
>> +}
>
> nit: stats, mtu, all those mundane things are implemented in team
>      already.  If we had this as kernel-internal team mode we wouldn't
>      have to reimplement them...  You probably did investigate that
>      option, for my edification, would you mind saying what the
>      challenges/downsides were?

So I tried working with the bonding driver to get down to what we
needed. The issue is there are so many controls and such exposed that
trying to pull them out and generate a simple bond became very
difficult to get done. It was just much easier to start over versus
trying to take an existing interface and pare it down to just what we
needed. What may make more sense is to in the future create either a
lib or some sort of file in net/core/ that we could consolidate the
core functionality of these type of devices into and leave the
user-space interfaces, debugfs, ioctls, and such out of and leave
those driver specific.

>> +static struct net_device *
>> +get_virtnet_bypass_bymac(struct net *net, const u8 *mac)
>> +{
>> +     struct net_device *dev;
>> +
>> +     ASSERT_RTNL();
>> +
>> +     for_each_netdev(net, dev) {
>> +             if (dev->netdev_ops != &virtnet_bypass_netdev_ops)
>> +                     continue;       /* not a virtnet_bypass device */
>
> Is there anything inherently wrong with enslaving another virtio dev
> now?  I was expecting something like a hash map to map MAC addr ->
> master and then one can check if dev is already enslaved to that master.
> Just a random thought, I'm probably missing something...

This isn't about enslaving, this is just finding the master device.
Basically the virtnet_bypass uses a separate set of netdev ops so we
just are using that instead of maintaining a global or per-net hash.
You could have two virtio devices enslaved by the same virtnet_bypass
and it shouldn't be an issue.

>> +             if (ether_addr_equal(mac, dev->perm_addr))
>> +                     return dev;
>> +     }
>> +
>> +     return NULL;
>> +}
>> +
>> +static struct net_device *
>> +get_virtnet_bypass_byref(struct net_device *child_netdev)
>> +{
>> +     struct net *net = dev_net(child_netdev);
>> +     struct net_device *dev;
>> +
>> +     ASSERT_RTNL();
>> +
>> +     for_each_netdev(net, dev) {
>> +             struct virtnet_bypass_info *vbi;
>> +
>> +             if (dev->netdev_ops != &virtnet_bypass_netdev_ops)
>> +                     continue;       /* not a virtnet_bypass device */
>> +
>> +             vbi = netdev_priv(dev);
>> +
>> +             if ((rtnl_dereference(vbi->active_netdev) == child_netdev) ||
>> +                 (rtnl_dereference(vbi->backup_netdev) == child_netdev))
>
> nit: parens not needed

Yeah, it is a habit of mine since I do that for readability more than
anything else. Some people can't track the order of operations.

If they need to go they can.

>> +                     return dev;     /* a match */
>> +     }
>> +
>> +     return NULL;
>> +}
>
>> +static int virtnet_bypass_create(struct virtnet_info *vi)
>> +{
>> +     struct net_device *backup_netdev = vi->dev;
>> +     struct device *dev = &vi->vdev->dev;
>> +     struct net_device *bypass_netdev;
>> +     int res;
>> +
>> +     /* Alloc at least 2 queues, for now we are going with 16 assuming
>> +      * that most devices being bonded won't have too many queues.
>> +      */
>> +     bypass_netdev = alloc_etherdev_mq(sizeof(struct virtnet_bypass_info),
>> +                                       16);
>> +     if (!bypass_netdev) {
>> +             dev_err(dev, "Unable to allocate bypass_netdev!\n");
>> +             return -ENOMEM;
>> +     }
>
> Maybe it's just me but referring to master as bypass seems slightly
> confusing.  I know you don't like team and bond, but perhaps we can
> come up with a better name?  For me bypass device is the other leg,
> i.e. the VF, not the master.  Perhaps others disagree.

The choice of naming is based on some basic plumbing ideas. You could
almost think of the bypass netdev as more of a bypass valve. Basically
what we are doing is rerouting the traffic at the bypass interface and
sending it to the VF instead of routing it to the virtio_net. That was
why I thought it would be a fitting term. It gets us out of the
"bond", "team", and other such concepts because it isn't really any of
those. This is supposed to be a very simple device that will shunt the
traffic off of the virtio_net and re-route it through the VF. In
addition this isn't a complete solution by itself either since there
will still be some configuration required on the host to take care of
applying filters on the PF and/or tap interface to make certain we
don't end up with any loops in the traffic and that receives are
directed to the VF.

>> +     dev_net_set(bypass_netdev, dev_net(backup_netdev));
>> +     SET_NETDEV_DEV(bypass_netdev, dev);
>> +
>> +     bypass_netdev->netdev_ops = &virtnet_bypass_netdev_ops;
>> +     bypass_netdev->ethtool_ops = &virtnet_bypass_ethtool_ops;
>
> Thanks!

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
  2018-02-17 17:12     ` [virtio-dev] " Alexander Duyck
  (?)
@ 2018-02-19  6:11     ` Jakub Kicinski
  2018-02-20 16:26         ` [virtio-dev] " Samudrala, Sridhar
  2018-02-20 16:26       ` Samudrala, Sridhar
  -1 siblings, 2 replies; 121+ messages in thread
From: Jakub Kicinski @ 2018-02-19  6:11 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Duyck, Alexander H, virtio-dev, Michael S. Tsirkin,
	Sridhar Samudrala, virtualization, Siwei Liu, Netdev,
	David Miller

On Sat, 17 Feb 2018 09:12:01 -0800, Alexander Duyck wrote:
> >> We noticed a couple of issues with this approach during testing.
> >> - As both 'bypass' and 'backup' netdevs are associated with the same
> >>   virtio pci device, udev tries to rename both of them with the same name
> >>   and the 2nd rename will fail. This would be OK as long as the first netdev
> >>   to be renamed is the 'bypass' netdev, but the order in which udev gets
> >>   to rename the 2 netdevs is not reliable.  
> >
> > Out of curiosity - why do you link the master netdev to the virtio
> > struct device?  
> 
> The basic idea of all this is that we wanted this to work with an
> existing VM image that was using virtio. As such we were trying to
> make it so that the bypass interface takes the place of the original
> virtio and get udev to rename the bypass to what the original
> virtio_net was.

That makes sense.  Is it udev/naming that you're most concerned about
here?  I.e. what's the user space app that expects the netdev to be
linked?  This is just out of curiosity, the linking of netdevs to
devices is a bit of a PITA in the switchdev eswitch mode world, with
libvirt expecting only certain devices to be there..  Right now we're
not linking VF reprs, which breaks naming.  I wanted to revisit that.

> > FWIW two solutions that immediately come to mind is to export "backup"
> > as phys_port_name of the backup virtio link and/or assign a name to the
> > master like you are doing already.  I think team uses team%d and bond
> > uses bond%d, soft naming of master devices seems quite natural in this
> > case.  
> 
> I figured I had overlooked something like that.. Thanks for pointing
> this out. Okay so I think the phys_port_name approach might resolve
> the original issue. If I am reading things correctly what we end up
> with is the master showing up as "ens1" for example and the backup
> showing up as "ens1nbackup". Am I understanding that right?

Yes, provided systemd is new enough.

> The problem with the team/bond%d approach is that it creates a new
> netdevice and so it would require guest configuration changes.
>
> > IMHO phys_port_name == "backup" if BACKUP bit is set on slave virtio
> > link is quite neat.  
> 
> I agree. For non-"backup" virio_net devices would it be okay for us to
> just return -EOPNOTSUPP? I assume it would be and that way the legacy
> behavior could be maintained although the function still exists.

That's my understanding too.

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
  2018-02-16 18:11 ` [virtio-dev] " Sridhar Samudrala
                   ` (8 preceding siblings ...)
  (?)
@ 2018-02-20 10:42 ` Jiri Pirko
  2018-02-20 16:04     ` [virtio-dev] " Alexander Duyck
  2018-02-20 16:04   ` Alexander Duyck
  -1 siblings, 2 replies; 121+ messages in thread
From: Jiri Pirko @ 2018-02-20 10:42 UTC (permalink / raw)
  To: Sridhar Samudrala
  Cc: alexander.h.duyck, virtio-dev, mst, kubakici, netdev,
	virtualization, loseweigh, davem

Fri, Feb 16, 2018 at 07:11:19PM CET, sridhar.samudrala@intel.com wrote:
>Patch 1 introduces a new feature bit VIRTIO_NET_F_BACKUP that can be
>used by hypervisor to indicate that virtio_net interface should act as
>a backup for another device with the same MAC address.
>
>Ppatch 2 is in response to the community request for a 3 netdev
>solution.  However, it creates some issues we'll get into in a moment.
>It extends virtio_net to use alternate datapath when available and
>registered. When BACKUP feature is enabled, virtio_net driver creates
>an additional 'bypass' netdev that acts as a master device and controls
>2 slave devices.  The original virtio_net netdev is registered as
>'backup' netdev and a passthru/vf device with the same MAC gets
>registered as 'active' netdev. Both 'bypass' and 'backup' netdevs are
>associated with the same 'pci' device.  The user accesses the network
>interface via 'bypass' netdev. The 'bypass' netdev chooses 'active' netdev
>as default for transmits when it is available with link up and running.

Sorry, but this is ridiculous. You are apparently re-implemeting part
of bonding driver as a part of NIC driver. Bond and team drivers
are mature solutions, well tested, broadly used, with lots of issues
resolved in the past. What you try to introduce is a weird shortcut
that already has couple of issues as you mentioned and will certanly
have many more. Also, I'm pretty sure that in future, someone comes up
with ideas like multiple VFs, LACP and similar bonding things.

What is the reason for this abomination? According to:
https://marc.info/?l=linux-virtualization&m=151189725224231&w=2
The reason is quite weak.
User in the vm sees 2 (or more) netdevices, he puts them in bond/team
and that's it. This works now! If the vm lacks some userspace features,
let's fix it there! For example the MAC changes is something that could
be easily handled in teamd userspace deamon.

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
  2018-02-20 10:42 ` Jiri Pirko
@ 2018-02-20 16:04     ` Alexander Duyck
  2018-02-20 16:04   ` Alexander Duyck
  1 sibling, 0 replies; 121+ messages in thread
From: Alexander Duyck @ 2018-02-20 16:04 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Sridhar Samudrala, Michael S. Tsirkin, Stephen Hemminger,
	David Miller, Netdev, virtualization, virtio-dev, Brandeburg,
	Jesse, Duyck, Alexander H, Jakub Kicinski, Jason Wang, Siwei Liu

On Tue, Feb 20, 2018 at 2:42 AM, Jiri Pirko <jiri@resnulli.us> wrote:
> Fri, Feb 16, 2018 at 07:11:19PM CET, sridhar.samudrala@intel.com wrote:
>>Patch 1 introduces a new feature bit VIRTIO_NET_F_BACKUP that can be
>>used by hypervisor to indicate that virtio_net interface should act as
>>a backup for another device with the same MAC address.
>>
>>Ppatch 2 is in response to the community request for a 3 netdev
>>solution.  However, it creates some issues we'll get into in a moment.
>>It extends virtio_net to use alternate datapath when available and
>>registered. When BACKUP feature is enabled, virtio_net driver creates
>>an additional 'bypass' netdev that acts as a master device and controls
>>2 slave devices.  The original virtio_net netdev is registered as
>>'backup' netdev and a passthru/vf device with the same MAC gets
>>registered as 'active' netdev. Both 'bypass' and 'backup' netdevs are
>>associated with the same 'pci' device.  The user accesses the network
>>interface via 'bypass' netdev. The 'bypass' netdev chooses 'active' netdev
>>as default for transmits when it is available with link up and running.
>
> Sorry, but this is ridiculous. You are apparently re-implemeting part
> of bonding driver as a part of NIC driver. Bond and team drivers
> are mature solutions, well tested, broadly used, with lots of issues
> resolved in the past. What you try to introduce is a weird shortcut
> that already has couple of issues as you mentioned and will certanly
> have many more. Also, I'm pretty sure that in future, someone comes up
> with ideas like multiple VFs, LACP and similar bonding things.

The problem with the bond and team drivers is they are too large and
have too many interfaces available for configuration so as a result
they can really screw this interface up.

Essentially this is meant to be a bond that is more-or-less managed by
the host, not the guest. We want the host to be able to configure it
and have it automatically kick in on the guest. For now we want to
avoid adding too much complexity as this is meant to be just the first
step. Trying to go in and implement the whole solution right from the
start based on existing drivers is going to be a massive time sink and
will likely never get completed due to the fact that there is always
going to be some other thing that will interfere.

My personal hope is that we can look at doing a virtio-bond sort of
device that will handle all this as well as providing a communication
channel, but that is much further down the road. For now we only have
a single bit so the goal for now is trying to keep this as simple as
possible.

> What is the reason for this abomination? According to:
> https://marc.info/?l=linux-virtualization&m=151189725224231&w=2
> The reason is quite weak.
> User in the vm sees 2 (or more) netdevices, he puts them in bond/team
> and that's it. This works now! If the vm lacks some userspace features,
> let's fix it there! For example the MAC changes is something that could
> be easily handled in teamd userspace deamon.

I think you might have missed the point of this. This is meant to be a
simple interface so the guest should not be able to change the MAC
address, and it shouldn't require any userspace daemon to setup or
tear down. Ideally with this solution the virtio bypass will come up
and be assigned the name of the original virtio, and the "backup"
interface will come up and be assigned the name of the original virtio
with an additional "nbackup" tacked on via the phys_port_name, and
then whenever a VF is added it will automatically be enslaved by the
bypass interface, and it will be removed when the VF is hotplugged
out.

In my mind the difference between this and bond or team is where the
configuration interface lies. In the case of bond it is in the kernel.
If my understanding is correct team is mostly in user space. With this
the configuration interface is really down in the hypervisor and
requests are communicated up to the guest. I would prefer not to make
virtio_net dependent on the bonding or team drivers, or worse yet a
userspace daemon in the guest. For now I would argue we should keep
this as simple as possible just to support basic live migration. There
has already been discussions of refactoring this after it is in so
that we can start to combine the functionality here with what is there
in bonding/team, but the differences in configuration interface and
the size of the code bases will make it challenging to outright merge
this into something like that.

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
  2018-02-20 10:42 ` Jiri Pirko
  2018-02-20 16:04     ` [virtio-dev] " Alexander Duyck
@ 2018-02-20 16:04   ` Alexander Duyck
  1 sibling, 0 replies; 121+ messages in thread
From: Alexander Duyck @ 2018-02-20 16:04 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Duyck, Alexander H, virtio-dev, Michael S. Tsirkin,
	Jakub Kicinski, Sridhar Samudrala, virtualization, Siwei Liu,
	Netdev, David Miller

On Tue, Feb 20, 2018 at 2:42 AM, Jiri Pirko <jiri@resnulli.us> wrote:
> Fri, Feb 16, 2018 at 07:11:19PM CET, sridhar.samudrala@intel.com wrote:
>>Patch 1 introduces a new feature bit VIRTIO_NET_F_BACKUP that can be
>>used by hypervisor to indicate that virtio_net interface should act as
>>a backup for another device with the same MAC address.
>>
>>Ppatch 2 is in response to the community request for a 3 netdev
>>solution.  However, it creates some issues we'll get into in a moment.
>>It extends virtio_net to use alternate datapath when available and
>>registered. When BACKUP feature is enabled, virtio_net driver creates
>>an additional 'bypass' netdev that acts as a master device and controls
>>2 slave devices.  The original virtio_net netdev is registered as
>>'backup' netdev and a passthru/vf device with the same MAC gets
>>registered as 'active' netdev. Both 'bypass' and 'backup' netdevs are
>>associated with the same 'pci' device.  The user accesses the network
>>interface via 'bypass' netdev. The 'bypass' netdev chooses 'active' netdev
>>as default for transmits when it is available with link up and running.
>
> Sorry, but this is ridiculous. You are apparently re-implemeting part
> of bonding driver as a part of NIC driver. Bond and team drivers
> are mature solutions, well tested, broadly used, with lots of issues
> resolved in the past. What you try to introduce is a weird shortcut
> that already has couple of issues as you mentioned and will certanly
> have many more. Also, I'm pretty sure that in future, someone comes up
> with ideas like multiple VFs, LACP and similar bonding things.

The problem with the bond and team drivers is they are too large and
have too many interfaces available for configuration so as a result
they can really screw this interface up.

Essentially this is meant to be a bond that is more-or-less managed by
the host, not the guest. We want the host to be able to configure it
and have it automatically kick in on the guest. For now we want to
avoid adding too much complexity as this is meant to be just the first
step. Trying to go in and implement the whole solution right from the
start based on existing drivers is going to be a massive time sink and
will likely never get completed due to the fact that there is always
going to be some other thing that will interfere.

My personal hope is that we can look at doing a virtio-bond sort of
device that will handle all this as well as providing a communication
channel, but that is much further down the road. For now we only have
a single bit so the goal for now is trying to keep this as simple as
possible.

> What is the reason for this abomination? According to:
> https://marc.info/?l=linux-virtualization&m=151189725224231&w=2
> The reason is quite weak.
> User in the vm sees 2 (or more) netdevices, he puts them in bond/team
> and that's it. This works now! If the vm lacks some userspace features,
> let's fix it there! For example the MAC changes is something that could
> be easily handled in teamd userspace deamon.

I think you might have missed the point of this. This is meant to be a
simple interface so the guest should not be able to change the MAC
address, and it shouldn't require any userspace daemon to setup or
tear down. Ideally with this solution the virtio bypass will come up
and be assigned the name of the original virtio, and the "backup"
interface will come up and be assigned the name of the original virtio
with an additional "nbackup" tacked on via the phys_port_name, and
then whenever a VF is added it will automatically be enslaved by the
bypass interface, and it will be removed when the VF is hotplugged
out.

In my mind the difference between this and bond or team is where the
configuration interface lies. In the case of bond it is in the kernel.
If my understanding is correct team is mostly in user space. With this
the configuration interface is really down in the hypervisor and
requests are communicated up to the guest. I would prefer not to make
virtio_net dependent on the bonding or team drivers, or worse yet a
userspace daemon in the guest. For now I would argue we should keep
this as simple as possible just to support basic live migration. There
has already been discussions of refactoring this after it is in so
that we can start to combine the functionality here with what is there
in bonding/team, but the differences in configuration interface and
the size of the code bases will make it challenging to outright merge
this into something like that.

^ permalink raw reply	[flat|nested] 121+ messages in thread

* [virtio-dev] Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
@ 2018-02-20 16:04     ` Alexander Duyck
  0 siblings, 0 replies; 121+ messages in thread
From: Alexander Duyck @ 2018-02-20 16:04 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Sridhar Samudrala, Michael S. Tsirkin, Stephen Hemminger,
	David Miller, Netdev, virtualization, virtio-dev, Brandeburg,
	Jesse, Duyck, Alexander H, Jakub Kicinski, Jason Wang, Siwei Liu

On Tue, Feb 20, 2018 at 2:42 AM, Jiri Pirko <jiri@resnulli.us> wrote:
> Fri, Feb 16, 2018 at 07:11:19PM CET, sridhar.samudrala@intel.com wrote:
>>Patch 1 introduces a new feature bit VIRTIO_NET_F_BACKUP that can be
>>used by hypervisor to indicate that virtio_net interface should act as
>>a backup for another device with the same MAC address.
>>
>>Ppatch 2 is in response to the community request for a 3 netdev
>>solution.  However, it creates some issues we'll get into in a moment.
>>It extends virtio_net to use alternate datapath when available and
>>registered. When BACKUP feature is enabled, virtio_net driver creates
>>an additional 'bypass' netdev that acts as a master device and controls
>>2 slave devices.  The original virtio_net netdev is registered as
>>'backup' netdev and a passthru/vf device with the same MAC gets
>>registered as 'active' netdev. Both 'bypass' and 'backup' netdevs are
>>associated with the same 'pci' device.  The user accesses the network
>>interface via 'bypass' netdev. The 'bypass' netdev chooses 'active' netdev
>>as default for transmits when it is available with link up and running.
>
> Sorry, but this is ridiculous. You are apparently re-implemeting part
> of bonding driver as a part of NIC driver. Bond and team drivers
> are mature solutions, well tested, broadly used, with lots of issues
> resolved in the past. What you try to introduce is a weird shortcut
> that already has couple of issues as you mentioned and will certanly
> have many more. Also, I'm pretty sure that in future, someone comes up
> with ideas like multiple VFs, LACP and similar bonding things.

The problem with the bond and team drivers is they are too large and
have too many interfaces available for configuration so as a result
they can really screw this interface up.

Essentially this is meant to be a bond that is more-or-less managed by
the host, not the guest. We want the host to be able to configure it
and have it automatically kick in on the guest. For now we want to
avoid adding too much complexity as this is meant to be just the first
step. Trying to go in and implement the whole solution right from the
start based on existing drivers is going to be a massive time sink and
will likely never get completed due to the fact that there is always
going to be some other thing that will interfere.

My personal hope is that we can look at doing a virtio-bond sort of
device that will handle all this as well as providing a communication
channel, but that is much further down the road. For now we only have
a single bit so the goal for now is trying to keep this as simple as
possible.

> What is the reason for this abomination? According to:
> https://marc.info/?l=linux-virtualization&m=151189725224231&w=2
> The reason is quite weak.
> User in the vm sees 2 (or more) netdevices, he puts them in bond/team
> and that's it. This works now! If the vm lacks some userspace features,
> let's fix it there! For example the MAC changes is something that could
> be easily handled in teamd userspace deamon.

I think you might have missed the point of this. This is meant to be a
simple interface so the guest should not be able to change the MAC
address, and it shouldn't require any userspace daemon to setup or
tear down. Ideally with this solution the virtio bypass will come up
and be assigned the name of the original virtio, and the "backup"
interface will come up and be assigned the name of the original virtio
with an additional "nbackup" tacked on via the phys_port_name, and
then whenever a VF is added it will automatically be enslaved by the
bypass interface, and it will be removed when the VF is hotplugged
out.

In my mind the difference between this and bond or team is where the
configuration interface lies. In the case of bond it is in the kernel.
If my understanding is correct team is mostly in user space. With this
the configuration interface is really down in the hypervisor and
requests are communicated up to the guest. I would prefer not to make
virtio_net dependent on the bonding or team drivers, or worse yet a
userspace daemon in the guest. For now I would argue we should keep
this as simple as possible just to support basic live migration. There
has already been discussions of refactoring this after it is in so
that we can start to combine the functionality here with what is there
in bonding/team, but the differences in configuration interface and
the size of the code bases will make it challenging to outright merge
this into something like that.

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
  2018-02-19  6:11     ` Jakub Kicinski
@ 2018-02-20 16:26         ` Samudrala, Sridhar
  2018-02-20 16:26       ` Samudrala, Sridhar
  1 sibling, 0 replies; 121+ messages in thread
From: Samudrala, Sridhar @ 2018-02-20 16:26 UTC (permalink / raw)
  To: Jakub Kicinski, Alexander Duyck
  Cc: Michael S. Tsirkin, Stephen Hemminger, David Miller, Netdev,
	virtualization, virtio-dev, Brandeburg, Jesse, Duyck,
	Alexander H, Jason Wang, Siwei Liu

On 2/18/2018 10:11 PM, Jakub Kicinski wrote:
> On Sat, 17 Feb 2018 09:12:01 -0800, Alexander Duyck wrote:
>>>> We noticed a couple of issues with this approach during testing.
>>>> - As both 'bypass' and 'backup' netdevs are associated with the same
>>>>    virtio pci device, udev tries to rename both of them with the same name
>>>>    and the 2nd rename will fail. This would be OK as long as the first netdev
>>>>    to be renamed is the 'bypass' netdev, but the order in which udev gets
>>>>    to rename the 2 netdevs is not reliable.
>>> Out of curiosity - why do you link the master netdev to the virtio
>>> struct device?
>> The basic idea of all this is that we wanted this to work with an
>> existing VM image that was using virtio. As such we were trying to
>> make it so that the bypass interface takes the place of the original
>> virtio and get udev to rename the bypass to what the original
>> virtio_net was.
> That makes sense.  Is it udev/naming that you're most concerned about
> here?  I.e. what's the user space app that expects the netdev to be
> linked?  This is just out of curiosity, the linking of netdevs to
> devices is a bit of a PITA in the switchdev eswitch mode world, with
> libvirt expecting only certain devices to be there..  Right now we're
> not linking VF reprs, which breaks naming.  I wanted to revisit that.

For live migration usecase, userspace is only aware of one virtio_net 
interface and
it doesn't expect it to be linked with any lower dev.  So it should be 
fine even if the
lower netdev is not present. Only the master netdev should be assigned 
the same
name so that userspace configuration scripts  in the VM don't need to 
change.

>
>>> FWIW two solutions that immediately come to mind is to export "backup"
>>> as phys_port_name of the backup virtio link and/or assign a name to the
>>> master like you are doing already.  I think team uses team%d and bond
>>> uses bond%d, soft naming of master devices seems quite natural in this
>>> case.
>> I figured I had overlooked something like that.. Thanks for pointing
>> this out. Okay so I think the phys_port_name approach might resolve
>> the original issue. If I am reading things correctly what we end up
>> with is the master showing up as "ens1" for example and the backup
>> showing up as "ens1nbackup". Am I understanding that right?
> Yes, provided systemd is new enough.

Yes. I did a quick test to confirm that adding ndo_phys_port_name() to 
virtio_net
ndo_ops fixes the udev naming issue with 2 virtio netdevs.  This is on 
fedora27.


>
>> The problem with the team/bond%d approach is that it creates a new
>> netdevice and so it would require guest configuration changes.
>>
>>> IMHO phys_port_name == "backup" if BACKUP bit is set on slave virtio
>>> link is quite neat.
>> I agree. For non-"backup" virio_net devices would it be okay for us to
>> just return -EOPNOTSUPP? I assume it would be and that way the legacy
>> behavior could be maintained although the function still exists.
> That's my understanding too.

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
  2018-02-19  6:11     ` Jakub Kicinski
  2018-02-20 16:26         ` [virtio-dev] " Samudrala, Sridhar
@ 2018-02-20 16:26       ` Samudrala, Sridhar
  1 sibling, 0 replies; 121+ messages in thread
From: Samudrala, Sridhar @ 2018-02-20 16:26 UTC (permalink / raw)
  To: Jakub Kicinski, Alexander Duyck
  Cc: Duyck, Alexander H, virtio-dev, Michael S. Tsirkin, Netdev,
	virtualization, Siwei Liu, David Miller

On 2/18/2018 10:11 PM, Jakub Kicinski wrote:
> On Sat, 17 Feb 2018 09:12:01 -0800, Alexander Duyck wrote:
>>>> We noticed a couple of issues with this approach during testing.
>>>> - As both 'bypass' and 'backup' netdevs are associated with the same
>>>>    virtio pci device, udev tries to rename both of them with the same name
>>>>    and the 2nd rename will fail. This would be OK as long as the first netdev
>>>>    to be renamed is the 'bypass' netdev, but the order in which udev gets
>>>>    to rename the 2 netdevs is not reliable.
>>> Out of curiosity - why do you link the master netdev to the virtio
>>> struct device?
>> The basic idea of all this is that we wanted this to work with an
>> existing VM image that was using virtio. As such we were trying to
>> make it so that the bypass interface takes the place of the original
>> virtio and get udev to rename the bypass to what the original
>> virtio_net was.
> That makes sense.  Is it udev/naming that you're most concerned about
> here?  I.e. what's the user space app that expects the netdev to be
> linked?  This is just out of curiosity, the linking of netdevs to
> devices is a bit of a PITA in the switchdev eswitch mode world, with
> libvirt expecting only certain devices to be there..  Right now we're
> not linking VF reprs, which breaks naming.  I wanted to revisit that.

For live migration usecase, userspace is only aware of one virtio_net 
interface and
it doesn't expect it to be linked with any lower dev.  So it should be 
fine even if the
lower netdev is not present. Only the master netdev should be assigned 
the same
name so that userspace configuration scripts  in the VM don't need to 
change.

>
>>> FWIW two solutions that immediately come to mind is to export "backup"
>>> as phys_port_name of the backup virtio link and/or assign a name to the
>>> master like you are doing already.  I think team uses team%d and bond
>>> uses bond%d, soft naming of master devices seems quite natural in this
>>> case.
>> I figured I had overlooked something like that.. Thanks for pointing
>> this out. Okay so I think the phys_port_name approach might resolve
>> the original issue. If I am reading things correctly what we end up
>> with is the master showing up as "ens1" for example and the backup
>> showing up as "ens1nbackup". Am I understanding that right?
> Yes, provided systemd is new enough.

Yes. I did a quick test to confirm that adding ndo_phys_port_name() to 
virtio_net
ndo_ops fixes the udev naming issue with 2 virtio netdevs.  This is on 
fedora27.


>
>> The problem with the team/bond%d approach is that it creates a new
>> netdevice and so it would require guest configuration changes.
>>
>>> IMHO phys_port_name == "backup" if BACKUP bit is set on slave virtio
>>> link is quite neat.
>> I agree. For non-"backup" virio_net devices would it be okay for us to
>> just return -EOPNOTSUPP? I assume it would be and that way the legacy
>> behavior could be maintained although the function still exists.
> That's my understanding too.

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 121+ messages in thread

* [virtio-dev] Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
@ 2018-02-20 16:26         ` Samudrala, Sridhar
  0 siblings, 0 replies; 121+ messages in thread
From: Samudrala, Sridhar @ 2018-02-20 16:26 UTC (permalink / raw)
  To: Jakub Kicinski, Alexander Duyck
  Cc: Michael S. Tsirkin, Stephen Hemminger, David Miller, Netdev,
	virtualization, virtio-dev, Brandeburg, Jesse, Duyck,
	Alexander H, Jason Wang, Siwei Liu

On 2/18/2018 10:11 PM, Jakub Kicinski wrote:
> On Sat, 17 Feb 2018 09:12:01 -0800, Alexander Duyck wrote:
>>>> We noticed a couple of issues with this approach during testing.
>>>> - As both 'bypass' and 'backup' netdevs are associated with the same
>>>>    virtio pci device, udev tries to rename both of them with the same name
>>>>    and the 2nd rename will fail. This would be OK as long as the first netdev
>>>>    to be renamed is the 'bypass' netdev, but the order in which udev gets
>>>>    to rename the 2 netdevs is not reliable.
>>> Out of curiosity - why do you link the master netdev to the virtio
>>> struct device?
>> The basic idea of all this is that we wanted this to work with an
>> existing VM image that was using virtio. As such we were trying to
>> make it so that the bypass interface takes the place of the original
>> virtio and get udev to rename the bypass to what the original
>> virtio_net was.
> That makes sense.  Is it udev/naming that you're most concerned about
> here?  I.e. what's the user space app that expects the netdev to be
> linked?  This is just out of curiosity, the linking of netdevs to
> devices is a bit of a PITA in the switchdev eswitch mode world, with
> libvirt expecting only certain devices to be there..  Right now we're
> not linking VF reprs, which breaks naming.  I wanted to revisit that.

For live migration usecase, userspace is only aware of one virtio_net 
interface and
it doesn't expect it to be linked with any lower dev.  So it should be 
fine even if the
lower netdev is not present. Only the master netdev should be assigned 
the same
name so that userspace configuration scripts  in the VM don't need to 
change.

>
>>> FWIW two solutions that immediately come to mind is to export "backup"
>>> as phys_port_name of the backup virtio link and/or assign a name to the
>>> master like you are doing already.  I think team uses team%d and bond
>>> uses bond%d, soft naming of master devices seems quite natural in this
>>> case.
>> I figured I had overlooked something like that.. Thanks for pointing
>> this out. Okay so I think the phys_port_name approach might resolve
>> the original issue. If I am reading things correctly what we end up
>> with is the master showing up as "ens1" for example and the backup
>> showing up as "ens1nbackup". Am I understanding that right?
> Yes, provided systemd is new enough.

Yes. I did a quick test to confirm that adding ndo_phys_port_name() to 
virtio_net
ndo_ops fixes the udev naming issue with 2 virtio netdevs.  This is on 
fedora27.


>
>> The problem with the team/bond%d approach is that it creates a new
>> netdevice and so it would require guest configuration changes.
>>
>>> IMHO phys_port_name == "backup" if BACKUP bit is set on slave virtio
>>> link is quite neat.
>> I agree. For non-"backup" virio_net devices would it be okay for us to
>> just return -EOPNOTSUPP? I assume it would be and that way the legacy
>> behavior could be maintained although the function still exists.
> That's my understanding too.


---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
  2018-02-20 16:04     ` [virtio-dev] " Alexander Duyck
  (?)
@ 2018-02-20 16:29     ` Jiri Pirko
  2018-02-20 17:14         ` [virtio-dev] " Samudrala, Sridhar
                         ` (2 more replies)
  -1 siblings, 3 replies; 121+ messages in thread
From: Jiri Pirko @ 2018-02-20 16:29 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Duyck, Alexander H, virtio-dev, Michael S. Tsirkin,
	Jakub Kicinski, Sridhar Samudrala, virtualization, Siwei Liu,
	Netdev, David Miller

Tue, Feb 20, 2018 at 05:04:29PM CET, alexander.duyck@gmail.com wrote:
>On Tue, Feb 20, 2018 at 2:42 AM, Jiri Pirko <jiri@resnulli.us> wrote:
>> Fri, Feb 16, 2018 at 07:11:19PM CET, sridhar.samudrala@intel.com wrote:
>>>Patch 1 introduces a new feature bit VIRTIO_NET_F_BACKUP that can be
>>>used by hypervisor to indicate that virtio_net interface should act as
>>>a backup for another device with the same MAC address.
>>>
>>>Ppatch 2 is in response to the community request for a 3 netdev
>>>solution.  However, it creates some issues we'll get into in a moment.
>>>It extends virtio_net to use alternate datapath when available and
>>>registered. When BACKUP feature is enabled, virtio_net driver creates
>>>an additional 'bypass' netdev that acts as a master device and controls
>>>2 slave devices.  The original virtio_net netdev is registered as
>>>'backup' netdev and a passthru/vf device with the same MAC gets
>>>registered as 'active' netdev. Both 'bypass' and 'backup' netdevs are
>>>associated with the same 'pci' device.  The user accesses the network
>>>interface via 'bypass' netdev. The 'bypass' netdev chooses 'active' netdev
>>>as default for transmits when it is available with link up and running.
>>
>> Sorry, but this is ridiculous. You are apparently re-implemeting part
>> of bonding driver as a part of NIC driver. Bond and team drivers
>> are mature solutions, well tested, broadly used, with lots of issues
>> resolved in the past. What you try to introduce is a weird shortcut
>> that already has couple of issues as you mentioned and will certanly
>> have many more. Also, I'm pretty sure that in future, someone comes up
>> with ideas like multiple VFs, LACP and similar bonding things.
>
>The problem with the bond and team drivers is they are too large and
>have too many interfaces available for configuration so as a result
>they can really screw this interface up.

What? Too large is which sense? Why "too many interfaces" is a problem?
Also, team has only one interface to userspace team-generic-netlink.


>
>Essentially this is meant to be a bond that is more-or-less managed by
>the host, not the guest. We want the host to be able to configure it

How is it managed by the host? In your usecase the guest has 2 netdevs:
virtio_net, pci vf.
I don't see how host can do any managing of that, other than the
obvious. But still, the active/backup decision is done in guest. This is
a simple bond/team usecase. As I said, there is something needed to be
implemented in userspace in order to handle re-appear of vf netdev.
But that should be fairly easy to do in teamd.


>and have it automatically kick in on the guest. For now we want to
>avoid adding too much complexity as this is meant to be just the first

That's what I fear, "for now"..


>step. Trying to go in and implement the whole solution right from the
>start based on existing drivers is going to be a massive time sink and
>will likely never get completed due to the fact that there is always
>going to be some other thing that will interfere.

"implement the whole solution right from the start based on existing
drivers" - what solution are you talking about? I don't understand this
para.


>
>My personal hope is that we can look at doing a virtio-bond sort of
>device that will handle all this as well as providing a communication
>channel, but that is much further down the road. For now we only have
>a single bit so the goal for now is trying to keep this as simple as
>possible.

Oh. So there is really intention to do re-implementation of bonding
in virtio. That is plain-wrong in my opinion.

Could you just use bond/team, please, and don't reinvent the wheel with
this abomination?


>
>> What is the reason for this abomination? According to:
>> https://marc.info/?l=linux-virtualization&m=151189725224231&w=2
>> The reason is quite weak.
>> User in the vm sees 2 (or more) netdevices, he puts them in bond/team
>> and that's it. This works now! If the vm lacks some userspace features,
>> let's fix it there! For example the MAC changes is something that could
>> be easily handled in teamd userspace deamon.
>
>I think you might have missed the point of this. This is meant to be a
>simple interface so the guest should not be able to change the MAC
>address, and it shouldn't require any userspace daemon to setup or
>tear down. Ideally with this solution the virtio bypass will come up
>and be assigned the name of the original virtio, and the "backup"
>interface will come up and be assigned the name of the original virtio
>with an additional "nbackup" tacked on via the phys_port_name, and
>then whenever a VF is added it will automatically be enslaved by the
>bypass interface, and it will be removed when the VF is hotplugged
>out.
>
>In my mind the difference between this and bond or team is where the
>configuration interface lies. In the case of bond it is in the kernel.
>If my understanding is correct team is mostly in user space. With this
>the configuration interface is really down in the hypervisor and
>requests are communicated up to the guest. I would prefer not to make
>virtio_net dependent on the bonding or team drivers, or worse yet a
>userspace daemon in the guest. For now I would argue we should keep
>this as simple as possible just to support basic live migration. There
>has already been discussions of refactoring this after it is in so
>that we can start to combine the functionality here with what is there
>in bonding/team, but the differences in configuration interface and
>the size of the code bases will make it challenging to outright merge
>this into something like that.

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
  2018-02-20 16:29     ` Jiri Pirko
@ 2018-02-20 17:14         ` Samudrala, Sridhar
  2018-02-20 17:14       ` Samudrala, Sridhar
  2018-02-20 17:23         ` [virtio-dev] " Alexander Duyck
  2 siblings, 0 replies; 121+ messages in thread
From: Samudrala, Sridhar @ 2018-02-20 17:14 UTC (permalink / raw)
  To: Jiri Pirko, Alexander Duyck
  Cc: Michael S. Tsirkin, Stephen Hemminger, David Miller, Netdev,
	virtualization, virtio-dev, Brandeburg, Jesse, Duyck,
	Alexander H, Jakub Kicinski, Jason Wang, Siwei Liu

On 2/20/2018 8:29 AM, Jiri Pirko wrote:
> Tue, Feb 20, 2018 at 05:04:29PM CET, alexander.duyck@gmail.com wrote:
>> On Tue, Feb 20, 2018 at 2:42 AM, Jiri Pirko <jiri@resnulli.us> wrote:
>>> Fri, Feb 16, 2018 at 07:11:19PM CET, sridhar.samudrala@intel.com wrote:
>>>> Patch 1 introduces a new feature bit VIRTIO_NET_F_BACKUP that can be
>>>> used by hypervisor to indicate that virtio_net interface should act as
>>>> a backup for another device with the same MAC address.
>>>>
>>>> Ppatch 2 is in response to the community request for a 3 netdev
>>>> solution.  However, it creates some issues we'll get into in a moment.
>>>> It extends virtio_net to use alternate datapath when available and
>>>> registered. When BACKUP feature is enabled, virtio_net driver creates
>>>> an additional 'bypass' netdev that acts as a master device and controls
>>>> 2 slave devices.  The original virtio_net netdev is registered as
>>>> 'backup' netdev and a passthru/vf device with the same MAC gets
>>>> registered as 'active' netdev. Both 'bypass' and 'backup' netdevs are
>>>> associated with the same 'pci' device.  The user accesses the network
>>>> interface via 'bypass' netdev. The 'bypass' netdev chooses 'active' netdev
>>>> as default for transmits when it is available with link up and running.
>>> Sorry, but this is ridiculous. You are apparently re-implemeting part
>>> of bonding driver as a part of NIC driver. Bond and team drivers
>>> are mature solutions, well tested, broadly used, with lots of issues
>>> resolved in the past. What you try to introduce is a weird shortcut
>>> that already has couple of issues as you mentioned and will certanly
>>> have many more. Also, I'm pretty sure that in future, someone comes up
>>> with ideas like multiple VFs, LACP and similar bonding things.
>> The problem with the bond and team drivers is they are too large and
>> have too many interfaces available for configuration so as a result
>> they can really screw this interface up.
> What? Too large is which sense? Why "too many interfaces" is a problem?
> Also, team has only one interface to userspace team-generic-netlink.
>
>
>> Essentially this is meant to be a bond that is more-or-less managed by
>> the host, not the guest. We want the host to be able to configure it
> How is it managed by the host? In your usecase the guest has 2 netdevs:
> virtio_net, pci vf.
> I don't see how host can do any managing of that, other than the
> obvious. But still, the active/backup decision is done in guest. This is
> a simple bond/team usecase. As I said, there is something needed to be
> implemented in userspace in order to handle re-appear of vf netdev.
> But that should be fairly easy to do in teamd.

The host manages the active/backup decision by
- assigning the same MAC address to both VF and virtio interfaces
- setting a BACKUP feature bit on virtio that enables virtio to 
transparently take
   over the VFs datapath.
- only enable one datapath at anytime so that packets don't get looped back
- during live migration enable virtio datapth, unplug vf on the source 
and replug
   vf on the destination.

The VM is not expected and doesn't have any control of setting the MAC 
address
or bringing up/down the links.

This is the model that is currently supported with netvsc driver on Azure.

>
>
>> and have it automatically kick in on the guest. For now we want to
>> avoid adding too much complexity as this is meant to be just the first
> That's what I fear, "for now"..
>
>
>> step. Trying to go in and implement the whole solution right from the
>> start based on existing drivers is going to be a massive time sink and
>> will likely never get completed due to the fact that there is always
>> going to be some other thing that will interfere.
> "implement the whole solution right from the start based on existing
> drivers" - what solution are you talking about? I don't understand this
> para.
>
>
>> My personal hope is that we can look at doing a virtio-bond sort of
>> device that will handle all this as well as providing a communication
>> channel, but that is much further down the road. For now we only have
>> a single bit so the goal for now is trying to keep this as simple as
>> possible.
> Oh. So there is really intention to do re-implementation of bonding
> in virtio. That is plain-wrong in my opinion.
>
> Could you just use bond/team, please, and don't reinvent the wheel with
> this abomination?



>
>>> What is the reason for this abomination? According to:
>>> https://marc.info/?l=linux-virtualization&m=151189725224231&w=2
>>> The reason is quite weak.
>>> User in the vm sees 2 (or more) netdevices, he puts them in bond/team
>>> and that's it. This works now! If the vm lacks some userspace features,
>>> let's fix it there! For example the MAC changes is something that could
>>> be easily handled in teamd userspace deamon.
>> I think you might have missed the point of this. This is meant to be a
>> simple interface so the guest should not be able to change the MAC
>> address, and it shouldn't require any userspace daemon to setup or
>> tear down. Ideally with this solution the virtio bypass will come up
>> and be assigned the name of the original virtio, and the "backup"
>> interface will come up and be assigned the name of the original virtio
>> with an additional "nbackup" tacked on via the phys_port_name, and
>> then whenever a VF is added it will automatically be enslaved by the
>> bypass interface, and it will be removed when the VF is hotplugged
>> out.
>>
>> In my mind the difference between this and bond or team is where the
>> configuration interface lies. In the case of bond it is in the kernel.
>> If my understanding is correct team is mostly in user space. With this
>> the configuration interface is really down in the hypervisor and
>> requests are communicated up to the guest. I would prefer not to make
>> virtio_net dependent on the bonding or team drivers, or worse yet a
>> userspace daemon in the guest. For now I would argue we should keep
>> this as simple as possible just to support basic live migration. There
>> has already been discussions of refactoring this after it is in so
>> that we can start to combine the functionality here with what is there
>> in bonding/team, but the differences in configuration interface and
>> the size of the code bases will make it challenging to outright merge
>> this into something like that.

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
  2018-02-20 16:29     ` Jiri Pirko
  2018-02-20 17:14         ` [virtio-dev] " Samudrala, Sridhar
@ 2018-02-20 17:14       ` Samudrala, Sridhar
  2018-02-20 17:23         ` [virtio-dev] " Alexander Duyck
  2 siblings, 0 replies; 121+ messages in thread
From: Samudrala, Sridhar @ 2018-02-20 17:14 UTC (permalink / raw)
  To: Jiri Pirko, Alexander Duyck
  Cc: Duyck, Alexander H, virtio-dev, Michael S. Tsirkin,
	Jakub Kicinski, Netdev, virtualization, Siwei Liu, David Miller

On 2/20/2018 8:29 AM, Jiri Pirko wrote:
> Tue, Feb 20, 2018 at 05:04:29PM CET, alexander.duyck@gmail.com wrote:
>> On Tue, Feb 20, 2018 at 2:42 AM, Jiri Pirko <jiri@resnulli.us> wrote:
>>> Fri, Feb 16, 2018 at 07:11:19PM CET, sridhar.samudrala@intel.com wrote:
>>>> Patch 1 introduces a new feature bit VIRTIO_NET_F_BACKUP that can be
>>>> used by hypervisor to indicate that virtio_net interface should act as
>>>> a backup for another device with the same MAC address.
>>>>
>>>> Ppatch 2 is in response to the community request for a 3 netdev
>>>> solution.  However, it creates some issues we'll get into in a moment.
>>>> It extends virtio_net to use alternate datapath when available and
>>>> registered. When BACKUP feature is enabled, virtio_net driver creates
>>>> an additional 'bypass' netdev that acts as a master device and controls
>>>> 2 slave devices.  The original virtio_net netdev is registered as
>>>> 'backup' netdev and a passthru/vf device with the same MAC gets
>>>> registered as 'active' netdev. Both 'bypass' and 'backup' netdevs are
>>>> associated with the same 'pci' device.  The user accesses the network
>>>> interface via 'bypass' netdev. The 'bypass' netdev chooses 'active' netdev
>>>> as default for transmits when it is available with link up and running.
>>> Sorry, but this is ridiculous. You are apparently re-implemeting part
>>> of bonding driver as a part of NIC driver. Bond and team drivers
>>> are mature solutions, well tested, broadly used, with lots of issues
>>> resolved in the past. What you try to introduce is a weird shortcut
>>> that already has couple of issues as you mentioned and will certanly
>>> have many more. Also, I'm pretty sure that in future, someone comes up
>>> with ideas like multiple VFs, LACP and similar bonding things.
>> The problem with the bond and team drivers is they are too large and
>> have too many interfaces available for configuration so as a result
>> they can really screw this interface up.
> What? Too large is which sense? Why "too many interfaces" is a problem?
> Also, team has only one interface to userspace team-generic-netlink.
>
>
>> Essentially this is meant to be a bond that is more-or-less managed by
>> the host, not the guest. We want the host to be able to configure it
> How is it managed by the host? In your usecase the guest has 2 netdevs:
> virtio_net, pci vf.
> I don't see how host can do any managing of that, other than the
> obvious. But still, the active/backup decision is done in guest. This is
> a simple bond/team usecase. As I said, there is something needed to be
> implemented in userspace in order to handle re-appear of vf netdev.
> But that should be fairly easy to do in teamd.

The host manages the active/backup decision by
- assigning the same MAC address to both VF and virtio interfaces
- setting a BACKUP feature bit on virtio that enables virtio to 
transparently take
   over the VFs datapath.
- only enable one datapath at anytime so that packets don't get looped back
- during live migration enable virtio datapth, unplug vf on the source 
and replug
   vf on the destination.

The VM is not expected and doesn't have any control of setting the MAC 
address
or bringing up/down the links.

This is the model that is currently supported with netvsc driver on Azure.

>
>
>> and have it automatically kick in on the guest. For now we want to
>> avoid adding too much complexity as this is meant to be just the first
> That's what I fear, "for now"..
>
>
>> step. Trying to go in and implement the whole solution right from the
>> start based on existing drivers is going to be a massive time sink and
>> will likely never get completed due to the fact that there is always
>> going to be some other thing that will interfere.
> "implement the whole solution right from the start based on existing
> drivers" - what solution are you talking about? I don't understand this
> para.
>
>
>> My personal hope is that we can look at doing a virtio-bond sort of
>> device that will handle all this as well as providing a communication
>> channel, but that is much further down the road. For now we only have
>> a single bit so the goal for now is trying to keep this as simple as
>> possible.
> Oh. So there is really intention to do re-implementation of bonding
> in virtio. That is plain-wrong in my opinion.
>
> Could you just use bond/team, please, and don't reinvent the wheel with
> this abomination?



>
>>> What is the reason for this abomination? According to:
>>> https://marc.info/?l=linux-virtualization&m=151189725224231&w=2
>>> The reason is quite weak.
>>> User in the vm sees 2 (or more) netdevices, he puts them in bond/team
>>> and that's it. This works now! If the vm lacks some userspace features,
>>> let's fix it there! For example the MAC changes is something that could
>>> be easily handled in teamd userspace deamon.
>> I think you might have missed the point of this. This is meant to be a
>> simple interface so the guest should not be able to change the MAC
>> address, and it shouldn't require any userspace daemon to setup or
>> tear down. Ideally with this solution the virtio bypass will come up
>> and be assigned the name of the original virtio, and the "backup"
>> interface will come up and be assigned the name of the original virtio
>> with an additional "nbackup" tacked on via the phys_port_name, and
>> then whenever a VF is added it will automatically be enslaved by the
>> bypass interface, and it will be removed when the VF is hotplugged
>> out.
>>
>> In my mind the difference between this and bond or team is where the
>> configuration interface lies. In the case of bond it is in the kernel.
>> If my understanding is correct team is mostly in user space. With this
>> the configuration interface is really down in the hypervisor and
>> requests are communicated up to the guest. I would prefer not to make
>> virtio_net dependent on the bonding or team drivers, or worse yet a
>> userspace daemon in the guest. For now I would argue we should keep
>> this as simple as possible just to support basic live migration. There
>> has already been discussions of refactoring this after it is in so
>> that we can start to combine the functionality here with what is there
>> in bonding/team, but the differences in configuration interface and
>> the size of the code bases will make it challenging to outright merge
>> this into something like that.

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 121+ messages in thread

* [virtio-dev] Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
@ 2018-02-20 17:14         ` Samudrala, Sridhar
  0 siblings, 0 replies; 121+ messages in thread
From: Samudrala, Sridhar @ 2018-02-20 17:14 UTC (permalink / raw)
  To: Jiri Pirko, Alexander Duyck
  Cc: Michael S. Tsirkin, Stephen Hemminger, David Miller, Netdev,
	virtualization, virtio-dev, Brandeburg, Jesse, Duyck,
	Alexander H, Jakub Kicinski, Jason Wang, Siwei Liu

On 2/20/2018 8:29 AM, Jiri Pirko wrote:
> Tue, Feb 20, 2018 at 05:04:29PM CET, alexander.duyck@gmail.com wrote:
>> On Tue, Feb 20, 2018 at 2:42 AM, Jiri Pirko <jiri@resnulli.us> wrote:
>>> Fri, Feb 16, 2018 at 07:11:19PM CET, sridhar.samudrala@intel.com wrote:
>>>> Patch 1 introduces a new feature bit VIRTIO_NET_F_BACKUP that can be
>>>> used by hypervisor to indicate that virtio_net interface should act as
>>>> a backup for another device with the same MAC address.
>>>>
>>>> Ppatch 2 is in response to the community request for a 3 netdev
>>>> solution.  However, it creates some issues we'll get into in a moment.
>>>> It extends virtio_net to use alternate datapath when available and
>>>> registered. When BACKUP feature is enabled, virtio_net driver creates
>>>> an additional 'bypass' netdev that acts as a master device and controls
>>>> 2 slave devices.  The original virtio_net netdev is registered as
>>>> 'backup' netdev and a passthru/vf device with the same MAC gets
>>>> registered as 'active' netdev. Both 'bypass' and 'backup' netdevs are
>>>> associated with the same 'pci' device.  The user accesses the network
>>>> interface via 'bypass' netdev. The 'bypass' netdev chooses 'active' netdev
>>>> as default for transmits when it is available with link up and running.
>>> Sorry, but this is ridiculous. You are apparently re-implemeting part
>>> of bonding driver as a part of NIC driver. Bond and team drivers
>>> are mature solutions, well tested, broadly used, with lots of issues
>>> resolved in the past. What you try to introduce is a weird shortcut
>>> that already has couple of issues as you mentioned and will certanly
>>> have many more. Also, I'm pretty sure that in future, someone comes up
>>> with ideas like multiple VFs, LACP and similar bonding things.
>> The problem with the bond and team drivers is they are too large and
>> have too many interfaces available for configuration so as a result
>> they can really screw this interface up.
> What? Too large is which sense? Why "too many interfaces" is a problem?
> Also, team has only one interface to userspace team-generic-netlink.
>
>
>> Essentially this is meant to be a bond that is more-or-less managed by
>> the host, not the guest. We want the host to be able to configure it
> How is it managed by the host? In your usecase the guest has 2 netdevs:
> virtio_net, pci vf.
> I don't see how host can do any managing of that, other than the
> obvious. But still, the active/backup decision is done in guest. This is
> a simple bond/team usecase. As I said, there is something needed to be
> implemented in userspace in order to handle re-appear of vf netdev.
> But that should be fairly easy to do in teamd.

The host manages the active/backup decision by
- assigning the same MAC address to both VF and virtio interfaces
- setting a BACKUP feature bit on virtio that enables virtio to 
transparently take
   over the VFs datapath.
- only enable one datapath at anytime so that packets don't get looped back
- during live migration enable virtio datapth, unplug vf on the source 
and replug
   vf on the destination.

The VM is not expected and doesn't have any control of setting the MAC 
address
or bringing up/down the links.

This is the model that is currently supported with netvsc driver on Azure.

>
>
>> and have it automatically kick in on the guest. For now we want to
>> avoid adding too much complexity as this is meant to be just the first
> That's what I fear, "for now"..
>
>
>> step. Trying to go in and implement the whole solution right from the
>> start based on existing drivers is going to be a massive time sink and
>> will likely never get completed due to the fact that there is always
>> going to be some other thing that will interfere.
> "implement the whole solution right from the start based on existing
> drivers" - what solution are you talking about? I don't understand this
> para.
>
>
>> My personal hope is that we can look at doing a virtio-bond sort of
>> device that will handle all this as well as providing a communication
>> channel, but that is much further down the road. For now we only have
>> a single bit so the goal for now is trying to keep this as simple as
>> possible.
> Oh. So there is really intention to do re-implementation of bonding
> in virtio. That is plain-wrong in my opinion.
>
> Could you just use bond/team, please, and don't reinvent the wheel with
> this abomination?



>
>>> What is the reason for this abomination? According to:
>>> https://marc.info/?l=linux-virtualization&m=151189725224231&w=2
>>> The reason is quite weak.
>>> User in the vm sees 2 (or more) netdevices, he puts them in bond/team
>>> and that's it. This works now! If the vm lacks some userspace features,
>>> let's fix it there! For example the MAC changes is something that could
>>> be easily handled in teamd userspace deamon.
>> I think you might have missed the point of this. This is meant to be a
>> simple interface so the guest should not be able to change the MAC
>> address, and it shouldn't require any userspace daemon to setup or
>> tear down. Ideally with this solution the virtio bypass will come up
>> and be assigned the name of the original virtio, and the "backup"
>> interface will come up and be assigned the name of the original virtio
>> with an additional "nbackup" tacked on via the phys_port_name, and
>> then whenever a VF is added it will automatically be enslaved by the
>> bypass interface, and it will be removed when the VF is hotplugged
>> out.
>>
>> In my mind the difference between this and bond or team is where the
>> configuration interface lies. In the case of bond it is in the kernel.
>> If my understanding is correct team is mostly in user space. With this
>> the configuration interface is really down in the hypervisor and
>> requests are communicated up to the guest. I would prefer not to make
>> virtio_net dependent on the bonding or team drivers, or worse yet a
>> userspace daemon in the guest. For now I would argue we should keep
>> this as simple as possible just to support basic live migration. There
>> has already been discussions of refactoring this after it is in so
>> that we can start to combine the functionality here with what is there
>> in bonding/team, but the differences in configuration interface and
>> the size of the code bases will make it challenging to outright merge
>> this into something like that.


---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
  2018-02-20 16:29     ` Jiri Pirko
@ 2018-02-20 17:23         ` Alexander Duyck
  2018-02-20 17:14       ` Samudrala, Sridhar
  2018-02-20 17:23         ` [virtio-dev] " Alexander Duyck
  2 siblings, 0 replies; 121+ messages in thread
From: Alexander Duyck @ 2018-02-20 17:23 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Duyck, Alexander H, virtio-dev, Michael S. Tsirkin,
	Jakub Kicinski, Sridhar Samudrala, virtualization, Siwei Liu,
	Netdev, David Miller

On Tue, Feb 20, 2018 at 8:29 AM, Jiri Pirko <jiri@resnulli.us> wrote:
> Tue, Feb 20, 2018 at 05:04:29PM CET, alexander.duyck@gmail.com wrote:
>>On Tue, Feb 20, 2018 at 2:42 AM, Jiri Pirko <jiri@resnulli.us> wrote:
>>> Fri, Feb 16, 2018 at 07:11:19PM CET, sridhar.samudrala@intel.com wrote:
>>>>Patch 1 introduces a new feature bit VIRTIO_NET_F_BACKUP that can be
>>>>used by hypervisor to indicate that virtio_net interface should act as
>>>>a backup for another device with the same MAC address.
>>>>
>>>>Ppatch 2 is in response to the community request for a 3 netdev
>>>>solution.  However, it creates some issues we'll get into in a moment.
>>>>It extends virtio_net to use alternate datapath when available and
>>>>registered. When BACKUP feature is enabled, virtio_net driver creates
>>>>an additional 'bypass' netdev that acts as a master device and controls
>>>>2 slave devices.  The original virtio_net netdev is registered as
>>>>'backup' netdev and a passthru/vf device with the same MAC gets
>>>>registered as 'active' netdev. Both 'bypass' and 'backup' netdevs are
>>>>associated with the same 'pci' device.  The user accesses the network
>>>>interface via 'bypass' netdev. The 'bypass' netdev chooses 'active' netdev
>>>>as default for transmits when it is available with link up and running.
>>>
>>> Sorry, but this is ridiculous. You are apparently re-implemeting part
>>> of bonding driver as a part of NIC driver. Bond and team drivers
>>> are mature solutions, well tested, broadly used, with lots of issues
>>> resolved in the past. What you try to introduce is a weird shortcut
>>> that already has couple of issues as you mentioned and will certanly
>>> have many more. Also, I'm pretty sure that in future, someone comes up
>>> with ideas like multiple VFs, LACP and similar bonding things.
>>
>>The problem with the bond and team drivers is they are too large and
>>have too many interfaces available for configuration so as a result
>>they can really screw this interface up.
>
> What? Too large is which sense? Why "too many interfaces" is a problem?
> Also, team has only one interface to userspace team-generic-netlink.

Specifically I was working with bond. I had overlooked team for the
most part since it required an additional userspace daemon which
basically broke our requirement of no user-space intervention.

I was trying to focus on just doing an active/backup setup. The
problem is there are debugfs, sysfs, and procfs interfaces exposed
that we don't need and/or want. Adding any sort of interface to
exclude these would just bloat up the bonding driver, and leaving them
in would just be confusing since they would all need to be ignored. In
addition the steps needed to get the name to come out the same as the
original virtio interface would just bloat up bonding.

>>
>>Essentially this is meant to be a bond that is more-or-less managed by
>>the host, not the guest. We want the host to be able to configure it
>
> How is it managed by the host? In your usecase the guest has 2 netdevs:
> virtio_net, pci vf.
> I don't see how host can do any managing of that, other than the
> obvious. But still, the active/backup decision is done in guest. This is
> a simple bond/team usecase. As I said, there is something needed to be
> implemented in userspace in order to handle re-appear of vf netdev.
> But that should be fairly easy to do in teamd.
>
>
>>and have it automatically kick in on the guest. For now we want to
>>avoid adding too much complexity as this is meant to be just the first
>
> That's what I fear, "for now"..

I used the expression "for now" as I see this being the first stage of
a multi-stage process.

Step 1 is to get a basic virtio-bypass driver added to virtio so that
it is at least comparable to netvsc in terms of feature set and
enables basic network live migration.

Step 2 is adding some sort of dirty page tracking, preferably via
something like a paravirtual iommu interface. Once we have that we can
defer the eviction of the VF until the very last moment of the live
migration. For now I need to work on testing a modification to allow
mapping the entire guest as being pass-through for DMA to the device,
and requiring dynamic for any DMA that is bidirectional or from the
device.

Step 3 will be to start looking at advanced configuration. That is
where we drop the implementation in step 1 and instead look at
spawning something that looks more like the team type interface,
however instead of working with a user-space daemon we would likely
need to work with some sort of mailbox or message queue coming up from
the hypervisor. Then we can start looking at doing things like passing
up blocks of eBPF code to handle Tx port selection or whatever we
need.

>
>>step. Trying to go in and implement the whole solution right from the
>>start based on existing drivers is going to be a massive time sink and
>>will likely never get completed due to the fact that there is always
>>going to be some other thing that will interfere.
>
> "implement the whole solution right from the start based on existing
> drivers" - what solution are you talking about? I don't understand this
> para.

You started mentioning much more complex configurations such as
multi-VF, LACP, and other such things. I fully own that this cannot
support that. My understanding is that the netvsc solution that is out
there cannot support anything like that either. The idea for now is to
keep this as simple as possible. It makes things like the possibility
of porting this to other OSes much easier.

>>
>>My personal hope is that we can look at doing a virtio-bond sort of
>>device that will handle all this as well as providing a communication
>>channel, but that is much further down the road. For now we only have
>>a single bit so the goal for now is trying to keep this as simple as
>>possible.
>
> Oh. So there is really intention to do re-implementation of bonding
> in virtio. That is plain-wrong in my opinion.
>
> Could you just use bond/team, please, and don't reinvent the wheel with
> this abomination?

So I have a question for you. Why did you create the team driver? The
bonding code was already there and does almost exactly the same thing.
I would think it has to do with where things are managed. That is the
same situation we have with this.

In my mind I don't see this something where we can just fit it into
one of these two drivers because of the same reason the bonding and
team drivers are split. We want to manage this interface somewhere
else. In my mind what we probably need to do is look at refactoring
the code since the control paths are in different locations for each
of these drivers, but much of the datapath is the same. That is where
I see things going eventually for this "virtio-bond" interface I
referenced, but for now this interface is not that since there isn't
really any communication channel present at all.

>>
>>> What is the reason for this abomination? According to:
>>> https://marc.info/?l=linux-virtualization&m=151189725224231&w=2
>>> The reason is quite weak.
>>> User in the vm sees 2 (or more) netdevices, he puts them in bond/team
>>> and that's it. This works now! If the vm lacks some userspace features,
>>> let's fix it there! For example the MAC changes is something that could
>>> be easily handled in teamd userspace deamon.
>>
>>I think you might have missed the point of this. This is meant to be a
>>simple interface so the guest should not be able to change the MAC
>>address, and it shouldn't require any userspace daemon to setup or
>>tear down. Ideally with this solution the virtio bypass will come up
>>and be assigned the name of the original virtio, and the "backup"
>>interface will come up and be assigned the name of the original virtio
>>with an additional "nbackup" tacked on via the phys_port_name, and
>>then whenever a VF is added it will automatically be enslaved by the
>>bypass interface, and it will be removed when the VF is hotplugged
>>out.
>>
>>In my mind the difference between this and bond or team is where the
>>configuration interface lies. In the case of bond it is in the kernel.
>>If my understanding is correct team is mostly in user space. With this
>>the configuration interface is really down in the hypervisor and
>>requests are communicated up to the guest. I would prefer not to make
>>virtio_net dependent on the bonding or team drivers, or worse yet a
>>userspace daemon in the guest. For now I would argue we should keep
>>this as simple as possible just to support basic live migration. There
>>has already been discussions of refactoring this after it is in so
>>that we can start to combine the functionality here with what is there
>>in bonding/team, but the differences in configuration interface and
>>the size of the code bases will make it challenging to outright merge
>>this into something like that.

^ permalink raw reply	[flat|nested] 121+ messages in thread

* [virtio-dev] Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
@ 2018-02-20 17:23         ` Alexander Duyck
  0 siblings, 0 replies; 121+ messages in thread
From: Alexander Duyck @ 2018-02-20 17:23 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Sridhar Samudrala, Michael S. Tsirkin, Stephen Hemminger,
	David Miller, Netdev, virtualization, virtio-dev, Brandeburg,
	Jesse, Duyck, Alexander H, Jakub Kicinski, Jason Wang, Siwei Liu

On Tue, Feb 20, 2018 at 8:29 AM, Jiri Pirko <jiri@resnulli.us> wrote:
> Tue, Feb 20, 2018 at 05:04:29PM CET, alexander.duyck@gmail.com wrote:
>>On Tue, Feb 20, 2018 at 2:42 AM, Jiri Pirko <jiri@resnulli.us> wrote:
>>> Fri, Feb 16, 2018 at 07:11:19PM CET, sridhar.samudrala@intel.com wrote:
>>>>Patch 1 introduces a new feature bit VIRTIO_NET_F_BACKUP that can be
>>>>used by hypervisor to indicate that virtio_net interface should act as
>>>>a backup for another device with the same MAC address.
>>>>
>>>>Ppatch 2 is in response to the community request for a 3 netdev
>>>>solution.  However, it creates some issues we'll get into in a moment.
>>>>It extends virtio_net to use alternate datapath when available and
>>>>registered. When BACKUP feature is enabled, virtio_net driver creates
>>>>an additional 'bypass' netdev that acts as a master device and controls
>>>>2 slave devices.  The original virtio_net netdev is registered as
>>>>'backup' netdev and a passthru/vf device with the same MAC gets
>>>>registered as 'active' netdev. Both 'bypass' and 'backup' netdevs are
>>>>associated with the same 'pci' device.  The user accesses the network
>>>>interface via 'bypass' netdev. The 'bypass' netdev chooses 'active' netdev
>>>>as default for transmits when it is available with link up and running.
>>>
>>> Sorry, but this is ridiculous. You are apparently re-implemeting part
>>> of bonding driver as a part of NIC driver. Bond and team drivers
>>> are mature solutions, well tested, broadly used, with lots of issues
>>> resolved in the past. What you try to introduce is a weird shortcut
>>> that already has couple of issues as you mentioned and will certanly
>>> have many more. Also, I'm pretty sure that in future, someone comes up
>>> with ideas like multiple VFs, LACP and similar bonding things.
>>
>>The problem with the bond and team drivers is they are too large and
>>have too many interfaces available for configuration so as a result
>>they can really screw this interface up.
>
> What? Too large is which sense? Why "too many interfaces" is a problem?
> Also, team has only one interface to userspace team-generic-netlink.

Specifically I was working with bond. I had overlooked team for the
most part since it required an additional userspace daemon which
basically broke our requirement of no user-space intervention.

I was trying to focus on just doing an active/backup setup. The
problem is there are debugfs, sysfs, and procfs interfaces exposed
that we don't need and/or want. Adding any sort of interface to
exclude these would just bloat up the bonding driver, and leaving them
in would just be confusing since they would all need to be ignored. In
addition the steps needed to get the name to come out the same as the
original virtio interface would just bloat up bonding.

>>
>>Essentially this is meant to be a bond that is more-or-less managed by
>>the host, not the guest. We want the host to be able to configure it
>
> How is it managed by the host? In your usecase the guest has 2 netdevs:
> virtio_net, pci vf.
> I don't see how host can do any managing of that, other than the
> obvious. But still, the active/backup decision is done in guest. This is
> a simple bond/team usecase. As I said, there is something needed to be
> implemented in userspace in order to handle re-appear of vf netdev.
> But that should be fairly easy to do in teamd.
>
>
>>and have it automatically kick in on the guest. For now we want to
>>avoid adding too much complexity as this is meant to be just the first
>
> That's what I fear, "for now"..

I used the expression "for now" as I see this being the first stage of
a multi-stage process.

Step 1 is to get a basic virtio-bypass driver added to virtio so that
it is at least comparable to netvsc in terms of feature set and
enables basic network live migration.

Step 2 is adding some sort of dirty page tracking, preferably via
something like a paravirtual iommu interface. Once we have that we can
defer the eviction of the VF until the very last moment of the live
migration. For now I need to work on testing a modification to allow
mapping the entire guest as being pass-through for DMA to the device,
and requiring dynamic for any DMA that is bidirectional or from the
device.

Step 3 will be to start looking at advanced configuration. That is
where we drop the implementation in step 1 and instead look at
spawning something that looks more like the team type interface,
however instead of working with a user-space daemon we would likely
need to work with some sort of mailbox or message queue coming up from
the hypervisor. Then we can start looking at doing things like passing
up blocks of eBPF code to handle Tx port selection or whatever we
need.

>
>>step. Trying to go in and implement the whole solution right from the
>>start based on existing drivers is going to be a massive time sink and
>>will likely never get completed due to the fact that there is always
>>going to be some other thing that will interfere.
>
> "implement the whole solution right from the start based on existing
> drivers" - what solution are you talking about? I don't understand this
> para.

You started mentioning much more complex configurations such as
multi-VF, LACP, and other such things. I fully own that this cannot
support that. My understanding is that the netvsc solution that is out
there cannot support anything like that either. The idea for now is to
keep this as simple as possible. It makes things like the possibility
of porting this to other OSes much easier.

>>
>>My personal hope is that we can look at doing a virtio-bond sort of
>>device that will handle all this as well as providing a communication
>>channel, but that is much further down the road. For now we only have
>>a single bit so the goal for now is trying to keep this as simple as
>>possible.
>
> Oh. So there is really intention to do re-implementation of bonding
> in virtio. That is plain-wrong in my opinion.
>
> Could you just use bond/team, please, and don't reinvent the wheel with
> this abomination?

So I have a question for you. Why did you create the team driver? The
bonding code was already there and does almost exactly the same thing.
I would think it has to do with where things are managed. That is the
same situation we have with this.

In my mind I don't see this something where we can just fit it into
one of these two drivers because of the same reason the bonding and
team drivers are split. We want to manage this interface somewhere
else. In my mind what we probably need to do is look at refactoring
the code since the control paths are in different locations for each
of these drivers, but much of the datapath is the same. That is where
I see things going eventually for this "virtio-bond" interface I
referenced, but for now this interface is not that since there isn't
really any communication channel present at all.

>>
>>> What is the reason for this abomination? According to:
>>> https://marc.info/?l=linux-virtualization&m=151189725224231&w=2
>>> The reason is quite weak.
>>> User in the vm sees 2 (or more) netdevices, he puts them in bond/team
>>> and that's it. This works now! If the vm lacks some userspace features,
>>> let's fix it there! For example the MAC changes is something that could
>>> be easily handled in teamd userspace deamon.
>>
>>I think you might have missed the point of this. This is meant to be a
>>simple interface so the guest should not be able to change the MAC
>>address, and it shouldn't require any userspace daemon to setup or
>>tear down. Ideally with this solution the virtio bypass will come up
>>and be assigned the name of the original virtio, and the "backup"
>>interface will come up and be assigned the name of the original virtio
>>with an additional "nbackup" tacked on via the phys_port_name, and
>>then whenever a VF is added it will automatically be enslaved by the
>>bypass interface, and it will be removed when the VF is hotplugged
>>out.
>>
>>In my mind the difference between this and bond or team is where the
>>configuration interface lies. In the case of bond it is in the kernel.
>>If my understanding is correct team is mostly in user space. With this
>>the configuration interface is really down in the hypervisor and
>>requests are communicated up to the guest. I would prefer not to make
>>virtio_net dependent on the bonding or team drivers, or worse yet a
>>userspace daemon in the guest. For now I would argue we should keep
>>this as simple as possible just to support basic live migration. There
>>has already been discussions of refactoring this after it is in so
>>that we can start to combine the functionality here with what is there
>>in bonding/team, but the differences in configuration interface and
>>the size of the code bases will make it challenging to outright merge
>>this into something like that.

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
  2018-02-20 17:23         ` [virtio-dev] " Alexander Duyck
  (?)
@ 2018-02-20 19:53         ` Jiri Pirko
  -1 siblings, 0 replies; 121+ messages in thread
From: Jiri Pirko @ 2018-02-20 19:53 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Duyck, Alexander H, virtio-dev, Michael S. Tsirkin,
	Jakub Kicinski, Sridhar Samudrala, virtualization, Siwei Liu,
	Netdev, David Miller

Tue, Feb 20, 2018 at 06:23:49PM CET, alexander.duyck@gmail.com wrote:
>On Tue, Feb 20, 2018 at 8:29 AM, Jiri Pirko <jiri@resnulli.us> wrote:
>> Tue, Feb 20, 2018 at 05:04:29PM CET, alexander.duyck@gmail.com wrote:
>>>On Tue, Feb 20, 2018 at 2:42 AM, Jiri Pirko <jiri@resnulli.us> wrote:
>>>> Fri, Feb 16, 2018 at 07:11:19PM CET, sridhar.samudrala@intel.com wrote:
>>>>>Patch 1 introduces a new feature bit VIRTIO_NET_F_BACKUP that can be
>>>>>used by hypervisor to indicate that virtio_net interface should act as
>>>>>a backup for another device with the same MAC address.
>>>>>
>>>>>Ppatch 2 is in response to the community request for a 3 netdev
>>>>>solution.  However, it creates some issues we'll get into in a moment.
>>>>>It extends virtio_net to use alternate datapath when available and
>>>>>registered. When BACKUP feature is enabled, virtio_net driver creates
>>>>>an additional 'bypass' netdev that acts as a master device and controls
>>>>>2 slave devices.  The original virtio_net netdev is registered as
>>>>>'backup' netdev and a passthru/vf device with the same MAC gets
>>>>>registered as 'active' netdev. Both 'bypass' and 'backup' netdevs are
>>>>>associated with the same 'pci' device.  The user accesses the network
>>>>>interface via 'bypass' netdev. The 'bypass' netdev chooses 'active' netdev
>>>>>as default for transmits when it is available with link up and running.
>>>>
>>>> Sorry, but this is ridiculous. You are apparently re-implemeting part
>>>> of bonding driver as a part of NIC driver. Bond and team drivers
>>>> are mature solutions, well tested, broadly used, with lots of issues
>>>> resolved in the past. What you try to introduce is a weird shortcut
>>>> that already has couple of issues as you mentioned and will certanly
>>>> have many more. Also, I'm pretty sure that in future, someone comes up
>>>> with ideas like multiple VFs, LACP and similar bonding things.
>>>
>>>The problem with the bond and team drivers is they are too large and
>>>have too many interfaces available for configuration so as a result
>>>they can really screw this interface up.
>>
>> What? Too large is which sense? Why "too many interfaces" is a problem?
>> Also, team has only one interface to userspace team-generic-netlink.
>
>Specifically I was working with bond. I had overlooked team for the
>most part since it required an additional userspace daemon which
>basically broke our requirement of no user-space intervention.

Why? That sound artificial. Why the userspace cannot be part of the
solution?


>
>I was trying to focus on just doing an active/backup setup. The
>problem is there are debugfs, sysfs, and procfs interfaces exposed
>that we don't need and/or want. Adding any sort of interface to
>exclude these would just bloat up the bonding driver, and leaving them
>in would just be confusing since they would all need to be ignored. In
>addition the steps needed to get the name to come out the same as the
>original virtio interface would just bloat up bonding.

Why to you care about "name"? it's a netdev, isn't it all that matters?

The viewpoint of the user inside vm boils down to:
1) I have 2 netdevs
2) One is preferred
3) I setup team on top of them

That's should be it. It is the users responsibility to do it this way.


>
>>>
>>>Essentially this is meant to be a bond that is more-or-less managed by
>>>the host, not the guest. We want the host to be able to configure it
>>
>> How is it managed by the host? In your usecase the guest has 2 netdevs:
>> virtio_net, pci vf.
>> I don't see how host can do any managing of that, other than the
>> obvious. But still, the active/backup decision is done in guest. This is
>> a simple bond/team usecase. As I said, there is something needed to be
>> implemented in userspace in order to handle re-appear of vf netdev.
>> But that should be fairly easy to do in teamd.
>>
>>
>>>and have it automatically kick in on the guest. For now we want to
>>>avoid adding too much complexity as this is meant to be just the first
>>
>> That's what I fear, "for now"..
>
>I used the expression "for now" as I see this being the first stage of
>a multi-stage process.

That is what I fear...


>
>Step 1 is to get a basic virtio-bypass driver added to virtio so that
>it is at least comparable to netvsc in terms of feature set and
>enables basic network live migration.
>
>Step 2 is adding some sort of dirty page tracking, preferably via
>something like a paravirtual iommu interface. Once we have that we can
>defer the eviction of the VF until the very last moment of the live
>migration. For now I need to work on testing a modification to allow
>mapping the entire guest as being pass-through for DMA to the device,
>and requiring dynamic for any DMA that is bidirectional or from the
>device.

That is purely on the host side. Does not really matter if your solution
or standard bond/team is in use, right?


>
>Step 3 will be to start looking at advanced configuration. That is
>where we drop the implementation in step 1 and instead look at
>spawning something that looks more like the team type interface,
>however instead of working with a user-space daemon we would likely
>need to work with some sort of mailbox or message queue coming up from
>the hypervisor. Then we can start looking at doing things like passing
>up blocks of eBPF code to handle Tx port selection or whatever we
>need.

:O

>
>>
>>>step. Trying to go in and implement the whole solution right from the
>>>start based on existing drivers is going to be a massive time sink and
>>>will likely never get completed due to the fact that there is always
>>>going to be some other thing that will interfere.
>>
>> "implement the whole solution right from the start based on existing
>> drivers" - what solution are you talking about? I don't understand this
>> para.
>
>You started mentioning much more complex configurations such as
>multi-VF, LACP, and other such things. I fully own that this cannot
>support that. My understanding is that the netvsc solution that is out
>there cannot support anything like that either. The idea for now is to
>keep this as simple as possible. It makes things like the possibility
>of porting this to other OSes much easier.

Easier solution is team and teamd with linimal modifications in order to
make your usecase work. Btw, do you have the needs for your usecase
written down somewhere, so we are on the same page?


>
>>>
>>>My personal hope is that we can look at doing a virtio-bond sort of
>>>device that will handle all this as well as providing a communication
>>>channel, but that is much further down the road. For now we only have
>>>a single bit so the goal for now is trying to keep this as simple as
>>>possible.
>>
>> Oh. So there is really intention to do re-implementation of bonding
>> in virtio. That is plain-wrong in my opinion.
>>
>> Could you just use bond/team, please, and don't reinvent the wheel with
>> this abomination?
>
>So I have a question for you. Why did you create the team driver? The
>bonding code was already there and does almost exactly the same thing.

Please do go down the git log memory lane. Team was introduced in 2011.
At that time bonding was not in a good shape. I decided to rewrite it
with minimal parts being in kernel to allow the flexibility user needs
to be done in userspace. By the way, the usecase you are trying to
resolve by this patchset is something that can benefit from the team
driver kernel-userspace architecture. Easily.


>I would think it has to do with where things are managed. That is the
>same situation we have with this.
>
>In my mind I don't see this something where we can just fit it into
>one of these two drivers because of the same reason the bonding and
>team drivers are split. We want to manage this interface somewhere
>else. In my mind what we probably need to do is look at refactoring
>the code since the control paths are in different locations for each
>of these drivers, but much of the datapath is the same. That is where

This is where you try to twist the universe, in my opinion. You want to
move the responsibilities of the user inside the guest to the user
ourside it. And you use this weird mechanisms to do so. It feel very
wrong. Look at it from non-virtualized point of view. It's like you
would have HW that would configure the kernel which runs on it. It is
supposed to be the other way around.


>I see things going eventually for this "virtio-bond" interface I
>referenced, but for now this interface is not that since there isn't
>really any communication channel present at all.

That is exactly what I fear of. Thanks for ensuring me that is the final
vision :/


>
>>>
>>>> What is the reason for this abomination? According to:
>>>> https://marc.info/?l=linux-virtualization&m=151189725224231&w=2
>>>> The reason is quite weak.
>>>> User in the vm sees 2 (or more) netdevices, he puts them in bond/team
>>>> and that's it. This works now! If the vm lacks some userspace features,
>>>> let's fix it there! For example the MAC changes is something that could
>>>> be easily handled in teamd userspace deamon.
>>>
>>>I think you might have missed the point of this. This is meant to be a
>>>simple interface so the guest should not be able to change the MAC
>>>address, and it shouldn't require any userspace daemon to setup or
>>>tear down. Ideally with this solution the virtio bypass will come up
>>>and be assigned the name of the original virtio, and the "backup"
>>>interface will come up and be assigned the name of the original virtio
>>>with an additional "nbackup" tacked on via the phys_port_name, and
>>>then whenever a VF is added it will automatically be enslaved by the
>>>bypass interface, and it will be removed when the VF is hotplugged
>>>out.
>>>
>>>In my mind the difference between this and bond or team is where the
>>>configuration interface lies. In the case of bond it is in the kernel.
>>>If my understanding is correct team is mostly in user space. With this
>>>the configuration interface is really down in the hypervisor and
>>>requests are communicated up to the guest. I would prefer not to make
>>>virtio_net dependent on the bonding or team drivers, or worse yet a
>>>userspace daemon in the guest. For now I would argue we should keep
>>>this as simple as possible just to support basic live migration. There
>>>has already been discussions of refactoring this after it is in so
>>>that we can start to combine the functionality here with what is there
>>>in bonding/team, but the differences in configuration interface and
>>>the size of the code bases will make it challenging to outright merge
>>>this into something like that.

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
  2018-02-20 17:14         ` [virtio-dev] " Samudrala, Sridhar
  (?)
@ 2018-02-20 20:14         ` Jiri Pirko
  2018-02-20 21:02             ` [virtio-dev] " Alexander Duyck
                             ` (2 more replies)
  -1 siblings, 3 replies; 121+ messages in thread
From: Jiri Pirko @ 2018-02-20 20:14 UTC (permalink / raw)
  To: Samudrala, Sridhar
  Cc: Duyck, Alexander H, virtio-dev, Michael S. Tsirkin,
	Jakub Kicinski, Netdev, Alexander Duyck, virtualization,
	Siwei Liu, David Miller

Tue, Feb 20, 2018 at 06:14:32PM CET, sridhar.samudrala@intel.com wrote:
>On 2/20/2018 8:29 AM, Jiri Pirko wrote:
>> Tue, Feb 20, 2018 at 05:04:29PM CET, alexander.duyck@gmail.com wrote:
>> > On Tue, Feb 20, 2018 at 2:42 AM, Jiri Pirko <jiri@resnulli.us> wrote:
>> > > Fri, Feb 16, 2018 at 07:11:19PM CET, sridhar.samudrala@intel.com wrote:
>> > > > Patch 1 introduces a new feature bit VIRTIO_NET_F_BACKUP that can be
>> > > > used by hypervisor to indicate that virtio_net interface should act as
>> > > > a backup for another device with the same MAC address.
>> > > > 
>> > > > Ppatch 2 is in response to the community request for a 3 netdev
>> > > > solution.  However, it creates some issues we'll get into in a moment.
>> > > > It extends virtio_net to use alternate datapath when available and
>> > > > registered. When BACKUP feature is enabled, virtio_net driver creates
>> > > > an additional 'bypass' netdev that acts as a master device and controls
>> > > > 2 slave devices.  The original virtio_net netdev is registered as
>> > > > 'backup' netdev and a passthru/vf device with the same MAC gets
>> > > > registered as 'active' netdev. Both 'bypass' and 'backup' netdevs are
>> > > > associated with the same 'pci' device.  The user accesses the network
>> > > > interface via 'bypass' netdev. The 'bypass' netdev chooses 'active' netdev
>> > > > as default for transmits when it is available with link up and running.
>> > > Sorry, but this is ridiculous. You are apparently re-implemeting part
>> > > of bonding driver as a part of NIC driver. Bond and team drivers
>> > > are mature solutions, well tested, broadly used, with lots of issues
>> > > resolved in the past. What you try to introduce is a weird shortcut
>> > > that already has couple of issues as you mentioned and will certanly
>> > > have many more. Also, I'm pretty sure that in future, someone comes up
>> > > with ideas like multiple VFs, LACP and similar bonding things.
>> > The problem with the bond and team drivers is they are too large and
>> > have too many interfaces available for configuration so as a result
>> > they can really screw this interface up.
>> What? Too large is which sense? Why "too many interfaces" is a problem?
>> Also, team has only one interface to userspace team-generic-netlink.
>> 
>> 
>> > Essentially this is meant to be a bond that is more-or-less managed by
>> > the host, not the guest. We want the host to be able to configure it
>> How is it managed by the host? In your usecase the guest has 2 netdevs:
>> virtio_net, pci vf.
>> I don't see how host can do any managing of that, other than the
>> obvious. But still, the active/backup decision is done in guest. This is
>> a simple bond/team usecase. As I said, there is something needed to be
>> implemented in userspace in order to handle re-appear of vf netdev.
>> But that should be fairly easy to do in teamd.
>
>The host manages the active/backup decision by
>- assigning the same MAC address to both VF and virtio interfaces
>- setting a BACKUP feature bit on virtio that enables virtio to transparently
>take
>  over the VFs datapath.
>- only enable one datapath at anytime so that packets don't get looped back
>- during live migration enable virtio datapth, unplug vf on the source and
>replug
>  vf on the destination.
>
>The VM is not expected and doesn't have any control of setting the MAC
>address
>or bringing up/down the links.
>
>This is the model that is currently supported with netvsc driver on Azure.

Yeah, I can see it now :( I guess that the ship has sailed and we are
stuck with this ugly thing forever...

Could you at least make some common code that is shared in between
netvsc and virtio_net so this is handled in exacly the same way in both?

The fact that the netvsc/virtio_net kidnaps a netdev only because it
has the same mac is going to give me some serious nighmares...
I think we need to introduce some more strict checks.

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
  2018-02-20 20:14         ` Jiri Pirko
@ 2018-02-20 21:02             ` Alexander Duyck
  2018-02-20 21:02           ` Alexander Duyck
  2018-02-20 22:33           ` Jakub Kicinski
  2 siblings, 0 replies; 121+ messages in thread
From: Alexander Duyck @ 2018-02-20 21:02 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Samudrala, Sridhar, Michael S. Tsirkin, Stephen Hemminger,
	David Miller, Netdev, virtualization, virtio-dev, Brandeburg,
	Jesse, Duyck, Alexander H, Jakub Kicinski, Jason Wang, Siwei Liu

On Tue, Feb 20, 2018 at 12:14 PM, Jiri Pirko <jiri@resnulli.us> wrote:
> Tue, Feb 20, 2018 at 06:14:32PM CET, sridhar.samudrala@intel.com wrote:
>>On 2/20/2018 8:29 AM, Jiri Pirko wrote:
>>> Tue, Feb 20, 2018 at 05:04:29PM CET, alexander.duyck@gmail.com wrote:
>>> > On Tue, Feb 20, 2018 at 2:42 AM, Jiri Pirko <jiri@resnulli.us> wrote:
>>> > > Fri, Feb 16, 2018 at 07:11:19PM CET, sridhar.samudrala@intel.com wrote:
>>> > > > Patch 1 introduces a new feature bit VIRTIO_NET_F_BACKUP that can be
>>> > > > used by hypervisor to indicate that virtio_net interface should act as
>>> > > > a backup for another device with the same MAC address.
>>> > > >
>>> > > > Ppatch 2 is in response to the community request for a 3 netdev
>>> > > > solution.  However, it creates some issues we'll get into in a moment.
>>> > > > It extends virtio_net to use alternate datapath when available and
>>> > > > registered. When BACKUP feature is enabled, virtio_net driver creates
>>> > > > an additional 'bypass' netdev that acts as a master device and controls
>>> > > > 2 slave devices.  The original virtio_net netdev is registered as
>>> > > > 'backup' netdev and a passthru/vf device with the same MAC gets
>>> > > > registered as 'active' netdev. Both 'bypass' and 'backup' netdevs are
>>> > > > associated with the same 'pci' device.  The user accesses the network
>>> > > > interface via 'bypass' netdev. The 'bypass' netdev chooses 'active' netdev
>>> > > > as default for transmits when it is available with link up and running.
>>> > > Sorry, but this is ridiculous. You are apparently re-implemeting part
>>> > > of bonding driver as a part of NIC driver. Bond and team drivers
>>> > > are mature solutions, well tested, broadly used, with lots of issues
>>> > > resolved in the past. What you try to introduce is a weird shortcut
>>> > > that already has couple of issues as you mentioned and will certanly
>>> > > have many more. Also, I'm pretty sure that in future, someone comes up
>>> > > with ideas like multiple VFs, LACP and similar bonding things.
>>> > The problem with the bond and team drivers is they are too large and
>>> > have too many interfaces available for configuration so as a result
>>> > they can really screw this interface up.
>>> What? Too large is which sense? Why "too many interfaces" is a problem?
>>> Also, team has only one interface to userspace team-generic-netlink.
>>>
>>>
>>> > Essentially this is meant to be a bond that is more-or-less managed by
>>> > the host, not the guest. We want the host to be able to configure it
>>> How is it managed by the host? In your usecase the guest has 2 netdevs:
>>> virtio_net, pci vf.
>>> I don't see how host can do any managing of that, other than the
>>> obvious. But still, the active/backup decision is done in guest. This is
>>> a simple bond/team usecase. As I said, there is something needed to be
>>> implemented in userspace in order to handle re-appear of vf netdev.
>>> But that should be fairly easy to do in teamd.
>>
>>The host manages the active/backup decision by
>>- assigning the same MAC address to both VF and virtio interfaces
>>- setting a BACKUP feature bit on virtio that enables virtio to transparently
>>take
>>  over the VFs datapath.
>>- only enable one datapath at anytime so that packets don't get looped back
>>- during live migration enable virtio datapth, unplug vf on the source and
>>replug
>>  vf on the destination.
>>
>>The VM is not expected and doesn't have any control of setting the MAC
>>address
>>or bringing up/down the links.
>>
>>This is the model that is currently supported with netvsc driver on Azure.
>
> Yeah, I can see it now :( I guess that the ship has sailed and we are
> stuck with this ugly thing forever...
>
> Could you at least make some common code that is shared in between
> netvsc and virtio_net so this is handled in exacly the same way in both?
>
> The fact that the netvsc/virtio_net kidnaps a netdev only because it
> has the same mac is going to give me some serious nighmares...
> I think we need to introduce some more strict checks.

In order for that to work we need to settle on a model for these. The
issue is that netvsc is using what we refer to as the "2 netdev" model
where they don't expose the paravirtual interface as its own netdev.
The opinion of Jakub and others has been that we should do a "3
netdev" model in the case of virtio_net since otherwise we will lose
functionality such as in-driver XDP and have to deal with an extra set
of qdiscs and Tx queue locks on transmit path.

Really at this point I am good either way, but we need to probably
have Stephen, Jakub, and whoever else had an opinion on the matter
sort out the 2 vs 3 argument before we could proceed on that. Most of
patch 2 in the set can easily be broken out into a separate file later
if we decide to go that route.

Thanks.

- Alex

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
  2018-02-20 20:14         ` Jiri Pirko
  2018-02-20 21:02             ` [virtio-dev] " Alexander Duyck
@ 2018-02-20 21:02           ` Alexander Duyck
  2018-02-20 22:33           ` Jakub Kicinski
  2 siblings, 0 replies; 121+ messages in thread
From: Alexander Duyck @ 2018-02-20 21:02 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Duyck, Alexander H, virtio-dev, Michael S. Tsirkin,
	Jakub Kicinski, Samudrala, Sridhar, virtualization, Siwei Liu,
	Netdev, David Miller

On Tue, Feb 20, 2018 at 12:14 PM, Jiri Pirko <jiri@resnulli.us> wrote:
> Tue, Feb 20, 2018 at 06:14:32PM CET, sridhar.samudrala@intel.com wrote:
>>On 2/20/2018 8:29 AM, Jiri Pirko wrote:
>>> Tue, Feb 20, 2018 at 05:04:29PM CET, alexander.duyck@gmail.com wrote:
>>> > On Tue, Feb 20, 2018 at 2:42 AM, Jiri Pirko <jiri@resnulli.us> wrote:
>>> > > Fri, Feb 16, 2018 at 07:11:19PM CET, sridhar.samudrala@intel.com wrote:
>>> > > > Patch 1 introduces a new feature bit VIRTIO_NET_F_BACKUP that can be
>>> > > > used by hypervisor to indicate that virtio_net interface should act as
>>> > > > a backup for another device with the same MAC address.
>>> > > >
>>> > > > Ppatch 2 is in response to the community request for a 3 netdev
>>> > > > solution.  However, it creates some issues we'll get into in a moment.
>>> > > > It extends virtio_net to use alternate datapath when available and
>>> > > > registered. When BACKUP feature is enabled, virtio_net driver creates
>>> > > > an additional 'bypass' netdev that acts as a master device and controls
>>> > > > 2 slave devices.  The original virtio_net netdev is registered as
>>> > > > 'backup' netdev and a passthru/vf device with the same MAC gets
>>> > > > registered as 'active' netdev. Both 'bypass' and 'backup' netdevs are
>>> > > > associated with the same 'pci' device.  The user accesses the network
>>> > > > interface via 'bypass' netdev. The 'bypass' netdev chooses 'active' netdev
>>> > > > as default for transmits when it is available with link up and running.
>>> > > Sorry, but this is ridiculous. You are apparently re-implemeting part
>>> > > of bonding driver as a part of NIC driver. Bond and team drivers
>>> > > are mature solutions, well tested, broadly used, with lots of issues
>>> > > resolved in the past. What you try to introduce is a weird shortcut
>>> > > that already has couple of issues as you mentioned and will certanly
>>> > > have many more. Also, I'm pretty sure that in future, someone comes up
>>> > > with ideas like multiple VFs, LACP and similar bonding things.
>>> > The problem with the bond and team drivers is they are too large and
>>> > have too many interfaces available for configuration so as a result
>>> > they can really screw this interface up.
>>> What? Too large is which sense? Why "too many interfaces" is a problem?
>>> Also, team has only one interface to userspace team-generic-netlink.
>>>
>>>
>>> > Essentially this is meant to be a bond that is more-or-less managed by
>>> > the host, not the guest. We want the host to be able to configure it
>>> How is it managed by the host? In your usecase the guest has 2 netdevs:
>>> virtio_net, pci vf.
>>> I don't see how host can do any managing of that, other than the
>>> obvious. But still, the active/backup decision is done in guest. This is
>>> a simple bond/team usecase. As I said, there is something needed to be
>>> implemented in userspace in order to handle re-appear of vf netdev.
>>> But that should be fairly easy to do in teamd.
>>
>>The host manages the active/backup decision by
>>- assigning the same MAC address to both VF and virtio interfaces
>>- setting a BACKUP feature bit on virtio that enables virtio to transparently
>>take
>>  over the VFs datapath.
>>- only enable one datapath at anytime so that packets don't get looped back
>>- during live migration enable virtio datapth, unplug vf on the source and
>>replug
>>  vf on the destination.
>>
>>The VM is not expected and doesn't have any control of setting the MAC
>>address
>>or bringing up/down the links.
>>
>>This is the model that is currently supported with netvsc driver on Azure.
>
> Yeah, I can see it now :( I guess that the ship has sailed and we are
> stuck with this ugly thing forever...
>
> Could you at least make some common code that is shared in between
> netvsc and virtio_net so this is handled in exacly the same way in both?
>
> The fact that the netvsc/virtio_net kidnaps a netdev only because it
> has the same mac is going to give me some serious nighmares...
> I think we need to introduce some more strict checks.

In order for that to work we need to settle on a model for these. The
issue is that netvsc is using what we refer to as the "2 netdev" model
where they don't expose the paravirtual interface as its own netdev.
The opinion of Jakub and others has been that we should do a "3
netdev" model in the case of virtio_net since otherwise we will lose
functionality such as in-driver XDP and have to deal with an extra set
of qdiscs and Tx queue locks on transmit path.

Really at this point I am good either way, but we need to probably
have Stephen, Jakub, and whoever else had an opinion on the matter
sort out the 2 vs 3 argument before we could proceed on that. Most of
patch 2 in the set can easily be broken out into a separate file later
if we decide to go that route.

Thanks.

- Alex

^ permalink raw reply	[flat|nested] 121+ messages in thread

* [virtio-dev] Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
@ 2018-02-20 21:02             ` Alexander Duyck
  0 siblings, 0 replies; 121+ messages in thread
From: Alexander Duyck @ 2018-02-20 21:02 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Samudrala, Sridhar, Michael S. Tsirkin, Stephen Hemminger,
	David Miller, Netdev, virtualization, virtio-dev, Brandeburg,
	Jesse, Duyck, Alexander H, Jakub Kicinski, Jason Wang, Siwei Liu

On Tue, Feb 20, 2018 at 12:14 PM, Jiri Pirko <jiri@resnulli.us> wrote:
> Tue, Feb 20, 2018 at 06:14:32PM CET, sridhar.samudrala@intel.com wrote:
>>On 2/20/2018 8:29 AM, Jiri Pirko wrote:
>>> Tue, Feb 20, 2018 at 05:04:29PM CET, alexander.duyck@gmail.com wrote:
>>> > On Tue, Feb 20, 2018 at 2:42 AM, Jiri Pirko <jiri@resnulli.us> wrote:
>>> > > Fri, Feb 16, 2018 at 07:11:19PM CET, sridhar.samudrala@intel.com wrote:
>>> > > > Patch 1 introduces a new feature bit VIRTIO_NET_F_BACKUP that can be
>>> > > > used by hypervisor to indicate that virtio_net interface should act as
>>> > > > a backup for another device with the same MAC address.
>>> > > >
>>> > > > Ppatch 2 is in response to the community request for a 3 netdev
>>> > > > solution.  However, it creates some issues we'll get into in a moment.
>>> > > > It extends virtio_net to use alternate datapath when available and
>>> > > > registered. When BACKUP feature is enabled, virtio_net driver creates
>>> > > > an additional 'bypass' netdev that acts as a master device and controls
>>> > > > 2 slave devices.  The original virtio_net netdev is registered as
>>> > > > 'backup' netdev and a passthru/vf device with the same MAC gets
>>> > > > registered as 'active' netdev. Both 'bypass' and 'backup' netdevs are
>>> > > > associated with the same 'pci' device.  The user accesses the network
>>> > > > interface via 'bypass' netdev. The 'bypass' netdev chooses 'active' netdev
>>> > > > as default for transmits when it is available with link up and running.
>>> > > Sorry, but this is ridiculous. You are apparently re-implemeting part
>>> > > of bonding driver as a part of NIC driver. Bond and team drivers
>>> > > are mature solutions, well tested, broadly used, with lots of issues
>>> > > resolved in the past. What you try to introduce is a weird shortcut
>>> > > that already has couple of issues as you mentioned and will certanly
>>> > > have many more. Also, I'm pretty sure that in future, someone comes up
>>> > > with ideas like multiple VFs, LACP and similar bonding things.
>>> > The problem with the bond and team drivers is they are too large and
>>> > have too many interfaces available for configuration so as a result
>>> > they can really screw this interface up.
>>> What? Too large is which sense? Why "too many interfaces" is a problem?
>>> Also, team has only one interface to userspace team-generic-netlink.
>>>
>>>
>>> > Essentially this is meant to be a bond that is more-or-less managed by
>>> > the host, not the guest. We want the host to be able to configure it
>>> How is it managed by the host? In your usecase the guest has 2 netdevs:
>>> virtio_net, pci vf.
>>> I don't see how host can do any managing of that, other than the
>>> obvious. But still, the active/backup decision is done in guest. This is
>>> a simple bond/team usecase. As I said, there is something needed to be
>>> implemented in userspace in order to handle re-appear of vf netdev.
>>> But that should be fairly easy to do in teamd.
>>
>>The host manages the active/backup decision by
>>- assigning the same MAC address to both VF and virtio interfaces
>>- setting a BACKUP feature bit on virtio that enables virtio to transparently
>>take
>>  over the VFs datapath.
>>- only enable one datapath at anytime so that packets don't get looped back
>>- during live migration enable virtio datapth, unplug vf on the source and
>>replug
>>  vf on the destination.
>>
>>The VM is not expected and doesn't have any control of setting the MAC
>>address
>>or bringing up/down the links.
>>
>>This is the model that is currently supported with netvsc driver on Azure.
>
> Yeah, I can see it now :( I guess that the ship has sailed and we are
> stuck with this ugly thing forever...
>
> Could you at least make some common code that is shared in between
> netvsc and virtio_net so this is handled in exacly the same way in both?
>
> The fact that the netvsc/virtio_net kidnaps a netdev only because it
> has the same mac is going to give me some serious nighmares...
> I think we need to introduce some more strict checks.

In order for that to work we need to settle on a model for these. The
issue is that netvsc is using what we refer to as the "2 netdev" model
where they don't expose the paravirtual interface as its own netdev.
The opinion of Jakub and others has been that we should do a "3
netdev" model in the case of virtio_net since otherwise we will lose
functionality such as in-driver XDP and have to deal with an extra set
of qdiscs and Tx queue locks on transmit path.

Really at this point I am good either way, but we need to probably
have Stephen, Jakub, and whoever else had an opinion on the matter
sort out the 2 vs 3 argument before we could proceed on that. Most of
patch 2 in the set can easily be broken out into a separate file later
if we decide to go that route.

Thanks.

- Alex

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
  2018-02-20 20:14         ` Jiri Pirko
  2018-02-20 21:02             ` [virtio-dev] " Alexander Duyck
  2018-02-20 21:02           ` Alexander Duyck
@ 2018-02-20 22:33           ` Jakub Kicinski
  2018-02-21  9:51             ` Jiri Pirko
  2 siblings, 1 reply; 121+ messages in thread
From: Jakub Kicinski @ 2018-02-20 22:33 UTC (permalink / raw)
  To: Jiri Pirko, Samudrala, Sridhar, Alexander Duyck
  Cc: Duyck, Alexander H, virtio-dev, Michael S. Tsirkin, Netdev,
	virtualization, Siwei Liu, David Miller

On Tue, 20 Feb 2018 21:14:10 +0100, Jiri Pirko wrote:
> Yeah, I can see it now :( I guess that the ship has sailed and we are
> stuck with this ugly thing forever...
> 
> Could you at least make some common code that is shared in between
> netvsc and virtio_net so this is handled in exacly the same way in both?

IMHO netvsc is a vendor specific driver which made a mistake on what
behaviour it provides (or tried to align itself with Windows SR-IOV).
Let's not make a far, far more commonly deployed and important driver
(virtio) bug-compatible with netvsc.

To Jiri's initial comments, I feel the same way, in fact I've talked to
the NetworkManager guys to get auto-bonding based on MACs handled in
user space.  I think it may very well get done in next versions of NM,
but isn't done yet.  Stephen also raised the point that not everybody is
using NM.

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
  2018-02-20 22:33           ` Jakub Kicinski
@ 2018-02-21  9:51             ` Jiri Pirko
  2018-02-21 15:56                 ` [virtio-dev] " Alexander Duyck
  2018-02-21 15:56               ` Alexander Duyck
  0 siblings, 2 replies; 121+ messages in thread
From: Jiri Pirko @ 2018-02-21  9:51 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Duyck, Alexander H, virtio-dev, Michael S. Tsirkin, Samudrala,
	Sridhar, Alexander Duyck, virtualization, Siwei Liu, Netdev,
	David Miller

Tue, Feb 20, 2018 at 11:33:56PM CET, kubakici@wp.pl wrote:
>On Tue, 20 Feb 2018 21:14:10 +0100, Jiri Pirko wrote:
>> Yeah, I can see it now :( I guess that the ship has sailed and we are
>> stuck with this ugly thing forever...
>> 
>> Could you at least make some common code that is shared in between
>> netvsc and virtio_net so this is handled in exacly the same way in both?
>
>IMHO netvsc is a vendor specific driver which made a mistake on what
>behaviour it provides (or tried to align itself with Windows SR-IOV).
>Let's not make a far, far more commonly deployed and important driver
>(virtio) bug-compatible with netvsc.

Yeah. netvsc solution is a dangerous precedent here and in my opinition
it was a huge mistake to merge it. I personally would vote to unmerge it
and make the solution based on team/bond.


>
>To Jiri's initial comments, I feel the same way, in fact I've talked to
>the NetworkManager guys to get auto-bonding based on MACs handled in
>user space.  I think it may very well get done in next versions of NM,
>but isn't done yet.  Stephen also raised the point that not everybody is
>using NM.

Can be done in NM, networkd or other network management tools.
Even easier to do this in teamd and let them all benefit.

Actually, I took a stab to implement this in teamd. Took me like an hour
and half.

You can just run teamd with config option "kidnap" like this:
# teamd/teamd -c '{"kidnap": true }'

Whenever teamd sees another netdev to appear with the same mac as his,
or whenever teamd sees another netdev to change mac to his,
it enslaves it.

Here's the patch (quick and dirty):

Subject: [patch teamd] teamd: introduce kidnap feature

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
---
 include/team.h             |  7 +++++++
 libteam/ifinfo.c           | 20 ++++++++++++++++++++
 teamd/teamd.c              | 17 +++++++++++++++++
 teamd/teamd.h              |  5 +++++
 teamd/teamd_events.c       | 17 +++++++++++++++++
 teamd/teamd_ifinfo_watch.c |  9 +++++++++
 teamd/teamd_per_port.c     |  7 ++++++-
 7 files changed, 81 insertions(+), 1 deletion(-)

diff --git a/include/team.h b/include/team.h
index 9ae517d..b0c19c8 100644
--- a/include/team.h
+++ b/include/team.h
@@ -137,6 +137,13 @@ struct team_ifinfo *team_get_next_ifinfo(struct team_handle *th,
 #define team_for_each_ifinfo(ifinfo, th)			\
 	for (ifinfo = team_get_next_ifinfo(th, NULL); ifinfo;	\
 	     ifinfo = team_get_next_ifinfo(th, ifinfo))
+
+struct team_ifinfo *team_get_next_unlinked_ifinfo(struct team_handle *th,
+						  struct team_ifinfo *ifinfo);
+#define team_for_each_unlinked_ifinfo(ifinfo, th)			\
+	for (ifinfo = team_get_next_unlinked_ifinfo(th, NULL); ifinfo;	\
+	     ifinfo = team_get_next_unlinked_ifinfo(th, ifinfo))
+
 /* ifinfo getters */
 bool team_is_ifinfo_removed(struct team_ifinfo *ifinfo);
 uint32_t team_get_ifinfo_ifindex(struct team_ifinfo *ifinfo);
diff --git a/libteam/ifinfo.c b/libteam/ifinfo.c
index 5c32a9c..8f9548e 100644
--- a/libteam/ifinfo.c
+++ b/libteam/ifinfo.c
@@ -494,6 +494,26 @@ struct team_ifinfo *team_get_next_ifinfo(struct team_handle *th,
 	return NULL;
 }
 
+/**
+ * @param th		libteam library context
+ * @param ifinfo	ifinfo structure
+ *
+ * @details Get next unlinked ifinfo in list.
+ *
+ * @return Ifinfo next to ifinfo passed.
+ **/
+TEAM_EXPORT
+struct team_ifinfo *team_get_next_unlinked_ifinfo(struct team_handle *th,
+						  struct team_ifinfo *ifinfo)
+{
+	do {
+		ifinfo = list_get_next_node_entry(&th->ifinfo_list, ifinfo, list);
+		if (ifinfo && !ifinfo->linked)
+			return ifinfo;
+	} while (ifinfo);
+	return NULL;
+}
+
 /**
  * @param ifinfo	ifinfo structure
  *
diff --git a/teamd/teamd.c b/teamd/teamd.c
index aac2511..069c7f0 100644
--- a/teamd/teamd.c
+++ b/teamd/teamd.c
@@ -926,8 +926,25 @@ static int teamd_event_watch_port_added(struct teamd_context *ctx,
 	return 0;
 }
 
+static int teamd_event_watch_unlinked_hwaddr_changed(struct teamd_context *ctx,
+						     struct team_ifinfo *ifinfo,
+						     void *priv)
+{
+	int err;
+	bool kidnap;
+
+	err = teamd_config_bool_get(ctx, &kidnap, "$.kidnap");
+	if (err || !kidnap ||
+	    ctx->hwaddr_len != team_get_ifinfo_hwaddr_len(ifinfo) ||
+	    memcmp(team_get_ifinfo_hwaddr(ifinfo),
+		   ctx->hwaddr, ctx->hwaddr_len))
+		return 0;
+	return teamd_port_add(ctx, team_get_ifinfo_ifindex(ifinfo));
+}
+
 static const struct teamd_event_watch_ops teamd_port_watch_ops = {
 	.port_added = teamd_event_watch_port_added,
+	.unlinked_hwaddr_changed = teamd_event_watch_unlinked_hwaddr_changed,
 };
 
 static int teamd_port_watch_init(struct teamd_context *ctx)
diff --git a/teamd/teamd.h b/teamd/teamd.h
index 5dbfb9b..171a8d1 100644
--- a/teamd/teamd.h
+++ b/teamd/teamd.h
@@ -189,6 +189,8 @@ struct teamd_event_watch_ops {
 				   struct teamd_port *tdport, void *priv);
 	int (*port_ifname_changed)(struct teamd_context *ctx,
 				   struct teamd_port *tdport, void *priv);
+	int (*unlinked_hwaddr_changed)(struct teamd_context *ctx,
+				       struct team_ifinfo *ifinfo, void *priv);
 	int (*option_changed)(struct teamd_context *ctx,
 			      struct team_option *option, void *priv);
 	char *option_changed_match_name;
@@ -210,6 +212,8 @@ int teamd_event_ifinfo_ifname_changed(struct teamd_context *ctx,
 				      struct team_ifinfo *ifinfo);
 int teamd_event_ifinfo_admin_state_changed(struct teamd_context *ctx,
 					   struct team_ifinfo *ifinfo);
+int teamd_event_unlinked_ifinfo_hwaddr_changed(struct teamd_context *ctx,
+					       struct team_ifinfo *ifinfo);
 int teamd_events_init(struct teamd_context *ctx);
 void teamd_events_fini(struct teamd_context *ctx);
 int teamd_event_watch_register(struct teamd_context *ctx,
@@ -313,6 +317,7 @@ static inline unsigned int teamd_port_count(struct teamd_context *ctx)
 	return ctx->port_obj_list_count;
 }
 
+int teamd_port_add(struct teamd_context *ctx, uint32_t ifindex);
 int teamd_port_add_ifname(struct teamd_context *ctx, const char *port_name);
 int teamd_port_remove_ifname(struct teamd_context *ctx, const char *port_name);
 int teamd_port_remove_all(struct teamd_context *ctx);
diff --git a/teamd/teamd_events.c b/teamd/teamd_events.c
index 1a95974..a377090 100644
--- a/teamd/teamd_events.c
+++ b/teamd/teamd_events.c
@@ -184,6 +184,23 @@ int teamd_event_ifinfo_admin_state_changed(struct teamd_context *ctx,
 	return 0;
 }
 
+int teamd_event_unlinked_ifinfo_hwaddr_changed(struct teamd_context *ctx,
+					       struct team_ifinfo *ifinfo)
+{
+	struct event_watch_item *watch;
+	int err;
+
+	list_for_each_node_entry(watch, &ctx->event_watch_list, list) {
+		if (watch->ops->unlinked_hwaddr_changed) {
+			err = watch->ops->unlinked_hwaddr_changed(ctx, ifinfo,
+								  watch->priv);
+			if (err)
+				return err;
+		}
+	}
+	return 0;
+}
+
 int teamd_events_init(struct teamd_context *ctx)
 {
 	list_init(&ctx->event_watch_list);
diff --git a/teamd/teamd_ifinfo_watch.c b/teamd/teamd_ifinfo_watch.c
index f334ff6..8d01a76 100644
--- a/teamd/teamd_ifinfo_watch.c
+++ b/teamd/teamd_ifinfo_watch.c
@@ -60,6 +60,15 @@ static int ifinfo_change_handler_func(struct team_handle *th, void *priv,
 				return err;
 		}
 	}
+
+	team_for_each_unlinked_ifinfo(ifinfo, th) {
+		if (team_is_ifinfo_hwaddr_changed(ifinfo) ||
+		    team_is_ifinfo_hwaddr_len_changed(ifinfo)) {
+			err = teamd_event_unlinked_ifinfo_hwaddr_changed(ctx, ifinfo);
+			if (err)
+				return err;
+		}
+	}
 	return 0;
 }
 
diff --git a/teamd/teamd_per_port.c b/teamd/teamd_per_port.c
index 09d1dc7..21e1bda 100644
--- a/teamd/teamd_per_port.c
+++ b/teamd/teamd_per_port.c
@@ -331,6 +331,11 @@ next_one:
 	return tdport;
 }
 
+int teamd_port_add(struct teamd_context *ctx, uint32_t ifindex)
+{
+	return team_port_add(ctx->th, ifindex);
+}
+
 int teamd_port_add_ifname(struct teamd_context *ctx, const char *port_name)
 {
 	uint32_t ifindex;
@@ -338,7 +343,7 @@ int teamd_port_add_ifname(struct teamd_context *ctx, const char *port_name)
 	ifindex = team_ifname2ifindex(ctx->th, port_name);
 	teamd_log_dbg("%s: Adding port (found ifindex \"%d\").",
 		      port_name, ifindex);
-	return team_port_add(ctx->th, ifindex);
+	return teamd_port_add(ctx, ifindex);
 }
 
 static int teamd_port_remove(struct teamd_context *ctx,
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
  2018-02-21  9:51             ` Jiri Pirko
@ 2018-02-21 15:56                 ` Alexander Duyck
  2018-02-21 15:56               ` Alexander Duyck
  1 sibling, 0 replies; 121+ messages in thread
From: Alexander Duyck @ 2018-02-21 15:56 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Jakub Kicinski, Samudrala, Sridhar, Michael S. Tsirkin,
	Stephen Hemminger, David Miller, Netdev, virtualization,
	virtio-dev, Brandeburg, Jesse, Duyck, Alexander H, Jason Wang,
	Siwei Liu

On Wed, Feb 21, 2018 at 1:51 AM, Jiri Pirko <jiri@resnulli.us> wrote:
> Tue, Feb 20, 2018 at 11:33:56PM CET, kubakici@wp.pl wrote:
>>On Tue, 20 Feb 2018 21:14:10 +0100, Jiri Pirko wrote:
>>> Yeah, I can see it now :( I guess that the ship has sailed and we are
>>> stuck with this ugly thing forever...
>>>
>>> Could you at least make some common code that is shared in between
>>> netvsc and virtio_net so this is handled in exacly the same way in both?
>>
>>IMHO netvsc is a vendor specific driver which made a mistake on what
>>behaviour it provides (or tried to align itself with Windows SR-IOV).
>>Let's not make a far, far more commonly deployed and important driver
>>(virtio) bug-compatible with netvsc.
>
> Yeah. netvsc solution is a dangerous precedent here and in my opinition
> it was a huge mistake to merge it. I personally would vote to unmerge it
> and make the solution based on team/bond.
>
>
>>
>>To Jiri's initial comments, I feel the same way, in fact I've talked to
>>the NetworkManager guys to get auto-bonding based on MACs handled in
>>user space.  I think it may very well get done in next versions of NM,
>>but isn't done yet.  Stephen also raised the point that not everybody is
>>using NM.
>
> Can be done in NM, networkd or other network management tools.
> Even easier to do this in teamd and let them all benefit.
>
> Actually, I took a stab to implement this in teamd. Took me like an hour
> and half.
>
> You can just run teamd with config option "kidnap" like this:
> # teamd/teamd -c '{"kidnap": true }'
>
> Whenever teamd sees another netdev to appear with the same mac as his,
> or whenever teamd sees another netdev to change mac to his,
> it enslaves it.
>
> Here's the patch (quick and dirty):
>
> Subject: [patch teamd] teamd: introduce kidnap feature
>
> Signed-off-by: Jiri Pirko <jiri@mellanox.com>

So this doesn't really address the original problem we were trying to
solve. You asked earlier why the netdev name mattered and it mostly
has to do with configuration. Specifically what our patch is
attempting to resolve is the issue of how to allow a cloud provider to
upgrade their customer to SR-IOV support and live migration without
requiring them to reconfigure their guest. So the general idea with
our patch is to take a VM that is running with virtio_net only and
allow it to instead spawn a virtio_bypass master using the same netdev
name as the original virtio, and then have the virtio_net and VF come
up and be enslaved by the bypass interface. Doing it this way we can
allow for multi-vendor SR-IOV live migration support using a guest
that was originally configured for virtio only.

The problem with your solution is we already have teaming and bonding
as you said. There is already a write-up from Red Hat on how to do it
(https://access.redhat.com/documentation/en-us/red_hat_virtualization/4.1/html/virtual_machine_management_guide/sect-migrating_virtual_machines_between_hosts).
That is all well and good as long as you are willing to keep around
two VM images, one for virtio, and one for SR-IOV with live migration.
The problem is nobody wants to do that. What they want is to maintain
one guest image and if they decide to upgrade to SR-IOV they still
want their live migration and they don't want to have to reconfigure
the guest.

That said it does seem to make the existing Red Hat solution easier to
manage since you wouldn't be guessing at ifname so I have provided
some feedback below.

> ---
>  include/team.h             |  7 +++++++
>  libteam/ifinfo.c           | 20 ++++++++++++++++++++
>  teamd/teamd.c              | 17 +++++++++++++++++
>  teamd/teamd.h              |  5 +++++
>  teamd/teamd_events.c       | 17 +++++++++++++++++
>  teamd/teamd_ifinfo_watch.c |  9 +++++++++
>  teamd/teamd_per_port.c     |  7 ++++++-
>  7 files changed, 81 insertions(+), 1 deletion(-)
>
> diff --git a/include/team.h b/include/team.h
> index 9ae517d..b0c19c8 100644
> --- a/include/team.h
> +++ b/include/team.h
> @@ -137,6 +137,13 @@ struct team_ifinfo *team_get_next_ifinfo(struct team_handle *th,
>  #define team_for_each_ifinfo(ifinfo, th)                       \
>         for (ifinfo = team_get_next_ifinfo(th, NULL); ifinfo;   \
>              ifinfo = team_get_next_ifinfo(th, ifinfo))
> +
> +struct team_ifinfo *team_get_next_unlinked_ifinfo(struct team_handle *th,
> +                                                 struct team_ifinfo *ifinfo);
> +#define team_for_each_unlinked_ifinfo(ifinfo, th)                      \
> +       for (ifinfo = team_get_next_unlinked_ifinfo(th, NULL); ifinfo;  \
> +            ifinfo = team_get_next_unlinked_ifinfo(th, ifinfo))
> +
>  /* ifinfo getters */
>  bool team_is_ifinfo_removed(struct team_ifinfo *ifinfo);
>  uint32_t team_get_ifinfo_ifindex(struct team_ifinfo *ifinfo);
> diff --git a/libteam/ifinfo.c b/libteam/ifinfo.c
> index 5c32a9c..8f9548e 100644
> --- a/libteam/ifinfo.c
> +++ b/libteam/ifinfo.c
> @@ -494,6 +494,26 @@ struct team_ifinfo *team_get_next_ifinfo(struct team_handle *th,
>         return NULL;
>  }
>
> +/**
> + * @param th           libteam library context
> + * @param ifinfo       ifinfo structure
> + *
> + * @details Get next unlinked ifinfo in list.
> + *
> + * @return Ifinfo next to ifinfo passed.
> + **/
> +TEAM_EXPORT
> +struct team_ifinfo *team_get_next_unlinked_ifinfo(struct team_handle *th,
> +                                                 struct team_ifinfo *ifinfo)
> +{
> +       do {
> +               ifinfo = list_get_next_node_entry(&th->ifinfo_list, ifinfo, list);
> +               if (ifinfo && !ifinfo->linked)
> +                       return ifinfo;
> +       } while (ifinfo);
> +       return NULL;
> +}
> +
>  /**
>   * @param ifinfo       ifinfo structure
>   *
> diff --git a/teamd/teamd.c b/teamd/teamd.c
> index aac2511..069c7f0 100644
> --- a/teamd/teamd.c
> +++ b/teamd/teamd.c
> @@ -926,8 +926,25 @@ static int teamd_event_watch_port_added(struct teamd_context *ctx,
>         return 0;
>  }
>
> +static int teamd_event_watch_unlinked_hwaddr_changed(struct teamd_context *ctx,
> +                                                    struct team_ifinfo *ifinfo,
> +                                                    void *priv)
> +{
> +       int err;
> +       bool kidnap;
> +
> +       err = teamd_config_bool_get(ctx, &kidnap, "$.kidnap");
> +       if (err || !kidnap ||
> +           ctx->hwaddr_len != team_get_ifinfo_hwaddr_len(ifinfo) ||
> +           memcmp(team_get_ifinfo_hwaddr(ifinfo),
> +                  ctx->hwaddr, ctx->hwaddr_len))
> +               return 0;
> +       return teamd_port_add(ctx, team_get_ifinfo_ifindex(ifinfo));
> +}
> +

So I am not sure about the name of this function. It seems to imply
that we want to capture a device if it changed its MAC address to
match the one we are using. I suppose that works if we are making this
a genreric thing that can run on any netdev, but our focus is virtio
and VFs. In the grand scheme of things they shouldn't be able to
change their MAC address in most environments that we will care about.
That was one of the reasons why we didn't bother supporting a MAC
change in our code since the hypervisor should have this locked and
attempting to use a different MAC address would likely trigger the VM
as being flagged as malicious.

>  static const struct teamd_event_watch_ops teamd_port_watch_ops = {
>         .port_added = teamd_event_watch_port_added,
> +       .unlinked_hwaddr_changed = teamd_event_watch_unlinked_hwaddr_changed,
>  };
>
>  static int teamd_port_watch_init(struct teamd_context *ctx)
> diff --git a/teamd/teamd.h b/teamd/teamd.h
> index 5dbfb9b..171a8d1 100644
> --- a/teamd/teamd.h
> +++ b/teamd/teamd.h
> @@ -189,6 +189,8 @@ struct teamd_event_watch_ops {
>                                    struct teamd_port *tdport, void *priv);
>         int (*port_ifname_changed)(struct teamd_context *ctx,
>                                    struct teamd_port *tdport, void *priv);
> +       int (*unlinked_hwaddr_changed)(struct teamd_context *ctx,
> +                                      struct team_ifinfo *ifinfo, void *priv);
>         int (*option_changed)(struct teamd_context *ctx,
>                               struct team_option *option, void *priv);
>         char *option_changed_match_name;
> @@ -210,6 +212,8 @@ int teamd_event_ifinfo_ifname_changed(struct teamd_context *ctx,
>                                       struct team_ifinfo *ifinfo);
>  int teamd_event_ifinfo_admin_state_changed(struct teamd_context *ctx,
>                                            struct team_ifinfo *ifinfo);
> +int teamd_event_unlinked_ifinfo_hwaddr_changed(struct teamd_context *ctx,
> +                                              struct team_ifinfo *ifinfo);
>  int teamd_events_init(struct teamd_context *ctx);
>  void teamd_events_fini(struct teamd_context *ctx);
>  int teamd_event_watch_register(struct teamd_context *ctx,
> @@ -313,6 +317,7 @@ static inline unsigned int teamd_port_count(struct teamd_context *ctx)
>         return ctx->port_obj_list_count;
>  }
>
> +int teamd_port_add(struct teamd_context *ctx, uint32_t ifindex);
>  int teamd_port_add_ifname(struct teamd_context *ctx, const char *port_name);
>  int teamd_port_remove_ifname(struct teamd_context *ctx, const char *port_name);
>  int teamd_port_remove_all(struct teamd_context *ctx);
> diff --git a/teamd/teamd_events.c b/teamd/teamd_events.c
> index 1a95974..a377090 100644
> --- a/teamd/teamd_events.c
> +++ b/teamd/teamd_events.c
> @@ -184,6 +184,23 @@ int teamd_event_ifinfo_admin_state_changed(struct teamd_context *ctx,
>         return 0;
>  }
>
> +int teamd_event_unlinked_ifinfo_hwaddr_changed(struct teamd_context *ctx,
> +                                              struct team_ifinfo *ifinfo)
> +{
> +       struct event_watch_item *watch;
> +       int err;
> +
> +       list_for_each_node_entry(watch, &ctx->event_watch_list, list) {
> +               if (watch->ops->unlinked_hwaddr_changed) {

I would probably flip the order of this. There is no point in doing
the loop if unlinked_hwaddr_changed is NULL. So you could probably
check for the function pointer first and then run the loop if it is
set.

> +                       err = watch->ops->unlinked_hwaddr_changed(ctx, ifinfo,
> +                                                                 watch->priv);
> +                       if (err)
> +                               return err;
> +               }
> +       }
> +       return 0;
> +}
> +
>  int teamd_events_init(struct teamd_context *ctx)
>  {
>         list_init(&ctx->event_watch_list);
> diff --git a/teamd/teamd_ifinfo_watch.c b/teamd/teamd_ifinfo_watch.c
> index f334ff6..8d01a76 100644
> --- a/teamd/teamd_ifinfo_watch.c
> +++ b/teamd/teamd_ifinfo_watch.c
> @@ -60,6 +60,15 @@ static int ifinfo_change_handler_func(struct team_handle *th, void *priv,
>                                 return err;
>                 }
>         }
> +
> +       team_for_each_unlinked_ifinfo(ifinfo, th) {
> +               if (team_is_ifinfo_hwaddr_changed(ifinfo) ||
> +                   team_is_ifinfo_hwaddr_len_changed(ifinfo)) {
> +                       err = teamd_event_unlinked_ifinfo_hwaddr_changed(ctx, ifinfo);
> +                       if (err)
> +                               return err;
> +               }
> +       }

I guess this is needed for the generic case, but as I said we wouldn't
probably need to worry about this in the VF/virtio case as the VM is
probably locked to a specific MAC address.

Also I am not sure about this bit. It seems like this is only checking
for the HW addr being changed. Is that bit set if a new interface is
registered? I haven't worked on teamd so I am not familiar with how it
handles new interfaces. Also how does this handle existing interfaces
that were registered before you started this?

>         return 0;
>  }
>
> diff --git a/teamd/teamd_per_port.c b/teamd/teamd_per_port.c
> index 09d1dc7..21e1bda 100644
> --- a/teamd/teamd_per_port.c
> +++ b/teamd/teamd_per_port.c
> @@ -331,6 +331,11 @@ next_one:
>         return tdport;
>  }
>
> +int teamd_port_add(struct teamd_context *ctx, uint32_t ifindex)
> +{
> +       return team_port_add(ctx->th, ifindex);
> +}
> +
>  int teamd_port_add_ifname(struct teamd_context *ctx, const char *port_name)
>  {
>         uint32_t ifindex;
> @@ -338,7 +343,7 @@ int teamd_port_add_ifname(struct teamd_context *ctx, const char *port_name)
>         ifindex = team_ifname2ifindex(ctx->th, port_name);
>         teamd_log_dbg("%s: Adding port (found ifindex \"%d\").",
>                       port_name, ifindex);
> -       return team_port_add(ctx->th, ifindex);
> +       return teamd_port_add(ctx, ifindex);
>  }
>
>  static int teamd_port_remove(struct teamd_context *ctx,

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
  2018-02-21  9:51             ` Jiri Pirko
  2018-02-21 15:56                 ` [virtio-dev] " Alexander Duyck
@ 2018-02-21 15:56               ` Alexander Duyck
  1 sibling, 0 replies; 121+ messages in thread
From: Alexander Duyck @ 2018-02-21 15:56 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Duyck, Alexander H, virtio-dev, Michael S. Tsirkin,
	Jakub Kicinski, Samudrala, Sridhar, virtualization, Siwei Liu,
	Netdev, David Miller

On Wed, Feb 21, 2018 at 1:51 AM, Jiri Pirko <jiri@resnulli.us> wrote:
> Tue, Feb 20, 2018 at 11:33:56PM CET, kubakici@wp.pl wrote:
>>On Tue, 20 Feb 2018 21:14:10 +0100, Jiri Pirko wrote:
>>> Yeah, I can see it now :( I guess that the ship has sailed and we are
>>> stuck with this ugly thing forever...
>>>
>>> Could you at least make some common code that is shared in between
>>> netvsc and virtio_net so this is handled in exacly the same way in both?
>>
>>IMHO netvsc is a vendor specific driver which made a mistake on what
>>behaviour it provides (or tried to align itself with Windows SR-IOV).
>>Let's not make a far, far more commonly deployed and important driver
>>(virtio) bug-compatible with netvsc.
>
> Yeah. netvsc solution is a dangerous precedent here and in my opinition
> it was a huge mistake to merge it. I personally would vote to unmerge it
> and make the solution based on team/bond.
>
>
>>
>>To Jiri's initial comments, I feel the same way, in fact I've talked to
>>the NetworkManager guys to get auto-bonding based on MACs handled in
>>user space.  I think it may very well get done in next versions of NM,
>>but isn't done yet.  Stephen also raised the point that not everybody is
>>using NM.
>
> Can be done in NM, networkd or other network management tools.
> Even easier to do this in teamd and let them all benefit.
>
> Actually, I took a stab to implement this in teamd. Took me like an hour
> and half.
>
> You can just run teamd with config option "kidnap" like this:
> # teamd/teamd -c '{"kidnap": true }'
>
> Whenever teamd sees another netdev to appear with the same mac as his,
> or whenever teamd sees another netdev to change mac to his,
> it enslaves it.
>
> Here's the patch (quick and dirty):
>
> Subject: [patch teamd] teamd: introduce kidnap feature
>
> Signed-off-by: Jiri Pirko <jiri@mellanox.com>

So this doesn't really address the original problem we were trying to
solve. You asked earlier why the netdev name mattered and it mostly
has to do with configuration. Specifically what our patch is
attempting to resolve is the issue of how to allow a cloud provider to
upgrade their customer to SR-IOV support and live migration without
requiring them to reconfigure their guest. So the general idea with
our patch is to take a VM that is running with virtio_net only and
allow it to instead spawn a virtio_bypass master using the same netdev
name as the original virtio, and then have the virtio_net and VF come
up and be enslaved by the bypass interface. Doing it this way we can
allow for multi-vendor SR-IOV live migration support using a guest
that was originally configured for virtio only.

The problem with your solution is we already have teaming and bonding
as you said. There is already a write-up from Red Hat on how to do it
(https://access.redhat.com/documentation/en-us/red_hat_virtualization/4.1/html/virtual_machine_management_guide/sect-migrating_virtual_machines_between_hosts).
That is all well and good as long as you are willing to keep around
two VM images, one for virtio, and one for SR-IOV with live migration.
The problem is nobody wants to do that. What they want is to maintain
one guest image and if they decide to upgrade to SR-IOV they still
want their live migration and they don't want to have to reconfigure
the guest.

That said it does seem to make the existing Red Hat solution easier to
manage since you wouldn't be guessing at ifname so I have provided
some feedback below.

> ---
>  include/team.h             |  7 +++++++
>  libteam/ifinfo.c           | 20 ++++++++++++++++++++
>  teamd/teamd.c              | 17 +++++++++++++++++
>  teamd/teamd.h              |  5 +++++
>  teamd/teamd_events.c       | 17 +++++++++++++++++
>  teamd/teamd_ifinfo_watch.c |  9 +++++++++
>  teamd/teamd_per_port.c     |  7 ++++++-
>  7 files changed, 81 insertions(+), 1 deletion(-)
>
> diff --git a/include/team.h b/include/team.h
> index 9ae517d..b0c19c8 100644
> --- a/include/team.h
> +++ b/include/team.h
> @@ -137,6 +137,13 @@ struct team_ifinfo *team_get_next_ifinfo(struct team_handle *th,
>  #define team_for_each_ifinfo(ifinfo, th)                       \
>         for (ifinfo = team_get_next_ifinfo(th, NULL); ifinfo;   \
>              ifinfo = team_get_next_ifinfo(th, ifinfo))
> +
> +struct team_ifinfo *team_get_next_unlinked_ifinfo(struct team_handle *th,
> +                                                 struct team_ifinfo *ifinfo);
> +#define team_for_each_unlinked_ifinfo(ifinfo, th)                      \
> +       for (ifinfo = team_get_next_unlinked_ifinfo(th, NULL); ifinfo;  \
> +            ifinfo = team_get_next_unlinked_ifinfo(th, ifinfo))
> +
>  /* ifinfo getters */
>  bool team_is_ifinfo_removed(struct team_ifinfo *ifinfo);
>  uint32_t team_get_ifinfo_ifindex(struct team_ifinfo *ifinfo);
> diff --git a/libteam/ifinfo.c b/libteam/ifinfo.c
> index 5c32a9c..8f9548e 100644
> --- a/libteam/ifinfo.c
> +++ b/libteam/ifinfo.c
> @@ -494,6 +494,26 @@ struct team_ifinfo *team_get_next_ifinfo(struct team_handle *th,
>         return NULL;
>  }
>
> +/**
> + * @param th           libteam library context
> + * @param ifinfo       ifinfo structure
> + *
> + * @details Get next unlinked ifinfo in list.
> + *
> + * @return Ifinfo next to ifinfo passed.
> + **/
> +TEAM_EXPORT
> +struct team_ifinfo *team_get_next_unlinked_ifinfo(struct team_handle *th,
> +                                                 struct team_ifinfo *ifinfo)
> +{
> +       do {
> +               ifinfo = list_get_next_node_entry(&th->ifinfo_list, ifinfo, list);
> +               if (ifinfo && !ifinfo->linked)
> +                       return ifinfo;
> +       } while (ifinfo);
> +       return NULL;
> +}
> +
>  /**
>   * @param ifinfo       ifinfo structure
>   *
> diff --git a/teamd/teamd.c b/teamd/teamd.c
> index aac2511..069c7f0 100644
> --- a/teamd/teamd.c
> +++ b/teamd/teamd.c
> @@ -926,8 +926,25 @@ static int teamd_event_watch_port_added(struct teamd_context *ctx,
>         return 0;
>  }
>
> +static int teamd_event_watch_unlinked_hwaddr_changed(struct teamd_context *ctx,
> +                                                    struct team_ifinfo *ifinfo,
> +                                                    void *priv)
> +{
> +       int err;
> +       bool kidnap;
> +
> +       err = teamd_config_bool_get(ctx, &kidnap, "$.kidnap");
> +       if (err || !kidnap ||
> +           ctx->hwaddr_len != team_get_ifinfo_hwaddr_len(ifinfo) ||
> +           memcmp(team_get_ifinfo_hwaddr(ifinfo),
> +                  ctx->hwaddr, ctx->hwaddr_len))
> +               return 0;
> +       return teamd_port_add(ctx, team_get_ifinfo_ifindex(ifinfo));
> +}
> +

So I am not sure about the name of this function. It seems to imply
that we want to capture a device if it changed its MAC address to
match the one we are using. I suppose that works if we are making this
a genreric thing that can run on any netdev, but our focus is virtio
and VFs. In the grand scheme of things they shouldn't be able to
change their MAC address in most environments that we will care about.
That was one of the reasons why we didn't bother supporting a MAC
change in our code since the hypervisor should have this locked and
attempting to use a different MAC address would likely trigger the VM
as being flagged as malicious.

>  static const struct teamd_event_watch_ops teamd_port_watch_ops = {
>         .port_added = teamd_event_watch_port_added,
> +       .unlinked_hwaddr_changed = teamd_event_watch_unlinked_hwaddr_changed,
>  };
>
>  static int teamd_port_watch_init(struct teamd_context *ctx)
> diff --git a/teamd/teamd.h b/teamd/teamd.h
> index 5dbfb9b..171a8d1 100644
> --- a/teamd/teamd.h
> +++ b/teamd/teamd.h
> @@ -189,6 +189,8 @@ struct teamd_event_watch_ops {
>                                    struct teamd_port *tdport, void *priv);
>         int (*port_ifname_changed)(struct teamd_context *ctx,
>                                    struct teamd_port *tdport, void *priv);
> +       int (*unlinked_hwaddr_changed)(struct teamd_context *ctx,
> +                                      struct team_ifinfo *ifinfo, void *priv);
>         int (*option_changed)(struct teamd_context *ctx,
>                               struct team_option *option, void *priv);
>         char *option_changed_match_name;
> @@ -210,6 +212,8 @@ int teamd_event_ifinfo_ifname_changed(struct teamd_context *ctx,
>                                       struct team_ifinfo *ifinfo);
>  int teamd_event_ifinfo_admin_state_changed(struct teamd_context *ctx,
>                                            struct team_ifinfo *ifinfo);
> +int teamd_event_unlinked_ifinfo_hwaddr_changed(struct teamd_context *ctx,
> +                                              struct team_ifinfo *ifinfo);
>  int teamd_events_init(struct teamd_context *ctx);
>  void teamd_events_fini(struct teamd_context *ctx);
>  int teamd_event_watch_register(struct teamd_context *ctx,
> @@ -313,6 +317,7 @@ static inline unsigned int teamd_port_count(struct teamd_context *ctx)
>         return ctx->port_obj_list_count;
>  }
>
> +int teamd_port_add(struct teamd_context *ctx, uint32_t ifindex);
>  int teamd_port_add_ifname(struct teamd_context *ctx, const char *port_name);
>  int teamd_port_remove_ifname(struct teamd_context *ctx, const char *port_name);
>  int teamd_port_remove_all(struct teamd_context *ctx);
> diff --git a/teamd/teamd_events.c b/teamd/teamd_events.c
> index 1a95974..a377090 100644
> --- a/teamd/teamd_events.c
> +++ b/teamd/teamd_events.c
> @@ -184,6 +184,23 @@ int teamd_event_ifinfo_admin_state_changed(struct teamd_context *ctx,
>         return 0;
>  }
>
> +int teamd_event_unlinked_ifinfo_hwaddr_changed(struct teamd_context *ctx,
> +                                              struct team_ifinfo *ifinfo)
> +{
> +       struct event_watch_item *watch;
> +       int err;
> +
> +       list_for_each_node_entry(watch, &ctx->event_watch_list, list) {
> +               if (watch->ops->unlinked_hwaddr_changed) {

I would probably flip the order of this. There is no point in doing
the loop if unlinked_hwaddr_changed is NULL. So you could probably
check for the function pointer first and then run the loop if it is
set.

> +                       err = watch->ops->unlinked_hwaddr_changed(ctx, ifinfo,
> +                                                                 watch->priv);
> +                       if (err)
> +                               return err;
> +               }
> +       }
> +       return 0;
> +}
> +
>  int teamd_events_init(struct teamd_context *ctx)
>  {
>         list_init(&ctx->event_watch_list);
> diff --git a/teamd/teamd_ifinfo_watch.c b/teamd/teamd_ifinfo_watch.c
> index f334ff6..8d01a76 100644
> --- a/teamd/teamd_ifinfo_watch.c
> +++ b/teamd/teamd_ifinfo_watch.c
> @@ -60,6 +60,15 @@ static int ifinfo_change_handler_func(struct team_handle *th, void *priv,
>                                 return err;
>                 }
>         }
> +
> +       team_for_each_unlinked_ifinfo(ifinfo, th) {
> +               if (team_is_ifinfo_hwaddr_changed(ifinfo) ||
> +                   team_is_ifinfo_hwaddr_len_changed(ifinfo)) {
> +                       err = teamd_event_unlinked_ifinfo_hwaddr_changed(ctx, ifinfo);
> +                       if (err)
> +                               return err;
> +               }
> +       }

I guess this is needed for the generic case, but as I said we wouldn't
probably need to worry about this in the VF/virtio case as the VM is
probably locked to a specific MAC address.

Also I am not sure about this bit. It seems like this is only checking
for the HW addr being changed. Is that bit set if a new interface is
registered? I haven't worked on teamd so I am not familiar with how it
handles new interfaces. Also how does this handle existing interfaces
that were registered before you started this?

>         return 0;
>  }
>
> diff --git a/teamd/teamd_per_port.c b/teamd/teamd_per_port.c
> index 09d1dc7..21e1bda 100644
> --- a/teamd/teamd_per_port.c
> +++ b/teamd/teamd_per_port.c
> @@ -331,6 +331,11 @@ next_one:
>         return tdport;
>  }
>
> +int teamd_port_add(struct teamd_context *ctx, uint32_t ifindex)
> +{
> +       return team_port_add(ctx->th, ifindex);
> +}
> +
>  int teamd_port_add_ifname(struct teamd_context *ctx, const char *port_name)
>  {
>         uint32_t ifindex;
> @@ -338,7 +343,7 @@ int teamd_port_add_ifname(struct teamd_context *ctx, const char *port_name)
>         ifindex = team_ifname2ifindex(ctx->th, port_name);
>         teamd_log_dbg("%s: Adding port (found ifindex \"%d\").",
>                       port_name, ifindex);
> -       return team_port_add(ctx->th, ifindex);
> +       return teamd_port_add(ctx, ifindex);
>  }
>
>  static int teamd_port_remove(struct teamd_context *ctx,

^ permalink raw reply	[flat|nested] 121+ messages in thread

* [virtio-dev] Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
@ 2018-02-21 15:56                 ` Alexander Duyck
  0 siblings, 0 replies; 121+ messages in thread
From: Alexander Duyck @ 2018-02-21 15:56 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Jakub Kicinski, Samudrala, Sridhar, Michael S. Tsirkin,
	Stephen Hemminger, David Miller, Netdev, virtualization,
	virtio-dev, Brandeburg, Jesse, Duyck, Alexander H, Jason Wang,
	Siwei Liu

On Wed, Feb 21, 2018 at 1:51 AM, Jiri Pirko <jiri@resnulli.us> wrote:
> Tue, Feb 20, 2018 at 11:33:56PM CET, kubakici@wp.pl wrote:
>>On Tue, 20 Feb 2018 21:14:10 +0100, Jiri Pirko wrote:
>>> Yeah, I can see it now :( I guess that the ship has sailed and we are
>>> stuck with this ugly thing forever...
>>>
>>> Could you at least make some common code that is shared in between
>>> netvsc and virtio_net so this is handled in exacly the same way in both?
>>
>>IMHO netvsc is a vendor specific driver which made a mistake on what
>>behaviour it provides (or tried to align itself with Windows SR-IOV).
>>Let's not make a far, far more commonly deployed and important driver
>>(virtio) bug-compatible with netvsc.
>
> Yeah. netvsc solution is a dangerous precedent here and in my opinition
> it was a huge mistake to merge it. I personally would vote to unmerge it
> and make the solution based on team/bond.
>
>
>>
>>To Jiri's initial comments, I feel the same way, in fact I've talked to
>>the NetworkManager guys to get auto-bonding based on MACs handled in
>>user space.  I think it may very well get done in next versions of NM,
>>but isn't done yet.  Stephen also raised the point that not everybody is
>>using NM.
>
> Can be done in NM, networkd or other network management tools.
> Even easier to do this in teamd and let them all benefit.
>
> Actually, I took a stab to implement this in teamd. Took me like an hour
> and half.
>
> You can just run teamd with config option "kidnap" like this:
> # teamd/teamd -c '{"kidnap": true }'
>
> Whenever teamd sees another netdev to appear with the same mac as his,
> or whenever teamd sees another netdev to change mac to his,
> it enslaves it.
>
> Here's the patch (quick and dirty):
>
> Subject: [patch teamd] teamd: introduce kidnap feature
>
> Signed-off-by: Jiri Pirko <jiri@mellanox.com>

So this doesn't really address the original problem we were trying to
solve. You asked earlier why the netdev name mattered and it mostly
has to do with configuration. Specifically what our patch is
attempting to resolve is the issue of how to allow a cloud provider to
upgrade their customer to SR-IOV support and live migration without
requiring them to reconfigure their guest. So the general idea with
our patch is to take a VM that is running with virtio_net only and
allow it to instead spawn a virtio_bypass master using the same netdev
name as the original virtio, and then have the virtio_net and VF come
up and be enslaved by the bypass interface. Doing it this way we can
allow for multi-vendor SR-IOV live migration support using a guest
that was originally configured for virtio only.

The problem with your solution is we already have teaming and bonding
as you said. There is already a write-up from Red Hat on how to do it
(https://access.redhat.com/documentation/en-us/red_hat_virtualization/4.1/html/virtual_machine_management_guide/sect-migrating_virtual_machines_between_hosts).
That is all well and good as long as you are willing to keep around
two VM images, one for virtio, and one for SR-IOV with live migration.
The problem is nobody wants to do that. What they want is to maintain
one guest image and if they decide to upgrade to SR-IOV they still
want their live migration and they don't want to have to reconfigure
the guest.

That said it does seem to make the existing Red Hat solution easier to
manage since you wouldn't be guessing at ifname so I have provided
some feedback below.

> ---
>  include/team.h             |  7 +++++++
>  libteam/ifinfo.c           | 20 ++++++++++++++++++++
>  teamd/teamd.c              | 17 +++++++++++++++++
>  teamd/teamd.h              |  5 +++++
>  teamd/teamd_events.c       | 17 +++++++++++++++++
>  teamd/teamd_ifinfo_watch.c |  9 +++++++++
>  teamd/teamd_per_port.c     |  7 ++++++-
>  7 files changed, 81 insertions(+), 1 deletion(-)
>
> diff --git a/include/team.h b/include/team.h
> index 9ae517d..b0c19c8 100644
> --- a/include/team.h
> +++ b/include/team.h
> @@ -137,6 +137,13 @@ struct team_ifinfo *team_get_next_ifinfo(struct team_handle *th,
>  #define team_for_each_ifinfo(ifinfo, th)                       \
>         for (ifinfo = team_get_next_ifinfo(th, NULL); ifinfo;   \
>              ifinfo = team_get_next_ifinfo(th, ifinfo))
> +
> +struct team_ifinfo *team_get_next_unlinked_ifinfo(struct team_handle *th,
> +                                                 struct team_ifinfo *ifinfo);
> +#define team_for_each_unlinked_ifinfo(ifinfo, th)                      \
> +       for (ifinfo = team_get_next_unlinked_ifinfo(th, NULL); ifinfo;  \
> +            ifinfo = team_get_next_unlinked_ifinfo(th, ifinfo))
> +
>  /* ifinfo getters */
>  bool team_is_ifinfo_removed(struct team_ifinfo *ifinfo);
>  uint32_t team_get_ifinfo_ifindex(struct team_ifinfo *ifinfo);
> diff --git a/libteam/ifinfo.c b/libteam/ifinfo.c
> index 5c32a9c..8f9548e 100644
> --- a/libteam/ifinfo.c
> +++ b/libteam/ifinfo.c
> @@ -494,6 +494,26 @@ struct team_ifinfo *team_get_next_ifinfo(struct team_handle *th,
>         return NULL;
>  }
>
> +/**
> + * @param th           libteam library context
> + * @param ifinfo       ifinfo structure
> + *
> + * @details Get next unlinked ifinfo in list.
> + *
> + * @return Ifinfo next to ifinfo passed.
> + **/
> +TEAM_EXPORT
> +struct team_ifinfo *team_get_next_unlinked_ifinfo(struct team_handle *th,
> +                                                 struct team_ifinfo *ifinfo)
> +{
> +       do {
> +               ifinfo = list_get_next_node_entry(&th->ifinfo_list, ifinfo, list);
> +               if (ifinfo && !ifinfo->linked)
> +                       return ifinfo;
> +       } while (ifinfo);
> +       return NULL;
> +}
> +
>  /**
>   * @param ifinfo       ifinfo structure
>   *
> diff --git a/teamd/teamd.c b/teamd/teamd.c
> index aac2511..069c7f0 100644
> --- a/teamd/teamd.c
> +++ b/teamd/teamd.c
> @@ -926,8 +926,25 @@ static int teamd_event_watch_port_added(struct teamd_context *ctx,
>         return 0;
>  }
>
> +static int teamd_event_watch_unlinked_hwaddr_changed(struct teamd_context *ctx,
> +                                                    struct team_ifinfo *ifinfo,
> +                                                    void *priv)
> +{
> +       int err;
> +       bool kidnap;
> +
> +       err = teamd_config_bool_get(ctx, &kidnap, "$.kidnap");
> +       if (err || !kidnap ||
> +           ctx->hwaddr_len != team_get_ifinfo_hwaddr_len(ifinfo) ||
> +           memcmp(team_get_ifinfo_hwaddr(ifinfo),
> +                  ctx->hwaddr, ctx->hwaddr_len))
> +               return 0;
> +       return teamd_port_add(ctx, team_get_ifinfo_ifindex(ifinfo));
> +}
> +

So I am not sure about the name of this function. It seems to imply
that we want to capture a device if it changed its MAC address to
match the one we are using. I suppose that works if we are making this
a genreric thing that can run on any netdev, but our focus is virtio
and VFs. In the grand scheme of things they shouldn't be able to
change their MAC address in most environments that we will care about.
That was one of the reasons why we didn't bother supporting a MAC
change in our code since the hypervisor should have this locked and
attempting to use a different MAC address would likely trigger the VM
as being flagged as malicious.

>  static const struct teamd_event_watch_ops teamd_port_watch_ops = {
>         .port_added = teamd_event_watch_port_added,
> +       .unlinked_hwaddr_changed = teamd_event_watch_unlinked_hwaddr_changed,
>  };
>
>  static int teamd_port_watch_init(struct teamd_context *ctx)
> diff --git a/teamd/teamd.h b/teamd/teamd.h
> index 5dbfb9b..171a8d1 100644
> --- a/teamd/teamd.h
> +++ b/teamd/teamd.h
> @@ -189,6 +189,8 @@ struct teamd_event_watch_ops {
>                                    struct teamd_port *tdport, void *priv);
>         int (*port_ifname_changed)(struct teamd_context *ctx,
>                                    struct teamd_port *tdport, void *priv);
> +       int (*unlinked_hwaddr_changed)(struct teamd_context *ctx,
> +                                      struct team_ifinfo *ifinfo, void *priv);
>         int (*option_changed)(struct teamd_context *ctx,
>                               struct team_option *option, void *priv);
>         char *option_changed_match_name;
> @@ -210,6 +212,8 @@ int teamd_event_ifinfo_ifname_changed(struct teamd_context *ctx,
>                                       struct team_ifinfo *ifinfo);
>  int teamd_event_ifinfo_admin_state_changed(struct teamd_context *ctx,
>                                            struct team_ifinfo *ifinfo);
> +int teamd_event_unlinked_ifinfo_hwaddr_changed(struct teamd_context *ctx,
> +                                              struct team_ifinfo *ifinfo);
>  int teamd_events_init(struct teamd_context *ctx);
>  void teamd_events_fini(struct teamd_context *ctx);
>  int teamd_event_watch_register(struct teamd_context *ctx,
> @@ -313,6 +317,7 @@ static inline unsigned int teamd_port_count(struct teamd_context *ctx)
>         return ctx->port_obj_list_count;
>  }
>
> +int teamd_port_add(struct teamd_context *ctx, uint32_t ifindex);
>  int teamd_port_add_ifname(struct teamd_context *ctx, const char *port_name);
>  int teamd_port_remove_ifname(struct teamd_context *ctx, const char *port_name);
>  int teamd_port_remove_all(struct teamd_context *ctx);
> diff --git a/teamd/teamd_events.c b/teamd/teamd_events.c
> index 1a95974..a377090 100644
> --- a/teamd/teamd_events.c
> +++ b/teamd/teamd_events.c
> @@ -184,6 +184,23 @@ int teamd_event_ifinfo_admin_state_changed(struct teamd_context *ctx,
>         return 0;
>  }
>
> +int teamd_event_unlinked_ifinfo_hwaddr_changed(struct teamd_context *ctx,
> +                                              struct team_ifinfo *ifinfo)
> +{
> +       struct event_watch_item *watch;
> +       int err;
> +
> +       list_for_each_node_entry(watch, &ctx->event_watch_list, list) {
> +               if (watch->ops->unlinked_hwaddr_changed) {

I would probably flip the order of this. There is no point in doing
the loop if unlinked_hwaddr_changed is NULL. So you could probably
check for the function pointer first and then run the loop if it is
set.

> +                       err = watch->ops->unlinked_hwaddr_changed(ctx, ifinfo,
> +                                                                 watch->priv);
> +                       if (err)
> +                               return err;
> +               }
> +       }
> +       return 0;
> +}
> +
>  int teamd_events_init(struct teamd_context *ctx)
>  {
>         list_init(&ctx->event_watch_list);
> diff --git a/teamd/teamd_ifinfo_watch.c b/teamd/teamd_ifinfo_watch.c
> index f334ff6..8d01a76 100644
> --- a/teamd/teamd_ifinfo_watch.c
> +++ b/teamd/teamd_ifinfo_watch.c
> @@ -60,6 +60,15 @@ static int ifinfo_change_handler_func(struct team_handle *th, void *priv,
>                                 return err;
>                 }
>         }
> +
> +       team_for_each_unlinked_ifinfo(ifinfo, th) {
> +               if (team_is_ifinfo_hwaddr_changed(ifinfo) ||
> +                   team_is_ifinfo_hwaddr_len_changed(ifinfo)) {
> +                       err = teamd_event_unlinked_ifinfo_hwaddr_changed(ctx, ifinfo);
> +                       if (err)
> +                               return err;
> +               }
> +       }

I guess this is needed for the generic case, but as I said we wouldn't
probably need to worry about this in the VF/virtio case as the VM is
probably locked to a specific MAC address.

Also I am not sure about this bit. It seems like this is only checking
for the HW addr being changed. Is that bit set if a new interface is
registered? I haven't worked on teamd so I am not familiar with how it
handles new interfaces. Also how does this handle existing interfaces
that were registered before you started this?

>         return 0;
>  }
>
> diff --git a/teamd/teamd_per_port.c b/teamd/teamd_per_port.c
> index 09d1dc7..21e1bda 100644
> --- a/teamd/teamd_per_port.c
> +++ b/teamd/teamd_per_port.c
> @@ -331,6 +331,11 @@ next_one:
>         return tdport;
>  }
>
> +int teamd_port_add(struct teamd_context *ctx, uint32_t ifindex)
> +{
> +       return team_port_add(ctx->th, ifindex);
> +}
> +
>  int teamd_port_add_ifname(struct teamd_context *ctx, const char *port_name)
>  {
>         uint32_t ifindex;
> @@ -338,7 +343,7 @@ int teamd_port_add_ifname(struct teamd_context *ctx, const char *port_name)
>         ifindex = team_ifname2ifindex(ctx->th, port_name);
>         teamd_log_dbg("%s: Adding port (found ifindex \"%d\").",
>                       port_name, ifindex);
> -       return team_port_add(ctx->th, ifindex);
> +       return teamd_port_add(ctx, ifindex);
>  }
>
>  static int teamd_port_remove(struct teamd_context *ctx,

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
  2018-02-21 15:56                 ` [virtio-dev] " Alexander Duyck
  (?)
@ 2018-02-21 16:11                 ` Jiri Pirko
  2018-02-21 16:49                     ` [virtio-dev] " Alexander Duyck
  2018-02-21 16:49                   ` Alexander Duyck
  -1 siblings, 2 replies; 121+ messages in thread
From: Jiri Pirko @ 2018-02-21 16:11 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Duyck, Alexander H, virtio-dev, Michael S. Tsirkin,
	Jakub Kicinski, Samudrala, Sridhar, virtualization, Siwei Liu,
	Netdev, David Miller

Wed, Feb 21, 2018 at 04:56:48PM CET, alexander.duyck@gmail.com wrote:
>On Wed, Feb 21, 2018 at 1:51 AM, Jiri Pirko <jiri@resnulli.us> wrote:
>> Tue, Feb 20, 2018 at 11:33:56PM CET, kubakici@wp.pl wrote:
>>>On Tue, 20 Feb 2018 21:14:10 +0100, Jiri Pirko wrote:
>>>> Yeah, I can see it now :( I guess that the ship has sailed and we are
>>>> stuck with this ugly thing forever...
>>>>
>>>> Could you at least make some common code that is shared in between
>>>> netvsc and virtio_net so this is handled in exacly the same way in both?
>>>
>>>IMHO netvsc is a vendor specific driver which made a mistake on what
>>>behaviour it provides (or tried to align itself with Windows SR-IOV).
>>>Let's not make a far, far more commonly deployed and important driver
>>>(virtio) bug-compatible with netvsc.
>>
>> Yeah. netvsc solution is a dangerous precedent here and in my opinition
>> it was a huge mistake to merge it. I personally would vote to unmerge it
>> and make the solution based on team/bond.
>>
>>
>>>
>>>To Jiri's initial comments, I feel the same way, in fact I've talked to
>>>the NetworkManager guys to get auto-bonding based on MACs handled in
>>>user space.  I think it may very well get done in next versions of NM,
>>>but isn't done yet.  Stephen also raised the point that not everybody is
>>>using NM.
>>
>> Can be done in NM, networkd or other network management tools.
>> Even easier to do this in teamd and let them all benefit.
>>
>> Actually, I took a stab to implement this in teamd. Took me like an hour
>> and half.
>>
>> You can just run teamd with config option "kidnap" like this:
>> # teamd/teamd -c '{"kidnap": true }'
>>
>> Whenever teamd sees another netdev to appear with the same mac as his,
>> or whenever teamd sees another netdev to change mac to his,
>> it enslaves it.
>>
>> Here's the patch (quick and dirty):
>>
>> Subject: [patch teamd] teamd: introduce kidnap feature
>>
>> Signed-off-by: Jiri Pirko <jiri@mellanox.com>
>
>So this doesn't really address the original problem we were trying to
>solve. You asked earlier why the netdev name mattered and it mostly
>has to do with configuration. Specifically what our patch is
>attempting to resolve is the issue of how to allow a cloud provider to
>upgrade their customer to SR-IOV support and live migration without
>requiring them to reconfigure their guest. So the general idea with
>our patch is to take a VM that is running with virtio_net only and
>allow it to instead spawn a virtio_bypass master using the same netdev
>name as the original virtio, and then have the virtio_net and VF come
>up and be enslaved by the bypass interface. Doing it this way we can
>allow for multi-vendor SR-IOV live migration support using a guest
>that was originally configured for virtio only.
>
>The problem with your solution is we already have teaming and bonding
>as you said. There is already a write-up from Red Hat on how to do it
>(https://access.redhat.com/documentation/en-us/red_hat_virtualization/4.1/html/virtual_machine_management_guide/sect-migrating_virtual_machines_between_hosts).
>That is all well and good as long as you are willing to keep around
>two VM images, one for virtio, and one for SR-IOV with live migration.

You don't need 2 images. You need only one. The one with the team setup.
That's it. If another netdev with the same mac appears, teamd will
enslave it and run traffic on it. If not, ok, you'll go only through
virtio_net.


>The problem is nobody wants to do that. What they want is to maintain
>one guest image and if they decide to upgrade to SR-IOV they still
>want their live migration and they don't want to have to reconfigure
>the guest.
>
>That said it does seem to make the existing Red Hat solution easier to
>manage since you wouldn't be guessing at ifname so I have provided
>some feedback below.
>
>> ---
>>  include/team.h             |  7 +++++++
>>  libteam/ifinfo.c           | 20 ++++++++++++++++++++
>>  teamd/teamd.c              | 17 +++++++++++++++++
>>  teamd/teamd.h              |  5 +++++
>>  teamd/teamd_events.c       | 17 +++++++++++++++++
>>  teamd/teamd_ifinfo_watch.c |  9 +++++++++
>>  teamd/teamd_per_port.c     |  7 ++++++-
>>  7 files changed, 81 insertions(+), 1 deletion(-)
>>
>> diff --git a/include/team.h b/include/team.h
>> index 9ae517d..b0c19c8 100644
>> --- a/include/team.h
>> +++ b/include/team.h
>> @@ -137,6 +137,13 @@ struct team_ifinfo *team_get_next_ifinfo(struct team_handle *th,
>>  #define team_for_each_ifinfo(ifinfo, th)                       \
>>         for (ifinfo = team_get_next_ifinfo(th, NULL); ifinfo;   \
>>              ifinfo = team_get_next_ifinfo(th, ifinfo))
>> +
>> +struct team_ifinfo *team_get_next_unlinked_ifinfo(struct team_handle *th,
>> +                                                 struct team_ifinfo *ifinfo);
>> +#define team_for_each_unlinked_ifinfo(ifinfo, th)                      \
>> +       for (ifinfo = team_get_next_unlinked_ifinfo(th, NULL); ifinfo;  \
>> +            ifinfo = team_get_next_unlinked_ifinfo(th, ifinfo))
>> +
>>  /* ifinfo getters */
>>  bool team_is_ifinfo_removed(struct team_ifinfo *ifinfo);
>>  uint32_t team_get_ifinfo_ifindex(struct team_ifinfo *ifinfo);
>> diff --git a/libteam/ifinfo.c b/libteam/ifinfo.c
>> index 5c32a9c..8f9548e 100644
>> --- a/libteam/ifinfo.c
>> +++ b/libteam/ifinfo.c
>> @@ -494,6 +494,26 @@ struct team_ifinfo *team_get_next_ifinfo(struct team_handle *th,
>>         return NULL;
>>  }
>>
>> +/**
>> + * @param th           libteam library context
>> + * @param ifinfo       ifinfo structure
>> + *
>> + * @details Get next unlinked ifinfo in list.
>> + *
>> + * @return Ifinfo next to ifinfo passed.
>> + **/
>> +TEAM_EXPORT
>> +struct team_ifinfo *team_get_next_unlinked_ifinfo(struct team_handle *th,
>> +                                                 struct team_ifinfo *ifinfo)
>> +{
>> +       do {
>> +               ifinfo = list_get_next_node_entry(&th->ifinfo_list, ifinfo, list);
>> +               if (ifinfo && !ifinfo->linked)
>> +                       return ifinfo;
>> +       } while (ifinfo);
>> +       return NULL;
>> +}
>> +
>>  /**
>>   * @param ifinfo       ifinfo structure
>>   *
>> diff --git a/teamd/teamd.c b/teamd/teamd.c
>> index aac2511..069c7f0 100644
>> --- a/teamd/teamd.c
>> +++ b/teamd/teamd.c
>> @@ -926,8 +926,25 @@ static int teamd_event_watch_port_added(struct teamd_context *ctx,
>>         return 0;
>>  }
>>
>> +static int teamd_event_watch_unlinked_hwaddr_changed(struct teamd_context *ctx,
>> +                                                    struct team_ifinfo *ifinfo,
>> +                                                    void *priv)
>> +{
>> +       int err;
>> +       bool kidnap;
>> +
>> +       err = teamd_config_bool_get(ctx, &kidnap, "$.kidnap");
>> +       if (err || !kidnap ||
>> +           ctx->hwaddr_len != team_get_ifinfo_hwaddr_len(ifinfo) ||
>> +           memcmp(team_get_ifinfo_hwaddr(ifinfo),
>> +                  ctx->hwaddr, ctx->hwaddr_len))
>> +               return 0;
>> +       return teamd_port_add(ctx, team_get_ifinfo_ifindex(ifinfo));
>> +}
>> +
>
>So I am not sure about the name of this function. It seems to imply
>that we want to capture a device if it changed its MAC address to
>match the one we are using. I suppose that works if we are making this
>a genreric thing that can run on any netdev, but our focus is virtio
>and VFs. In the grand scheme of things they shouldn't be able to
>change their MAC address in most environments that we will care about.
>That was one of the reasons why we didn't bother supporting a MAC
>change in our code since the hypervisor should have this locked and
>attempting to use a different MAC address would likely trigger the VM
>as being flagged as malicious.

This cb is called whenever mac changes, but also when netdev appears
(hwaddr changes from 00:00:00:00:00:00 and hwaddr_len from 0)


>
>>  static const struct teamd_event_watch_ops teamd_port_watch_ops = {
>>         .port_added = teamd_event_watch_port_added,
>> +       .unlinked_hwaddr_changed = teamd_event_watch_unlinked_hwaddr_changed,
>>  };
>>
>>  static int teamd_port_watch_init(struct teamd_context *ctx)
>> diff --git a/teamd/teamd.h b/teamd/teamd.h
>> index 5dbfb9b..171a8d1 100644
>> --- a/teamd/teamd.h
>> +++ b/teamd/teamd.h
>> @@ -189,6 +189,8 @@ struct teamd_event_watch_ops {
>>                                    struct teamd_port *tdport, void *priv);
>>         int (*port_ifname_changed)(struct teamd_context *ctx,
>>                                    struct teamd_port *tdport, void *priv);
>> +       int (*unlinked_hwaddr_changed)(struct teamd_context *ctx,
>> +                                      struct team_ifinfo *ifinfo, void *priv);
>>         int (*option_changed)(struct teamd_context *ctx,
>>                               struct team_option *option, void *priv);
>>         char *option_changed_match_name;
>> @@ -210,6 +212,8 @@ int teamd_event_ifinfo_ifname_changed(struct teamd_context *ctx,
>>                                       struct team_ifinfo *ifinfo);
>>  int teamd_event_ifinfo_admin_state_changed(struct teamd_context *ctx,
>>                                            struct team_ifinfo *ifinfo);
>> +int teamd_event_unlinked_ifinfo_hwaddr_changed(struct teamd_context *ctx,
>> +                                              struct team_ifinfo *ifinfo);
>>  int teamd_events_init(struct teamd_context *ctx);
>>  void teamd_events_fini(struct teamd_context *ctx);
>>  int teamd_event_watch_register(struct teamd_context *ctx,
>> @@ -313,6 +317,7 @@ static inline unsigned int teamd_port_count(struct teamd_context *ctx)
>>         return ctx->port_obj_list_count;
>>  }
>>
>> +int teamd_port_add(struct teamd_context *ctx, uint32_t ifindex);
>>  int teamd_port_add_ifname(struct teamd_context *ctx, const char *port_name);
>>  int teamd_port_remove_ifname(struct teamd_context *ctx, const char *port_name);
>>  int teamd_port_remove_all(struct teamd_context *ctx);
>> diff --git a/teamd/teamd_events.c b/teamd/teamd_events.c
>> index 1a95974..a377090 100644
>> --- a/teamd/teamd_events.c
>> +++ b/teamd/teamd_events.c
>> @@ -184,6 +184,23 @@ int teamd_event_ifinfo_admin_state_changed(struct teamd_context *ctx,
>>         return 0;
>>  }
>>
>> +int teamd_event_unlinked_ifinfo_hwaddr_changed(struct teamd_context *ctx,
>> +                                              struct team_ifinfo *ifinfo)
>> +{
>> +       struct event_watch_item *watch;
>> +       int err;
>> +
>> +       list_for_each_node_entry(watch, &ctx->event_watch_list, list) {
>> +               if (watch->ops->unlinked_hwaddr_changed) {
>
>I would probably flip the order of this. There is no point in doing
>the loop if unlinked_hwaddr_changed is NULL. So you could probably
>check for the function pointer first and then run the loop if it is
>set.

Sure. As I said, quick and dirty patch :)


>
>> +                       err = watch->ops->unlinked_hwaddr_changed(ctx, ifinfo,
>> +                                                                 watch->priv);
>> +                       if (err)
>> +                               return err;
>> +               }
>> +       }
>> +       return 0;
>> +}
>> +
>>  int teamd_events_init(struct teamd_context *ctx)
>>  {
>>         list_init(&ctx->event_watch_list);
>> diff --git a/teamd/teamd_ifinfo_watch.c b/teamd/teamd_ifinfo_watch.c
>> index f334ff6..8d01a76 100644
>> --- a/teamd/teamd_ifinfo_watch.c
>> +++ b/teamd/teamd_ifinfo_watch.c
>> @@ -60,6 +60,15 @@ static int ifinfo_change_handler_func(struct team_handle *th, void *priv,
>>                                 return err;
>>                 }
>>         }
>> +
>> +       team_for_each_unlinked_ifinfo(ifinfo, th) {
>> +               if (team_is_ifinfo_hwaddr_changed(ifinfo) ||
>> +                   team_is_ifinfo_hwaddr_len_changed(ifinfo)) {
>> +                       err = teamd_event_unlinked_ifinfo_hwaddr_changed(ctx, ifinfo);
>> +                       if (err)
>> +                               return err;
>> +               }
>> +       }
>
>I guess this is needed for the generic case, but as I said we wouldn't
>probably need to worry about this in the VF/virtio case as the VM is
>probably locked to a specific MAC address.
>
>Also I am not sure about this bit. It seems like this is only checking
>for the HW addr being changed. Is that bit set if a new interface is
>registered? I haven't worked on teamd so I am not familiar with how it
>handles new interfaces. Also how does this handle existing interfaces
>that were registered before you started this?

See my reply above.


>
>>         return 0;
>>  }
>>
>> diff --git a/teamd/teamd_per_port.c b/teamd/teamd_per_port.c
>> index 09d1dc7..21e1bda 100644
>> --- a/teamd/teamd_per_port.c
>> +++ b/teamd/teamd_per_port.c
>> @@ -331,6 +331,11 @@ next_one:
>>         return tdport;
>>  }
>>
>> +int teamd_port_add(struct teamd_context *ctx, uint32_t ifindex)
>> +{
>> +       return team_port_add(ctx->th, ifindex);
>> +}
>> +
>>  int teamd_port_add_ifname(struct teamd_context *ctx, const char *port_name)
>>  {
>>         uint32_t ifindex;
>> @@ -338,7 +343,7 @@ int teamd_port_add_ifname(struct teamd_context *ctx, const char *port_name)
>>         ifindex = team_ifname2ifindex(ctx->th, port_name);
>>         teamd_log_dbg("%s: Adding port (found ifindex \"%d\").",
>>                       port_name, ifindex);
>> -       return team_port_add(ctx->th, ifindex);
>> +       return teamd_port_add(ctx, ifindex);
>>  }
>>
>>  static int teamd_port_remove(struct teamd_context *ctx,

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
  2018-02-21 16:11                 ` Jiri Pirko
@ 2018-02-21 16:49                     ` Alexander Duyck
  2018-02-21 16:49                   ` Alexander Duyck
  1 sibling, 0 replies; 121+ messages in thread
From: Alexander Duyck @ 2018-02-21 16:49 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Jakub Kicinski, Samudrala, Sridhar, Michael S. Tsirkin,
	Stephen Hemminger, David Miller, Netdev, virtualization,
	virtio-dev, Brandeburg, Jesse, Duyck, Alexander H, Jason Wang,
	Siwei Liu

On Wed, Feb 21, 2018 at 8:11 AM, Jiri Pirko <jiri@resnulli.us> wrote:
> Wed, Feb 21, 2018 at 04:56:48PM CET, alexander.duyck@gmail.com wrote:
>>On Wed, Feb 21, 2018 at 1:51 AM, Jiri Pirko <jiri@resnulli.us> wrote:
>>> Tue, Feb 20, 2018 at 11:33:56PM CET, kubakici@wp.pl wrote:
>>>>On Tue, 20 Feb 2018 21:14:10 +0100, Jiri Pirko wrote:
>>>>> Yeah, I can see it now :( I guess that the ship has sailed and we are
>>>>> stuck with this ugly thing forever...
>>>>>
>>>>> Could you at least make some common code that is shared in between
>>>>> netvsc and virtio_net so this is handled in exacly the same way in both?
>>>>
>>>>IMHO netvsc is a vendor specific driver which made a mistake on what
>>>>behaviour it provides (or tried to align itself with Windows SR-IOV).
>>>>Let's not make a far, far more commonly deployed and important driver
>>>>(virtio) bug-compatible with netvsc.
>>>
>>> Yeah. netvsc solution is a dangerous precedent here and in my opinition
>>> it was a huge mistake to merge it. I personally would vote to unmerge it
>>> and make the solution based on team/bond.
>>>
>>>
>>>>
>>>>To Jiri's initial comments, I feel the same way, in fact I've talked to
>>>>the NetworkManager guys to get auto-bonding based on MACs handled in
>>>>user space.  I think it may very well get done in next versions of NM,
>>>>but isn't done yet.  Stephen also raised the point that not everybody is
>>>>using NM.
>>>
>>> Can be done in NM, networkd or other network management tools.
>>> Even easier to do this in teamd and let them all benefit.
>>>
>>> Actually, I took a stab to implement this in teamd. Took me like an hour
>>> and half.
>>>
>>> You can just run teamd with config option "kidnap" like this:
>>> # teamd/teamd -c '{"kidnap": true }'
>>>
>>> Whenever teamd sees another netdev to appear with the same mac as his,
>>> or whenever teamd sees another netdev to change mac to his,
>>> it enslaves it.
>>>
>>> Here's the patch (quick and dirty):
>>>
>>> Subject: [patch teamd] teamd: introduce kidnap feature
>>>
>>> Signed-off-by: Jiri Pirko <jiri@mellanox.com>
>>
>>So this doesn't really address the original problem we were trying to
>>solve. You asked earlier why the netdev name mattered and it mostly
>>has to do with configuration. Specifically what our patch is
>>attempting to resolve is the issue of how to allow a cloud provider to
>>upgrade their customer to SR-IOV support and live migration without
>>requiring them to reconfigure their guest. So the general idea with
>>our patch is to take a VM that is running with virtio_net only and
>>allow it to instead spawn a virtio_bypass master using the same netdev
>>name as the original virtio, and then have the virtio_net and VF come
>>up and be enslaved by the bypass interface. Doing it this way we can
>>allow for multi-vendor SR-IOV live migration support using a guest
>>that was originally configured for virtio only.
>>
>>The problem with your solution is we already have teaming and bonding
>>as you said. There is already a write-up from Red Hat on how to do it
>>(https://access.redhat.com/documentation/en-us/red_hat_virtualization/4.1/html/virtual_machine_management_guide/sect-migrating_virtual_machines_between_hosts).
>>That is all well and good as long as you are willing to keep around
>>two VM images, one for virtio, and one for SR-IOV with live migration.
>
> You don't need 2 images. You need only one. The one with the team setup.
> That's it. If another netdev with the same mac appears, teamd will
> enslave it and run traffic on it. If not, ok, you'll go only through
> virtio_net.

Isn't that going to cause the routing table to get messed up when we
rearrange the netdevs? We don't want to have an significant disruption
 in traffic when we are adding/removing the VF. It seems like we would
need to invalidate any entries that were configured for the virtio_net
and reestablish them on the new team interface. Part of the criteria
we have been working with is that we should be able to transition from
having a VF to not or vice versa without seeing any significant
disruption in the traffic.

Also how does this handle any static configuration? I am assuming that
everything here assumes the team will be brought up as soon as it is
seen and assigned a DHCP address.

The solution as you have proposed seems problematic at best. I don't
see how the team solution works without introducing some sort of
traffic disruption to either add/remove the VF and bring up/tear down
the team interface. At that point we might as well just give up on
this piece of live migration support entirely since the disruption was
what we were trying to avoid. We might as well just hotplug out the VF
and hotplug in a virtio at the same bus device and function number and
just let udev take care of renaming it for us. The idea was supposed
to be a seamless transition between the two interfaces.

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
  2018-02-21 16:11                 ` Jiri Pirko
  2018-02-21 16:49                     ` [virtio-dev] " Alexander Duyck
@ 2018-02-21 16:49                   ` Alexander Duyck
  1 sibling, 0 replies; 121+ messages in thread
From: Alexander Duyck @ 2018-02-21 16:49 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Duyck, Alexander H, virtio-dev, Michael S. Tsirkin,
	Jakub Kicinski, Samudrala, Sridhar, virtualization, Siwei Liu,
	Netdev, David Miller

On Wed, Feb 21, 2018 at 8:11 AM, Jiri Pirko <jiri@resnulli.us> wrote:
> Wed, Feb 21, 2018 at 04:56:48PM CET, alexander.duyck@gmail.com wrote:
>>On Wed, Feb 21, 2018 at 1:51 AM, Jiri Pirko <jiri@resnulli.us> wrote:
>>> Tue, Feb 20, 2018 at 11:33:56PM CET, kubakici@wp.pl wrote:
>>>>On Tue, 20 Feb 2018 21:14:10 +0100, Jiri Pirko wrote:
>>>>> Yeah, I can see it now :( I guess that the ship has sailed and we are
>>>>> stuck with this ugly thing forever...
>>>>>
>>>>> Could you at least make some common code that is shared in between
>>>>> netvsc and virtio_net so this is handled in exacly the same way in both?
>>>>
>>>>IMHO netvsc is a vendor specific driver which made a mistake on what
>>>>behaviour it provides (or tried to align itself with Windows SR-IOV).
>>>>Let's not make a far, far more commonly deployed and important driver
>>>>(virtio) bug-compatible with netvsc.
>>>
>>> Yeah. netvsc solution is a dangerous precedent here and in my opinition
>>> it was a huge mistake to merge it. I personally would vote to unmerge it
>>> and make the solution based on team/bond.
>>>
>>>
>>>>
>>>>To Jiri's initial comments, I feel the same way, in fact I've talked to
>>>>the NetworkManager guys to get auto-bonding based on MACs handled in
>>>>user space.  I think it may very well get done in next versions of NM,
>>>>but isn't done yet.  Stephen also raised the point that not everybody is
>>>>using NM.
>>>
>>> Can be done in NM, networkd or other network management tools.
>>> Even easier to do this in teamd and let them all benefit.
>>>
>>> Actually, I took a stab to implement this in teamd. Took me like an hour
>>> and half.
>>>
>>> You can just run teamd with config option "kidnap" like this:
>>> # teamd/teamd -c '{"kidnap": true }'
>>>
>>> Whenever teamd sees another netdev to appear with the same mac as his,
>>> or whenever teamd sees another netdev to change mac to his,
>>> it enslaves it.
>>>
>>> Here's the patch (quick and dirty):
>>>
>>> Subject: [patch teamd] teamd: introduce kidnap feature
>>>
>>> Signed-off-by: Jiri Pirko <jiri@mellanox.com>
>>
>>So this doesn't really address the original problem we were trying to
>>solve. You asked earlier why the netdev name mattered and it mostly
>>has to do with configuration. Specifically what our patch is
>>attempting to resolve is the issue of how to allow a cloud provider to
>>upgrade their customer to SR-IOV support and live migration without
>>requiring them to reconfigure their guest. So the general idea with
>>our patch is to take a VM that is running with virtio_net only and
>>allow it to instead spawn a virtio_bypass master using the same netdev
>>name as the original virtio, and then have the virtio_net and VF come
>>up and be enslaved by the bypass interface. Doing it this way we can
>>allow for multi-vendor SR-IOV live migration support using a guest
>>that was originally configured for virtio only.
>>
>>The problem with your solution is we already have teaming and bonding
>>as you said. There is already a write-up from Red Hat on how to do it
>>(https://access.redhat.com/documentation/en-us/red_hat_virtualization/4.1/html/virtual_machine_management_guide/sect-migrating_virtual_machines_between_hosts).
>>That is all well and good as long as you are willing to keep around
>>two VM images, one for virtio, and one for SR-IOV with live migration.
>
> You don't need 2 images. You need only one. The one with the team setup.
> That's it. If another netdev with the same mac appears, teamd will
> enslave it and run traffic on it. If not, ok, you'll go only through
> virtio_net.

Isn't that going to cause the routing table to get messed up when we
rearrange the netdevs? We don't want to have an significant disruption
 in traffic when we are adding/removing the VF. It seems like we would
need to invalidate any entries that were configured for the virtio_net
and reestablish them on the new team interface. Part of the criteria
we have been working with is that we should be able to transition from
having a VF to not or vice versa without seeing any significant
disruption in the traffic.

Also how does this handle any static configuration? I am assuming that
everything here assumes the team will be brought up as soon as it is
seen and assigned a DHCP address.

The solution as you have proposed seems problematic at best. I don't
see how the team solution works without introducing some sort of
traffic disruption to either add/remove the VF and bring up/tear down
the team interface. At that point we might as well just give up on
this piece of live migration support entirely since the disruption was
what we were trying to avoid. We might as well just hotplug out the VF
and hotplug in a virtio at the same bus device and function number and
just let udev take care of renaming it for us. The idea was supposed
to be a seamless transition between the two interfaces.

^ permalink raw reply	[flat|nested] 121+ messages in thread

* [virtio-dev] Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
@ 2018-02-21 16:49                     ` Alexander Duyck
  0 siblings, 0 replies; 121+ messages in thread
From: Alexander Duyck @ 2018-02-21 16:49 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Jakub Kicinski, Samudrala, Sridhar, Michael S. Tsirkin,
	Stephen Hemminger, David Miller, Netdev, virtualization,
	virtio-dev, Brandeburg, Jesse, Duyck, Alexander H, Jason Wang,
	Siwei Liu

On Wed, Feb 21, 2018 at 8:11 AM, Jiri Pirko <jiri@resnulli.us> wrote:
> Wed, Feb 21, 2018 at 04:56:48PM CET, alexander.duyck@gmail.com wrote:
>>On Wed, Feb 21, 2018 at 1:51 AM, Jiri Pirko <jiri@resnulli.us> wrote:
>>> Tue, Feb 20, 2018 at 11:33:56PM CET, kubakici@wp.pl wrote:
>>>>On Tue, 20 Feb 2018 21:14:10 +0100, Jiri Pirko wrote:
>>>>> Yeah, I can see it now :( I guess that the ship has sailed and we are
>>>>> stuck with this ugly thing forever...
>>>>>
>>>>> Could you at least make some common code that is shared in between
>>>>> netvsc and virtio_net so this is handled in exacly the same way in both?
>>>>
>>>>IMHO netvsc is a vendor specific driver which made a mistake on what
>>>>behaviour it provides (or tried to align itself with Windows SR-IOV).
>>>>Let's not make a far, far more commonly deployed and important driver
>>>>(virtio) bug-compatible with netvsc.
>>>
>>> Yeah. netvsc solution is a dangerous precedent here and in my opinition
>>> it was a huge mistake to merge it. I personally would vote to unmerge it
>>> and make the solution based on team/bond.
>>>
>>>
>>>>
>>>>To Jiri's initial comments, I feel the same way, in fact I've talked to
>>>>the NetworkManager guys to get auto-bonding based on MACs handled in
>>>>user space.  I think it may very well get done in next versions of NM,
>>>>but isn't done yet.  Stephen also raised the point that not everybody is
>>>>using NM.
>>>
>>> Can be done in NM, networkd or other network management tools.
>>> Even easier to do this in teamd and let them all benefit.
>>>
>>> Actually, I took a stab to implement this in teamd. Took me like an hour
>>> and half.
>>>
>>> You can just run teamd with config option "kidnap" like this:
>>> # teamd/teamd -c '{"kidnap": true }'
>>>
>>> Whenever teamd sees another netdev to appear with the same mac as his,
>>> or whenever teamd sees another netdev to change mac to his,
>>> it enslaves it.
>>>
>>> Here's the patch (quick and dirty):
>>>
>>> Subject: [patch teamd] teamd: introduce kidnap feature
>>>
>>> Signed-off-by: Jiri Pirko <jiri@mellanox.com>
>>
>>So this doesn't really address the original problem we were trying to
>>solve. You asked earlier why the netdev name mattered and it mostly
>>has to do with configuration. Specifically what our patch is
>>attempting to resolve is the issue of how to allow a cloud provider to
>>upgrade their customer to SR-IOV support and live migration without
>>requiring them to reconfigure their guest. So the general idea with
>>our patch is to take a VM that is running with virtio_net only and
>>allow it to instead spawn a virtio_bypass master using the same netdev
>>name as the original virtio, and then have the virtio_net and VF come
>>up and be enslaved by the bypass interface. Doing it this way we can
>>allow for multi-vendor SR-IOV live migration support using a guest
>>that was originally configured for virtio only.
>>
>>The problem with your solution is we already have teaming and bonding
>>as you said. There is already a write-up from Red Hat on how to do it
>>(https://access.redhat.com/documentation/en-us/red_hat_virtualization/4.1/html/virtual_machine_management_guide/sect-migrating_virtual_machines_between_hosts).
>>That is all well and good as long as you are willing to keep around
>>two VM images, one for virtio, and one for SR-IOV with live migration.
>
> You don't need 2 images. You need only one. The one with the team setup.
> That's it. If another netdev with the same mac appears, teamd will
> enslave it and run traffic on it. If not, ok, you'll go only through
> virtio_net.

Isn't that going to cause the routing table to get messed up when we
rearrange the netdevs? We don't want to have an significant disruption
 in traffic when we are adding/removing the VF. It seems like we would
need to invalidate any entries that were configured for the virtio_net
and reestablish them on the new team interface. Part of the criteria
we have been working with is that we should be able to transition from
having a VF to not or vice versa without seeing any significant
disruption in the traffic.

Also how does this handle any static configuration? I am assuming that
everything here assumes the team will be brought up as soon as it is
seen and assigned a DHCP address.

The solution as you have proposed seems problematic at best. I don't
see how the team solution works without introducing some sort of
traffic disruption to either add/remove the VF and bring up/tear down
the team interface. At that point we might as well just give up on
this piece of live migration support entirely since the disruption was
what we were trying to avoid. We might as well just hotplug out the VF
and hotplug in a virtio at the same bus device and function number and
just let udev take care of renaming it for us. The idea was supposed
to be a seamless transition between the two interfaces.

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
  2018-02-21 16:49                     ` [virtio-dev] " Alexander Duyck
  (?)
@ 2018-02-21 16:58                     ` Jiri Pirko
  2018-02-21 17:56                       ` Alexander Duyck
  2018-02-21 17:56                         ` [virtio-dev] " Alexander Duyck
  -1 siblings, 2 replies; 121+ messages in thread
From: Jiri Pirko @ 2018-02-21 16:58 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Duyck, Alexander H, virtio-dev, Michael S. Tsirkin,
	Jakub Kicinski, Samudrala, Sridhar, virtualization, Siwei Liu,
	Netdev, David Miller

Wed, Feb 21, 2018 at 05:49:49PM CET, alexander.duyck@gmail.com wrote:
>On Wed, Feb 21, 2018 at 8:11 AM, Jiri Pirko <jiri@resnulli.us> wrote:
>> Wed, Feb 21, 2018 at 04:56:48PM CET, alexander.duyck@gmail.com wrote:
>>>On Wed, Feb 21, 2018 at 1:51 AM, Jiri Pirko <jiri@resnulli.us> wrote:
>>>> Tue, Feb 20, 2018 at 11:33:56PM CET, kubakici@wp.pl wrote:
>>>>>On Tue, 20 Feb 2018 21:14:10 +0100, Jiri Pirko wrote:
>>>>>> Yeah, I can see it now :( I guess that the ship has sailed and we are
>>>>>> stuck with this ugly thing forever...
>>>>>>
>>>>>> Could you at least make some common code that is shared in between
>>>>>> netvsc and virtio_net so this is handled in exacly the same way in both?
>>>>>
>>>>>IMHO netvsc is a vendor specific driver which made a mistake on what
>>>>>behaviour it provides (or tried to align itself with Windows SR-IOV).
>>>>>Let's not make a far, far more commonly deployed and important driver
>>>>>(virtio) bug-compatible with netvsc.
>>>>
>>>> Yeah. netvsc solution is a dangerous precedent here and in my opinition
>>>> it was a huge mistake to merge it. I personally would vote to unmerge it
>>>> and make the solution based on team/bond.
>>>>
>>>>
>>>>>
>>>>>To Jiri's initial comments, I feel the same way, in fact I've talked to
>>>>>the NetworkManager guys to get auto-bonding based on MACs handled in
>>>>>user space.  I think it may very well get done in next versions of NM,
>>>>>but isn't done yet.  Stephen also raised the point that not everybody is
>>>>>using NM.
>>>>
>>>> Can be done in NM, networkd or other network management tools.
>>>> Even easier to do this in teamd and let them all benefit.
>>>>
>>>> Actually, I took a stab to implement this in teamd. Took me like an hour
>>>> and half.
>>>>
>>>> You can just run teamd with config option "kidnap" like this:
>>>> # teamd/teamd -c '{"kidnap": true }'
>>>>
>>>> Whenever teamd sees another netdev to appear with the same mac as his,
>>>> or whenever teamd sees another netdev to change mac to his,
>>>> it enslaves it.
>>>>
>>>> Here's the patch (quick and dirty):
>>>>
>>>> Subject: [patch teamd] teamd: introduce kidnap feature
>>>>
>>>> Signed-off-by: Jiri Pirko <jiri@mellanox.com>
>>>
>>>So this doesn't really address the original problem we were trying to
>>>solve. You asked earlier why the netdev name mattered and it mostly
>>>has to do with configuration. Specifically what our patch is
>>>attempting to resolve is the issue of how to allow a cloud provider to
>>>upgrade their customer to SR-IOV support and live migration without
>>>requiring them to reconfigure their guest. So the general idea with
>>>our patch is to take a VM that is running with virtio_net only and
>>>allow it to instead spawn a virtio_bypass master using the same netdev
>>>name as the original virtio, and then have the virtio_net and VF come
>>>up and be enslaved by the bypass interface. Doing it this way we can
>>>allow for multi-vendor SR-IOV live migration support using a guest
>>>that was originally configured for virtio only.
>>>
>>>The problem with your solution is we already have teaming and bonding
>>>as you said. There is already a write-up from Red Hat on how to do it
>>>(https://access.redhat.com/documentation/en-us/red_hat_virtualization/4.1/html/virtual_machine_management_guide/sect-migrating_virtual_machines_between_hosts).
>>>That is all well and good as long as you are willing to keep around
>>>two VM images, one for virtio, and one for SR-IOV with live migration.
>>
>> You don't need 2 images. You need only one. The one with the team setup.
>> That's it. If another netdev with the same mac appears, teamd will
>> enslave it and run traffic on it. If not, ok, you'll go only through
>> virtio_net.
>
>Isn't that going to cause the routing table to get messed up when we
>rearrange the netdevs? We don't want to have an significant disruption
> in traffic when we are adding/removing the VF. It seems like we would
>need to invalidate any entries that were configured for the virtio_net
>and reestablish them on the new team interface. Part of the criteria
>we have been working with is that we should be able to transition from
>having a VF to not or vice versa without seeing any significant
>disruption in the traffic.

What? You have routes on the team netdev. virtio_net and VF are only
slaves. What are you talking about? I don't get it :/


>
>Also how does this handle any static configuration? I am assuming that
>everything here assumes the team will be brought up as soon as it is
>seen and assigned a DHCP address.

Again. You configure whatever you need on the team netdev.


>
>The solution as you have proposed seems problematic at best. I don't
>see how the team solution works without introducing some sort of
>traffic disruption to either add/remove the VF and bring up/tear down
>the team interface. At that point we might as well just give up on
>this piece of live migration support entirely since the disruption was
>what we were trying to avoid. We might as well just hotplug out the VF
>and hotplug in a virtio at the same bus device and function number and
>just let udev take care of renaming it for us. The idea was supposed
>to be a seamless transition between the two interfaces.

Alex. What you are trying to do in this patchset and what netvsc does it
essentialy in-driver bonding. Same thing mechanism, rx_handler,
everything. I don't really understand what are you talking about. With
use of team you will get exactly the same behaviour.

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
  2018-02-21 16:58                     ` Jiri Pirko
@ 2018-02-21 17:56                         ` Alexander Duyck
  2018-02-21 17:56                         ` [virtio-dev] " Alexander Duyck
  1 sibling, 0 replies; 121+ messages in thread
From: Alexander Duyck @ 2018-02-21 17:56 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Jakub Kicinski, Samudrala, Sridhar, Michael S. Tsirkin,
	Stephen Hemminger, David Miller, Netdev, virtualization,
	virtio-dev, Brandeburg, Jesse, Duyck, Alexander H, Jason Wang,
	Siwei Liu

On Wed, Feb 21, 2018 at 8:58 AM, Jiri Pirko <jiri@resnulli.us> wrote:
> Wed, Feb 21, 2018 at 05:49:49PM CET, alexander.duyck@gmail.com wrote:
>>On Wed, Feb 21, 2018 at 8:11 AM, Jiri Pirko <jiri@resnulli.us> wrote:
>>> Wed, Feb 21, 2018 at 04:56:48PM CET, alexander.duyck@gmail.com wrote:
>>>>On Wed, Feb 21, 2018 at 1:51 AM, Jiri Pirko <jiri@resnulli.us> wrote:
>>>>> Tue, Feb 20, 2018 at 11:33:56PM CET, kubakici@wp.pl wrote:
>>>>>>On Tue, 20 Feb 2018 21:14:10 +0100, Jiri Pirko wrote:
>>>>>>> Yeah, I can see it now :( I guess that the ship has sailed and we are
>>>>>>> stuck with this ugly thing forever...
>>>>>>>
>>>>>>> Could you at least make some common code that is shared in between
>>>>>>> netvsc and virtio_net so this is handled in exacly the same way in both?
>>>>>>
>>>>>>IMHO netvsc is a vendor specific driver which made a mistake on what
>>>>>>behaviour it provides (or tried to align itself with Windows SR-IOV).
>>>>>>Let's not make a far, far more commonly deployed and important driver
>>>>>>(virtio) bug-compatible with netvsc.
>>>>>
>>>>> Yeah. netvsc solution is a dangerous precedent here and in my opinition
>>>>> it was a huge mistake to merge it. I personally would vote to unmerge it
>>>>> and make the solution based on team/bond.
>>>>>
>>>>>
>>>>>>
>>>>>>To Jiri's initial comments, I feel the same way, in fact I've talked to
>>>>>>the NetworkManager guys to get auto-bonding based on MACs handled in
>>>>>>user space.  I think it may very well get done in next versions of NM,
>>>>>>but isn't done yet.  Stephen also raised the point that not everybody is
>>>>>>using NM.
>>>>>
>>>>> Can be done in NM, networkd or other network management tools.
>>>>> Even easier to do this in teamd and let them all benefit.
>>>>>
>>>>> Actually, I took a stab to implement this in teamd. Took me like an hour
>>>>> and half.
>>>>>
>>>>> You can just run teamd with config option "kidnap" like this:
>>>>> # teamd/teamd -c '{"kidnap": true }'
>>>>>
>>>>> Whenever teamd sees another netdev to appear with the same mac as his,
>>>>> or whenever teamd sees another netdev to change mac to his,
>>>>> it enslaves it.
>>>>>
>>>>> Here's the patch (quick and dirty):
>>>>>
>>>>> Subject: [patch teamd] teamd: introduce kidnap feature
>>>>>
>>>>> Signed-off-by: Jiri Pirko <jiri@mellanox.com>
>>>>
>>>>So this doesn't really address the original problem we were trying to
>>>>solve. You asked earlier why the netdev name mattered and it mostly
>>>>has to do with configuration. Specifically what our patch is
>>>>attempting to resolve is the issue of how to allow a cloud provider to
>>>>upgrade their customer to SR-IOV support and live migration without
>>>>requiring them to reconfigure their guest. So the general idea with
>>>>our patch is to take a VM that is running with virtio_net only and
>>>>allow it to instead spawn a virtio_bypass master using the same netdev
>>>>name as the original virtio, and then have the virtio_net and VF come
>>>>up and be enslaved by the bypass interface. Doing it this way we can
>>>>allow for multi-vendor SR-IOV live migration support using a guest
>>>>that was originally configured for virtio only.
>>>>
>>>>The problem with your solution is we already have teaming and bonding
>>>>as you said. There is already a write-up from Red Hat on how to do it
>>>>(https://access.redhat.com/documentation/en-us/red_hat_virtualization/4.1/html/virtual_machine_management_guide/sect-migrating_virtual_machines_between_hosts).
>>>>That is all well and good as long as you are willing to keep around
>>>>two VM images, one for virtio, and one for SR-IOV with live migration.
>>>
>>> You don't need 2 images. You need only one. The one with the team setup.
>>> That's it. If another netdev with the same mac appears, teamd will
>>> enslave it and run traffic on it. If not, ok, you'll go only through
>>> virtio_net.
>>
>>Isn't that going to cause the routing table to get messed up when we
>>rearrange the netdevs? We don't want to have an significant disruption
>> in traffic when we are adding/removing the VF. It seems like we would
>>need to invalidate any entries that were configured for the virtio_net
>>and reestablish them on the new team interface. Part of the criteria
>>we have been working with is that we should be able to transition from
>>having a VF to not or vice versa without seeing any significant
>>disruption in the traffic.
>
> What? You have routes on the team netdev. virtio_net and VF are only
> slaves. What are you talking about? I don't get it :/

So lets walk though this by example. The general idea of the base case
for all this is somebody starting with virtio_net, we will call the
interface "ens1" for now. It comes up and is assigned a dhcp address
and everything works as expected. Now in order to get better
performance we want to add a VF "ens2", but we don't want a new IP
address. Now if I understand correctly what will happen is that when
"ens2" appears on the system teamd will then create a new team
interface "team0". Before teamd can enslave ens1 it has to down the
interface if I understand things correctly. This means that we have to
disrupt network traffic in order for this to work.

To give you an idea of where we were before this became about trying
to do this in the team or bonding driver, we were debating a 2 netdev
model versus a 3 netdev model. I will call out the model and the
advantages/disadvantages of those below.

2 Netdev model, "ens1", enslaves "ens2".
- Requires dropping in-driver XDP in order to work (won't capture VF
traffic otherwise)
- VF takes performance hit for extra qdisc/Tx queue lock of virtio_net interface
- If you ass-u-me (I haven't been a fan of this model if you can't
tell) that it is okay to rip out in-driver XDP from virtio_net, then
you could transition between base virtio, virtio w/ backup bit set.
- Works for netvsc because they limit their features (no in-driver
XDP) to guarantee this works.

3 Netdev model, "ens1", enslaves "ens1nbackup" and "ens2"
- Exposes 2 netdevs "ens1" and "ens1nbackup" when only virtio is present
- No extra qdisc or locking
- All virtio_net original functionality still present
- Not able to transition from virtio to virtio w/ backup without
disruption (requires hot-plug)

The way I see it the only way your team setup could work would be
something closer to the 3 netdev model. Basically we would be
requiring the user to always have the team0 present in order to make
certain that anything like XDP would be run on the team interface
instead of assuming that the virtio_net could run by itself. I will
add it as a third option here to compare to the other 2.

3 Netdev "team" model, "team0", enslaves "ens1" and "ens2"
- Requires guest to configure teamd
- Exposes "team0" and "ens1" when only virtio is present
- No extra qdisc or locking
- Doesn't require "backup" bit in virtio

>>
>>Also how does this handle any static configuration? I am assuming that
>>everything here assumes the team will be brought up as soon as it is
>>seen and assigned a DHCP address.
>
> Again. You configure whatever you need on the team netdev.

Just so we are clear, are you then saying that the team0 interface
will always be present with this configuration? You had made it sound
like it would disappear if you didn't have at least 2 interfaces.

>>
>>The solution as you have proposed seems problematic at best. I don't
>>see how the team solution works without introducing some sort of
>>traffic disruption to either add/remove the VF and bring up/tear down
>>the team interface. At that point we might as well just give up on
>>this piece of live migration support entirely since the disruption was
>>what we were trying to avoid. We might as well just hotplug out the VF
>>and hotplug in a virtio at the same bus device and function number and
>>just let udev take care of renaming it for us. The idea was supposed
>>to be a seamless transition between the two interfaces.
>
> Alex. What you are trying to do in this patchset and what netvsc does it
> essentialy in-driver bonding. Same thing mechanism, rx_handler,
> everything. I don't really understand what are you talking about. With
> use of team you will get exactly the same behaviour.

So the goal of the "in-driver bonding" is to make the bonding as
non-intrusive as possible and require as little user intervention as
possible. I agree that much of the handling is the same, however the
control structure and requirements are significantly different. That
has been what I have been trying to explain. You keep wanting to use
the existing structures, but they don't really apply cleanly because
they push control for the interface up into the guest, and that
doesn't make much sense in the case of virtualization. What is
happening here is that we are exposing a bond that the guest should
have no control over, or at least as little as possible. In addition
making the user have to add additional configuration in the guest
means that there is that much more that can go wrong if they screw it
up.

The other problem here is that the transition needs to be as seamless
as possible between just a standard virtio_net setup and this new
setup. With either the team or bonding setup you end up essentially
forcing the guest to have the bond/team always there even if they are
running only a single interface. Only if they "upgrade" the VM by
adding a VF then it finally gets to do anything.

What this comes down to for us is the following requirements:
1. The name of the interface cannot change when going from virtio_net,
to virtio_net being bypassed using a VF. We cannot create an interface
on top of the interface, if anything we need to push the original
virtio_net out of the way so that the new team interface takes its
place in the configuration of the system. Otherwise a VM with VF w/
live migration will require a different configuration than one that
just runs virtio_net.
2. We need some way to signal if this VM should be running in an
"upgraded" mode or not. We have been using the backup bit in
virtio_net to do that. If it isn't "upgraded" then we don't need the
team/bond and we can just run with virtio_net.
3. We cannot introduce any downtime on the interface when adding a VF
or removing it. The link must stay up the entire time and be able to
handle packets.

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
  2018-02-21 16:58                     ` Jiri Pirko
@ 2018-02-21 17:56                       ` Alexander Duyck
  2018-02-21 17:56                         ` [virtio-dev] " Alexander Duyck
  1 sibling, 0 replies; 121+ messages in thread
From: Alexander Duyck @ 2018-02-21 17:56 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Duyck, Alexander H, virtio-dev, Michael S. Tsirkin,
	Jakub Kicinski, Samudrala, Sridhar, virtualization, Siwei Liu,
	Netdev, David Miller

On Wed, Feb 21, 2018 at 8:58 AM, Jiri Pirko <jiri@resnulli.us> wrote:
> Wed, Feb 21, 2018 at 05:49:49PM CET, alexander.duyck@gmail.com wrote:
>>On Wed, Feb 21, 2018 at 8:11 AM, Jiri Pirko <jiri@resnulli.us> wrote:
>>> Wed, Feb 21, 2018 at 04:56:48PM CET, alexander.duyck@gmail.com wrote:
>>>>On Wed, Feb 21, 2018 at 1:51 AM, Jiri Pirko <jiri@resnulli.us> wrote:
>>>>> Tue, Feb 20, 2018 at 11:33:56PM CET, kubakici@wp.pl wrote:
>>>>>>On Tue, 20 Feb 2018 21:14:10 +0100, Jiri Pirko wrote:
>>>>>>> Yeah, I can see it now :( I guess that the ship has sailed and we are
>>>>>>> stuck with this ugly thing forever...
>>>>>>>
>>>>>>> Could you at least make some common code that is shared in between
>>>>>>> netvsc and virtio_net so this is handled in exacly the same way in both?
>>>>>>
>>>>>>IMHO netvsc is a vendor specific driver which made a mistake on what
>>>>>>behaviour it provides (or tried to align itself with Windows SR-IOV).
>>>>>>Let's not make a far, far more commonly deployed and important driver
>>>>>>(virtio) bug-compatible with netvsc.
>>>>>
>>>>> Yeah. netvsc solution is a dangerous precedent here and in my opinition
>>>>> it was a huge mistake to merge it. I personally would vote to unmerge it
>>>>> and make the solution based on team/bond.
>>>>>
>>>>>
>>>>>>
>>>>>>To Jiri's initial comments, I feel the same way, in fact I've talked to
>>>>>>the NetworkManager guys to get auto-bonding based on MACs handled in
>>>>>>user space.  I think it may very well get done in next versions of NM,
>>>>>>but isn't done yet.  Stephen also raised the point that not everybody is
>>>>>>using NM.
>>>>>
>>>>> Can be done in NM, networkd or other network management tools.
>>>>> Even easier to do this in teamd and let them all benefit.
>>>>>
>>>>> Actually, I took a stab to implement this in teamd. Took me like an hour
>>>>> and half.
>>>>>
>>>>> You can just run teamd with config option "kidnap" like this:
>>>>> # teamd/teamd -c '{"kidnap": true }'
>>>>>
>>>>> Whenever teamd sees another netdev to appear with the same mac as his,
>>>>> or whenever teamd sees another netdev to change mac to his,
>>>>> it enslaves it.
>>>>>
>>>>> Here's the patch (quick and dirty):
>>>>>
>>>>> Subject: [patch teamd] teamd: introduce kidnap feature
>>>>>
>>>>> Signed-off-by: Jiri Pirko <jiri@mellanox.com>
>>>>
>>>>So this doesn't really address the original problem we were trying to
>>>>solve. You asked earlier why the netdev name mattered and it mostly
>>>>has to do with configuration. Specifically what our patch is
>>>>attempting to resolve is the issue of how to allow a cloud provider to
>>>>upgrade their customer to SR-IOV support and live migration without
>>>>requiring them to reconfigure their guest. So the general idea with
>>>>our patch is to take a VM that is running with virtio_net only and
>>>>allow it to instead spawn a virtio_bypass master using the same netdev
>>>>name as the original virtio, and then have the virtio_net and VF come
>>>>up and be enslaved by the bypass interface. Doing it this way we can
>>>>allow for multi-vendor SR-IOV live migration support using a guest
>>>>that was originally configured for virtio only.
>>>>
>>>>The problem with your solution is we already have teaming and bonding
>>>>as you said. There is already a write-up from Red Hat on how to do it
>>>>(https://access.redhat.com/documentation/en-us/red_hat_virtualization/4.1/html/virtual_machine_management_guide/sect-migrating_virtual_machines_between_hosts).
>>>>That is all well and good as long as you are willing to keep around
>>>>two VM images, one for virtio, and one for SR-IOV with live migration.
>>>
>>> You don't need 2 images. You need only one. The one with the team setup.
>>> That's it. If another netdev with the same mac appears, teamd will
>>> enslave it and run traffic on it. If not, ok, you'll go only through
>>> virtio_net.
>>
>>Isn't that going to cause the routing table to get messed up when we
>>rearrange the netdevs? We don't want to have an significant disruption
>> in traffic when we are adding/removing the VF. It seems like we would
>>need to invalidate any entries that were configured for the virtio_net
>>and reestablish them on the new team interface. Part of the criteria
>>we have been working with is that we should be able to transition from
>>having a VF to not or vice versa without seeing any significant
>>disruption in the traffic.
>
> What? You have routes on the team netdev. virtio_net and VF are only
> slaves. What are you talking about? I don't get it :/

So lets walk though this by example. The general idea of the base case
for all this is somebody starting with virtio_net, we will call the
interface "ens1" for now. It comes up and is assigned a dhcp address
and everything works as expected. Now in order to get better
performance we want to add a VF "ens2", but we don't want a new IP
address. Now if I understand correctly what will happen is that when
"ens2" appears on the system teamd will then create a new team
interface "team0". Before teamd can enslave ens1 it has to down the
interface if I understand things correctly. This means that we have to
disrupt network traffic in order for this to work.

To give you an idea of where we were before this became about trying
to do this in the team or bonding driver, we were debating a 2 netdev
model versus a 3 netdev model. I will call out the model and the
advantages/disadvantages of those below.

2 Netdev model, "ens1", enslaves "ens2".
- Requires dropping in-driver XDP in order to work (won't capture VF
traffic otherwise)
- VF takes performance hit for extra qdisc/Tx queue lock of virtio_net interface
- If you ass-u-me (I haven't been a fan of this model if you can't
tell) that it is okay to rip out in-driver XDP from virtio_net, then
you could transition between base virtio, virtio w/ backup bit set.
- Works for netvsc because they limit their features (no in-driver
XDP) to guarantee this works.

3 Netdev model, "ens1", enslaves "ens1nbackup" and "ens2"
- Exposes 2 netdevs "ens1" and "ens1nbackup" when only virtio is present
- No extra qdisc or locking
- All virtio_net original functionality still present
- Not able to transition from virtio to virtio w/ backup without
disruption (requires hot-plug)

The way I see it the only way your team setup could work would be
something closer to the 3 netdev model. Basically we would be
requiring the user to always have the team0 present in order to make
certain that anything like XDP would be run on the team interface
instead of assuming that the virtio_net could run by itself. I will
add it as a third option here to compare to the other 2.

3 Netdev "team" model, "team0", enslaves "ens1" and "ens2"
- Requires guest to configure teamd
- Exposes "team0" and "ens1" when only virtio is present
- No extra qdisc or locking
- Doesn't require "backup" bit in virtio

>>
>>Also how does this handle any static configuration? I am assuming that
>>everything here assumes the team will be brought up as soon as it is
>>seen and assigned a DHCP address.
>
> Again. You configure whatever you need on the team netdev.

Just so we are clear, are you then saying that the team0 interface
will always be present with this configuration? You had made it sound
like it would disappear if you didn't have at least 2 interfaces.

>>
>>The solution as you have proposed seems problematic at best. I don't
>>see how the team solution works without introducing some sort of
>>traffic disruption to either add/remove the VF and bring up/tear down
>>the team interface. At that point we might as well just give up on
>>this piece of live migration support entirely since the disruption was
>>what we were trying to avoid. We might as well just hotplug out the VF
>>and hotplug in a virtio at the same bus device and function number and
>>just let udev take care of renaming it for us. The idea was supposed
>>to be a seamless transition between the two interfaces.
>
> Alex. What you are trying to do in this patchset and what netvsc does it
> essentialy in-driver bonding. Same thing mechanism, rx_handler,
> everything. I don't really understand what are you talking about. With
> use of team you will get exactly the same behaviour.

So the goal of the "in-driver bonding" is to make the bonding as
non-intrusive as possible and require as little user intervention as
possible. I agree that much of the handling is the same, however the
control structure and requirements are significantly different. That
has been what I have been trying to explain. You keep wanting to use
the existing structures, but they don't really apply cleanly because
they push control for the interface up into the guest, and that
doesn't make much sense in the case of virtualization. What is
happening here is that we are exposing a bond that the guest should
have no control over, or at least as little as possible. In addition
making the user have to add additional configuration in the guest
means that there is that much more that can go wrong if they screw it
up.

The other problem here is that the transition needs to be as seamless
as possible between just a standard virtio_net setup and this new
setup. With either the team or bonding setup you end up essentially
forcing the guest to have the bond/team always there even if they are
running only a single interface. Only if they "upgrade" the VM by
adding a VF then it finally gets to do anything.

What this comes down to for us is the following requirements:
1. The name of the interface cannot change when going from virtio_net,
to virtio_net being bypassed using a VF. We cannot create an interface
on top of the interface, if anything we need to push the original
virtio_net out of the way so that the new team interface takes its
place in the configuration of the system. Otherwise a VM with VF w/
live migration will require a different configuration than one that
just runs virtio_net.
2. We need some way to signal if this VM should be running in an
"upgraded" mode or not. We have been using the backup bit in
virtio_net to do that. If it isn't "upgraded" then we don't need the
team/bond and we can just run with virtio_net.
3. We cannot introduce any downtime on the interface when adding a VF
or removing it. The link must stay up the entire time and be able to
handle packets.

^ permalink raw reply	[flat|nested] 121+ messages in thread

* [virtio-dev] Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
@ 2018-02-21 17:56                         ` Alexander Duyck
  0 siblings, 0 replies; 121+ messages in thread
From: Alexander Duyck @ 2018-02-21 17:56 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Jakub Kicinski, Samudrala, Sridhar, Michael S. Tsirkin,
	Stephen Hemminger, David Miller, Netdev, virtualization,
	virtio-dev, Brandeburg, Jesse, Duyck, Alexander H, Jason Wang,
	Siwei Liu

On Wed, Feb 21, 2018 at 8:58 AM, Jiri Pirko <jiri@resnulli.us> wrote:
> Wed, Feb 21, 2018 at 05:49:49PM CET, alexander.duyck@gmail.com wrote:
>>On Wed, Feb 21, 2018 at 8:11 AM, Jiri Pirko <jiri@resnulli.us> wrote:
>>> Wed, Feb 21, 2018 at 04:56:48PM CET, alexander.duyck@gmail.com wrote:
>>>>On Wed, Feb 21, 2018 at 1:51 AM, Jiri Pirko <jiri@resnulli.us> wrote:
>>>>> Tue, Feb 20, 2018 at 11:33:56PM CET, kubakici@wp.pl wrote:
>>>>>>On Tue, 20 Feb 2018 21:14:10 +0100, Jiri Pirko wrote:
>>>>>>> Yeah, I can see it now :( I guess that the ship has sailed and we are
>>>>>>> stuck with this ugly thing forever...
>>>>>>>
>>>>>>> Could you at least make some common code that is shared in between
>>>>>>> netvsc and virtio_net so this is handled in exacly the same way in both?
>>>>>>
>>>>>>IMHO netvsc is a vendor specific driver which made a mistake on what
>>>>>>behaviour it provides (or tried to align itself with Windows SR-IOV).
>>>>>>Let's not make a far, far more commonly deployed and important driver
>>>>>>(virtio) bug-compatible with netvsc.
>>>>>
>>>>> Yeah. netvsc solution is a dangerous precedent here and in my opinition
>>>>> it was a huge mistake to merge it. I personally would vote to unmerge it
>>>>> and make the solution based on team/bond.
>>>>>
>>>>>
>>>>>>
>>>>>>To Jiri's initial comments, I feel the same way, in fact I've talked to
>>>>>>the NetworkManager guys to get auto-bonding based on MACs handled in
>>>>>>user space.  I think it may very well get done in next versions of NM,
>>>>>>but isn't done yet.  Stephen also raised the point that not everybody is
>>>>>>using NM.
>>>>>
>>>>> Can be done in NM, networkd or other network management tools.
>>>>> Even easier to do this in teamd and let them all benefit.
>>>>>
>>>>> Actually, I took a stab to implement this in teamd. Took me like an hour
>>>>> and half.
>>>>>
>>>>> You can just run teamd with config option "kidnap" like this:
>>>>> # teamd/teamd -c '{"kidnap": true }'
>>>>>
>>>>> Whenever teamd sees another netdev to appear with the same mac as his,
>>>>> or whenever teamd sees another netdev to change mac to his,
>>>>> it enslaves it.
>>>>>
>>>>> Here's the patch (quick and dirty):
>>>>>
>>>>> Subject: [patch teamd] teamd: introduce kidnap feature
>>>>>
>>>>> Signed-off-by: Jiri Pirko <jiri@mellanox.com>
>>>>
>>>>So this doesn't really address the original problem we were trying to
>>>>solve. You asked earlier why the netdev name mattered and it mostly
>>>>has to do with configuration. Specifically what our patch is
>>>>attempting to resolve is the issue of how to allow a cloud provider to
>>>>upgrade their customer to SR-IOV support and live migration without
>>>>requiring them to reconfigure their guest. So the general idea with
>>>>our patch is to take a VM that is running with virtio_net only and
>>>>allow it to instead spawn a virtio_bypass master using the same netdev
>>>>name as the original virtio, and then have the virtio_net and VF come
>>>>up and be enslaved by the bypass interface. Doing it this way we can
>>>>allow for multi-vendor SR-IOV live migration support using a guest
>>>>that was originally configured for virtio only.
>>>>
>>>>The problem with your solution is we already have teaming and bonding
>>>>as you said. There is already a write-up from Red Hat on how to do it
>>>>(https://access.redhat.com/documentation/en-us/red_hat_virtualization/4.1/html/virtual_machine_management_guide/sect-migrating_virtual_machines_between_hosts).
>>>>That is all well and good as long as you are willing to keep around
>>>>two VM images, one for virtio, and one for SR-IOV with live migration.
>>>
>>> You don't need 2 images. You need only one. The one with the team setup.
>>> That's it. If another netdev with the same mac appears, teamd will
>>> enslave it and run traffic on it. If not, ok, you'll go only through
>>> virtio_net.
>>
>>Isn't that going to cause the routing table to get messed up when we
>>rearrange the netdevs? We don't want to have an significant disruption
>> in traffic when we are adding/removing the VF. It seems like we would
>>need to invalidate any entries that were configured for the virtio_net
>>and reestablish them on the new team interface. Part of the criteria
>>we have been working with is that we should be able to transition from
>>having a VF to not or vice versa without seeing any significant
>>disruption in the traffic.
>
> What? You have routes on the team netdev. virtio_net and VF are only
> slaves. What are you talking about? I don't get it :/

So lets walk though this by example. The general idea of the base case
for all this is somebody starting with virtio_net, we will call the
interface "ens1" for now. It comes up and is assigned a dhcp address
and everything works as expected. Now in order to get better
performance we want to add a VF "ens2", but we don't want a new IP
address. Now if I understand correctly what will happen is that when
"ens2" appears on the system teamd will then create a new team
interface "team0". Before teamd can enslave ens1 it has to down the
interface if I understand things correctly. This means that we have to
disrupt network traffic in order for this to work.

To give you an idea of where we were before this became about trying
to do this in the team or bonding driver, we were debating a 2 netdev
model versus a 3 netdev model. I will call out the model and the
advantages/disadvantages of those below.

2 Netdev model, "ens1", enslaves "ens2".
- Requires dropping in-driver XDP in order to work (won't capture VF
traffic otherwise)
- VF takes performance hit for extra qdisc/Tx queue lock of virtio_net interface
- If you ass-u-me (I haven't been a fan of this model if you can't
tell) that it is okay to rip out in-driver XDP from virtio_net, then
you could transition between base virtio, virtio w/ backup bit set.
- Works for netvsc because they limit their features (no in-driver
XDP) to guarantee this works.

3 Netdev model, "ens1", enslaves "ens1nbackup" and "ens2"
- Exposes 2 netdevs "ens1" and "ens1nbackup" when only virtio is present
- No extra qdisc or locking
- All virtio_net original functionality still present
- Not able to transition from virtio to virtio w/ backup without
disruption (requires hot-plug)

The way I see it the only way your team setup could work would be
something closer to the 3 netdev model. Basically we would be
requiring the user to always have the team0 present in order to make
certain that anything like XDP would be run on the team interface
instead of assuming that the virtio_net could run by itself. I will
add it as a third option here to compare to the other 2.

3 Netdev "team" model, "team0", enslaves "ens1" and "ens2"
- Requires guest to configure teamd
- Exposes "team0" and "ens1" when only virtio is present
- No extra qdisc or locking
- Doesn't require "backup" bit in virtio

>>
>>Also how does this handle any static configuration? I am assuming that
>>everything here assumes the team will be brought up as soon as it is
>>seen and assigned a DHCP address.
>
> Again. You configure whatever you need on the team netdev.

Just so we are clear, are you then saying that the team0 interface
will always be present with this configuration? You had made it sound
like it would disappear if you didn't have at least 2 interfaces.

>>
>>The solution as you have proposed seems problematic at best. I don't
>>see how the team solution works without introducing some sort of
>>traffic disruption to either add/remove the VF and bring up/tear down
>>the team interface. At that point we might as well just give up on
>>this piece of live migration support entirely since the disruption was
>>what we were trying to avoid. We might as well just hotplug out the VF
>>and hotplug in a virtio at the same bus device and function number and
>>just let udev take care of renaming it for us. The idea was supposed
>>to be a seamless transition between the two interfaces.
>
> Alex. What you are trying to do in this patchset and what netvsc does it
> essentialy in-driver bonding. Same thing mechanism, rx_handler,
> everything. I don't really understand what are you talking about. With
> use of team you will get exactly the same behaviour.

So the goal of the "in-driver bonding" is to make the bonding as
non-intrusive as possible and require as little user intervention as
possible. I agree that much of the handling is the same, however the
control structure and requirements are significantly different. That
has been what I have been trying to explain. You keep wanting to use
the existing structures, but they don't really apply cleanly because
they push control for the interface up into the guest, and that
doesn't make much sense in the case of virtualization. What is
happening here is that we are exposing a bond that the guest should
have no control over, or at least as little as possible. In addition
making the user have to add additional configuration in the guest
means that there is that much more that can go wrong if they screw it
up.

The other problem here is that the transition needs to be as seamless
as possible between just a standard virtio_net setup and this new
setup. With either the team or bonding setup you end up essentially
forcing the guest to have the bond/team always there even if they are
running only a single interface. Only if they "upgrade" the VM by
adding a VF then it finally gets to do anything.

What this comes down to for us is the following requirements:
1. The name of the interface cannot change when going from virtio_net,
to virtio_net being bypassed using a VF. We cannot create an interface
on top of the interface, if anything we need to push the original
virtio_net out of the way so that the new team interface takes its
place in the configuration of the system. Otherwise a VM with VF w/
live migration will require a different configuration than one that
just runs virtio_net.
2. We need some way to signal if this VM should be running in an
"upgraded" mode or not. We have been using the backup bit in
virtio_net to do that. If it isn't "upgraded" then we don't need the
team/bond and we can just run with virtio_net.
3. We cannot introduce any downtime on the interface when adding a VF
or removing it. The link must stay up the entire time and be able to
handle packets.

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
  2018-02-21 17:56                         ` [virtio-dev] " Alexander Duyck
  (?)
@ 2018-02-21 19:38                         ` Jiri Pirko
  2018-02-21 20:57                             ` [virtio-dev] " Alexander Duyck
  2018-02-21 20:57                           ` Alexander Duyck
  -1 siblings, 2 replies; 121+ messages in thread
From: Jiri Pirko @ 2018-02-21 19:38 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Duyck, Alexander H, virtio-dev, Michael S. Tsirkin,
	Jakub Kicinski, Samudrala, Sridhar, virtualization, Siwei Liu,
	Netdev, David Miller

Wed, Feb 21, 2018 at 06:56:35PM CET, alexander.duyck@gmail.com wrote:
>On Wed, Feb 21, 2018 at 8:58 AM, Jiri Pirko <jiri@resnulli.us> wrote:
>> Wed, Feb 21, 2018 at 05:49:49PM CET, alexander.duyck@gmail.com wrote:
>>>On Wed, Feb 21, 2018 at 8:11 AM, Jiri Pirko <jiri@resnulli.us> wrote:
>>>> Wed, Feb 21, 2018 at 04:56:48PM CET, alexander.duyck@gmail.com wrote:
>>>>>On Wed, Feb 21, 2018 at 1:51 AM, Jiri Pirko <jiri@resnulli.us> wrote:
>>>>>> Tue, Feb 20, 2018 at 11:33:56PM CET, kubakici@wp.pl wrote:
>>>>>>>On Tue, 20 Feb 2018 21:14:10 +0100, Jiri Pirko wrote:
>>>>>>>> Yeah, I can see it now :( I guess that the ship has sailed and we are
>>>>>>>> stuck with this ugly thing forever...
>>>>>>>>
>>>>>>>> Could you at least make some common code that is shared in between
>>>>>>>> netvsc and virtio_net so this is handled in exacly the same way in both?
>>>>>>>
>>>>>>>IMHO netvsc is a vendor specific driver which made a mistake on what
>>>>>>>behaviour it provides (or tried to align itself with Windows SR-IOV).
>>>>>>>Let's not make a far, far more commonly deployed and important driver
>>>>>>>(virtio) bug-compatible with netvsc.
>>>>>>
>>>>>> Yeah. netvsc solution is a dangerous precedent here and in my opinition
>>>>>> it was a huge mistake to merge it. I personally would vote to unmerge it
>>>>>> and make the solution based on team/bond.
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>To Jiri's initial comments, I feel the same way, in fact I've talked to
>>>>>>>the NetworkManager guys to get auto-bonding based on MACs handled in
>>>>>>>user space.  I think it may very well get done in next versions of NM,
>>>>>>>but isn't done yet.  Stephen also raised the point that not everybody is
>>>>>>>using NM.
>>>>>>
>>>>>> Can be done in NM, networkd or other network management tools.
>>>>>> Even easier to do this in teamd and let them all benefit.
>>>>>>
>>>>>> Actually, I took a stab to implement this in teamd. Took me like an hour
>>>>>> and half.
>>>>>>
>>>>>> You can just run teamd with config option "kidnap" like this:
>>>>>> # teamd/teamd -c '{"kidnap": true }'
>>>>>>
>>>>>> Whenever teamd sees another netdev to appear with the same mac as his,
>>>>>> or whenever teamd sees another netdev to change mac to his,
>>>>>> it enslaves it.
>>>>>>
>>>>>> Here's the patch (quick and dirty):
>>>>>>
>>>>>> Subject: [patch teamd] teamd: introduce kidnap feature
>>>>>>
>>>>>> Signed-off-by: Jiri Pirko <jiri@mellanox.com>
>>>>>
>>>>>So this doesn't really address the original problem we were trying to
>>>>>solve. You asked earlier why the netdev name mattered and it mostly
>>>>>has to do with configuration. Specifically what our patch is
>>>>>attempting to resolve is the issue of how to allow a cloud provider to
>>>>>upgrade their customer to SR-IOV support and live migration without
>>>>>requiring them to reconfigure their guest. So the general idea with
>>>>>our patch is to take a VM that is running with virtio_net only and
>>>>>allow it to instead spawn a virtio_bypass master using the same netdev
>>>>>name as the original virtio, and then have the virtio_net and VF come
>>>>>up and be enslaved by the bypass interface. Doing it this way we can
>>>>>allow for multi-vendor SR-IOV live migration support using a guest
>>>>>that was originally configured for virtio only.
>>>>>
>>>>>The problem with your solution is we already have teaming and bonding
>>>>>as you said. There is already a write-up from Red Hat on how to do it
>>>>>(https://access.redhat.com/documentation/en-us/red_hat_virtualization/4.1/html/virtual_machine_management_guide/sect-migrating_virtual_machines_between_hosts).
>>>>>That is all well and good as long as you are willing to keep around
>>>>>two VM images, one for virtio, and one for SR-IOV with live migration.
>>>>
>>>> You don't need 2 images. You need only one. The one with the team setup.
>>>> That's it. If another netdev with the same mac appears, teamd will
>>>> enslave it and run traffic on it. If not, ok, you'll go only through
>>>> virtio_net.
>>>
>>>Isn't that going to cause the routing table to get messed up when we
>>>rearrange the netdevs? We don't want to have an significant disruption
>>> in traffic when we are adding/removing the VF. It seems like we would
>>>need to invalidate any entries that were configured for the virtio_net
>>>and reestablish them on the new team interface. Part of the criteria
>>>we have been working with is that we should be able to transition from
>>>having a VF to not or vice versa without seeing any significant
>>>disruption in the traffic.
>>
>> What? You have routes on the team netdev. virtio_net and VF are only
>> slaves. What are you talking about? I don't get it :/
>
>So lets walk though this by example. The general idea of the base case
>for all this is somebody starting with virtio_net, we will call the
>interface "ens1" for now. It comes up and is assigned a dhcp address
>and everything works as expected. Now in order to get better
>performance we want to add a VF "ens2", but we don't want a new IP
>address. Now if I understand correctly what will happen is that when
>"ens2" appears on the system teamd will then create a new team
>interface "team0". Before teamd can enslave ens1 it has to down the

No, you don't understand that correctly.

There is always ens1 and team0. ens1 is a slave of team0. team0 is the
interface to use, to set ip on etc.

When ens2 appears, it gets enslaved to team0 as well.


>interface if I understand things correctly. This means that we have to
>disrupt network traffic in order for this to work.
>
>To give you an idea of where we were before this became about trying
>to do this in the team or bonding driver, we were debating a 2 netdev
>model versus a 3 netdev model. I will call out the model and the
>advantages/disadvantages of those below.
>
>2 Netdev model, "ens1", enslaves "ens2".
>- Requires dropping in-driver XDP in order to work (won't capture VF
>traffic otherwise)
>- VF takes performance hit for extra qdisc/Tx queue lock of virtio_net interface
>- If you ass-u-me (I haven't been a fan of this model if you can't
>tell) that it is okay to rip out in-driver XDP from virtio_net, then
>you could transition between base virtio, virtio w/ backup bit set.
>- Works for netvsc because they limit their features (no in-driver
>XDP) to guarantee this works.
>
>3 Netdev model, "ens1", enslaves "ens1nbackup" and "ens2"
>- Exposes 2 netdevs "ens1" and "ens1nbackup" when only virtio is present
>- No extra qdisc or locking
>- All virtio_net original functionality still present
>- Not able to transition from virtio to virtio w/ backup without
>disruption (requires hot-plug)
>
>The way I see it the only way your team setup could work would be
>something closer to the 3 netdev model. Basically we would be
>requiring the user to always have the team0 present in order to make
>certain that anything like XDP would be run on the team interface
>instead of assuming that the virtio_net could run by itself. I will
>add it as a third option here to compare to the other 2.

Yes.


>
>3 Netdev "team" model, "team0", enslaves "ens1" and "ens2"
>- Requires guest to configure teamd
>- Exposes "team0" and "ens1" when only virtio is present
>- No extra qdisc or locking
>- Doesn't require "backup" bit in virtio
>
>>>
>>>Also how does this handle any static configuration? I am assuming that
>>>everything here assumes the team will be brought up as soon as it is
>>>seen and assigned a DHCP address.
>>
>> Again. You configure whatever you need on the team netdev.
>
>Just so we are clear, are you then saying that the team0 interface
>will always be present with this configuration? You had made it sound

Of course.


>like it would disappear if you didn't have at least 2 interfaces.

Where did I make it sound like that? No.


>
>>>
>>>The solution as you have proposed seems problematic at best. I don't
>>>see how the team solution works without introducing some sort of
>>>traffic disruption to either add/remove the VF and bring up/tear down
>>>the team interface. At that point we might as well just give up on
>>>this piece of live migration support entirely since the disruption was
>>>what we were trying to avoid. We might as well just hotplug out the VF
>>>and hotplug in a virtio at the same bus device and function number and
>>>just let udev take care of renaming it for us. The idea was supposed
>>>to be a seamless transition between the two interfaces.
>>
>> Alex. What you are trying to do in this patchset and what netvsc does it
>> essentialy in-driver bonding. Same thing mechanism, rx_handler,
>> everything. I don't really understand what are you talking about. With
>> use of team you will get exactly the same behaviour.
>
>So the goal of the "in-driver bonding" is to make the bonding as
>non-intrusive as possible and require as little user intervention as
>possible. I agree that much of the handling is the same, however the
>control structure and requirements are significantly different. That
>has been what I have been trying to explain. You keep wanting to use
>the existing structures, but they don't really apply cleanly because
>they push control for the interface up into the guest, and that
>doesn't make much sense in the case of virtualization. What is
>happening here is that we are exposing a bond that the guest should
>have no control over, or at least as little as possible. In addition
>making the user have to add additional configuration in the guest
>means that there is that much more that can go wrong if they screw it
>up.
>
>The other problem here is that the transition needs to be as seamless
>as possible between just a standard virtio_net setup and this new
>setup. With either the team or bonding setup you end up essentially
>forcing the guest to have the bond/team always there even if they are
>running only a single interface. Only if they "upgrade" the VM by
>adding a VF then it finally gets to do anything.

Yeah. There is certainly a dilemma. We have to choose between
1) weird and hackish in-driver semi-bonding that would be simple
   for user.
2) the standard way that would be perhaps slighly more complicated
   for user.

>
>What this comes down to for us is the following requirements:
>1. The name of the interface cannot change when going from virtio_net,
>to virtio_net being bypassed using a VF. We cannot create an interface
>on top of the interface, if anything we need to push the original
>virtio_net out of the way so that the new team interface takes its
>place in the configuration of the system. Otherwise a VM with VF w/
>live migration will require a different configuration than one that
>just runs virtio_net.

Team driver netdev is still the same, no name changes.


>2. We need some way to signal if this VM should be running in an
>"upgraded" mode or not. We have been using the backup bit in
>virtio_net to do that. If it isn't "upgraded" then we don't need the
>team/bond and we can just run with virtio_net.

I don't see why the team cannot be there always.


>3. We cannot introduce any downtime on the interface when adding a VF
>or removing it. The link must stay up the entire time and be able to
>handle packets.

Sure. That should be handled by the team. Whenever the VF netdev
disappears, traffic would switch over to the virtio_net. The benefit of
your in-driver bonding solution is that qemu can actually signal the
guest driver that the disappearance would happen and do the switch a bit
earlier. But that is something that might be implemented in a different
channel where the kernel might get notification that certain pci is
going to disappear so everyone could prepare. Just an idea.

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
  2018-02-21 19:38                         ` Jiri Pirko
@ 2018-02-21 20:57                             ` Alexander Duyck
  2018-02-21 20:57                           ` Alexander Duyck
  1 sibling, 0 replies; 121+ messages in thread
From: Alexander Duyck @ 2018-02-21 20:57 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Jakub Kicinski, Samudrala, Sridhar, Michael S. Tsirkin,
	Stephen Hemminger, David Miller, Netdev, virtualization,
	virtio-dev, Brandeburg, Jesse, Duyck, Alexander H, Jason Wang,
	Siwei Liu

On Wed, Feb 21, 2018 at 11:38 AM, Jiri Pirko <jiri@resnulli.us> wrote:
> Wed, Feb 21, 2018 at 06:56:35PM CET, alexander.duyck@gmail.com wrote:
>>On Wed, Feb 21, 2018 at 8:58 AM, Jiri Pirko <jiri@resnulli.us> wrote:
>>> Wed, Feb 21, 2018 at 05:49:49PM CET, alexander.duyck@gmail.com wrote:
>>>>On Wed, Feb 21, 2018 at 8:11 AM, Jiri Pirko <jiri@resnulli.us> wrote:
>>>>> Wed, Feb 21, 2018 at 04:56:48PM CET, alexander.duyck@gmail.com wrote:
>>>>>>On Wed, Feb 21, 2018 at 1:51 AM, Jiri Pirko <jiri@resnulli.us> wrote:
>>>>>>> Tue, Feb 20, 2018 at 11:33:56PM CET, kubakici@wp.pl wrote:
>>>>>>>>On Tue, 20 Feb 2018 21:14:10 +0100, Jiri Pirko wrote:
>>>>>>>>> Yeah, I can see it now :( I guess that the ship has sailed and we are
>>>>>>>>> stuck with this ugly thing forever...
>>>>>>>>>
>>>>>>>>> Could you at least make some common code that is shared in between
>>>>>>>>> netvsc and virtio_net so this is handled in exacly the same way in both?
>>>>>>>>
>>>>>>>>IMHO netvsc is a vendor specific driver which made a mistake on what
>>>>>>>>behaviour it provides (or tried to align itself with Windows SR-IOV).
>>>>>>>>Let's not make a far, far more commonly deployed and important driver
>>>>>>>>(virtio) bug-compatible with netvsc.
>>>>>>>
>>>>>>> Yeah. netvsc solution is a dangerous precedent here and in my opinition
>>>>>>> it was a huge mistake to merge it. I personally would vote to unmerge it
>>>>>>> and make the solution based on team/bond.
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>To Jiri's initial comments, I feel the same way, in fact I've talked to
>>>>>>>>the NetworkManager guys to get auto-bonding based on MACs handled in
>>>>>>>>user space.  I think it may very well get done in next versions of NM,
>>>>>>>>but isn't done yet.  Stephen also raised the point that not everybody is
>>>>>>>>using NM.
>>>>>>>
>>>>>>> Can be done in NM, networkd or other network management tools.
>>>>>>> Even easier to do this in teamd and let them all benefit.
>>>>>>>
>>>>>>> Actually, I took a stab to implement this in teamd. Took me like an hour
>>>>>>> and half.
>>>>>>>
>>>>>>> You can just run teamd with config option "kidnap" like this:
>>>>>>> # teamd/teamd -c '{"kidnap": true }'
>>>>>>>
>>>>>>> Whenever teamd sees another netdev to appear with the same mac as his,
>>>>>>> or whenever teamd sees another netdev to change mac to his,
>>>>>>> it enslaves it.
>>>>>>>
>>>>>>> Here's the patch (quick and dirty):
>>>>>>>
>>>>>>> Subject: [patch teamd] teamd: introduce kidnap feature
>>>>>>>
>>>>>>> Signed-off-by: Jiri Pirko <jiri@mellanox.com>
>>>>>>
>>>>>>So this doesn't really address the original problem we were trying to
>>>>>>solve. You asked earlier why the netdev name mattered and it mostly
>>>>>>has to do with configuration. Specifically what our patch is
>>>>>>attempting to resolve is the issue of how to allow a cloud provider to
>>>>>>upgrade their customer to SR-IOV support and live migration without
>>>>>>requiring them to reconfigure their guest. So the general idea with
>>>>>>our patch is to take a VM that is running with virtio_net only and
>>>>>>allow it to instead spawn a virtio_bypass master using the same netdev
>>>>>>name as the original virtio, and then have the virtio_net and VF come
>>>>>>up and be enslaved by the bypass interface. Doing it this way we can
>>>>>>allow for multi-vendor SR-IOV live migration support using a guest
>>>>>>that was originally configured for virtio only.
>>>>>>
>>>>>>The problem with your solution is we already have teaming and bonding
>>>>>>as you said. There is already a write-up from Red Hat on how to do it
>>>>>>(https://access.redhat.com/documentation/en-us/red_hat_virtualization/4.1/html/virtual_machine_management_guide/sect-migrating_virtual_machines_between_hosts).
>>>>>>That is all well and good as long as you are willing to keep around
>>>>>>two VM images, one for virtio, and one for SR-IOV with live migration.
>>>>>
>>>>> You don't need 2 images. You need only one. The one with the team setup.
>>>>> That's it. If another netdev with the same mac appears, teamd will
>>>>> enslave it and run traffic on it. If not, ok, you'll go only through
>>>>> virtio_net.
>>>>
>>>>Isn't that going to cause the routing table to get messed up when we
>>>>rearrange the netdevs? We don't want to have an significant disruption
>>>> in traffic when we are adding/removing the VF. It seems like we would
>>>>need to invalidate any entries that were configured for the virtio_net
>>>>and reestablish them on the new team interface. Part of the criteria
>>>>we have been working with is that we should be able to transition from
>>>>having a VF to not or vice versa without seeing any significant
>>>>disruption in the traffic.
>>>
>>> What? You have routes on the team netdev. virtio_net and VF are only
>>> slaves. What are you talking about? I don't get it :/
>>
>>So lets walk though this by example. The general idea of the base case
>>for all this is somebody starting with virtio_net, we will call the
>>interface "ens1" for now. It comes up and is assigned a dhcp address
>>and everything works as expected. Now in order to get better
>>performance we want to add a VF "ens2", but we don't want a new IP
>>address. Now if I understand correctly what will happen is that when
>>"ens2" appears on the system teamd will then create a new team
>>interface "team0". Before teamd can enslave ens1 it has to down the
>
> No, you don't understand that correctly.
>
> There is always ens1 and team0. ens1 is a slave of team0. team0 is the
> interface to use, to set ip on etc.
>
> When ens2 appears, it gets enslaved to team0 as well.
>
>
>>interface if I understand things correctly. This means that we have to
>>disrupt network traffic in order for this to work.
>>
>>To give you an idea of where we were before this became about trying
>>to do this in the team or bonding driver, we were debating a 2 netdev
>>model versus a 3 netdev model. I will call out the model and the
>>advantages/disadvantages of those below.
>>
>>2 Netdev model, "ens1", enslaves "ens2".
>>- Requires dropping in-driver XDP in order to work (won't capture VF
>>traffic otherwise)
>>- VF takes performance hit for extra qdisc/Tx queue lock of virtio_net interface
>>- If you ass-u-me (I haven't been a fan of this model if you can't
>>tell) that it is okay to rip out in-driver XDP from virtio_net, then
>>you could transition between base virtio, virtio w/ backup bit set.
>>- Works for netvsc because they limit their features (no in-driver
>>XDP) to guarantee this works.
>>
>>3 Netdev model, "ens1", enslaves "ens1nbackup" and "ens2"
>>- Exposes 2 netdevs "ens1" and "ens1nbackup" when only virtio is present
>>- No extra qdisc or locking
>>- All virtio_net original functionality still present
>>- Not able to transition from virtio to virtio w/ backup without
>>disruption (requires hot-plug)
>>
>>The way I see it the only way your team setup could work would be
>>something closer to the 3 netdev model. Basically we would be
>>requiring the user to always have the team0 present in order to make
>>certain that anything like XDP would be run on the team interface
>>instead of assuming that the virtio_net could run by itself. I will
>>add it as a third option here to compare to the other 2.
>
> Yes.
>
>
>>
>>3 Netdev "team" model, "team0", enslaves "ens1" and "ens2"
>>- Requires guest to configure teamd
>>- Exposes "team0" and "ens1" when only virtio is present
>>- No extra qdisc or locking
>>- Doesn't require "backup" bit in virtio
>>
>>>>
>>>>Also how does this handle any static configuration? I am assuming that
>>>>everything here assumes the team will be brought up as soon as it is
>>>>seen and assigned a DHCP address.
>>>
>>> Again. You configure whatever you need on the team netdev.
>>
>>Just so we are clear, are you then saying that the team0 interface
>>will always be present with this configuration? You had made it sound
>
> Of course.
>
>
>>like it would disappear if you didn't have at least 2 interfaces.
>
> Where did I make it sound like that? No.

I think it was a bit of misspeak/misread specifically I am thinking of:
  You don't need 2 images. You need only one. The one with the
  team setup. That's it. If another netdev with the same mac appears,
  teamd will enslave it and run traffic on it. If not, ok, you'll go only
  through virtio_net.

I read that as there being no team if the VF wasn't present since you
would still be going through team and then virtio_net otherwise.

>
>>
>>>>
>>>>The solution as you have proposed seems problematic at best. I don't
>>>>see how the team solution works without introducing some sort of
>>>>traffic disruption to either add/remove the VF and bring up/tear down
>>>>the team interface. At that point we might as well just give up on
>>>>this piece of live migration support entirely since the disruption was
>>>>what we were trying to avoid. We might as well just hotplug out the VF
>>>>and hotplug in a virtio at the same bus device and function number and
>>>>just let udev take care of renaming it for us. The idea was supposed
>>>>to be a seamless transition between the two interfaces.
>>>
>>> Alex. What you are trying to do in this patchset and what netvsc does it
>>> essentialy in-driver bonding. Same thing mechanism, rx_handler,
>>> everything. I don't really understand what are you talking about. With
>>> use of team you will get exactly the same behaviour.
>>
>>So the goal of the "in-driver bonding" is to make the bonding as
>>non-intrusive as possible and require as little user intervention as
>>possible. I agree that much of the handling is the same, however the
>>control structure and requirements are significantly different. That
>>has been what I have been trying to explain. You keep wanting to use
>>the existing structures, but they don't really apply cleanly because
>>they push control for the interface up into the guest, and that
>>doesn't make much sense in the case of virtualization. What is
>>happening here is that we are exposing a bond that the guest should
>>have no control over, or at least as little as possible. In addition
>>making the user have to add additional configuration in the guest
>>means that there is that much more that can go wrong if they screw it
>>up.
>>
>>The other problem here is that the transition needs to be as seamless
>>as possible between just a standard virtio_net setup and this new
>>setup. With either the team or bonding setup you end up essentially
>>forcing the guest to have the bond/team always there even if they are
>>running only a single interface. Only if they "upgrade" the VM by
>>adding a VF then it finally gets to do anything.
>
> Yeah. There is certainly a dilemma. We have to choose between
> 1) weird and hackish in-driver semi-bonding that would be simple
>    for user.
> 2) the standard way that would be perhaps slighly more complicated
>    for user.

The problem is for us option 2 is quite a bit uglier. Basically it
means essentially telling all the distros and such that their cloud
images have to use team by default on all virtio_net interfaces. It
pretty much means we have to throw away this as a possible solution
since you are requiring guest changes that most customers/OS vendors
would ever accept.

At least with our solution it was the driver making use of the
functionality if a given feature bit was set. The teaming solution as
proposed doesn't even give us that option.

>>
>>What this comes down to for us is the following requirements:
>>1. The name of the interface cannot change when going from virtio_net,
>>to virtio_net being bypassed using a VF. We cannot create an interface
>>on top of the interface, if anything we need to push the original
>>virtio_net out of the way so that the new team interface takes its
>>place in the configuration of the system. Otherwise a VM with VF w/
>>live migration will require a different configuration than one that
>>just runs virtio_net.
>
> Team driver netdev is still the same, no name changes.

Right. Basically we need to have the renaming occur so that any
existing config gets moved to the upper interface instead of having to
rely on configuration being adjusted for the team interface.

>>2. We need some way to signal if this VM should be running in an
>>"upgraded" mode or not. We have been using the backup bit in
>>virtio_net to do that. If it isn't "upgraded" then we don't need the
>>team/bond and we can just run with virtio_net.
>
> I don't see why the team cannot be there always.

It is more the logistical nightmare. Part of the goal here was to work
with the cloud base images that are out there such as
https://alt.fedoraproject.org/cloud/. With just the kernel changes the
overhead for this stays fairly small and would be pulled in as just a
standard part of the kernel update process. The virtio bypass only
pops up if the backup bit is present. With the team solution it
requires that the base image use the team driver on virtio_net when it
sees one. I doubt the OSVs would want to do that just because SR-IOV
isn't that popular of a case.

>>3. We cannot introduce any downtime on the interface when adding a VF
>>or removing it. The link must stay up the entire time and be able to
>>handle packets.
>
> Sure. That should be handled by the team. Whenever the VF netdev
> disappears, traffic would switch over to the virtio_net. The benefit of
> your in-driver bonding solution is that qemu can actually signal the
> guest driver that the disappearance would happen and do the switch a bit
> earlier. But that is something that might be implemented in a different
> channel where the kernel might get notification that certain pci is
> going to disappear so everyone could prepare. Just an idea.

The signaling isn't too much of an issue since we can just tweak the
link state of the VF or virtio manually to report the link up or down
prior to the hot-plug. Now that we are on the same page with the team0
interface always being there, I don't think 3 is much of a concern
with the solution as you proposed.

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
  2018-02-21 19:38                         ` Jiri Pirko
  2018-02-21 20:57                             ` [virtio-dev] " Alexander Duyck
@ 2018-02-21 20:57                           ` Alexander Duyck
  1 sibling, 0 replies; 121+ messages in thread
From: Alexander Duyck @ 2018-02-21 20:57 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Duyck, Alexander H, virtio-dev, Michael S. Tsirkin,
	Jakub Kicinski, Samudrala, Sridhar, virtualization, Siwei Liu,
	Netdev, David Miller

On Wed, Feb 21, 2018 at 11:38 AM, Jiri Pirko <jiri@resnulli.us> wrote:
> Wed, Feb 21, 2018 at 06:56:35PM CET, alexander.duyck@gmail.com wrote:
>>On Wed, Feb 21, 2018 at 8:58 AM, Jiri Pirko <jiri@resnulli.us> wrote:
>>> Wed, Feb 21, 2018 at 05:49:49PM CET, alexander.duyck@gmail.com wrote:
>>>>On Wed, Feb 21, 2018 at 8:11 AM, Jiri Pirko <jiri@resnulli.us> wrote:
>>>>> Wed, Feb 21, 2018 at 04:56:48PM CET, alexander.duyck@gmail.com wrote:
>>>>>>On Wed, Feb 21, 2018 at 1:51 AM, Jiri Pirko <jiri@resnulli.us> wrote:
>>>>>>> Tue, Feb 20, 2018 at 11:33:56PM CET, kubakici@wp.pl wrote:
>>>>>>>>On Tue, 20 Feb 2018 21:14:10 +0100, Jiri Pirko wrote:
>>>>>>>>> Yeah, I can see it now :( I guess that the ship has sailed and we are
>>>>>>>>> stuck with this ugly thing forever...
>>>>>>>>>
>>>>>>>>> Could you at least make some common code that is shared in between
>>>>>>>>> netvsc and virtio_net so this is handled in exacly the same way in both?
>>>>>>>>
>>>>>>>>IMHO netvsc is a vendor specific driver which made a mistake on what
>>>>>>>>behaviour it provides (or tried to align itself with Windows SR-IOV).
>>>>>>>>Let's not make a far, far more commonly deployed and important driver
>>>>>>>>(virtio) bug-compatible with netvsc.
>>>>>>>
>>>>>>> Yeah. netvsc solution is a dangerous precedent here and in my opinition
>>>>>>> it was a huge mistake to merge it. I personally would vote to unmerge it
>>>>>>> and make the solution based on team/bond.
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>To Jiri's initial comments, I feel the same way, in fact I've talked to
>>>>>>>>the NetworkManager guys to get auto-bonding based on MACs handled in
>>>>>>>>user space.  I think it may very well get done in next versions of NM,
>>>>>>>>but isn't done yet.  Stephen also raised the point that not everybody is
>>>>>>>>using NM.
>>>>>>>
>>>>>>> Can be done in NM, networkd or other network management tools.
>>>>>>> Even easier to do this in teamd and let them all benefit.
>>>>>>>
>>>>>>> Actually, I took a stab to implement this in teamd. Took me like an hour
>>>>>>> and half.
>>>>>>>
>>>>>>> You can just run teamd with config option "kidnap" like this:
>>>>>>> # teamd/teamd -c '{"kidnap": true }'
>>>>>>>
>>>>>>> Whenever teamd sees another netdev to appear with the same mac as his,
>>>>>>> or whenever teamd sees another netdev to change mac to his,
>>>>>>> it enslaves it.
>>>>>>>
>>>>>>> Here's the patch (quick and dirty):
>>>>>>>
>>>>>>> Subject: [patch teamd] teamd: introduce kidnap feature
>>>>>>>
>>>>>>> Signed-off-by: Jiri Pirko <jiri@mellanox.com>
>>>>>>
>>>>>>So this doesn't really address the original problem we were trying to
>>>>>>solve. You asked earlier why the netdev name mattered and it mostly
>>>>>>has to do with configuration. Specifically what our patch is
>>>>>>attempting to resolve is the issue of how to allow a cloud provider to
>>>>>>upgrade their customer to SR-IOV support and live migration without
>>>>>>requiring them to reconfigure their guest. So the general idea with
>>>>>>our patch is to take a VM that is running with virtio_net only and
>>>>>>allow it to instead spawn a virtio_bypass master using the same netdev
>>>>>>name as the original virtio, and then have the virtio_net and VF come
>>>>>>up and be enslaved by the bypass interface. Doing it this way we can
>>>>>>allow for multi-vendor SR-IOV live migration support using a guest
>>>>>>that was originally configured for virtio only.
>>>>>>
>>>>>>The problem with your solution is we already have teaming and bonding
>>>>>>as you said. There is already a write-up from Red Hat on how to do it
>>>>>>(https://access.redhat.com/documentation/en-us/red_hat_virtualization/4.1/html/virtual_machine_management_guide/sect-migrating_virtual_machines_between_hosts).
>>>>>>That is all well and good as long as you are willing to keep around
>>>>>>two VM images, one for virtio, and one for SR-IOV with live migration.
>>>>>
>>>>> You don't need 2 images. You need only one. The one with the team setup.
>>>>> That's it. If another netdev with the same mac appears, teamd will
>>>>> enslave it and run traffic on it. If not, ok, you'll go only through
>>>>> virtio_net.
>>>>
>>>>Isn't that going to cause the routing table to get messed up when we
>>>>rearrange the netdevs? We don't want to have an significant disruption
>>>> in traffic when we are adding/removing the VF. It seems like we would
>>>>need to invalidate any entries that were configured for the virtio_net
>>>>and reestablish them on the new team interface. Part of the criteria
>>>>we have been working with is that we should be able to transition from
>>>>having a VF to not or vice versa without seeing any significant
>>>>disruption in the traffic.
>>>
>>> What? You have routes on the team netdev. virtio_net and VF are only
>>> slaves. What are you talking about? I don't get it :/
>>
>>So lets walk though this by example. The general idea of the base case
>>for all this is somebody starting with virtio_net, we will call the
>>interface "ens1" for now. It comes up and is assigned a dhcp address
>>and everything works as expected. Now in order to get better
>>performance we want to add a VF "ens2", but we don't want a new IP
>>address. Now if I understand correctly what will happen is that when
>>"ens2" appears on the system teamd will then create a new team
>>interface "team0". Before teamd can enslave ens1 it has to down the
>
> No, you don't understand that correctly.
>
> There is always ens1 and team0. ens1 is a slave of team0. team0 is the
> interface to use, to set ip on etc.
>
> When ens2 appears, it gets enslaved to team0 as well.
>
>
>>interface if I understand things correctly. This means that we have to
>>disrupt network traffic in order for this to work.
>>
>>To give you an idea of where we were before this became about trying
>>to do this in the team or bonding driver, we were debating a 2 netdev
>>model versus a 3 netdev model. I will call out the model and the
>>advantages/disadvantages of those below.
>>
>>2 Netdev model, "ens1", enslaves "ens2".
>>- Requires dropping in-driver XDP in order to work (won't capture VF
>>traffic otherwise)
>>- VF takes performance hit for extra qdisc/Tx queue lock of virtio_net interface
>>- If you ass-u-me (I haven't been a fan of this model if you can't
>>tell) that it is okay to rip out in-driver XDP from virtio_net, then
>>you could transition between base virtio, virtio w/ backup bit set.
>>- Works for netvsc because they limit their features (no in-driver
>>XDP) to guarantee this works.
>>
>>3 Netdev model, "ens1", enslaves "ens1nbackup" and "ens2"
>>- Exposes 2 netdevs "ens1" and "ens1nbackup" when only virtio is present
>>- No extra qdisc or locking
>>- All virtio_net original functionality still present
>>- Not able to transition from virtio to virtio w/ backup without
>>disruption (requires hot-plug)
>>
>>The way I see it the only way your team setup could work would be
>>something closer to the 3 netdev model. Basically we would be
>>requiring the user to always have the team0 present in order to make
>>certain that anything like XDP would be run on the team interface
>>instead of assuming that the virtio_net could run by itself. I will
>>add it as a third option here to compare to the other 2.
>
> Yes.
>
>
>>
>>3 Netdev "team" model, "team0", enslaves "ens1" and "ens2"
>>- Requires guest to configure teamd
>>- Exposes "team0" and "ens1" when only virtio is present
>>- No extra qdisc or locking
>>- Doesn't require "backup" bit in virtio
>>
>>>>
>>>>Also how does this handle any static configuration? I am assuming that
>>>>everything here assumes the team will be brought up as soon as it is
>>>>seen and assigned a DHCP address.
>>>
>>> Again. You configure whatever you need on the team netdev.
>>
>>Just so we are clear, are you then saying that the team0 interface
>>will always be present with this configuration? You had made it sound
>
> Of course.
>
>
>>like it would disappear if you didn't have at least 2 interfaces.
>
> Where did I make it sound like that? No.

I think it was a bit of misspeak/misread specifically I am thinking of:
  You don't need 2 images. You need only one. The one with the
  team setup. That's it. If another netdev with the same mac appears,
  teamd will enslave it and run traffic on it. If not, ok, you'll go only
  through virtio_net.

I read that as there being no team if the VF wasn't present since you
would still be going through team and then virtio_net otherwise.

>
>>
>>>>
>>>>The solution as you have proposed seems problematic at best. I don't
>>>>see how the team solution works without introducing some sort of
>>>>traffic disruption to either add/remove the VF and bring up/tear down
>>>>the team interface. At that point we might as well just give up on
>>>>this piece of live migration support entirely since the disruption was
>>>>what we were trying to avoid. We might as well just hotplug out the VF
>>>>and hotplug in a virtio at the same bus device and function number and
>>>>just let udev take care of renaming it for us. The idea was supposed
>>>>to be a seamless transition between the two interfaces.
>>>
>>> Alex. What you are trying to do in this patchset and what netvsc does it
>>> essentialy in-driver bonding. Same thing mechanism, rx_handler,
>>> everything. I don't really understand what are you talking about. With
>>> use of team you will get exactly the same behaviour.
>>
>>So the goal of the "in-driver bonding" is to make the bonding as
>>non-intrusive as possible and require as little user intervention as
>>possible. I agree that much of the handling is the same, however the
>>control structure and requirements are significantly different. That
>>has been what I have been trying to explain. You keep wanting to use
>>the existing structures, but they don't really apply cleanly because
>>they push control for the interface up into the guest, and that
>>doesn't make much sense in the case of virtualization. What is
>>happening here is that we are exposing a bond that the guest should
>>have no control over, or at least as little as possible. In addition
>>making the user have to add additional configuration in the guest
>>means that there is that much more that can go wrong if they screw it
>>up.
>>
>>The other problem here is that the transition needs to be as seamless
>>as possible between just a standard virtio_net setup and this new
>>setup. With either the team or bonding setup you end up essentially
>>forcing the guest to have the bond/team always there even if they are
>>running only a single interface. Only if they "upgrade" the VM by
>>adding a VF then it finally gets to do anything.
>
> Yeah. There is certainly a dilemma. We have to choose between
> 1) weird and hackish in-driver semi-bonding that would be simple
>    for user.
> 2) the standard way that would be perhaps slighly more complicated
>    for user.

The problem is for us option 2 is quite a bit uglier. Basically it
means essentially telling all the distros and such that their cloud
images have to use team by default on all virtio_net interfaces. It
pretty much means we have to throw away this as a possible solution
since you are requiring guest changes that most customers/OS vendors
would ever accept.

At least with our solution it was the driver making use of the
functionality if a given feature bit was set. The teaming solution as
proposed doesn't even give us that option.

>>
>>What this comes down to for us is the following requirements:
>>1. The name of the interface cannot change when going from virtio_net,
>>to virtio_net being bypassed using a VF. We cannot create an interface
>>on top of the interface, if anything we need to push the original
>>virtio_net out of the way so that the new team interface takes its
>>place in the configuration of the system. Otherwise a VM with VF w/
>>live migration will require a different configuration than one that
>>just runs virtio_net.
>
> Team driver netdev is still the same, no name changes.

Right. Basically we need to have the renaming occur so that any
existing config gets moved to the upper interface instead of having to
rely on configuration being adjusted for the team interface.

>>2. We need some way to signal if this VM should be running in an
>>"upgraded" mode or not. We have been using the backup bit in
>>virtio_net to do that. If it isn't "upgraded" then we don't need the
>>team/bond and we can just run with virtio_net.
>
> I don't see why the team cannot be there always.

It is more the logistical nightmare. Part of the goal here was to work
with the cloud base images that are out there such as
https://alt.fedoraproject.org/cloud/. With just the kernel changes the
overhead for this stays fairly small and would be pulled in as just a
standard part of the kernel update process. The virtio bypass only
pops up if the backup bit is present. With the team solution it
requires that the base image use the team driver on virtio_net when it
sees one. I doubt the OSVs would want to do that just because SR-IOV
isn't that popular of a case.

>>3. We cannot introduce any downtime on the interface when adding a VF
>>or removing it. The link must stay up the entire time and be able to
>>handle packets.
>
> Sure. That should be handled by the team. Whenever the VF netdev
> disappears, traffic would switch over to the virtio_net. The benefit of
> your in-driver bonding solution is that qemu can actually signal the
> guest driver that the disappearance would happen and do the switch a bit
> earlier. But that is something that might be implemented in a different
> channel where the kernel might get notification that certain pci is
> going to disappear so everyone could prepare. Just an idea.

The signaling isn't too much of an issue since we can just tweak the
link state of the VF or virtio manually to report the link up or down
prior to the hot-plug. Now that we are on the same page with the team0
interface always being there, I don't think 3 is much of a concern
with the solution as you proposed.

^ permalink raw reply	[flat|nested] 121+ messages in thread

* [virtio-dev] Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
@ 2018-02-21 20:57                             ` Alexander Duyck
  0 siblings, 0 replies; 121+ messages in thread
From: Alexander Duyck @ 2018-02-21 20:57 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Jakub Kicinski, Samudrala, Sridhar, Michael S. Tsirkin,
	Stephen Hemminger, David Miller, Netdev, virtualization,
	virtio-dev, Brandeburg, Jesse, Duyck, Alexander H, Jason Wang,
	Siwei Liu

On Wed, Feb 21, 2018 at 11:38 AM, Jiri Pirko <jiri@resnulli.us> wrote:
> Wed, Feb 21, 2018 at 06:56:35PM CET, alexander.duyck@gmail.com wrote:
>>On Wed, Feb 21, 2018 at 8:58 AM, Jiri Pirko <jiri@resnulli.us> wrote:
>>> Wed, Feb 21, 2018 at 05:49:49PM CET, alexander.duyck@gmail.com wrote:
>>>>On Wed, Feb 21, 2018 at 8:11 AM, Jiri Pirko <jiri@resnulli.us> wrote:
>>>>> Wed, Feb 21, 2018 at 04:56:48PM CET, alexander.duyck@gmail.com wrote:
>>>>>>On Wed, Feb 21, 2018 at 1:51 AM, Jiri Pirko <jiri@resnulli.us> wrote:
>>>>>>> Tue, Feb 20, 2018 at 11:33:56PM CET, kubakici@wp.pl wrote:
>>>>>>>>On Tue, 20 Feb 2018 21:14:10 +0100, Jiri Pirko wrote:
>>>>>>>>> Yeah, I can see it now :( I guess that the ship has sailed and we are
>>>>>>>>> stuck with this ugly thing forever...
>>>>>>>>>
>>>>>>>>> Could you at least make some common code that is shared in between
>>>>>>>>> netvsc and virtio_net so this is handled in exacly the same way in both?
>>>>>>>>
>>>>>>>>IMHO netvsc is a vendor specific driver which made a mistake on what
>>>>>>>>behaviour it provides (or tried to align itself with Windows SR-IOV).
>>>>>>>>Let's not make a far, far more commonly deployed and important driver
>>>>>>>>(virtio) bug-compatible with netvsc.
>>>>>>>
>>>>>>> Yeah. netvsc solution is a dangerous precedent here and in my opinition
>>>>>>> it was a huge mistake to merge it. I personally would vote to unmerge it
>>>>>>> and make the solution based on team/bond.
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>To Jiri's initial comments, I feel the same way, in fact I've talked to
>>>>>>>>the NetworkManager guys to get auto-bonding based on MACs handled in
>>>>>>>>user space.  I think it may very well get done in next versions of NM,
>>>>>>>>but isn't done yet.  Stephen also raised the point that not everybody is
>>>>>>>>using NM.
>>>>>>>
>>>>>>> Can be done in NM, networkd or other network management tools.
>>>>>>> Even easier to do this in teamd and let them all benefit.
>>>>>>>
>>>>>>> Actually, I took a stab to implement this in teamd. Took me like an hour
>>>>>>> and half.
>>>>>>>
>>>>>>> You can just run teamd with config option "kidnap" like this:
>>>>>>> # teamd/teamd -c '{"kidnap": true }'
>>>>>>>
>>>>>>> Whenever teamd sees another netdev to appear with the same mac as his,
>>>>>>> or whenever teamd sees another netdev to change mac to his,
>>>>>>> it enslaves it.
>>>>>>>
>>>>>>> Here's the patch (quick and dirty):
>>>>>>>
>>>>>>> Subject: [patch teamd] teamd: introduce kidnap feature
>>>>>>>
>>>>>>> Signed-off-by: Jiri Pirko <jiri@mellanox.com>
>>>>>>
>>>>>>So this doesn't really address the original problem we were trying to
>>>>>>solve. You asked earlier why the netdev name mattered and it mostly
>>>>>>has to do with configuration. Specifically what our patch is
>>>>>>attempting to resolve is the issue of how to allow a cloud provider to
>>>>>>upgrade their customer to SR-IOV support and live migration without
>>>>>>requiring them to reconfigure their guest. So the general idea with
>>>>>>our patch is to take a VM that is running with virtio_net only and
>>>>>>allow it to instead spawn a virtio_bypass master using the same netdev
>>>>>>name as the original virtio, and then have the virtio_net and VF come
>>>>>>up and be enslaved by the bypass interface. Doing it this way we can
>>>>>>allow for multi-vendor SR-IOV live migration support using a guest
>>>>>>that was originally configured for virtio only.
>>>>>>
>>>>>>The problem with your solution is we already have teaming and bonding
>>>>>>as you said. There is already a write-up from Red Hat on how to do it
>>>>>>(https://access.redhat.com/documentation/en-us/red_hat_virtualization/4.1/html/virtual_machine_management_guide/sect-migrating_virtual_machines_between_hosts).
>>>>>>That is all well and good as long as you are willing to keep around
>>>>>>two VM images, one for virtio, and one for SR-IOV with live migration.
>>>>>
>>>>> You don't need 2 images. You need only one. The one with the team setup.
>>>>> That's it. If another netdev with the same mac appears, teamd will
>>>>> enslave it and run traffic on it. If not, ok, you'll go only through
>>>>> virtio_net.
>>>>
>>>>Isn't that going to cause the routing table to get messed up when we
>>>>rearrange the netdevs? We don't want to have an significant disruption
>>>> in traffic when we are adding/removing the VF. It seems like we would
>>>>need to invalidate any entries that were configured for the virtio_net
>>>>and reestablish them on the new team interface. Part of the criteria
>>>>we have been working with is that we should be able to transition from
>>>>having a VF to not or vice versa without seeing any significant
>>>>disruption in the traffic.
>>>
>>> What? You have routes on the team netdev. virtio_net and VF are only
>>> slaves. What are you talking about? I don't get it :/
>>
>>So lets walk though this by example. The general idea of the base case
>>for all this is somebody starting with virtio_net, we will call the
>>interface "ens1" for now. It comes up and is assigned a dhcp address
>>and everything works as expected. Now in order to get better
>>performance we want to add a VF "ens2", but we don't want a new IP
>>address. Now if I understand correctly what will happen is that when
>>"ens2" appears on the system teamd will then create a new team
>>interface "team0". Before teamd can enslave ens1 it has to down the
>
> No, you don't understand that correctly.
>
> There is always ens1 and team0. ens1 is a slave of team0. team0 is the
> interface to use, to set ip on etc.
>
> When ens2 appears, it gets enslaved to team0 as well.
>
>
>>interface if I understand things correctly. This means that we have to
>>disrupt network traffic in order for this to work.
>>
>>To give you an idea of where we were before this became about trying
>>to do this in the team or bonding driver, we were debating a 2 netdev
>>model versus a 3 netdev model. I will call out the model and the
>>advantages/disadvantages of those below.
>>
>>2 Netdev model, "ens1", enslaves "ens2".
>>- Requires dropping in-driver XDP in order to work (won't capture VF
>>traffic otherwise)
>>- VF takes performance hit for extra qdisc/Tx queue lock of virtio_net interface
>>- If you ass-u-me (I haven't been a fan of this model if you can't
>>tell) that it is okay to rip out in-driver XDP from virtio_net, then
>>you could transition between base virtio, virtio w/ backup bit set.
>>- Works for netvsc because they limit their features (no in-driver
>>XDP) to guarantee this works.
>>
>>3 Netdev model, "ens1", enslaves "ens1nbackup" and "ens2"
>>- Exposes 2 netdevs "ens1" and "ens1nbackup" when only virtio is present
>>- No extra qdisc or locking
>>- All virtio_net original functionality still present
>>- Not able to transition from virtio to virtio w/ backup without
>>disruption (requires hot-plug)
>>
>>The way I see it the only way your team setup could work would be
>>something closer to the 3 netdev model. Basically we would be
>>requiring the user to always have the team0 present in order to make
>>certain that anything like XDP would be run on the team interface
>>instead of assuming that the virtio_net could run by itself. I will
>>add it as a third option here to compare to the other 2.
>
> Yes.
>
>
>>
>>3 Netdev "team" model, "team0", enslaves "ens1" and "ens2"
>>- Requires guest to configure teamd
>>- Exposes "team0" and "ens1" when only virtio is present
>>- No extra qdisc or locking
>>- Doesn't require "backup" bit in virtio
>>
>>>>
>>>>Also how does this handle any static configuration? I am assuming that
>>>>everything here assumes the team will be brought up as soon as it is
>>>>seen and assigned a DHCP address.
>>>
>>> Again. You configure whatever you need on the team netdev.
>>
>>Just so we are clear, are you then saying that the team0 interface
>>will always be present with this configuration? You had made it sound
>
> Of course.
>
>
>>like it would disappear if you didn't have at least 2 interfaces.
>
> Where did I make it sound like that? No.

I think it was a bit of misspeak/misread specifically I am thinking of:
  You don't need 2 images. You need only one. The one with the
  team setup. That's it. If another netdev with the same mac appears,
  teamd will enslave it and run traffic on it. If not, ok, you'll go only
  through virtio_net.

I read that as there being no team if the VF wasn't present since you
would still be going through team and then virtio_net otherwise.

>
>>
>>>>
>>>>The solution as you have proposed seems problematic at best. I don't
>>>>see how the team solution works without introducing some sort of
>>>>traffic disruption to either add/remove the VF and bring up/tear down
>>>>the team interface. At that point we might as well just give up on
>>>>this piece of live migration support entirely since the disruption was
>>>>what we were trying to avoid. We might as well just hotplug out the VF
>>>>and hotplug in a virtio at the same bus device and function number and
>>>>just let udev take care of renaming it for us. The idea was supposed
>>>>to be a seamless transition between the two interfaces.
>>>
>>> Alex. What you are trying to do in this patchset and what netvsc does it
>>> essentialy in-driver bonding. Same thing mechanism, rx_handler,
>>> everything. I don't really understand what are you talking about. With
>>> use of team you will get exactly the same behaviour.
>>
>>So the goal of the "in-driver bonding" is to make the bonding as
>>non-intrusive as possible and require as little user intervention as
>>possible. I agree that much of the handling is the same, however the
>>control structure and requirements are significantly different. That
>>has been what I have been trying to explain. You keep wanting to use
>>the existing structures, but they don't really apply cleanly because
>>they push control for the interface up into the guest, and that
>>doesn't make much sense in the case of virtualization. What is
>>happening here is that we are exposing a bond that the guest should
>>have no control over, or at least as little as possible. In addition
>>making the user have to add additional configuration in the guest
>>means that there is that much more that can go wrong if they screw it
>>up.
>>
>>The other problem here is that the transition needs to be as seamless
>>as possible between just a standard virtio_net setup and this new
>>setup. With either the team or bonding setup you end up essentially
>>forcing the guest to have the bond/team always there even if they are
>>running only a single interface. Only if they "upgrade" the VM by
>>adding a VF then it finally gets to do anything.
>
> Yeah. There is certainly a dilemma. We have to choose between
> 1) weird and hackish in-driver semi-bonding that would be simple
>    for user.
> 2) the standard way that would be perhaps slighly more complicated
>    for user.

The problem is for us option 2 is quite a bit uglier. Basically it
means essentially telling all the distros and such that their cloud
images have to use team by default on all virtio_net interfaces. It
pretty much means we have to throw away this as a possible solution
since you are requiring guest changes that most customers/OS vendors
would ever accept.

At least with our solution it was the driver making use of the
functionality if a given feature bit was set. The teaming solution as
proposed doesn't even give us that option.

>>
>>What this comes down to for us is the following requirements:
>>1. The name of the interface cannot change when going from virtio_net,
>>to virtio_net being bypassed using a VF. We cannot create an interface
>>on top of the interface, if anything we need to push the original
>>virtio_net out of the way so that the new team interface takes its
>>place in the configuration of the system. Otherwise a VM with VF w/
>>live migration will require a different configuration than one that
>>just runs virtio_net.
>
> Team driver netdev is still the same, no name changes.

Right. Basically we need to have the renaming occur so that any
existing config gets moved to the upper interface instead of having to
rely on configuration being adjusted for the team interface.

>>2. We need some way to signal if this VM should be running in an
>>"upgraded" mode or not. We have been using the backup bit in
>>virtio_net to do that. If it isn't "upgraded" then we don't need the
>>team/bond and we can just run with virtio_net.
>
> I don't see why the team cannot be there always.

It is more the logistical nightmare. Part of the goal here was to work
with the cloud base images that are out there such as
https://alt.fedoraproject.org/cloud/. With just the kernel changes the
overhead for this stays fairly small and would be pulled in as just a
standard part of the kernel update process. The virtio bypass only
pops up if the backup bit is present. With the team solution it
requires that the base image use the team driver on virtio_net when it
sees one. I doubt the OSVs would want to do that just because SR-IOV
isn't that popular of a case.

>>3. We cannot introduce any downtime on the interface when adding a VF
>>or removing it. The link must stay up the entire time and be able to
>>handle packets.
>
> Sure. That should be handled by the team. Whenever the VF netdev
> disappears, traffic would switch over to the virtio_net. The benefit of
> your in-driver bonding solution is that qemu can actually signal the
> guest driver that the disappearance would happen and do the switch a bit
> earlier. But that is something that might be implemented in a different
> channel where the kernel might get notification that certain pci is
> going to disappear so everyone could prepare. Just an idea.

The signaling isn't too much of an issue since we can just tweak the
link state of the VF or virtio manually to report the link up or down
prior to the hot-plug. Now that we are on the same page with the team0
interface always being there, I don't think 3 is much of a concern
with the solution as you proposed.

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
  2018-02-17 17:12     ` [virtio-dev] " Alexander Duyck
@ 2018-02-21 23:50       ` Siwei Liu
  -1 siblings, 0 replies; 121+ messages in thread
From: Siwei Liu @ 2018-02-21 23:50 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Jakub Kicinski, Sridhar Samudrala, Michael S. Tsirkin,
	Stephen Hemminger, David Miller, Netdev, virtualization,
	virtio-dev, Brandeburg, Jesse, Duyck, Alexander H, Jason Wang

I haven't checked emails for days and did not realize the new revision
had already came out. And thank you for the effort, this revision
really looks to be a step forward towards our use case and is close to
what we wanted to do. A few questions in line.

On Sat, Feb 17, 2018 at 9:12 AM, Alexander Duyck
<alexander.duyck@gmail.com> wrote:
> On Fri, Feb 16, 2018 at 6:38 PM, Jakub Kicinski <kubakici@wp.pl> wrote:
>> On Fri, 16 Feb 2018 10:11:19 -0800, Sridhar Samudrala wrote:
>>> Ppatch 2 is in response to the community request for a 3 netdev
>>> solution.  However, it creates some issues we'll get into in a moment.
>>> It extends virtio_net to use alternate datapath when available and
>>> registered. When BACKUP feature is enabled, virtio_net driver creates
>>> an additional 'bypass' netdev that acts as a master device and controls
>>> 2 slave devices.  The original virtio_net netdev is registered as
>>> 'backup' netdev and a passthru/vf device with the same MAC gets
>>> registered as 'active' netdev. Both 'bypass' and 'backup' netdevs are
>>> associated with the same 'pci' device.  The user accesses the network
>>> interface via 'bypass' netdev. The 'bypass' netdev chooses 'active' netdev
>>> as default for transmits when it is available with link up and running.
>>
>> Thank you do doing this.
>>
>>> We noticed a couple of issues with this approach during testing.
>>> - As both 'bypass' and 'backup' netdevs are associated with the same
>>>   virtio pci device, udev tries to rename both of them with the same name
>>>   and the 2nd rename will fail. This would be OK as long as the first netdev
>>>   to be renamed is the 'bypass' netdev, but the order in which udev gets
>>>   to rename the 2 netdevs is not reliable.
>>
>> Out of curiosity - why do you link the master netdev to the virtio
>> struct device?
>
> The basic idea of all this is that we wanted this to work with an
> existing VM image that was using virtio. As such we were trying to
> make it so that the bypass interface takes the place of the original
> virtio and get udev to rename the bypass to what the original
> virtio_net was.

Could it made it also possible to take over the config from VF instead
of virtio on an existing VM image? And get udev rename the bypass
netdev to what the original VF was. I don't say tightly binding the
bypass master to only virtio or VF, but I think we should provide both
options to support different upgrade paths. Possibly we could tweak
the device tree layout to reuse the same PCI slot for the master
bypass netdev, such that udev would not get confused when renaming the
device. The VF needs to use a different function slot afterwards.
Perhaps we might need to a special multiseat like QEMU device for that
purpose?

Our case we'll upgrade the config from VF to virtio-bypass directly.

>
>> FWIW two solutions that immediately come to mind is to export "backup"
>> as phys_port_name of the backup virtio link and/or assign a name to the
>> master like you are doing already.  I think team uses team%d and bond
>> uses bond%d, soft naming of master devices seems quite natural in this
>> case.
>
> I figured I had overlooked something like that.. Thanks for pointing
> this out. Okay so I think the phys_port_name approach might resolve
> the original issue. If I am reading things correctly what we end up
> with is the master showing up as "ens1" for example and the backup
> showing up as "ens1nbackup". Am I understanding that right?
>
> The problem with the team/bond%d approach is that it creates a new
> netdevice and so it would require guest configuration changes.
>
>> IMHO phys_port_name == "backup" if BACKUP bit is set on slave virtio
>> link is quite neat.
>
> I agree. For non-"backup" virio_net devices would it be okay for us to
> just return -EOPNOTSUPP? I assume it would be and that way the legacy
> behavior could be maintained although the function still exists.
>
>>> - When the 'active' netdev is unplugged OR not present on a destination
>>>   system after live migration, the user will see 2 virtio_net netdevs.
>>
>> That's necessary and expected, all configuration applies to the master
>> so master must exist.
>
> With the naming issue resolved this is the only item left outstanding.
> This becomes a matter of form vs function.
>
> The main complaint about the "3 netdev" solution is a bit confusing to
> have the 2 netdevs present if the VF isn't there. The idea is that
> having the extra "master" netdev there if there isn't really a bond is
> a bit ugly.

Is it this uglier in terms of user experience rather than
functionality? I don't want it dynamically changed between 2-netdev
and 3-netdev depending on the presence of VF. That gets back to my
original question and suggestion earlier: why not just hide the lower
netdevs from udev renaming and such? Which important observability
benefits users may get if exposing the lower netdevs?

Thanks,
-Siwei

>
> The downside of the "2 netdev" solution is that you have to deal with
> an extra layer of locking/queueing to get to the VF and you lose some
> functionality since things like in-driver XDP have to be disabled in
> order to maintain the same functionality when the VF is present or
> not. However it looks more like classic virtio_net when the VF is not
> present.

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
  2018-02-17 17:12     ` [virtio-dev] " Alexander Duyck
                       ` (2 preceding siblings ...)
  (?)
@ 2018-02-21 23:50     ` Siwei Liu
  -1 siblings, 0 replies; 121+ messages in thread
From: Siwei Liu @ 2018-02-21 23:50 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Duyck, Alexander H, virtio-dev, Michael S. Tsirkin,
	Jakub Kicinski, Sridhar Samudrala, virtualization, Netdev,
	David Miller

I haven't checked emails for days and did not realize the new revision
had already came out. And thank you for the effort, this revision
really looks to be a step forward towards our use case and is close to
what we wanted to do. A few questions in line.

On Sat, Feb 17, 2018 at 9:12 AM, Alexander Duyck
<alexander.duyck@gmail.com> wrote:
> On Fri, Feb 16, 2018 at 6:38 PM, Jakub Kicinski <kubakici@wp.pl> wrote:
>> On Fri, 16 Feb 2018 10:11:19 -0800, Sridhar Samudrala wrote:
>>> Ppatch 2 is in response to the community request for a 3 netdev
>>> solution.  However, it creates some issues we'll get into in a moment.
>>> It extends virtio_net to use alternate datapath when available and
>>> registered. When BACKUP feature is enabled, virtio_net driver creates
>>> an additional 'bypass' netdev that acts as a master device and controls
>>> 2 slave devices.  The original virtio_net netdev is registered as
>>> 'backup' netdev and a passthru/vf device with the same MAC gets
>>> registered as 'active' netdev. Both 'bypass' and 'backup' netdevs are
>>> associated with the same 'pci' device.  The user accesses the network
>>> interface via 'bypass' netdev. The 'bypass' netdev chooses 'active' netdev
>>> as default for transmits when it is available with link up and running.
>>
>> Thank you do doing this.
>>
>>> We noticed a couple of issues with this approach during testing.
>>> - As both 'bypass' and 'backup' netdevs are associated with the same
>>>   virtio pci device, udev tries to rename both of them with the same name
>>>   and the 2nd rename will fail. This would be OK as long as the first netdev
>>>   to be renamed is the 'bypass' netdev, but the order in which udev gets
>>>   to rename the 2 netdevs is not reliable.
>>
>> Out of curiosity - why do you link the master netdev to the virtio
>> struct device?
>
> The basic idea of all this is that we wanted this to work with an
> existing VM image that was using virtio. As such we were trying to
> make it so that the bypass interface takes the place of the original
> virtio and get udev to rename the bypass to what the original
> virtio_net was.

Could it made it also possible to take over the config from VF instead
of virtio on an existing VM image? And get udev rename the bypass
netdev to what the original VF was. I don't say tightly binding the
bypass master to only virtio or VF, but I think we should provide both
options to support different upgrade paths. Possibly we could tweak
the device tree layout to reuse the same PCI slot for the master
bypass netdev, such that udev would not get confused when renaming the
device. The VF needs to use a different function slot afterwards.
Perhaps we might need to a special multiseat like QEMU device for that
purpose?

Our case we'll upgrade the config from VF to virtio-bypass directly.

>
>> FWIW two solutions that immediately come to mind is to export "backup"
>> as phys_port_name of the backup virtio link and/or assign a name to the
>> master like you are doing already.  I think team uses team%d and bond
>> uses bond%d, soft naming of master devices seems quite natural in this
>> case.
>
> I figured I had overlooked something like that.. Thanks for pointing
> this out. Okay so I think the phys_port_name approach might resolve
> the original issue. If I am reading things correctly what we end up
> with is the master showing up as "ens1" for example and the backup
> showing up as "ens1nbackup". Am I understanding that right?
>
> The problem with the team/bond%d approach is that it creates a new
> netdevice and so it would require guest configuration changes.
>
>> IMHO phys_port_name == "backup" if BACKUP bit is set on slave virtio
>> link is quite neat.
>
> I agree. For non-"backup" virio_net devices would it be okay for us to
> just return -EOPNOTSUPP? I assume it would be and that way the legacy
> behavior could be maintained although the function still exists.
>
>>> - When the 'active' netdev is unplugged OR not present on a destination
>>>   system after live migration, the user will see 2 virtio_net netdevs.
>>
>> That's necessary and expected, all configuration applies to the master
>> so master must exist.
>
> With the naming issue resolved this is the only item left outstanding.
> This becomes a matter of form vs function.
>
> The main complaint about the "3 netdev" solution is a bit confusing to
> have the 2 netdevs present if the VF isn't there. The idea is that
> having the extra "master" netdev there if there isn't really a bond is
> a bit ugly.

Is it this uglier in terms of user experience rather than
functionality? I don't want it dynamically changed between 2-netdev
and 3-netdev depending on the presence of VF. That gets back to my
original question and suggestion earlier: why not just hide the lower
netdevs from udev renaming and such? Which important observability
benefits users may get if exposing the lower netdevs?

Thanks,
-Siwei

>
> The downside of the "2 netdev" solution is that you have to deal with
> an extra layer of locking/queueing to get to the VF and you lose some
> functionality since things like in-driver XDP have to be disabled in
> order to maintain the same functionality when the VF is present or
> not. However it looks more like classic virtio_net when the VF is not
> present.

^ permalink raw reply	[flat|nested] 121+ messages in thread

* [virtio-dev] Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
@ 2018-02-21 23:50       ` Siwei Liu
  0 siblings, 0 replies; 121+ messages in thread
From: Siwei Liu @ 2018-02-21 23:50 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Jakub Kicinski, Sridhar Samudrala, Michael S. Tsirkin,
	Stephen Hemminger, David Miller, Netdev, virtualization,
	virtio-dev, Brandeburg, Jesse, Duyck, Alexander H, Jason Wang

I haven't checked emails for days and did not realize the new revision
had already came out. And thank you for the effort, this revision
really looks to be a step forward towards our use case and is close to
what we wanted to do. A few questions in line.

On Sat, Feb 17, 2018 at 9:12 AM, Alexander Duyck
<alexander.duyck@gmail.com> wrote:
> On Fri, Feb 16, 2018 at 6:38 PM, Jakub Kicinski <kubakici@wp.pl> wrote:
>> On Fri, 16 Feb 2018 10:11:19 -0800, Sridhar Samudrala wrote:
>>> Ppatch 2 is in response to the community request for a 3 netdev
>>> solution.  However, it creates some issues we'll get into in a moment.
>>> It extends virtio_net to use alternate datapath when available and
>>> registered. When BACKUP feature is enabled, virtio_net driver creates
>>> an additional 'bypass' netdev that acts as a master device and controls
>>> 2 slave devices.  The original virtio_net netdev is registered as
>>> 'backup' netdev and a passthru/vf device with the same MAC gets
>>> registered as 'active' netdev. Both 'bypass' and 'backup' netdevs are
>>> associated with the same 'pci' device.  The user accesses the network
>>> interface via 'bypass' netdev. The 'bypass' netdev chooses 'active' netdev
>>> as default for transmits when it is available with link up and running.
>>
>> Thank you do doing this.
>>
>>> We noticed a couple of issues with this approach during testing.
>>> - As both 'bypass' and 'backup' netdevs are associated with the same
>>>   virtio pci device, udev tries to rename both of them with the same name
>>>   and the 2nd rename will fail. This would be OK as long as the first netdev
>>>   to be renamed is the 'bypass' netdev, but the order in which udev gets
>>>   to rename the 2 netdevs is not reliable.
>>
>> Out of curiosity - why do you link the master netdev to the virtio
>> struct device?
>
> The basic idea of all this is that we wanted this to work with an
> existing VM image that was using virtio. As such we were trying to
> make it so that the bypass interface takes the place of the original
> virtio and get udev to rename the bypass to what the original
> virtio_net was.

Could it made it also possible to take over the config from VF instead
of virtio on an existing VM image? And get udev rename the bypass
netdev to what the original VF was. I don't say tightly binding the
bypass master to only virtio or VF, but I think we should provide both
options to support different upgrade paths. Possibly we could tweak
the device tree layout to reuse the same PCI slot for the master
bypass netdev, such that udev would not get confused when renaming the
device. The VF needs to use a different function slot afterwards.
Perhaps we might need to a special multiseat like QEMU device for that
purpose?

Our case we'll upgrade the config from VF to virtio-bypass directly.

>
>> FWIW two solutions that immediately come to mind is to export "backup"
>> as phys_port_name of the backup virtio link and/or assign a name to the
>> master like you are doing already.  I think team uses team%d and bond
>> uses bond%d, soft naming of master devices seems quite natural in this
>> case.
>
> I figured I had overlooked something like that.. Thanks for pointing
> this out. Okay so I think the phys_port_name approach might resolve
> the original issue. If I am reading things correctly what we end up
> with is the master showing up as "ens1" for example and the backup
> showing up as "ens1nbackup". Am I understanding that right?
>
> The problem with the team/bond%d approach is that it creates a new
> netdevice and so it would require guest configuration changes.
>
>> IMHO phys_port_name == "backup" if BACKUP bit is set on slave virtio
>> link is quite neat.
>
> I agree. For non-"backup" virio_net devices would it be okay for us to
> just return -EOPNOTSUPP? I assume it would be and that way the legacy
> behavior could be maintained although the function still exists.
>
>>> - When the 'active' netdev is unplugged OR not present on a destination
>>>   system after live migration, the user will see 2 virtio_net netdevs.
>>
>> That's necessary and expected, all configuration applies to the master
>> so master must exist.
>
> With the naming issue resolved this is the only item left outstanding.
> This becomes a matter of form vs function.
>
> The main complaint about the "3 netdev" solution is a bit confusing to
> have the 2 netdevs present if the VF isn't there. The idea is that
> having the extra "master" netdev there if there isn't really a bond is
> a bit ugly.

Is it this uglier in terms of user experience rather than
functionality? I don't want it dynamically changed between 2-netdev
and 3-netdev depending on the presence of VF. That gets back to my
original question and suggestion earlier: why not just hide the lower
netdevs from udev renaming and such? Which important observability
benefits users may get if exposing the lower netdevs?

Thanks,
-Siwei

>
> The downside of the "2 netdev" solution is that you have to deal with
> an extra layer of locking/queueing to get to the VF and you lose some
> functionality since things like in-driver XDP have to be disabled in
> order to maintain the same functionality when the VF is present or
> not. However it looks more like classic virtio_net when the VF is not
> present.

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
  2018-02-21 23:50       ` [virtio-dev] " Siwei Liu
@ 2018-02-22  0:17         ` Alexander Duyck
  -1 siblings, 0 replies; 121+ messages in thread
From: Alexander Duyck @ 2018-02-22  0:17 UTC (permalink / raw)
  To: Siwei Liu
  Cc: Jakub Kicinski, Sridhar Samudrala, Michael S. Tsirkin,
	Stephen Hemminger, David Miller, Netdev, virtualization,
	virtio-dev, Brandeburg, Jesse, Duyck, Alexander H, Jason Wang

On Wed, Feb 21, 2018 at 3:50 PM, Siwei Liu <loseweigh@gmail.com> wrote:
> I haven't checked emails for days and did not realize the new revision
> had already came out. And thank you for the effort, this revision
> really looks to be a step forward towards our use case and is close to
> what we wanted to do. A few questions in line.
>
> On Sat, Feb 17, 2018 at 9:12 AM, Alexander Duyck
> <alexander.duyck@gmail.com> wrote:
>> On Fri, Feb 16, 2018 at 6:38 PM, Jakub Kicinski <kubakici@wp.pl> wrote:
>>> On Fri, 16 Feb 2018 10:11:19 -0800, Sridhar Samudrala wrote:
>>>> Ppatch 2 is in response to the community request for a 3 netdev
>>>> solution.  However, it creates some issues we'll get into in a moment.
>>>> It extends virtio_net to use alternate datapath when available and
>>>> registered. When BACKUP feature is enabled, virtio_net driver creates
>>>> an additional 'bypass' netdev that acts as a master device and controls
>>>> 2 slave devices.  The original virtio_net netdev is registered as
>>>> 'backup' netdev and a passthru/vf device with the same MAC gets
>>>> registered as 'active' netdev. Both 'bypass' and 'backup' netdevs are
>>>> associated with the same 'pci' device.  The user accesses the network
>>>> interface via 'bypass' netdev. The 'bypass' netdev chooses 'active' netdev
>>>> as default for transmits when it is available with link up and running.
>>>
>>> Thank you do doing this.
>>>
>>>> We noticed a couple of issues with this approach during testing.
>>>> - As both 'bypass' and 'backup' netdevs are associated with the same
>>>>   virtio pci device, udev tries to rename both of them with the same name
>>>>   and the 2nd rename will fail. This would be OK as long as the first netdev
>>>>   to be renamed is the 'bypass' netdev, but the order in which udev gets
>>>>   to rename the 2 netdevs is not reliable.
>>>
>>> Out of curiosity - why do you link the master netdev to the virtio
>>> struct device?
>>
>> The basic idea of all this is that we wanted this to work with an
>> existing VM image that was using virtio. As such we were trying to
>> make it so that the bypass interface takes the place of the original
>> virtio and get udev to rename the bypass to what the original
>> virtio_net was.
>
> Could it made it also possible to take over the config from VF instead
> of virtio on an existing VM image? And get udev rename the bypass
> netdev to what the original VF was. I don't say tightly binding the
> bypass master to only virtio or VF, but I think we should provide both
> options to support different upgrade paths. Possibly we could tweak
> the device tree layout to reuse the same PCI slot for the master
> bypass netdev, such that udev would not get confused when renaming the
> device. The VF needs to use a different function slot afterwards.
> Perhaps we might need to a special multiseat like QEMU device for that
> purpose?
>
> Our case we'll upgrade the config from VF to virtio-bypass directly.

So if I am understanding what you are saying you are wanting to flip
the backup interface from the virtio to a VF. The problem is that
becomes a bit of a vendor lock-in solution since it would rely on a
specific VF driver. I would agree with Jiri that we don't want to go
down that path. We don't want every VF out there firing up its own
separate bond. Ideally you want the hypervisor to be able to manage
all of this which is why it makes sense to have virtio manage this and
why this is associated with the virtio_net interface.

The other bits get into more complexity then we are ready to handle
for now. I think I might have talked about something similar that I
was referring to as a "virtio-bond" where you would have a PCI/PCIe
tree topology that makes this easier to sort out, and the "virtio-bond
would be used to handle coordination/configuration of a much more
complex interface.

>>
>>> FWIW two solutions that immediately come to mind is to export "backup"
>>> as phys_port_name of the backup virtio link and/or assign a name to the
>>> master like you are doing already.  I think team uses team%d and bond
>>> uses bond%d, soft naming of master devices seems quite natural in this
>>> case.
>>
>> I figured I had overlooked something like that.. Thanks for pointing
>> this out. Okay so I think the phys_port_name approach might resolve
>> the original issue. If I am reading things correctly what we end up
>> with is the master showing up as "ens1" for example and the backup
>> showing up as "ens1nbackup". Am I understanding that right?
>>
>> The problem with the team/bond%d approach is that it creates a new
>> netdevice and so it would require guest configuration changes.
>>
>>> IMHO phys_port_name == "backup" if BACKUP bit is set on slave virtio
>>> link is quite neat.
>>
>> I agree. For non-"backup" virio_net devices would it be okay for us to
>> just return -EOPNOTSUPP? I assume it would be and that way the legacy
>> behavior could be maintained although the function still exists.
>>
>>>> - When the 'active' netdev is unplugged OR not present on a destination
>>>>   system after live migration, the user will see 2 virtio_net netdevs.
>>>
>>> That's necessary and expected, all configuration applies to the master
>>> so master must exist.
>>
>> With the naming issue resolved this is the only item left outstanding.
>> This becomes a matter of form vs function.
>>
>> The main complaint about the "3 netdev" solution is a bit confusing to
>> have the 2 netdevs present if the VF isn't there. The idea is that
>> having the extra "master" netdev there if there isn't really a bond is
>> a bit ugly.
>
> Is it this uglier in terms of user experience rather than
> functionality? I don't want it dynamically changed between 2-netdev
> and 3-netdev depending on the presence of VF. That gets back to my
> original question and suggestion earlier: why not just hide the lower
> netdevs from udev renaming and such? Which important observability
> benefits users may get if exposing the lower netdevs?
>
> Thanks,
> -Siwei

The only real advantage to a 2 netdev solution is that it looks like
the netvsc solution, however it doesn't behave like it since there are
some features like XDP that may not function correctly if they are
left enabled in the virtio_net interface.

As far as functionality the advantage of not hiding the lower devices
is that they are free to be managed. The problem with pushing all of
the configuration into the upper device is that you are limited to the
intersection of the features of the lower devices. This can be
limiting for some setups as some VFs support things like more queues,
or better interrupt moderation options than others so trying to make
everything work with one config would be ugly.

- Alex

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
  2018-02-21 23:50       ` [virtio-dev] " Siwei Liu
  (?)
@ 2018-02-22  0:17       ` Alexander Duyck
  -1 siblings, 0 replies; 121+ messages in thread
From: Alexander Duyck @ 2018-02-22  0:17 UTC (permalink / raw)
  To: Siwei Liu
  Cc: Duyck, Alexander H, virtio-dev, Michael S. Tsirkin,
	Jakub Kicinski, Sridhar Samudrala, virtualization, Netdev,
	David Miller

On Wed, Feb 21, 2018 at 3:50 PM, Siwei Liu <loseweigh@gmail.com> wrote:
> I haven't checked emails for days and did not realize the new revision
> had already came out. And thank you for the effort, this revision
> really looks to be a step forward towards our use case and is close to
> what we wanted to do. A few questions in line.
>
> On Sat, Feb 17, 2018 at 9:12 AM, Alexander Duyck
> <alexander.duyck@gmail.com> wrote:
>> On Fri, Feb 16, 2018 at 6:38 PM, Jakub Kicinski <kubakici@wp.pl> wrote:
>>> On Fri, 16 Feb 2018 10:11:19 -0800, Sridhar Samudrala wrote:
>>>> Ppatch 2 is in response to the community request for a 3 netdev
>>>> solution.  However, it creates some issues we'll get into in a moment.
>>>> It extends virtio_net to use alternate datapath when available and
>>>> registered. When BACKUP feature is enabled, virtio_net driver creates
>>>> an additional 'bypass' netdev that acts as a master device and controls
>>>> 2 slave devices.  The original virtio_net netdev is registered as
>>>> 'backup' netdev and a passthru/vf device with the same MAC gets
>>>> registered as 'active' netdev. Both 'bypass' and 'backup' netdevs are
>>>> associated with the same 'pci' device.  The user accesses the network
>>>> interface via 'bypass' netdev. The 'bypass' netdev chooses 'active' netdev
>>>> as default for transmits when it is available with link up and running.
>>>
>>> Thank you do doing this.
>>>
>>>> We noticed a couple of issues with this approach during testing.
>>>> - As both 'bypass' and 'backup' netdevs are associated with the same
>>>>   virtio pci device, udev tries to rename both of them with the same name
>>>>   and the 2nd rename will fail. This would be OK as long as the first netdev
>>>>   to be renamed is the 'bypass' netdev, but the order in which udev gets
>>>>   to rename the 2 netdevs is not reliable.
>>>
>>> Out of curiosity - why do you link the master netdev to the virtio
>>> struct device?
>>
>> The basic idea of all this is that we wanted this to work with an
>> existing VM image that was using virtio. As such we were trying to
>> make it so that the bypass interface takes the place of the original
>> virtio and get udev to rename the bypass to what the original
>> virtio_net was.
>
> Could it made it also possible to take over the config from VF instead
> of virtio on an existing VM image? And get udev rename the bypass
> netdev to what the original VF was. I don't say tightly binding the
> bypass master to only virtio or VF, but I think we should provide both
> options to support different upgrade paths. Possibly we could tweak
> the device tree layout to reuse the same PCI slot for the master
> bypass netdev, such that udev would not get confused when renaming the
> device. The VF needs to use a different function slot afterwards.
> Perhaps we might need to a special multiseat like QEMU device for that
> purpose?
>
> Our case we'll upgrade the config from VF to virtio-bypass directly.

So if I am understanding what you are saying you are wanting to flip
the backup interface from the virtio to a VF. The problem is that
becomes a bit of a vendor lock-in solution since it would rely on a
specific VF driver. I would agree with Jiri that we don't want to go
down that path. We don't want every VF out there firing up its own
separate bond. Ideally you want the hypervisor to be able to manage
all of this which is why it makes sense to have virtio manage this and
why this is associated with the virtio_net interface.

The other bits get into more complexity then we are ready to handle
for now. I think I might have talked about something similar that I
was referring to as a "virtio-bond" where you would have a PCI/PCIe
tree topology that makes this easier to sort out, and the "virtio-bond
would be used to handle coordination/configuration of a much more
complex interface.

>>
>>> FWIW two solutions that immediately come to mind is to export "backup"
>>> as phys_port_name of the backup virtio link and/or assign a name to the
>>> master like you are doing already.  I think team uses team%d and bond
>>> uses bond%d, soft naming of master devices seems quite natural in this
>>> case.
>>
>> I figured I had overlooked something like that.. Thanks for pointing
>> this out. Okay so I think the phys_port_name approach might resolve
>> the original issue. If I am reading things correctly what we end up
>> with is the master showing up as "ens1" for example and the backup
>> showing up as "ens1nbackup". Am I understanding that right?
>>
>> The problem with the team/bond%d approach is that it creates a new
>> netdevice and so it would require guest configuration changes.
>>
>>> IMHO phys_port_name == "backup" if BACKUP bit is set on slave virtio
>>> link is quite neat.
>>
>> I agree. For non-"backup" virio_net devices would it be okay for us to
>> just return -EOPNOTSUPP? I assume it would be and that way the legacy
>> behavior could be maintained although the function still exists.
>>
>>>> - When the 'active' netdev is unplugged OR not present on a destination
>>>>   system after live migration, the user will see 2 virtio_net netdevs.
>>>
>>> That's necessary and expected, all configuration applies to the master
>>> so master must exist.
>>
>> With the naming issue resolved this is the only item left outstanding.
>> This becomes a matter of form vs function.
>>
>> The main complaint about the "3 netdev" solution is a bit confusing to
>> have the 2 netdevs present if the VF isn't there. The idea is that
>> having the extra "master" netdev there if there isn't really a bond is
>> a bit ugly.
>
> Is it this uglier in terms of user experience rather than
> functionality? I don't want it dynamically changed between 2-netdev
> and 3-netdev depending on the presence of VF. That gets back to my
> original question and suggestion earlier: why not just hide the lower
> netdevs from udev renaming and such? Which important observability
> benefits users may get if exposing the lower netdevs?
>
> Thanks,
> -Siwei

The only real advantage to a 2 netdev solution is that it looks like
the netvsc solution, however it doesn't behave like it since there are
some features like XDP that may not function correctly if they are
left enabled in the virtio_net interface.

As far as functionality the advantage of not hiding the lower devices
is that they are free to be managed. The problem with pushing all of
the configuration into the upper device is that you are limited to the
intersection of the features of the lower devices. This can be
limiting for some setups as some VFs support things like more queues,
or better interrupt moderation options than others so trying to make
everything work with one config would be ugly.

- Alex

^ permalink raw reply	[flat|nested] 121+ messages in thread

* [virtio-dev] Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
@ 2018-02-22  0:17         ` Alexander Duyck
  0 siblings, 0 replies; 121+ messages in thread
From: Alexander Duyck @ 2018-02-22  0:17 UTC (permalink / raw)
  To: Siwei Liu
  Cc: Jakub Kicinski, Sridhar Samudrala, Michael S. Tsirkin,
	Stephen Hemminger, David Miller, Netdev, virtualization,
	virtio-dev, Brandeburg, Jesse, Duyck, Alexander H, Jason Wang

On Wed, Feb 21, 2018 at 3:50 PM, Siwei Liu <loseweigh@gmail.com> wrote:
> I haven't checked emails for days and did not realize the new revision
> had already came out. And thank you for the effort, this revision
> really looks to be a step forward towards our use case and is close to
> what we wanted to do. A few questions in line.
>
> On Sat, Feb 17, 2018 at 9:12 AM, Alexander Duyck
> <alexander.duyck@gmail.com> wrote:
>> On Fri, Feb 16, 2018 at 6:38 PM, Jakub Kicinski <kubakici@wp.pl> wrote:
>>> On Fri, 16 Feb 2018 10:11:19 -0800, Sridhar Samudrala wrote:
>>>> Ppatch 2 is in response to the community request for a 3 netdev
>>>> solution.  However, it creates some issues we'll get into in a moment.
>>>> It extends virtio_net to use alternate datapath when available and
>>>> registered. When BACKUP feature is enabled, virtio_net driver creates
>>>> an additional 'bypass' netdev that acts as a master device and controls
>>>> 2 slave devices.  The original virtio_net netdev is registered as
>>>> 'backup' netdev and a passthru/vf device with the same MAC gets
>>>> registered as 'active' netdev. Both 'bypass' and 'backup' netdevs are
>>>> associated with the same 'pci' device.  The user accesses the network
>>>> interface via 'bypass' netdev. The 'bypass' netdev chooses 'active' netdev
>>>> as default for transmits when it is available with link up and running.
>>>
>>> Thank you do doing this.
>>>
>>>> We noticed a couple of issues with this approach during testing.
>>>> - As both 'bypass' and 'backup' netdevs are associated with the same
>>>>   virtio pci device, udev tries to rename both of them with the same name
>>>>   and the 2nd rename will fail. This would be OK as long as the first netdev
>>>>   to be renamed is the 'bypass' netdev, but the order in which udev gets
>>>>   to rename the 2 netdevs is not reliable.
>>>
>>> Out of curiosity - why do you link the master netdev to the virtio
>>> struct device?
>>
>> The basic idea of all this is that we wanted this to work with an
>> existing VM image that was using virtio. As such we were trying to
>> make it so that the bypass interface takes the place of the original
>> virtio and get udev to rename the bypass to what the original
>> virtio_net was.
>
> Could it made it also possible to take over the config from VF instead
> of virtio on an existing VM image? And get udev rename the bypass
> netdev to what the original VF was. I don't say tightly binding the
> bypass master to only virtio or VF, but I think we should provide both
> options to support different upgrade paths. Possibly we could tweak
> the device tree layout to reuse the same PCI slot for the master
> bypass netdev, such that udev would not get confused when renaming the
> device. The VF needs to use a different function slot afterwards.
> Perhaps we might need to a special multiseat like QEMU device for that
> purpose?
>
> Our case we'll upgrade the config from VF to virtio-bypass directly.

So if I am understanding what you are saying you are wanting to flip
the backup interface from the virtio to a VF. The problem is that
becomes a bit of a vendor lock-in solution since it would rely on a
specific VF driver. I would agree with Jiri that we don't want to go
down that path. We don't want every VF out there firing up its own
separate bond. Ideally you want the hypervisor to be able to manage
all of this which is why it makes sense to have virtio manage this and
why this is associated with the virtio_net interface.

The other bits get into more complexity then we are ready to handle
for now. I think I might have talked about something similar that I
was referring to as a "virtio-bond" where you would have a PCI/PCIe
tree topology that makes this easier to sort out, and the "virtio-bond
would be used to handle coordination/configuration of a much more
complex interface.

>>
>>> FWIW two solutions that immediately come to mind is to export "backup"
>>> as phys_port_name of the backup virtio link and/or assign a name to the
>>> master like you are doing already.  I think team uses team%d and bond
>>> uses bond%d, soft naming of master devices seems quite natural in this
>>> case.
>>
>> I figured I had overlooked something like that.. Thanks for pointing
>> this out. Okay so I think the phys_port_name approach might resolve
>> the original issue. If I am reading things correctly what we end up
>> with is the master showing up as "ens1" for example and the backup
>> showing up as "ens1nbackup". Am I understanding that right?
>>
>> The problem with the team/bond%d approach is that it creates a new
>> netdevice and so it would require guest configuration changes.
>>
>>> IMHO phys_port_name == "backup" if BACKUP bit is set on slave virtio
>>> link is quite neat.
>>
>> I agree. For non-"backup" virio_net devices would it be okay for us to
>> just return -EOPNOTSUPP? I assume it would be and that way the legacy
>> behavior could be maintained although the function still exists.
>>
>>>> - When the 'active' netdev is unplugged OR not present on a destination
>>>>   system after live migration, the user will see 2 virtio_net netdevs.
>>>
>>> That's necessary and expected, all configuration applies to the master
>>> so master must exist.
>>
>> With the naming issue resolved this is the only item left outstanding.
>> This becomes a matter of form vs function.
>>
>> The main complaint about the "3 netdev" solution is a bit confusing to
>> have the 2 netdevs present if the VF isn't there. The idea is that
>> having the extra "master" netdev there if there isn't really a bond is
>> a bit ugly.
>
> Is it this uglier in terms of user experience rather than
> functionality? I don't want it dynamically changed between 2-netdev
> and 3-netdev depending on the presence of VF. That gets back to my
> original question and suggestion earlier: why not just hide the lower
> netdevs from udev renaming and such? Which important observability
> benefits users may get if exposing the lower netdevs?
>
> Thanks,
> -Siwei

The only real advantage to a 2 netdev solution is that it looks like
the netvsc solution, however it doesn't behave like it since there are
some features like XDP that may not function correctly if they are
left enabled in the virtio_net interface.

As far as functionality the advantage of not hiding the lower devices
is that they are free to be managed. The problem with pushing all of
the configuration into the upper device is that you are limited to the
intersection of the features of the lower devices. This can be
limiting for some setups as some VFs support things like more queues,
or better interrupt moderation options than others so trying to make
everything work with one config would be ugly.

- Alex

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
  2018-02-22  0:17         ` [virtio-dev] " Alexander Duyck
@ 2018-02-22  1:59           ` Siwei Liu
  -1 siblings, 0 replies; 121+ messages in thread
From: Siwei Liu @ 2018-02-22  1:59 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Jakub Kicinski, Sridhar Samudrala, Michael S. Tsirkin,
	Stephen Hemminger, David Miller, Netdev, virtualization,
	virtio-dev, Brandeburg, Jesse, Duyck, Alexander H, Jason Wang

On Wed, Feb 21, 2018 at 4:17 PM, Alexander Duyck
<alexander.duyck@gmail.com> wrote:
> On Wed, Feb 21, 2018 at 3:50 PM, Siwei Liu <loseweigh@gmail.com> wrote:
>> I haven't checked emails for days and did not realize the new revision
>> had already came out. And thank you for the effort, this revision
>> really looks to be a step forward towards our use case and is close to
>> what we wanted to do. A few questions in line.
>>
>> On Sat, Feb 17, 2018 at 9:12 AM, Alexander Duyck
>> <alexander.duyck@gmail.com> wrote:
>>> On Fri, Feb 16, 2018 at 6:38 PM, Jakub Kicinski <kubakici@wp.pl> wrote:
>>>> On Fri, 16 Feb 2018 10:11:19 -0800, Sridhar Samudrala wrote:
>>>>> Ppatch 2 is in response to the community request for a 3 netdev
>>>>> solution.  However, it creates some issues we'll get into in a moment.
>>>>> It extends virtio_net to use alternate datapath when available and
>>>>> registered. When BACKUP feature is enabled, virtio_net driver creates
>>>>> an additional 'bypass' netdev that acts as a master device and controls
>>>>> 2 slave devices.  The original virtio_net netdev is registered as
>>>>> 'backup' netdev and a passthru/vf device with the same MAC gets
>>>>> registered as 'active' netdev. Both 'bypass' and 'backup' netdevs are
>>>>> associated with the same 'pci' device.  The user accesses the network
>>>>> interface via 'bypass' netdev. The 'bypass' netdev chooses 'active' netdev
>>>>> as default for transmits when it is available with link up and running.
>>>>
>>>> Thank you do doing this.
>>>>
>>>>> We noticed a couple of issues with this approach during testing.
>>>>> - As both 'bypass' and 'backup' netdevs are associated with the same
>>>>>   virtio pci device, udev tries to rename both of them with the same name
>>>>>   and the 2nd rename will fail. This would be OK as long as the first netdev
>>>>>   to be renamed is the 'bypass' netdev, but the order in which udev gets
>>>>>   to rename the 2 netdevs is not reliable.
>>>>
>>>> Out of curiosity - why do you link the master netdev to the virtio
>>>> struct device?
>>>
>>> The basic idea of all this is that we wanted this to work with an
>>> existing VM image that was using virtio. As such we were trying to
>>> make it so that the bypass interface takes the place of the original
>>> virtio and get udev to rename the bypass to what the original
>>> virtio_net was.
>>
>> Could it made it also possible to take over the config from VF instead
>> of virtio on an existing VM image? And get udev rename the bypass
>> netdev to what the original VF was. I don't say tightly binding the
>> bypass master to only virtio or VF, but I think we should provide both
>> options to support different upgrade paths. Possibly we could tweak
>> the device tree layout to reuse the same PCI slot for the master
>> bypass netdev, such that udev would not get confused when renaming the
>> device. The VF needs to use a different function slot afterwards.
>> Perhaps we might need to a special multiseat like QEMU device for that
>> purpose?
>>
>> Our case we'll upgrade the config from VF to virtio-bypass directly.
>
> So if I am understanding what you are saying you are wanting to flip
> the backup interface from the virtio to a VF. The problem is that
> becomes a bit of a vendor lock-in solution since it would rely on a
> specific VF driver. I would agree with Jiri that we don't want to go
> down that path. We don't want every VF out there firing up its own
> separate bond. Ideally you want the hypervisor to be able to manage
> all of this which is why it makes sense to have virtio manage this and
> why this is associated with the virtio_net interface.

No, that's not what I was talking about of course. I thought you
mentioned the upgrade scenario this patch would like to address is to
use the bypass interface "to take the place of the original virtio,
and get udev to rename the bypass to what the original virtio_net
was". That is one of the possible upgrade paths for sure. However the
upgrade path I was seeking is to use the bypass interface to take the
place of original VF interface while retaining the name and network
configs, which generally can be done simply with kernel upgrade. It
would become limiting as this patch makes the bypass interface share
the same virtio pci device with virito backup. Can this bypass
interface be made general to take place of any pci device other than
virtio-net? This will be more helpful as the cloud users who has
existing setup on VF interface don't have to recreate it on virtio-net
and VF separately again.

>
> The other bits get into more complexity then we are ready to handle
> for now. I think I might have talked about something similar that I
> was referring to as a "virtio-bond" where you would have a PCI/PCIe
> tree topology that makes this easier to sort out, and the "virtio-bond
> would be used to handle coordination/configuration of a much more
> complex interface.

That was one way to solve this problem but I'd like to see simple ways
to sort it out.

>
>>>
>>>> FWIW two solutions that immediately come to mind is to export "backup"
>>>> as phys_port_name of the backup virtio link and/or assign a name to the
>>>> master like you are doing already.  I think team uses team%d and bond
>>>> uses bond%d, soft naming of master devices seems quite natural in this
>>>> case.
>>>
>>> I figured I had overlooked something like that.. Thanks for pointing
>>> this out. Okay so I think the phys_port_name approach might resolve
>>> the original issue. If I am reading things correctly what we end up
>>> with is the master showing up as "ens1" for example and the backup
>>> showing up as "ens1nbackup". Am I understanding that right?
>>>
>>> The problem with the team/bond%d approach is that it creates a new
>>> netdevice and so it would require guest configuration changes.
>>>
>>>> IMHO phys_port_name == "backup" if BACKUP bit is set on slave virtio
>>>> link is quite neat.
>>>
>>> I agree. For non-"backup" virio_net devices would it be okay for us to
>>> just return -EOPNOTSUPP? I assume it would be and that way the legacy
>>> behavior could be maintained although the function still exists.
>>>
>>>>> - When the 'active' netdev is unplugged OR not present on a destination
>>>>>   system after live migration, the user will see 2 virtio_net netdevs.
>>>>
>>>> That's necessary and expected, all configuration applies to the master
>>>> so master must exist.
>>>
>>> With the naming issue resolved this is the only item left outstanding.
>>> This becomes a matter of form vs function.
>>>
>>> The main complaint about the "3 netdev" solution is a bit confusing to
>>> have the 2 netdevs present if the VF isn't there. The idea is that
>>> having the extra "master" netdev there if there isn't really a bond is
>>> a bit ugly.
>>
>> Is it this uglier in terms of user experience rather than
>> functionality? I don't want it dynamically changed between 2-netdev
>> and 3-netdev depending on the presence of VF. That gets back to my
>> original question and suggestion earlier: why not just hide the lower
>> netdevs from udev renaming and such? Which important observability
>> benefits users may get if exposing the lower netdevs?
>>
>> Thanks,
>> -Siwei
>
> The only real advantage to a 2 netdev solution is that it looks like
> the netvsc solution, however it doesn't behave like it since there are
> some features like XDP that may not function correctly if they are
> left enabled in the virtio_net interface.
>
> As far as functionality the advantage of not hiding the lower devices
> is that they are free to be managed. The problem with pushing all of
> the configuration into the upper device is that you are limited to the
> intersection of the features of the lower devices. This can be
> limiting for some setups as some VFs support things like more queues,
> or better interrupt moderation options than others so trying to make
> everything work with one config would be ugly.

It depends on how you build it and the way you expect it to work. IMHO
the lower devices don't need to be directly managed at all, otherwise
it ends up with loss of configuration across migration, and it really
does not bring much value than having a general team or bond device.
Users still have to reconfigure those queue settings and interrupt
moderation options after all. The new upper device could take the
assumption that the VF/PT lower device always has superior feature set
than virtio-net in order to apply advanced configuration. The upper
device should remember all configurations previously done and apply
supporting ones to active device automatically when switching the
datapath.

Regards,
-Siwei

>
> - Alex

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
  2018-02-22  0:17         ` [virtio-dev] " Alexander Duyck
  (?)
@ 2018-02-22  1:59         ` Siwei Liu
  -1 siblings, 0 replies; 121+ messages in thread
From: Siwei Liu @ 2018-02-22  1:59 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Duyck, Alexander H, virtio-dev, Michael S. Tsirkin,
	Jakub Kicinski, Sridhar Samudrala, virtualization, Netdev,
	David Miller

On Wed, Feb 21, 2018 at 4:17 PM, Alexander Duyck
<alexander.duyck@gmail.com> wrote:
> On Wed, Feb 21, 2018 at 3:50 PM, Siwei Liu <loseweigh@gmail.com> wrote:
>> I haven't checked emails for days and did not realize the new revision
>> had already came out. And thank you for the effort, this revision
>> really looks to be a step forward towards our use case and is close to
>> what we wanted to do. A few questions in line.
>>
>> On Sat, Feb 17, 2018 at 9:12 AM, Alexander Duyck
>> <alexander.duyck@gmail.com> wrote:
>>> On Fri, Feb 16, 2018 at 6:38 PM, Jakub Kicinski <kubakici@wp.pl> wrote:
>>>> On Fri, 16 Feb 2018 10:11:19 -0800, Sridhar Samudrala wrote:
>>>>> Ppatch 2 is in response to the community request for a 3 netdev
>>>>> solution.  However, it creates some issues we'll get into in a moment.
>>>>> It extends virtio_net to use alternate datapath when available and
>>>>> registered. When BACKUP feature is enabled, virtio_net driver creates
>>>>> an additional 'bypass' netdev that acts as a master device and controls
>>>>> 2 slave devices.  The original virtio_net netdev is registered as
>>>>> 'backup' netdev and a passthru/vf device with the same MAC gets
>>>>> registered as 'active' netdev. Both 'bypass' and 'backup' netdevs are
>>>>> associated with the same 'pci' device.  The user accesses the network
>>>>> interface via 'bypass' netdev. The 'bypass' netdev chooses 'active' netdev
>>>>> as default for transmits when it is available with link up and running.
>>>>
>>>> Thank you do doing this.
>>>>
>>>>> We noticed a couple of issues with this approach during testing.
>>>>> - As both 'bypass' and 'backup' netdevs are associated with the same
>>>>>   virtio pci device, udev tries to rename both of them with the same name
>>>>>   and the 2nd rename will fail. This would be OK as long as the first netdev
>>>>>   to be renamed is the 'bypass' netdev, but the order in which udev gets
>>>>>   to rename the 2 netdevs is not reliable.
>>>>
>>>> Out of curiosity - why do you link the master netdev to the virtio
>>>> struct device?
>>>
>>> The basic idea of all this is that we wanted this to work with an
>>> existing VM image that was using virtio. As such we were trying to
>>> make it so that the bypass interface takes the place of the original
>>> virtio and get udev to rename the bypass to what the original
>>> virtio_net was.
>>
>> Could it made it also possible to take over the config from VF instead
>> of virtio on an existing VM image? And get udev rename the bypass
>> netdev to what the original VF was. I don't say tightly binding the
>> bypass master to only virtio or VF, but I think we should provide both
>> options to support different upgrade paths. Possibly we could tweak
>> the device tree layout to reuse the same PCI slot for the master
>> bypass netdev, such that udev would not get confused when renaming the
>> device. The VF needs to use a different function slot afterwards.
>> Perhaps we might need to a special multiseat like QEMU device for that
>> purpose?
>>
>> Our case we'll upgrade the config from VF to virtio-bypass directly.
>
> So if I am understanding what you are saying you are wanting to flip
> the backup interface from the virtio to a VF. The problem is that
> becomes a bit of a vendor lock-in solution since it would rely on a
> specific VF driver. I would agree with Jiri that we don't want to go
> down that path. We don't want every VF out there firing up its own
> separate bond. Ideally you want the hypervisor to be able to manage
> all of this which is why it makes sense to have virtio manage this and
> why this is associated with the virtio_net interface.

No, that's not what I was talking about of course. I thought you
mentioned the upgrade scenario this patch would like to address is to
use the bypass interface "to take the place of the original virtio,
and get udev to rename the bypass to what the original virtio_net
was". That is one of the possible upgrade paths for sure. However the
upgrade path I was seeking is to use the bypass interface to take the
place of original VF interface while retaining the name and network
configs, which generally can be done simply with kernel upgrade. It
would become limiting as this patch makes the bypass interface share
the same virtio pci device with virito backup. Can this bypass
interface be made general to take place of any pci device other than
virtio-net? This will be more helpful as the cloud users who has
existing setup on VF interface don't have to recreate it on virtio-net
and VF separately again.

>
> The other bits get into more complexity then we are ready to handle
> for now. I think I might have talked about something similar that I
> was referring to as a "virtio-bond" where you would have a PCI/PCIe
> tree topology that makes this easier to sort out, and the "virtio-bond
> would be used to handle coordination/configuration of a much more
> complex interface.

That was one way to solve this problem but I'd like to see simple ways
to sort it out.

>
>>>
>>>> FWIW two solutions that immediately come to mind is to export "backup"
>>>> as phys_port_name of the backup virtio link and/or assign a name to the
>>>> master like you are doing already.  I think team uses team%d and bond
>>>> uses bond%d, soft naming of master devices seems quite natural in this
>>>> case.
>>>
>>> I figured I had overlooked something like that.. Thanks for pointing
>>> this out. Okay so I think the phys_port_name approach might resolve
>>> the original issue. If I am reading things correctly what we end up
>>> with is the master showing up as "ens1" for example and the backup
>>> showing up as "ens1nbackup". Am I understanding that right?
>>>
>>> The problem with the team/bond%d approach is that it creates a new
>>> netdevice and so it would require guest configuration changes.
>>>
>>>> IMHO phys_port_name == "backup" if BACKUP bit is set on slave virtio
>>>> link is quite neat.
>>>
>>> I agree. For non-"backup" virio_net devices would it be okay for us to
>>> just return -EOPNOTSUPP? I assume it would be and that way the legacy
>>> behavior could be maintained although the function still exists.
>>>
>>>>> - When the 'active' netdev is unplugged OR not present on a destination
>>>>>   system after live migration, the user will see 2 virtio_net netdevs.
>>>>
>>>> That's necessary and expected, all configuration applies to the master
>>>> so master must exist.
>>>
>>> With the naming issue resolved this is the only item left outstanding.
>>> This becomes a matter of form vs function.
>>>
>>> The main complaint about the "3 netdev" solution is a bit confusing to
>>> have the 2 netdevs present if the VF isn't there. The idea is that
>>> having the extra "master" netdev there if there isn't really a bond is
>>> a bit ugly.
>>
>> Is it this uglier in terms of user experience rather than
>> functionality? I don't want it dynamically changed between 2-netdev
>> and 3-netdev depending on the presence of VF. That gets back to my
>> original question and suggestion earlier: why not just hide the lower
>> netdevs from udev renaming and such? Which important observability
>> benefits users may get if exposing the lower netdevs?
>>
>> Thanks,
>> -Siwei
>
> The only real advantage to a 2 netdev solution is that it looks like
> the netvsc solution, however it doesn't behave like it since there are
> some features like XDP that may not function correctly if they are
> left enabled in the virtio_net interface.
>
> As far as functionality the advantage of not hiding the lower devices
> is that they are free to be managed. The problem with pushing all of
> the configuration into the upper device is that you are limited to the
> intersection of the features of the lower devices. This can be
> limiting for some setups as some VFs support things like more queues,
> or better interrupt moderation options than others so trying to make
> everything work with one config would be ugly.

It depends on how you build it and the way you expect it to work. IMHO
the lower devices don't need to be directly managed at all, otherwise
it ends up with loss of configuration across migration, and it really
does not bring much value than having a general team or bond device.
Users still have to reconfigure those queue settings and interrupt
moderation options after all. The new upper device could take the
assumption that the VF/PT lower device always has superior feature set
than virtio-net in order to apply advanced configuration. The upper
device should remember all configurations previously done and apply
supporting ones to active device automatically when switching the
datapath.

Regards,
-Siwei

>
> - Alex

^ permalink raw reply	[flat|nested] 121+ messages in thread

* [virtio-dev] Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
@ 2018-02-22  1:59           ` Siwei Liu
  0 siblings, 0 replies; 121+ messages in thread
From: Siwei Liu @ 2018-02-22  1:59 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Jakub Kicinski, Sridhar Samudrala, Michael S. Tsirkin,
	Stephen Hemminger, David Miller, Netdev, virtualization,
	virtio-dev, Brandeburg, Jesse, Duyck, Alexander H, Jason Wang

On Wed, Feb 21, 2018 at 4:17 PM, Alexander Duyck
<alexander.duyck@gmail.com> wrote:
> On Wed, Feb 21, 2018 at 3:50 PM, Siwei Liu <loseweigh@gmail.com> wrote:
>> I haven't checked emails for days and did not realize the new revision
>> had already came out. And thank you for the effort, this revision
>> really looks to be a step forward towards our use case and is close to
>> what we wanted to do. A few questions in line.
>>
>> On Sat, Feb 17, 2018 at 9:12 AM, Alexander Duyck
>> <alexander.duyck@gmail.com> wrote:
>>> On Fri, Feb 16, 2018 at 6:38 PM, Jakub Kicinski <kubakici@wp.pl> wrote:
>>>> On Fri, 16 Feb 2018 10:11:19 -0800, Sridhar Samudrala wrote:
>>>>> Ppatch 2 is in response to the community request for a 3 netdev
>>>>> solution.  However, it creates some issues we'll get into in a moment.
>>>>> It extends virtio_net to use alternate datapath when available and
>>>>> registered. When BACKUP feature is enabled, virtio_net driver creates
>>>>> an additional 'bypass' netdev that acts as a master device and controls
>>>>> 2 slave devices.  The original virtio_net netdev is registered as
>>>>> 'backup' netdev and a passthru/vf device with the same MAC gets
>>>>> registered as 'active' netdev. Both 'bypass' and 'backup' netdevs are
>>>>> associated with the same 'pci' device.  The user accesses the network
>>>>> interface via 'bypass' netdev. The 'bypass' netdev chooses 'active' netdev
>>>>> as default for transmits when it is available with link up and running.
>>>>
>>>> Thank you do doing this.
>>>>
>>>>> We noticed a couple of issues with this approach during testing.
>>>>> - As both 'bypass' and 'backup' netdevs are associated with the same
>>>>>   virtio pci device, udev tries to rename both of them with the same name
>>>>>   and the 2nd rename will fail. This would be OK as long as the first netdev
>>>>>   to be renamed is the 'bypass' netdev, but the order in which udev gets
>>>>>   to rename the 2 netdevs is not reliable.
>>>>
>>>> Out of curiosity - why do you link the master netdev to the virtio
>>>> struct device?
>>>
>>> The basic idea of all this is that we wanted this to work with an
>>> existing VM image that was using virtio. As such we were trying to
>>> make it so that the bypass interface takes the place of the original
>>> virtio and get udev to rename the bypass to what the original
>>> virtio_net was.
>>
>> Could it made it also possible to take over the config from VF instead
>> of virtio on an existing VM image? And get udev rename the bypass
>> netdev to what the original VF was. I don't say tightly binding the
>> bypass master to only virtio or VF, but I think we should provide both
>> options to support different upgrade paths. Possibly we could tweak
>> the device tree layout to reuse the same PCI slot for the master
>> bypass netdev, such that udev would not get confused when renaming the
>> device. The VF needs to use a different function slot afterwards.
>> Perhaps we might need to a special multiseat like QEMU device for that
>> purpose?
>>
>> Our case we'll upgrade the config from VF to virtio-bypass directly.
>
> So if I am understanding what you are saying you are wanting to flip
> the backup interface from the virtio to a VF. The problem is that
> becomes a bit of a vendor lock-in solution since it would rely on a
> specific VF driver. I would agree with Jiri that we don't want to go
> down that path. We don't want every VF out there firing up its own
> separate bond. Ideally you want the hypervisor to be able to manage
> all of this which is why it makes sense to have virtio manage this and
> why this is associated with the virtio_net interface.

No, that's not what I was talking about of course. I thought you
mentioned the upgrade scenario this patch would like to address is to
use the bypass interface "to take the place of the original virtio,
and get udev to rename the bypass to what the original virtio_net
was". That is one of the possible upgrade paths for sure. However the
upgrade path I was seeking is to use the bypass interface to take the
place of original VF interface while retaining the name and network
configs, which generally can be done simply with kernel upgrade. It
would become limiting as this patch makes the bypass interface share
the same virtio pci device with virito backup. Can this bypass
interface be made general to take place of any pci device other than
virtio-net? This will be more helpful as the cloud users who has
existing setup on VF interface don't have to recreate it on virtio-net
and VF separately again.

>
> The other bits get into more complexity then we are ready to handle
> for now. I think I might have talked about something similar that I
> was referring to as a "virtio-bond" where you would have a PCI/PCIe
> tree topology that makes this easier to sort out, and the "virtio-bond
> would be used to handle coordination/configuration of a much more
> complex interface.

That was one way to solve this problem but I'd like to see simple ways
to sort it out.

>
>>>
>>>> FWIW two solutions that immediately come to mind is to export "backup"
>>>> as phys_port_name of the backup virtio link and/or assign a name to the
>>>> master like you are doing already.  I think team uses team%d and bond
>>>> uses bond%d, soft naming of master devices seems quite natural in this
>>>> case.
>>>
>>> I figured I had overlooked something like that.. Thanks for pointing
>>> this out. Okay so I think the phys_port_name approach might resolve
>>> the original issue. If I am reading things correctly what we end up
>>> with is the master showing up as "ens1" for example and the backup
>>> showing up as "ens1nbackup". Am I understanding that right?
>>>
>>> The problem with the team/bond%d approach is that it creates a new
>>> netdevice and so it would require guest configuration changes.
>>>
>>>> IMHO phys_port_name == "backup" if BACKUP bit is set on slave virtio
>>>> link is quite neat.
>>>
>>> I agree. For non-"backup" virio_net devices would it be okay for us to
>>> just return -EOPNOTSUPP? I assume it would be and that way the legacy
>>> behavior could be maintained although the function still exists.
>>>
>>>>> - When the 'active' netdev is unplugged OR not present on a destination
>>>>>   system after live migration, the user will see 2 virtio_net netdevs.
>>>>
>>>> That's necessary and expected, all configuration applies to the master
>>>> so master must exist.
>>>
>>> With the naming issue resolved this is the only item left outstanding.
>>> This becomes a matter of form vs function.
>>>
>>> The main complaint about the "3 netdev" solution is a bit confusing to
>>> have the 2 netdevs present if the VF isn't there. The idea is that
>>> having the extra "master" netdev there if there isn't really a bond is
>>> a bit ugly.
>>
>> Is it this uglier in terms of user experience rather than
>> functionality? I don't want it dynamically changed between 2-netdev
>> and 3-netdev depending on the presence of VF. That gets back to my
>> original question and suggestion earlier: why not just hide the lower
>> netdevs from udev renaming and such? Which important observability
>> benefits users may get if exposing the lower netdevs?
>>
>> Thanks,
>> -Siwei
>
> The only real advantage to a 2 netdev solution is that it looks like
> the netvsc solution, however it doesn't behave like it since there are
> some features like XDP that may not function correctly if they are
> left enabled in the virtio_net interface.
>
> As far as functionality the advantage of not hiding the lower devices
> is that they are free to be managed. The problem with pushing all of
> the configuration into the upper device is that you are limited to the
> intersection of the features of the lower devices. This can be
> limiting for some setups as some VFs support things like more queues,
> or better interrupt moderation options than others so trying to make
> everything work with one config would be ugly.

It depends on how you build it and the way you expect it to work. IMHO
the lower devices don't need to be directly managed at all, otherwise
it ends up with loss of configuration across migration, and it really
does not bring much value than having a general team or bond device.
Users still have to reconfigure those queue settings and interrupt
moderation options after all. The new upper device could take the
assumption that the VF/PT lower device always has superior feature set
than virtio-net in order to apply advanced configuration. The upper
device should remember all configurations previously done and apply
supporting ones to active device automatically when switching the
datapath.

Regards,
-Siwei

>
> - Alex

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
  2018-02-21 20:57                             ` [virtio-dev] " Alexander Duyck
  (?)
@ 2018-02-22  2:02                             ` Jakub Kicinski
  2018-02-22  2:15                                 ` [virtio-dev] " Samudrala, Sridhar
  2018-02-22  2:15                               ` Samudrala, Sridhar
  -1 siblings, 2 replies; 121+ messages in thread
From: Jakub Kicinski @ 2018-02-22  2:02 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Duyck, Alexander H, virtio-dev, Jiri Pirko, Michael S. Tsirkin,
	Samudrala, Sridhar, virtualization, Siwei Liu, Netdev,
	David Miller

On Wed, 21 Feb 2018 12:57:09 -0800, Alexander Duyck wrote:
> > I don't see why the team cannot be there always.  
> 
> It is more the logistical nightmare. Part of the goal here was to work
> with the cloud base images that are out there such as
> https://alt.fedoraproject.org/cloud/. With just the kernel changes the
> overhead for this stays fairly small and would be pulled in as just a
> standard part of the kernel update process. The virtio bypass only
> pops up if the backup bit is present. With the team solution it
> requires that the base image use the team driver on virtio_net when it
> sees one. I doubt the OSVs would want to do that just because SR-IOV
> isn't that popular of a case.

IIUC we need to monitor for a "backup hint", spawn the master, rename it
to maintain backwards compatibility with no-VF setups and enslave the VF
if it appears.

All those sound possible from user space, the advantage of the kernel
solution right now is that it has more complete code.

Am I misunderstanding?

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
  2018-02-22  2:02                             ` Jakub Kicinski
@ 2018-02-22  2:15                                 ` Samudrala, Sridhar
  2018-02-22  2:15                               ` Samudrala, Sridhar
  1 sibling, 0 replies; 121+ messages in thread
From: Samudrala, Sridhar @ 2018-02-22  2:15 UTC (permalink / raw)
  To: Jakub Kicinski, Alexander Duyck
  Cc: Jiri Pirko, Michael S. Tsirkin, Stephen Hemminger, David Miller,
	Netdev, virtualization, virtio-dev, Brandeburg, Jesse, Duyck,
	Alexander H, Jason Wang, Siwei Liu

On 2/21/2018 6:02 PM, Jakub Kicinski wrote:
> On Wed, 21 Feb 2018 12:57:09 -0800, Alexander Duyck wrote:
>>> I don't see why the team cannot be there always.
>> It is more the logistical nightmare. Part of the goal here was to work
>> with the cloud base images that are out there such as
>> https://alt.fedoraproject.org/cloud/. With just the kernel changes the
>> overhead for this stays fairly small and would be pulled in as just a
>> standard part of the kernel update process. The virtio bypass only
>> pops up if the backup bit is present. With the team solution it
>> requires that the base image use the team driver on virtio_net when it
>> sees one. I doubt the OSVs would want to do that just because SR-IOV
>> isn't that popular of a case.
> IIUC we need to monitor for a "backup hint", spawn the master, rename it
> to maintain backwards compatibility with no-VF setups and enslave the VF
> if it appears.
>
> All those sound possible from user space, the advantage of the kernel
> solution right now is that it has more complete code.
>
> Am I misunderstanding?

I think there is some misunderstanding about the exact requirement and 
the usecase
we are trying to solve.  If the Guest is allowed to do this 
configuration, we already have
a solution with either bond/team based user space configuration.

This is to enable cloud service providers to provide a accelerated 
datapath by simply
letting to tenants to get their own images with the only requirement to 
enable their
kernels with newer virtio_net driver with BACKUP support and the VF driver.

To recap from an earlier thread, here is a response from Stephen that 
talks about the
requirement for the netvsc solution and we would like to provide similar 
solution for
KVM based cloud deployments.

 > The requirement with Azure accelerated network was that a stock 
distribution image from the
 > store must be able to run unmodified and get accelerated networking.
 >  Not sure if other environments need to work the same, but it would 
be nice.
 >  That meant no additional setup scripts (aka no bonding) and also it must
 > work transparently with hot-plug. Also there are diverse set of 
environments:
 > openstack, cloudinit, network manager and systemd. The solution had 
to not depend
 > on any one of them, but also not break any of them.

Thanks
Sridhar

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
  2018-02-22  2:02                             ` Jakub Kicinski
  2018-02-22  2:15                                 ` [virtio-dev] " Samudrala, Sridhar
@ 2018-02-22  2:15                               ` Samudrala, Sridhar
  1 sibling, 0 replies; 121+ messages in thread
From: Samudrala, Sridhar @ 2018-02-22  2:15 UTC (permalink / raw)
  To: Jakub Kicinski, Alexander Duyck
  Cc: Duyck, Alexander H, virtio-dev, Jiri Pirko, Michael S. Tsirkin,
	Netdev, virtualization, Siwei Liu, David Miller

On 2/21/2018 6:02 PM, Jakub Kicinski wrote:
> On Wed, 21 Feb 2018 12:57:09 -0800, Alexander Duyck wrote:
>>> I don't see why the team cannot be there always.
>> It is more the logistical nightmare. Part of the goal here was to work
>> with the cloud base images that are out there such as
>> https://alt.fedoraproject.org/cloud/. With just the kernel changes the
>> overhead for this stays fairly small and would be pulled in as just a
>> standard part of the kernel update process. The virtio bypass only
>> pops up if the backup bit is present. With the team solution it
>> requires that the base image use the team driver on virtio_net when it
>> sees one. I doubt the OSVs would want to do that just because SR-IOV
>> isn't that popular of a case.
> IIUC we need to monitor for a "backup hint", spawn the master, rename it
> to maintain backwards compatibility with no-VF setups and enslave the VF
> if it appears.
>
> All those sound possible from user space, the advantage of the kernel
> solution right now is that it has more complete code.
>
> Am I misunderstanding?

I think there is some misunderstanding about the exact requirement and 
the usecase
we are trying to solve.  If the Guest is allowed to do this 
configuration, we already have
a solution with either bond/team based user space configuration.

This is to enable cloud service providers to provide a accelerated 
datapath by simply
letting to tenants to get their own images with the only requirement to 
enable their
kernels with newer virtio_net driver with BACKUP support and the VF driver.

To recap from an earlier thread, here is a response from Stephen that 
talks about the
requirement for the netvsc solution and we would like to provide similar 
solution for
KVM based cloud deployments.

 > The requirement with Azure accelerated network was that a stock 
distribution image from the
 > store must be able to run unmodified and get accelerated networking.
 >  Not sure if other environments need to work the same, but it would 
be nice.
 >  That meant no additional setup scripts (aka no bonding) and also it must
 > work transparently with hot-plug. Also there are diverse set of 
environments:
 > openstack, cloudinit, network manager and systemd. The solution had 
to not depend
 > on any one of them, but also not break any of them.

Thanks
Sridhar
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 121+ messages in thread

* [virtio-dev] Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
@ 2018-02-22  2:15                                 ` Samudrala, Sridhar
  0 siblings, 0 replies; 121+ messages in thread
From: Samudrala, Sridhar @ 2018-02-22  2:15 UTC (permalink / raw)
  To: Jakub Kicinski, Alexander Duyck
  Cc: Jiri Pirko, Michael S. Tsirkin, Stephen Hemminger, David Miller,
	Netdev, virtualization, virtio-dev, Brandeburg, Jesse, Duyck,
	Alexander H, Jason Wang, Siwei Liu

On 2/21/2018 6:02 PM, Jakub Kicinski wrote:
> On Wed, 21 Feb 2018 12:57:09 -0800, Alexander Duyck wrote:
>>> I don't see why the team cannot be there always.
>> It is more the logistical nightmare. Part of the goal here was to work
>> with the cloud base images that are out there such as
>> https://alt.fedoraproject.org/cloud/. With just the kernel changes the
>> overhead for this stays fairly small and would be pulled in as just a
>> standard part of the kernel update process. The virtio bypass only
>> pops up if the backup bit is present. With the team solution it
>> requires that the base image use the team driver on virtio_net when it
>> sees one. I doubt the OSVs would want to do that just because SR-IOV
>> isn't that popular of a case.
> IIUC we need to monitor for a "backup hint", spawn the master, rename it
> to maintain backwards compatibility with no-VF setups and enslave the VF
> if it appears.
>
> All those sound possible from user space, the advantage of the kernel
> solution right now is that it has more complete code.
>
> Am I misunderstanding?

I think there is some misunderstanding about the exact requirement and 
the usecase
we are trying to solve.  If the Guest is allowed to do this 
configuration, we already have
a solution with either bond/team based user space configuration.

This is to enable cloud service providers to provide a accelerated 
datapath by simply
letting to tenants to get their own images with the only requirement to 
enable their
kernels with newer virtio_net driver with BACKUP support and the VF driver.

To recap from an earlier thread, here is a response from Stephen that 
talks about the
requirement for the netvsc solution and we would like to provide similar 
solution for
KVM based cloud deployments.

 > The requirement with Azure accelerated network was that a stock 
distribution image from the
 > store must be able to run unmodified and get accelerated networking.
 >  Not sure if other environments need to work the same, but it would 
be nice.
 >  That meant no additional setup scripts (aka no bonding) and also it must
 > work transparently with hot-plug. Also there are diverse set of 
environments:
 > openstack, cloudinit, network manager and systemd. The solution had 
to not depend
 > on any one of them, but also not break any of them.

Thanks
Sridhar

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
  2018-02-22  1:59           ` [virtio-dev] " Siwei Liu
@ 2018-02-22  2:35             ` Samudrala, Sridhar
  -1 siblings, 0 replies; 121+ messages in thread
From: Samudrala, Sridhar @ 2018-02-22  2:35 UTC (permalink / raw)
  To: Siwei Liu, Alexander Duyck
  Cc: Jakub Kicinski, Michael S. Tsirkin, Stephen Hemminger,
	David Miller, Netdev, virtualization, virtio-dev, Brandeburg,
	Jesse, Duyck, Alexander H, Jason Wang

On 2/21/2018 5:59 PM, Siwei Liu wrote:
> On Wed, Feb 21, 2018 at 4:17 PM, Alexander Duyck
> <alexander.duyck@gmail.com> wrote:
>> On Wed, Feb 21, 2018 at 3:50 PM, Siwei Liu <loseweigh@gmail.com> wrote:
>>> I haven't checked emails for days and did not realize the new revision
>>> had already came out. And thank you for the effort, this revision
>>> really looks to be a step forward towards our use case and is close to
>>> what we wanted to do. A few questions in line.
>>>
>>> On Sat, Feb 17, 2018 at 9:12 AM, Alexander Duyck
>>> <alexander.duyck@gmail.com> wrote:
>>>> On Fri, Feb 16, 2018 at 6:38 PM, Jakub Kicinski <kubakici@wp.pl> wrote:
>>>>> On Fri, 16 Feb 2018 10:11:19 -0800, Sridhar Samudrala wrote:
>>>>>> Ppatch 2 is in response to the community request for a 3 netdev
>>>>>> solution.  However, it creates some issues we'll get into in a moment.
>>>>>> It extends virtio_net to use alternate datapath when available and
>>>>>> registered. When BACKUP feature is enabled, virtio_net driver creates
>>>>>> an additional 'bypass' netdev that acts as a master device and controls
>>>>>> 2 slave devices.  The original virtio_net netdev is registered as
>>>>>> 'backup' netdev and a passthru/vf device with the same MAC gets
>>>>>> registered as 'active' netdev. Both 'bypass' and 'backup' netdevs are
>>>>>> associated with the same 'pci' device.  The user accesses the network
>>>>>> interface via 'bypass' netdev. The 'bypass' netdev chooses 'active' netdev
>>>>>> as default for transmits when it is available with link up and running.
>>>>> Thank you do doing this.
>>>>>
>>>>>> We noticed a couple of issues with this approach during testing.
>>>>>> - As both 'bypass' and 'backup' netdevs are associated with the same
>>>>>>    virtio pci device, udev tries to rename both of them with the same name
>>>>>>    and the 2nd rename will fail. This would be OK as long as the first netdev
>>>>>>    to be renamed is the 'bypass' netdev, but the order in which udev gets
>>>>>>    to rename the 2 netdevs is not reliable.
>>>>> Out of curiosity - why do you link the master netdev to the virtio
>>>>> struct device?
>>>> The basic idea of all this is that we wanted this to work with an
>>>> existing VM image that was using virtio. As such we were trying to
>>>> make it so that the bypass interface takes the place of the original
>>>> virtio and get udev to rename the bypass to what the original
>>>> virtio_net was.
>>> Could it made it also possible to take over the config from VF instead
>>> of virtio on an existing VM image? And get udev rename the bypass
>>> netdev to what the original VF was. I don't say tightly binding the
>>> bypass master to only virtio or VF, but I think we should provide both
>>> options to support different upgrade paths. Possibly we could tweak
>>> the device tree layout to reuse the same PCI slot for the master
>>> bypass netdev, such that udev would not get confused when renaming the
>>> device. The VF needs to use a different function slot afterwards.
>>> Perhaps we might need to a special multiseat like QEMU device for that
>>> purpose?
>>>
>>> Our case we'll upgrade the config from VF to virtio-bypass directly.
>> So if I am understanding what you are saying you are wanting to flip
>> the backup interface from the virtio to a VF. The problem is that
>> becomes a bit of a vendor lock-in solution since it would rely on a
>> specific VF driver. I would agree with Jiri that we don't want to go
>> down that path. We don't want every VF out there firing up its own
>> separate bond. Ideally you want the hypervisor to be able to manage
>> all of this which is why it makes sense to have virtio manage this and
>> why this is associated with the virtio_net interface.
> No, that's not what I was talking about of course. I thought you
> mentioned the upgrade scenario this patch would like to address is to
> use the bypass interface "to take the place of the original virtio,
> and get udev to rename the bypass to what the original virtio_net
> was". That is one of the possible upgrade paths for sure. However the
> upgrade path I was seeking is to use the bypass interface to take the
> place of original VF interface while retaining the name and network
> configs, which generally can be done simply with kernel upgrade. It
> would become limiting as this patch makes the bypass interface share
> the same virtio pci device with virito backup. Can this bypass
> interface be made general to take place of any pci device other than
> virtio-net? This will be more helpful as the cloud users who has
> existing setup on VF interface don't have to recreate it on virtio-net
> and VF separately again.

Yes. This sounds interesting. Looks like you want an existing VM image with
VF only configuration to get transparent live migration support by adding
virtio_net with BACKUP feature.  We may need another feature bit to switch
between these 2 options.


>
>> The other bits get into more complexity then we are ready to handle
>> for now. I think I might have talked about something similar that I
>> was referring to as a "virtio-bond" where you would have a PCI/PCIe
>> tree topology that makes this easier to sort out, and the "virtio-bond
>> would be used to handle coordination/configuration of a much more
>> complex interface.
> That was one way to solve this problem but I'd like to see simple ways
> to sort it out.
>
>>>>> FWIW two solutions that immediately come to mind is to export "backup"
>>>>> as phys_port_name of the backup virtio link and/or assign a name to the
>>>>> master like you are doing already.  I think team uses team%d and bond
>>>>> uses bond%d, soft naming of master devices seems quite natural in this
>>>>> case.
>>>> I figured I had overlooked something like that.. Thanks for pointing
>>>> this out. Okay so I think the phys_port_name approach might resolve
>>>> the original issue. If I am reading things correctly what we end up
>>>> with is the master showing up as "ens1" for example and the backup
>>>> showing up as "ens1nbackup". Am I understanding that right?
>>>>
>>>> The problem with the team/bond%d approach is that it creates a new
>>>> netdevice and so it would require guest configuration changes.
>>>>
>>>>> IMHO phys_port_name == "backup" if BACKUP bit is set on slave virtio
>>>>> link is quite neat.
>>>> I agree. For non-"backup" virio_net devices would it be okay for us to
>>>> just return -EOPNOTSUPP? I assume it would be and that way the legacy
>>>> behavior could be maintained although the function still exists.
>>>>
>>>>>> - When the 'active' netdev is unplugged OR not present on a destination
>>>>>>    system after live migration, the user will see 2 virtio_net netdevs.
>>>>> That's necessary and expected, all configuration applies to the master
>>>>> so master must exist.
>>>> With the naming issue resolved this is the only item left outstanding.
>>>> This becomes a matter of form vs function.
>>>>
>>>> The main complaint about the "3 netdev" solution is a bit confusing to
>>>> have the 2 netdevs present if the VF isn't there. The idea is that
>>>> having the extra "master" netdev there if there isn't really a bond is
>>>> a bit ugly.
>>> Is it this uglier in terms of user experience rather than
>>> functionality? I don't want it dynamically changed between 2-netdev
>>> and 3-netdev depending on the presence of VF. That gets back to my
>>> original question and suggestion earlier: why not just hide the lower
>>> netdevs from udev renaming and such? Which important observability
>>> benefits users may get if exposing the lower netdevs?
>>>
>>> Thanks,
>>> -Siwei
>> The only real advantage to a 2 netdev solution is that it looks like
>> the netvsc solution, however it doesn't behave like it since there are
>> some features like XDP that may not function correctly if they are
>> left enabled in the virtio_net interface.
>>
>> As far as functionality the advantage of not hiding the lower devices
>> is that they are free to be managed. The problem with pushing all of
>> the configuration into the upper device is that you are limited to the
>> intersection of the features of the lower devices. This can be
>> limiting for some setups as some VFs support things like more queues,
>> or better interrupt moderation options than others so trying to make
>> everything work with one config would be ugly.
> It depends on how you build it and the way you expect it to work. IMHO
> the lower devices don't need to be directly managed at all, otherwise
> it ends up with loss of configuration across migration, and it really
> does not bring much value than having a general team or bond device.
> Users still have to reconfigure those queue settings and interrupt
> moderation options after all. The new upper device could take the
> assumption that the VF/PT lower device always has superior feature set
> than virtio-net in order to apply advanced configuration. The upper
> device should remember all configurations previously done and apply
> supporting ones to active device automatically when switching the
> datapath.
>
It should be possible to extend this patchset to support migration of 
additional
settings  by enabling additional ndo_ops and ethtool_ops on the upper dev
and propagating them down the lower devices and replaying the settings after
the VF is replugged after migration.

Thanks
Sridhar

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
  2018-02-22  1:59           ` [virtio-dev] " Siwei Liu
  (?)
@ 2018-02-22  2:35           ` Samudrala, Sridhar
  -1 siblings, 0 replies; 121+ messages in thread
From: Samudrala, Sridhar @ 2018-02-22  2:35 UTC (permalink / raw)
  To: Siwei Liu, Alexander Duyck
  Cc: Duyck, Alexander H, virtio-dev, Michael S. Tsirkin,
	Jakub Kicinski, Netdev, virtualization, David Miller

On 2/21/2018 5:59 PM, Siwei Liu wrote:
> On Wed, Feb 21, 2018 at 4:17 PM, Alexander Duyck
> <alexander.duyck@gmail.com> wrote:
>> On Wed, Feb 21, 2018 at 3:50 PM, Siwei Liu <loseweigh@gmail.com> wrote:
>>> I haven't checked emails for days and did not realize the new revision
>>> had already came out. And thank you for the effort, this revision
>>> really looks to be a step forward towards our use case and is close to
>>> what we wanted to do. A few questions in line.
>>>
>>> On Sat, Feb 17, 2018 at 9:12 AM, Alexander Duyck
>>> <alexander.duyck@gmail.com> wrote:
>>>> On Fri, Feb 16, 2018 at 6:38 PM, Jakub Kicinski <kubakici@wp.pl> wrote:
>>>>> On Fri, 16 Feb 2018 10:11:19 -0800, Sridhar Samudrala wrote:
>>>>>> Ppatch 2 is in response to the community request for a 3 netdev
>>>>>> solution.  However, it creates some issues we'll get into in a moment.
>>>>>> It extends virtio_net to use alternate datapath when available and
>>>>>> registered. When BACKUP feature is enabled, virtio_net driver creates
>>>>>> an additional 'bypass' netdev that acts as a master device and controls
>>>>>> 2 slave devices.  The original virtio_net netdev is registered as
>>>>>> 'backup' netdev and a passthru/vf device with the same MAC gets
>>>>>> registered as 'active' netdev. Both 'bypass' and 'backup' netdevs are
>>>>>> associated with the same 'pci' device.  The user accesses the network
>>>>>> interface via 'bypass' netdev. The 'bypass' netdev chooses 'active' netdev
>>>>>> as default for transmits when it is available with link up and running.
>>>>> Thank you do doing this.
>>>>>
>>>>>> We noticed a couple of issues with this approach during testing.
>>>>>> - As both 'bypass' and 'backup' netdevs are associated with the same
>>>>>>    virtio pci device, udev tries to rename both of them with the same name
>>>>>>    and the 2nd rename will fail. This would be OK as long as the first netdev
>>>>>>    to be renamed is the 'bypass' netdev, but the order in which udev gets
>>>>>>    to rename the 2 netdevs is not reliable.
>>>>> Out of curiosity - why do you link the master netdev to the virtio
>>>>> struct device?
>>>> The basic idea of all this is that we wanted this to work with an
>>>> existing VM image that was using virtio. As such we were trying to
>>>> make it so that the bypass interface takes the place of the original
>>>> virtio and get udev to rename the bypass to what the original
>>>> virtio_net was.
>>> Could it made it also possible to take over the config from VF instead
>>> of virtio on an existing VM image? And get udev rename the bypass
>>> netdev to what the original VF was. I don't say tightly binding the
>>> bypass master to only virtio or VF, but I think we should provide both
>>> options to support different upgrade paths. Possibly we could tweak
>>> the device tree layout to reuse the same PCI slot for the master
>>> bypass netdev, such that udev would not get confused when renaming the
>>> device. The VF needs to use a different function slot afterwards.
>>> Perhaps we might need to a special multiseat like QEMU device for that
>>> purpose?
>>>
>>> Our case we'll upgrade the config from VF to virtio-bypass directly.
>> So if I am understanding what you are saying you are wanting to flip
>> the backup interface from the virtio to a VF. The problem is that
>> becomes a bit of a vendor lock-in solution since it would rely on a
>> specific VF driver. I would agree with Jiri that we don't want to go
>> down that path. We don't want every VF out there firing up its own
>> separate bond. Ideally you want the hypervisor to be able to manage
>> all of this which is why it makes sense to have virtio manage this and
>> why this is associated with the virtio_net interface.
> No, that's not what I was talking about of course. I thought you
> mentioned the upgrade scenario this patch would like to address is to
> use the bypass interface "to take the place of the original virtio,
> and get udev to rename the bypass to what the original virtio_net
> was". That is one of the possible upgrade paths for sure. However the
> upgrade path I was seeking is to use the bypass interface to take the
> place of original VF interface while retaining the name and network
> configs, which generally can be done simply with kernel upgrade. It
> would become limiting as this patch makes the bypass interface share
> the same virtio pci device with virito backup. Can this bypass
> interface be made general to take place of any pci device other than
> virtio-net? This will be more helpful as the cloud users who has
> existing setup on VF interface don't have to recreate it on virtio-net
> and VF separately again.

Yes. This sounds interesting. Looks like you want an existing VM image with
VF only configuration to get transparent live migration support by adding
virtio_net with BACKUP feature.  We may need another feature bit to switch
between these 2 options.


>
>> The other bits get into more complexity then we are ready to handle
>> for now. I think I might have talked about something similar that I
>> was referring to as a "virtio-bond" where you would have a PCI/PCIe
>> tree topology that makes this easier to sort out, and the "virtio-bond
>> would be used to handle coordination/configuration of a much more
>> complex interface.
> That was one way to solve this problem but I'd like to see simple ways
> to sort it out.
>
>>>>> FWIW two solutions that immediately come to mind is to export "backup"
>>>>> as phys_port_name of the backup virtio link and/or assign a name to the
>>>>> master like you are doing already.  I think team uses team%d and bond
>>>>> uses bond%d, soft naming of master devices seems quite natural in this
>>>>> case.
>>>> I figured I had overlooked something like that.. Thanks for pointing
>>>> this out. Okay so I think the phys_port_name approach might resolve
>>>> the original issue. If I am reading things correctly what we end up
>>>> with is the master showing up as "ens1" for example and the backup
>>>> showing up as "ens1nbackup". Am I understanding that right?
>>>>
>>>> The problem with the team/bond%d approach is that it creates a new
>>>> netdevice and so it would require guest configuration changes.
>>>>
>>>>> IMHO phys_port_name == "backup" if BACKUP bit is set on slave virtio
>>>>> link is quite neat.
>>>> I agree. For non-"backup" virio_net devices would it be okay for us to
>>>> just return -EOPNOTSUPP? I assume it would be and that way the legacy
>>>> behavior could be maintained although the function still exists.
>>>>
>>>>>> - When the 'active' netdev is unplugged OR not present on a destination
>>>>>>    system after live migration, the user will see 2 virtio_net netdevs.
>>>>> That's necessary and expected, all configuration applies to the master
>>>>> so master must exist.
>>>> With the naming issue resolved this is the only item left outstanding.
>>>> This becomes a matter of form vs function.
>>>>
>>>> The main complaint about the "3 netdev" solution is a bit confusing to
>>>> have the 2 netdevs present if the VF isn't there. The idea is that
>>>> having the extra "master" netdev there if there isn't really a bond is
>>>> a bit ugly.
>>> Is it this uglier in terms of user experience rather than
>>> functionality? I don't want it dynamically changed between 2-netdev
>>> and 3-netdev depending on the presence of VF. That gets back to my
>>> original question and suggestion earlier: why not just hide the lower
>>> netdevs from udev renaming and such? Which important observability
>>> benefits users may get if exposing the lower netdevs?
>>>
>>> Thanks,
>>> -Siwei
>> The only real advantage to a 2 netdev solution is that it looks like
>> the netvsc solution, however it doesn't behave like it since there are
>> some features like XDP that may not function correctly if they are
>> left enabled in the virtio_net interface.
>>
>> As far as functionality the advantage of not hiding the lower devices
>> is that they are free to be managed. The problem with pushing all of
>> the configuration into the upper device is that you are limited to the
>> intersection of the features of the lower devices. This can be
>> limiting for some setups as some VFs support things like more queues,
>> or better interrupt moderation options than others so trying to make
>> everything work with one config would be ugly.
> It depends on how you build it and the way you expect it to work. IMHO
> the lower devices don't need to be directly managed at all, otherwise
> it ends up with loss of configuration across migration, and it really
> does not bring much value than having a general team or bond device.
> Users still have to reconfigure those queue settings and interrupt
> moderation options after all. The new upper device could take the
> assumption that the VF/PT lower device always has superior feature set
> than virtio-net in order to apply advanced configuration. The upper
> device should remember all configurations previously done and apply
> supporting ones to active device automatically when switching the
> datapath.
>
It should be possible to extend this patchset to support migration of 
additional
settings  by enabling additional ndo_ops and ethtool_ops on the upper dev
and propagating them down the lower devices and replaying the settings after
the VF is replugged after migration.

Thanks
Sridhar
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 121+ messages in thread

* [virtio-dev] Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
@ 2018-02-22  2:35             ` Samudrala, Sridhar
  0 siblings, 0 replies; 121+ messages in thread
From: Samudrala, Sridhar @ 2018-02-22  2:35 UTC (permalink / raw)
  To: Siwei Liu, Alexander Duyck
  Cc: Jakub Kicinski, Michael S. Tsirkin, Stephen Hemminger,
	David Miller, Netdev, virtualization, virtio-dev, Brandeburg,
	Jesse, Duyck, Alexander H, Jason Wang

On 2/21/2018 5:59 PM, Siwei Liu wrote:
> On Wed, Feb 21, 2018 at 4:17 PM, Alexander Duyck
> <alexander.duyck@gmail.com> wrote:
>> On Wed, Feb 21, 2018 at 3:50 PM, Siwei Liu <loseweigh@gmail.com> wrote:
>>> I haven't checked emails for days and did not realize the new revision
>>> had already came out. And thank you for the effort, this revision
>>> really looks to be a step forward towards our use case and is close to
>>> what we wanted to do. A few questions in line.
>>>
>>> On Sat, Feb 17, 2018 at 9:12 AM, Alexander Duyck
>>> <alexander.duyck@gmail.com> wrote:
>>>> On Fri, Feb 16, 2018 at 6:38 PM, Jakub Kicinski <kubakici@wp.pl> wrote:
>>>>> On Fri, 16 Feb 2018 10:11:19 -0800, Sridhar Samudrala wrote:
>>>>>> Ppatch 2 is in response to the community request for a 3 netdev
>>>>>> solution.  However, it creates some issues we'll get into in a moment.
>>>>>> It extends virtio_net to use alternate datapath when available and
>>>>>> registered. When BACKUP feature is enabled, virtio_net driver creates
>>>>>> an additional 'bypass' netdev that acts as a master device and controls
>>>>>> 2 slave devices.  The original virtio_net netdev is registered as
>>>>>> 'backup' netdev and a passthru/vf device with the same MAC gets
>>>>>> registered as 'active' netdev. Both 'bypass' and 'backup' netdevs are
>>>>>> associated with the same 'pci' device.  The user accesses the network
>>>>>> interface via 'bypass' netdev. The 'bypass' netdev chooses 'active' netdev
>>>>>> as default for transmits when it is available with link up and running.
>>>>> Thank you do doing this.
>>>>>
>>>>>> We noticed a couple of issues with this approach during testing.
>>>>>> - As both 'bypass' and 'backup' netdevs are associated with the same
>>>>>>    virtio pci device, udev tries to rename both of them with the same name
>>>>>>    and the 2nd rename will fail. This would be OK as long as the first netdev
>>>>>>    to be renamed is the 'bypass' netdev, but the order in which udev gets
>>>>>>    to rename the 2 netdevs is not reliable.
>>>>> Out of curiosity - why do you link the master netdev to the virtio
>>>>> struct device?
>>>> The basic idea of all this is that we wanted this to work with an
>>>> existing VM image that was using virtio. As such we were trying to
>>>> make it so that the bypass interface takes the place of the original
>>>> virtio and get udev to rename the bypass to what the original
>>>> virtio_net was.
>>> Could it made it also possible to take over the config from VF instead
>>> of virtio on an existing VM image? And get udev rename the bypass
>>> netdev to what the original VF was. I don't say tightly binding the
>>> bypass master to only virtio or VF, but I think we should provide both
>>> options to support different upgrade paths. Possibly we could tweak
>>> the device tree layout to reuse the same PCI slot for the master
>>> bypass netdev, such that udev would not get confused when renaming the
>>> device. The VF needs to use a different function slot afterwards.
>>> Perhaps we might need to a special multiseat like QEMU device for that
>>> purpose?
>>>
>>> Our case we'll upgrade the config from VF to virtio-bypass directly.
>> So if I am understanding what you are saying you are wanting to flip
>> the backup interface from the virtio to a VF. The problem is that
>> becomes a bit of a vendor lock-in solution since it would rely on a
>> specific VF driver. I would agree with Jiri that we don't want to go
>> down that path. We don't want every VF out there firing up its own
>> separate bond. Ideally you want the hypervisor to be able to manage
>> all of this which is why it makes sense to have virtio manage this and
>> why this is associated with the virtio_net interface.
> No, that's not what I was talking about of course. I thought you
> mentioned the upgrade scenario this patch would like to address is to
> use the bypass interface "to take the place of the original virtio,
> and get udev to rename the bypass to what the original virtio_net
> was". That is one of the possible upgrade paths for sure. However the
> upgrade path I was seeking is to use the bypass interface to take the
> place of original VF interface while retaining the name and network
> configs, which generally can be done simply with kernel upgrade. It
> would become limiting as this patch makes the bypass interface share
> the same virtio pci device with virito backup. Can this bypass
> interface be made general to take place of any pci device other than
> virtio-net? This will be more helpful as the cloud users who has
> existing setup on VF interface don't have to recreate it on virtio-net
> and VF separately again.

Yes. This sounds interesting. Looks like you want an existing VM image with
VF only configuration to get transparent live migration support by adding
virtio_net with BACKUP feature.  We may need another feature bit to switch
between these 2 options.


>
>> The other bits get into more complexity then we are ready to handle
>> for now. I think I might have talked about something similar that I
>> was referring to as a "virtio-bond" where you would have a PCI/PCIe
>> tree topology that makes this easier to sort out, and the "virtio-bond
>> would be used to handle coordination/configuration of a much more
>> complex interface.
> That was one way to solve this problem but I'd like to see simple ways
> to sort it out.
>
>>>>> FWIW two solutions that immediately come to mind is to export "backup"
>>>>> as phys_port_name of the backup virtio link and/or assign a name to the
>>>>> master like you are doing already.  I think team uses team%d and bond
>>>>> uses bond%d, soft naming of master devices seems quite natural in this
>>>>> case.
>>>> I figured I had overlooked something like that.. Thanks for pointing
>>>> this out. Okay so I think the phys_port_name approach might resolve
>>>> the original issue. If I am reading things correctly what we end up
>>>> with is the master showing up as "ens1" for example and the backup
>>>> showing up as "ens1nbackup". Am I understanding that right?
>>>>
>>>> The problem with the team/bond%d approach is that it creates a new
>>>> netdevice and so it would require guest configuration changes.
>>>>
>>>>> IMHO phys_port_name == "backup" if BACKUP bit is set on slave virtio
>>>>> link is quite neat.
>>>> I agree. For non-"backup" virio_net devices would it be okay for us to
>>>> just return -EOPNOTSUPP? I assume it would be and that way the legacy
>>>> behavior could be maintained although the function still exists.
>>>>
>>>>>> - When the 'active' netdev is unplugged OR not present on a destination
>>>>>>    system after live migration, the user will see 2 virtio_net netdevs.
>>>>> That's necessary and expected, all configuration applies to the master
>>>>> so master must exist.
>>>> With the naming issue resolved this is the only item left outstanding.
>>>> This becomes a matter of form vs function.
>>>>
>>>> The main complaint about the "3 netdev" solution is a bit confusing to
>>>> have the 2 netdevs present if the VF isn't there. The idea is that
>>>> having the extra "master" netdev there if there isn't really a bond is
>>>> a bit ugly.
>>> Is it this uglier in terms of user experience rather than
>>> functionality? I don't want it dynamically changed between 2-netdev
>>> and 3-netdev depending on the presence of VF. That gets back to my
>>> original question and suggestion earlier: why not just hide the lower
>>> netdevs from udev renaming and such? Which important observability
>>> benefits users may get if exposing the lower netdevs?
>>>
>>> Thanks,
>>> -Siwei
>> The only real advantage to a 2 netdev solution is that it looks like
>> the netvsc solution, however it doesn't behave like it since there are
>> some features like XDP that may not function correctly if they are
>> left enabled in the virtio_net interface.
>>
>> As far as functionality the advantage of not hiding the lower devices
>> is that they are free to be managed. The problem with pushing all of
>> the configuration into the upper device is that you are limited to the
>> intersection of the features of the lower devices. This can be
>> limiting for some setups as some VFs support things like more queues,
>> or better interrupt moderation options than others so trying to make
>> everything work with one config would be ugly.
> It depends on how you build it and the way you expect it to work. IMHO
> the lower devices don't need to be directly managed at all, otherwise
> it ends up with loss of configuration across migration, and it really
> does not bring much value than having a general team or bond device.
> Users still have to reconfigure those queue settings and interrupt
> moderation options after all. The new upper device could take the
> assumption that the VF/PT lower device always has superior feature set
> than virtio-net in order to apply advanced configuration. The upper
> device should remember all configurations previously done and apply
> supporting ones to active device automatically when switching the
> datapath.
>
It should be possible to extend this patchset to support migration of 
additional
settings  by enabling additional ndo_ops and ethtool_ops on the upper dev
and propagating them down the lower devices and replaying the settings after
the VF is replugged after migration.

Thanks
Sridhar

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
  2018-02-22  2:35             ` [virtio-dev] " Samudrala, Sridhar
@ 2018-02-22  3:28               ` Samudrala, Sridhar
  -1 siblings, 0 replies; 121+ messages in thread
From: Samudrala, Sridhar @ 2018-02-22  3:28 UTC (permalink / raw)
  To: Siwei Liu, Alexander Duyck
  Cc: Duyck, Alexander H, virtio-dev, Michael S. Tsirkin,
	Jakub Kicinski, Netdev, virtualization, David Miller

On 2/21/2018 6:35 PM, Samudrala, Sridhar wrote:
> On 2/21/2018 5:59 PM, Siwei Liu wrote:
>> On Wed, Feb 21, 2018 at 4:17 PM, Alexander Duyck
>> <alexander.duyck@gmail.com> wrote:
>>> On Wed, Feb 21, 2018 at 3:50 PM, Siwei Liu <loseweigh@gmail.com> wrote:
>>>> I haven't checked emails for days and did not realize the new revision
>>>> had already came out. And thank you for the effort, this revision
>>>> really looks to be a step forward towards our use case and is close to
>>>> what we wanted to do. A few questions in line.
>>>>
>>>> On Sat, Feb 17, 2018 at 9:12 AM, Alexander Duyck
>>>> <alexander.duyck@gmail.com> wrote:
>>>>> On Fri, Feb 16, 2018 at 6:38 PM, Jakub Kicinski <kubakici@wp.pl> 
>>>>> wrote:
>>>>>> On Fri, 16 Feb 2018 10:11:19 -0800, Sridhar Samudrala wrote:
>>>>>>> Ppatch 2 is in response to the community request for a 3 netdev
>>>>>>> solution.  However, it creates some issues we'll get into in a 
>>>>>>> moment.
>>>>>>> It extends virtio_net to use alternate datapath when available and
>>>>>>> registered. When BACKUP feature is enabled, virtio_net driver 
>>>>>>> creates
>>>>>>> an additional 'bypass' netdev that acts as a master device and 
>>>>>>> controls
>>>>>>> 2 slave devices.  The original virtio_net netdev is registered as
>>>>>>> 'backup' netdev and a passthru/vf device with the same MAC gets
>>>>>>> registered as 'active' netdev. Both 'bypass' and 'backup' 
>>>>>>> netdevs are
>>>>>>> associated with the same 'pci' device.  The user accesses the 
>>>>>>> network
>>>>>>> interface via 'bypass' netdev. The 'bypass' netdev chooses 
>>>>>>> 'active' netdev
>>>>>>> as default for transmits when it is available with link up and 
>>>>>>> running.
>>>>>> Thank you do doing this.
>>>>>>
>>>>>>> We noticed a couple of issues with this approach during testing.
>>>>>>> - As both 'bypass' and 'backup' netdevs are associated with the 
>>>>>>> same
>>>>>>>    virtio pci device, udev tries to rename both of them with the 
>>>>>>> same name
>>>>>>>    and the 2nd rename will fail. This would be OK as long as the 
>>>>>>> first netdev
>>>>>>>    to be renamed is the 'bypass' netdev, but the order in which 
>>>>>>> udev gets
>>>>>>>    to rename the 2 netdevs is not reliable.
>>>>>> Out of curiosity - why do you link the master netdev to the virtio
>>>>>> struct device?
>>>>> The basic idea of all this is that we wanted this to work with an
>>>>> existing VM image that was using virtio. As such we were trying to
>>>>> make it so that the bypass interface takes the place of the original
>>>>> virtio and get udev to rename the bypass to what the original
>>>>> virtio_net was.
>>>> Could it made it also possible to take over the config from VF instead
>>>> of virtio on an existing VM image? And get udev rename the bypass
>>>> netdev to what the original VF was. I don't say tightly binding the
>>>> bypass master to only virtio or VF, but I think we should provide both
>>>> options to support different upgrade paths. Possibly we could tweak
>>>> the device tree layout to reuse the same PCI slot for the master
>>>> bypass netdev, such that udev would not get confused when renaming the
>>>> device. The VF needs to use a different function slot afterwards.
>>>> Perhaps we might need to a special multiseat like QEMU device for that
>>>> purpose?
>>>>
>>>> Our case we'll upgrade the config from VF to virtio-bypass directly.
>>> So if I am understanding what you are saying you are wanting to flip
>>> the backup interface from the virtio to a VF. The problem is that
>>> becomes a bit of a vendor lock-in solution since it would rely on a
>>> specific VF driver. I would agree with Jiri that we don't want to go
>>> down that path. We don't want every VF out there firing up its own
>>> separate bond. Ideally you want the hypervisor to be able to manage
>>> all of this which is why it makes sense to have virtio manage this and
>>> why this is associated with the virtio_net interface.
>> No, that's not what I was talking about of course. I thought you
>> mentioned the upgrade scenario this patch would like to address is to
>> use the bypass interface "to take the place of the original virtio,
>> and get udev to rename the bypass to what the original virtio_net
>> was". That is one of the possible upgrade paths for sure. However the
>> upgrade path I was seeking is to use the bypass interface to take the
>> place of original VF interface while retaining the name and network
>> configs, which generally can be done simply with kernel upgrade. It
>> would become limiting as this patch makes the bypass interface share
>> the same virtio pci device with virito backup. Can this bypass
>> interface be made general to take place of any pci device other than
>> virtio-net? This will be more helpful as the cloud users who has
>> existing setup on VF interface don't have to recreate it on virtio-net
>> and VF separately again.
>
> Yes. This sounds interesting. Looks like you want an existing VM image 
> with
> VF only configuration to get transparent live migration support by adding
> virtio_net with BACKUP feature.  We may need another feature bit to 
> switch
> between these 2 options.

After thinking some more, this may be more involved than adding a new
feature bit.  This requires a netdev created by virtio to take over the 
name of
a VF netdev associated with a PCI device that may not be plugged in when
the virtio driver is coming up. This definitely requires some new messages
exchanged across the virtio control queue to pass the PCI device info of the
VF netdev.

>
>
>>
>>> The other bits get into more complexity then we are ready to handle
>>> for now. I think I might have talked about something similar that I
>>> was referring to as a "virtio-bond" where you would have a PCI/PCIe
>>> tree topology that makes this easier to sort out, and the "virtio-bond
>>> would be used to handle coordination/configuration of a much more
>>> complex interface.
>> That was one way to solve this problem but I'd like to see simple ways
>> to sort it out.
>>
>>>>>> FWIW two solutions that immediately come to mind is to export 
>>>>>> "backup"
>>>>>> as phys_port_name of the backup virtio link and/or assign a name 
>>>>>> to the
>>>>>> master like you are doing already.  I think team uses team%d and 
>>>>>> bond
>>>>>> uses bond%d, soft naming of master devices seems quite natural in 
>>>>>> this
>>>>>> case.
>>>>> I figured I had overlooked something like that.. Thanks for pointing
>>>>> this out. Okay so I think the phys_port_name approach might resolve
>>>>> the original issue. If I am reading things correctly what we end up
>>>>> with is the master showing up as "ens1" for example and the backup
>>>>> showing up as "ens1nbackup". Am I understanding that right?
>>>>>
>>>>> The problem with the team/bond%d approach is that it creates a new
>>>>> netdevice and so it would require guest configuration changes.
>>>>>
>>>>>> IMHO phys_port_name == "backup" if BACKUP bit is set on slave virtio
>>>>>> link is quite neat.
>>>>> I agree. For non-"backup" virio_net devices would it be okay for 
>>>>> us to
>>>>> just return -EOPNOTSUPP? I assume it would be and that way the legacy
>>>>> behavior could be maintained although the function still exists.
>>>>>
>>>>>>> - When the 'active' netdev is unplugged OR not present on a 
>>>>>>> destination
>>>>>>>    system after live migration, the user will see 2 virtio_net 
>>>>>>> netdevs.
>>>>>> That's necessary and expected, all configuration applies to the 
>>>>>> master
>>>>>> so master must exist.
>>>>> With the naming issue resolved this is the only item left 
>>>>> outstanding.
>>>>> This becomes a matter of form vs function.
>>>>>
>>>>> The main complaint about the "3 netdev" solution is a bit 
>>>>> confusing to
>>>>> have the 2 netdevs present if the VF isn't there. The idea is that
>>>>> having the extra "master" netdev there if there isn't really a 
>>>>> bond is
>>>>> a bit ugly.
>>>> Is it this uglier in terms of user experience rather than
>>>> functionality? I don't want it dynamically changed between 2-netdev
>>>> and 3-netdev depending on the presence of VF. That gets back to my
>>>> original question and suggestion earlier: why not just hide the lower
>>>> netdevs from udev renaming and such? Which important observability
>>>> benefits users may get if exposing the lower netdevs?
>>>>
>>>> Thanks,
>>>> -Siwei
>>> The only real advantage to a 2 netdev solution is that it looks like
>>> the netvsc solution, however it doesn't behave like it since there are
>>> some features like XDP that may not function correctly if they are
>>> left enabled in the virtio_net interface.
>>>
>>> As far as functionality the advantage of not hiding the lower devices
>>> is that they are free to be managed. The problem with pushing all of
>>> the configuration into the upper device is that you are limited to the
>>> intersection of the features of the lower devices. This can be
>>> limiting for some setups as some VFs support things like more queues,
>>> or better interrupt moderation options than others so trying to make
>>> everything work with one config would be ugly.
>> It depends on how you build it and the way you expect it to work. IMHO
>> the lower devices don't need to be directly managed at all, otherwise
>> it ends up with loss of configuration across migration, and it really
>> does not bring much value than having a general team or bond device.
>> Users still have to reconfigure those queue settings and interrupt
>> moderation options after all. The new upper device could take the
>> assumption that the VF/PT lower device always has superior feature set
>> than virtio-net in order to apply advanced configuration. The upper
>> device should remember all configurations previously done and apply
>> supporting ones to active device automatically when switching the
>> datapath.
>>
> It should be possible to extend this patchset to support migration of 
> additional
> settings  by enabling additional ndo_ops and ethtool_ops on the upper dev
> and propagating them down the lower devices and replaying the settings 
> after
> the VF is replugged after migration.
>
> Thanks
> Sridhar

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 121+ messages in thread

* [virtio-dev] Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
@ 2018-02-22  3:28               ` Samudrala, Sridhar
  0 siblings, 0 replies; 121+ messages in thread
From: Samudrala, Sridhar @ 2018-02-22  3:28 UTC (permalink / raw)
  To: Siwei Liu, Alexander Duyck
  Cc: Jakub Kicinski, Michael S. Tsirkin, Stephen Hemminger,
	David Miller, Netdev, virtualization, virtio-dev, Brandeburg,
	Jesse, Duyck, Alexander H, Jason Wang

On 2/21/2018 6:35 PM, Samudrala, Sridhar wrote:
> On 2/21/2018 5:59 PM, Siwei Liu wrote:
>> On Wed, Feb 21, 2018 at 4:17 PM, Alexander Duyck
>> <alexander.duyck@gmail.com> wrote:
>>> On Wed, Feb 21, 2018 at 3:50 PM, Siwei Liu <loseweigh@gmail.com> wrote:
>>>> I haven't checked emails for days and did not realize the new revision
>>>> had already came out. And thank you for the effort, this revision
>>>> really looks to be a step forward towards our use case and is close to
>>>> what we wanted to do. A few questions in line.
>>>>
>>>> On Sat, Feb 17, 2018 at 9:12 AM, Alexander Duyck
>>>> <alexander.duyck@gmail.com> wrote:
>>>>> On Fri, Feb 16, 2018 at 6:38 PM, Jakub Kicinski <kubakici@wp.pl> 
>>>>> wrote:
>>>>>> On Fri, 16 Feb 2018 10:11:19 -0800, Sridhar Samudrala wrote:
>>>>>>> Ppatch 2 is in response to the community request for a 3 netdev
>>>>>>> solution.  However, it creates some issues we'll get into in a 
>>>>>>> moment.
>>>>>>> It extends virtio_net to use alternate datapath when available and
>>>>>>> registered. When BACKUP feature is enabled, virtio_net driver 
>>>>>>> creates
>>>>>>> an additional 'bypass' netdev that acts as a master device and 
>>>>>>> controls
>>>>>>> 2 slave devices.  The original virtio_net netdev is registered as
>>>>>>> 'backup' netdev and a passthru/vf device with the same MAC gets
>>>>>>> registered as 'active' netdev. Both 'bypass' and 'backup' 
>>>>>>> netdevs are
>>>>>>> associated with the same 'pci' device.  The user accesses the 
>>>>>>> network
>>>>>>> interface via 'bypass' netdev. The 'bypass' netdev chooses 
>>>>>>> 'active' netdev
>>>>>>> as default for transmits when it is available with link up and 
>>>>>>> running.
>>>>>> Thank you do doing this.
>>>>>>
>>>>>>> We noticed a couple of issues with this approach during testing.
>>>>>>> - As both 'bypass' and 'backup' netdevs are associated with the 
>>>>>>> same
>>>>>>>    virtio pci device, udev tries to rename both of them with the 
>>>>>>> same name
>>>>>>>    and the 2nd rename will fail. This would be OK as long as the 
>>>>>>> first netdev
>>>>>>>    to be renamed is the 'bypass' netdev, but the order in which 
>>>>>>> udev gets
>>>>>>>    to rename the 2 netdevs is not reliable.
>>>>>> Out of curiosity - why do you link the master netdev to the virtio
>>>>>> struct device?
>>>>> The basic idea of all this is that we wanted this to work with an
>>>>> existing VM image that was using virtio. As such we were trying to
>>>>> make it so that the bypass interface takes the place of the original
>>>>> virtio and get udev to rename the bypass to what the original
>>>>> virtio_net was.
>>>> Could it made it also possible to take over the config from VF instead
>>>> of virtio on an existing VM image? And get udev rename the bypass
>>>> netdev to what the original VF was. I don't say tightly binding the
>>>> bypass master to only virtio or VF, but I think we should provide both
>>>> options to support different upgrade paths. Possibly we could tweak
>>>> the device tree layout to reuse the same PCI slot for the master
>>>> bypass netdev, such that udev would not get confused when renaming the
>>>> device. The VF needs to use a different function slot afterwards.
>>>> Perhaps we might need to a special multiseat like QEMU device for that
>>>> purpose?
>>>>
>>>> Our case we'll upgrade the config from VF to virtio-bypass directly.
>>> So if I am understanding what you are saying you are wanting to flip
>>> the backup interface from the virtio to a VF. The problem is that
>>> becomes a bit of a vendor lock-in solution since it would rely on a
>>> specific VF driver. I would agree with Jiri that we don't want to go
>>> down that path. We don't want every VF out there firing up its own
>>> separate bond. Ideally you want the hypervisor to be able to manage
>>> all of this which is why it makes sense to have virtio manage this and
>>> why this is associated with the virtio_net interface.
>> No, that's not what I was talking about of course. I thought you
>> mentioned the upgrade scenario this patch would like to address is to
>> use the bypass interface "to take the place of the original virtio,
>> and get udev to rename the bypass to what the original virtio_net
>> was". That is one of the possible upgrade paths for sure. However the
>> upgrade path I was seeking is to use the bypass interface to take the
>> place of original VF interface while retaining the name and network
>> configs, which generally can be done simply with kernel upgrade. It
>> would become limiting as this patch makes the bypass interface share
>> the same virtio pci device with virito backup. Can this bypass
>> interface be made general to take place of any pci device other than
>> virtio-net? This will be more helpful as the cloud users who has
>> existing setup on VF interface don't have to recreate it on virtio-net
>> and VF separately again.
>
> Yes. This sounds interesting. Looks like you want an existing VM image 
> with
> VF only configuration to get transparent live migration support by adding
> virtio_net with BACKUP feature.  We may need another feature bit to 
> switch
> between these 2 options.

After thinking some more, this may be more involved than adding a new
feature bit.  This requires a netdev created by virtio to take over the 
name of
a VF netdev associated with a PCI device that may not be plugged in when
the virtio driver is coming up. This definitely requires some new messages
exchanged across the virtio control queue to pass the PCI device info of the
VF netdev.

>
>
>>
>>> The other bits get into more complexity then we are ready to handle
>>> for now. I think I might have talked about something similar that I
>>> was referring to as a "virtio-bond" where you would have a PCI/PCIe
>>> tree topology that makes this easier to sort out, and the "virtio-bond
>>> would be used to handle coordination/configuration of a much more
>>> complex interface.
>> That was one way to solve this problem but I'd like to see simple ways
>> to sort it out.
>>
>>>>>> FWIW two solutions that immediately come to mind is to export 
>>>>>> "backup"
>>>>>> as phys_port_name of the backup virtio link and/or assign a name 
>>>>>> to the
>>>>>> master like you are doing already.  I think team uses team%d and 
>>>>>> bond
>>>>>> uses bond%d, soft naming of master devices seems quite natural in 
>>>>>> this
>>>>>> case.
>>>>> I figured I had overlooked something like that.. Thanks for pointing
>>>>> this out. Okay so I think the phys_port_name approach might resolve
>>>>> the original issue. If I am reading things correctly what we end up
>>>>> with is the master showing up as "ens1" for example and the backup
>>>>> showing up as "ens1nbackup". Am I understanding that right?
>>>>>
>>>>> The problem with the team/bond%d approach is that it creates a new
>>>>> netdevice and so it would require guest configuration changes.
>>>>>
>>>>>> IMHO phys_port_name == "backup" if BACKUP bit is set on slave virtio
>>>>>> link is quite neat.
>>>>> I agree. For non-"backup" virio_net devices would it be okay for 
>>>>> us to
>>>>> just return -EOPNOTSUPP? I assume it would be and that way the legacy
>>>>> behavior could be maintained although the function still exists.
>>>>>
>>>>>>> - When the 'active' netdev is unplugged OR not present on a 
>>>>>>> destination
>>>>>>>    system after live migration, the user will see 2 virtio_net 
>>>>>>> netdevs.
>>>>>> That's necessary and expected, all configuration applies to the 
>>>>>> master
>>>>>> so master must exist.
>>>>> With the naming issue resolved this is the only item left 
>>>>> outstanding.
>>>>> This becomes a matter of form vs function.
>>>>>
>>>>> The main complaint about the "3 netdev" solution is a bit 
>>>>> confusing to
>>>>> have the 2 netdevs present if the VF isn't there. The idea is that
>>>>> having the extra "master" netdev there if there isn't really a 
>>>>> bond is
>>>>> a bit ugly.
>>>> Is it this uglier in terms of user experience rather than
>>>> functionality? I don't want it dynamically changed between 2-netdev
>>>> and 3-netdev depending on the presence of VF. That gets back to my
>>>> original question and suggestion earlier: why not just hide the lower
>>>> netdevs from udev renaming and such? Which important observability
>>>> benefits users may get if exposing the lower netdevs?
>>>>
>>>> Thanks,
>>>> -Siwei
>>> The only real advantage to a 2 netdev solution is that it looks like
>>> the netvsc solution, however it doesn't behave like it since there are
>>> some features like XDP that may not function correctly if they are
>>> left enabled in the virtio_net interface.
>>>
>>> As far as functionality the advantage of not hiding the lower devices
>>> is that they are free to be managed. The problem with pushing all of
>>> the configuration into the upper device is that you are limited to the
>>> intersection of the features of the lower devices. This can be
>>> limiting for some setups as some VFs support things like more queues,
>>> or better interrupt moderation options than others so trying to make
>>> everything work with one config would be ugly.
>> It depends on how you build it and the way you expect it to work. IMHO
>> the lower devices don't need to be directly managed at all, otherwise
>> it ends up with loss of configuration across migration, and it really
>> does not bring much value than having a general team or bond device.
>> Users still have to reconfigure those queue settings and interrupt
>> moderation options after all. The new upper device could take the
>> assumption that the VF/PT lower device always has superior feature set
>> than virtio-net in order to apply advanced configuration. The upper
>> device should remember all configurations previously done and apply
>> supporting ones to active device automatically when switching the
>> datapath.
>>
> It should be possible to extend this patchset to support migration of 
> additional
> settings  by enabling additional ndo_ops and ethtool_ops on the upper dev
> and propagating them down the lower devices and replaying the settings 
> after
> the VF is replugged after migration.
>
> Thanks
> Sridhar


---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
  2018-02-21 20:57                             ` [virtio-dev] " Alexander Duyck
  (?)
  (?)
@ 2018-02-22  8:11                             ` Jiri Pirko
  2018-02-22 11:54                               ` Or Gerlitz
                                                 ` (2 more replies)
  -1 siblings, 3 replies; 121+ messages in thread
From: Jiri Pirko @ 2018-02-22  8:11 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Jakub Kicinski, Samudrala, Sridhar, Michael S. Tsirkin,
	Stephen Hemminger, David Miller, Netdev, virtualization,
	virtio-dev, Brandeburg, Jesse, Duyck, Alexander H, Jason Wang,
	Siwei Liu

Wed, Feb 21, 2018 at 09:57:09PM CET, alexander.duyck@gmail.com wrote:
>On Wed, Feb 21, 2018 at 11:38 AM, Jiri Pirko <jiri@resnulli.us> wrote:
>> Wed, Feb 21, 2018 at 06:56:35PM CET, alexander.duyck@gmail.com wrote:
>>>On Wed, Feb 21, 2018 at 8:58 AM, Jiri Pirko <jiri@resnulli.us> wrote:
>>>> Wed, Feb 21, 2018 at 05:49:49PM CET, alexander.duyck@gmail.com wrote:
>>>>>On Wed, Feb 21, 2018 at 8:11 AM, Jiri Pirko <jiri@resnulli.us> wrote:
>>>>>> Wed, Feb 21, 2018 at 04:56:48PM CET, alexander.duyck@gmail.com wrote:
>>>>>>>On Wed, Feb 21, 2018 at 1:51 AM, Jiri Pirko <jiri@resnulli.us> wrote:
>>>>>>>> Tue, Feb 20, 2018 at 11:33:56PM CET, kubakici@wp.pl wrote:
>>>>>>>>>On Tue, 20 Feb 2018 21:14:10 +0100, Jiri Pirko wrote:
>>>>>>>>>> Yeah, I can see it now :( I guess that the ship has sailed and we are
>>>>>>>>>> stuck with this ugly thing forever...
>>>>>>>>>>
>>>>>>>>>> Could you at least make some common code that is shared in between
>>>>>>>>>> netvsc and virtio_net so this is handled in exacly the same way in both?
>>>>>>>>>
>>>>>>>>>IMHO netvsc is a vendor specific driver which made a mistake on what
>>>>>>>>>behaviour it provides (or tried to align itself with Windows SR-IOV).
>>>>>>>>>Let's not make a far, far more commonly deployed and important driver
>>>>>>>>>(virtio) bug-compatible with netvsc.
>>>>>>>>
>>>>>>>> Yeah. netvsc solution is a dangerous precedent here and in my opinition
>>>>>>>> it was a huge mistake to merge it. I personally would vote to unmerge it
>>>>>>>> and make the solution based on team/bond.
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>>To Jiri's initial comments, I feel the same way, in fact I've talked to
>>>>>>>>>the NetworkManager guys to get auto-bonding based on MACs handled in
>>>>>>>>>user space.  I think it may very well get done in next versions of NM,
>>>>>>>>>but isn't done yet.  Stephen also raised the point that not everybody is
>>>>>>>>>using NM.
>>>>>>>>
>>>>>>>> Can be done in NM, networkd or other network management tools.
>>>>>>>> Even easier to do this in teamd and let them all benefit.
>>>>>>>>
>>>>>>>> Actually, I took a stab to implement this in teamd. Took me like an hour
>>>>>>>> and half.
>>>>>>>>
>>>>>>>> You can just run teamd with config option "kidnap" like this:
>>>>>>>> # teamd/teamd -c '{"kidnap": true }'
>>>>>>>>
>>>>>>>> Whenever teamd sees another netdev to appear with the same mac as his,
>>>>>>>> or whenever teamd sees another netdev to change mac to his,
>>>>>>>> it enslaves it.
>>>>>>>>
>>>>>>>> Here's the patch (quick and dirty):
>>>>>>>>
>>>>>>>> Subject: [patch teamd] teamd: introduce kidnap feature
>>>>>>>>
>>>>>>>> Signed-off-by: Jiri Pirko <jiri@mellanox.com>
>>>>>>>
>>>>>>>So this doesn't really address the original problem we were trying to
>>>>>>>solve. You asked earlier why the netdev name mattered and it mostly
>>>>>>>has to do with configuration. Specifically what our patch is
>>>>>>>attempting to resolve is the issue of how to allow a cloud provider to
>>>>>>>upgrade their customer to SR-IOV support and live migration without
>>>>>>>requiring them to reconfigure their guest. So the general idea with
>>>>>>>our patch is to take a VM that is running with virtio_net only and
>>>>>>>allow it to instead spawn a virtio_bypass master using the same netdev
>>>>>>>name as the original virtio, and then have the virtio_net and VF come
>>>>>>>up and be enslaved by the bypass interface. Doing it this way we can
>>>>>>>allow for multi-vendor SR-IOV live migration support using a guest
>>>>>>>that was originally configured for virtio only.
>>>>>>>
>>>>>>>The problem with your solution is we already have teaming and bonding
>>>>>>>as you said. There is already a write-up from Red Hat on how to do it
>>>>>>>(https://access.redhat.com/documentation/en-us/red_hat_virtualization/4.1/html/virtual_machine_management_guide/sect-migrating_virtual_machines_between_hosts).
>>>>>>>That is all well and good as long as you are willing to keep around
>>>>>>>two VM images, one for virtio, and one for SR-IOV with live migration.
>>>>>>
>>>>>> You don't need 2 images. You need only one. The one with the team setup.
>>>>>> That's it. If another netdev with the same mac appears, teamd will
>>>>>> enslave it and run traffic on it. If not, ok, you'll go only through
>>>>>> virtio_net.
>>>>>
>>>>>Isn't that going to cause the routing table to get messed up when we
>>>>>rearrange the netdevs? We don't want to have an significant disruption
>>>>> in traffic when we are adding/removing the VF. It seems like we would
>>>>>need to invalidate any entries that were configured for the virtio_net
>>>>>and reestablish them on the new team interface. Part of the criteria
>>>>>we have been working with is that we should be able to transition from
>>>>>having a VF to not or vice versa without seeing any significant
>>>>>disruption in the traffic.
>>>>
>>>> What? You have routes on the team netdev. virtio_net and VF are only
>>>> slaves. What are you talking about? I don't get it :/
>>>
>>>So lets walk though this by example. The general idea of the base case
>>>for all this is somebody starting with virtio_net, we will call the
>>>interface "ens1" for now. It comes up and is assigned a dhcp address
>>>and everything works as expected. Now in order to get better
>>>performance we want to add a VF "ens2", but we don't want a new IP
>>>address. Now if I understand correctly what will happen is that when
>>>"ens2" appears on the system teamd will then create a new team
>>>interface "team0". Before teamd can enslave ens1 it has to down the
>>
>> No, you don't understand that correctly.
>>
>> There is always ens1 and team0. ens1 is a slave of team0. team0 is the
>> interface to use, to set ip on etc.
>>
>> When ens2 appears, it gets enslaved to team0 as well.
>>
>>
>>>interface if I understand things correctly. This means that we have to
>>>disrupt network traffic in order for this to work.
>>>
>>>To give you an idea of where we were before this became about trying
>>>to do this in the team or bonding driver, we were debating a 2 netdev
>>>model versus a 3 netdev model. I will call out the model and the
>>>advantages/disadvantages of those below.
>>>
>>>2 Netdev model, "ens1", enslaves "ens2".
>>>- Requires dropping in-driver XDP in order to work (won't capture VF
>>>traffic otherwise)
>>>- VF takes performance hit for extra qdisc/Tx queue lock of virtio_net interface
>>>- If you ass-u-me (I haven't been a fan of this model if you can't
>>>tell) that it is okay to rip out in-driver XDP from virtio_net, then
>>>you could transition between base virtio, virtio w/ backup bit set.
>>>- Works for netvsc because they limit their features (no in-driver
>>>XDP) to guarantee this works.
>>>
>>>3 Netdev model, "ens1", enslaves "ens1nbackup" and "ens2"
>>>- Exposes 2 netdevs "ens1" and "ens1nbackup" when only virtio is present
>>>- No extra qdisc or locking
>>>- All virtio_net original functionality still present
>>>- Not able to transition from virtio to virtio w/ backup without
>>>disruption (requires hot-plug)
>>>
>>>The way I see it the only way your team setup could work would be
>>>something closer to the 3 netdev model. Basically we would be
>>>requiring the user to always have the team0 present in order to make
>>>certain that anything like XDP would be run on the team interface
>>>instead of assuming that the virtio_net could run by itself. I will
>>>add it as a third option here to compare to the other 2.
>>
>> Yes.
>>
>>
>>>
>>>3 Netdev "team" model, "team0", enslaves "ens1" and "ens2"
>>>- Requires guest to configure teamd
>>>- Exposes "team0" and "ens1" when only virtio is present
>>>- No extra qdisc or locking
>>>- Doesn't require "backup" bit in virtio
>>>
>>>>>
>>>>>Also how does this handle any static configuration? I am assuming that
>>>>>everything here assumes the team will be brought up as soon as it is
>>>>>seen and assigned a DHCP address.
>>>>
>>>> Again. You configure whatever you need on the team netdev.
>>>
>>>Just so we are clear, are you then saying that the team0 interface
>>>will always be present with this configuration? You had made it sound
>>
>> Of course.
>>
>>
>>>like it would disappear if you didn't have at least 2 interfaces.
>>
>> Where did I make it sound like that? No.
>
>I think it was a bit of misspeak/misread specifically I am thinking of:
>  You don't need 2 images. You need only one. The one with the
>  team setup. That's it. If another netdev with the same mac appears,
>  teamd will enslave it and run traffic on it. If not, ok, you'll go only
>  through virtio_net.
>
>I read that as there being no team if the VF wasn't present since you
>would still be going through team and then virtio_net otherwise.

team netdev is always there.


>
>>
>>>
>>>>>
>>>>>The solution as you have proposed seems problematic at best. I don't
>>>>>see how the team solution works without introducing some sort of
>>>>>traffic disruption to either add/remove the VF and bring up/tear down
>>>>>the team interface. At that point we might as well just give up on
>>>>>this piece of live migration support entirely since the disruption was
>>>>>what we were trying to avoid. We might as well just hotplug out the VF
>>>>>and hotplug in a virtio at the same bus device and function number and
>>>>>just let udev take care of renaming it for us. The idea was supposed
>>>>>to be a seamless transition between the two interfaces.
>>>>
>>>> Alex. What you are trying to do in this patchset and what netvsc does it
>>>> essentialy in-driver bonding. Same thing mechanism, rx_handler,
>>>> everything. I don't really understand what are you talking about. With
>>>> use of team you will get exactly the same behaviour.
>>>
>>>So the goal of the "in-driver bonding" is to make the bonding as
>>>non-intrusive as possible and require as little user intervention as
>>>possible. I agree that much of the handling is the same, however the
>>>control structure and requirements are significantly different. That
>>>has been what I have been trying to explain. You keep wanting to use
>>>the existing structures, but they don't really apply cleanly because
>>>they push control for the interface up into the guest, and that
>>>doesn't make much sense in the case of virtualization. What is
>>>happening here is that we are exposing a bond that the guest should
>>>have no control over, or at least as little as possible. In addition
>>>making the user have to add additional configuration in the guest
>>>means that there is that much more that can go wrong if they screw it
>>>up.
>>>
>>>The other problem here is that the transition needs to be as seamless
>>>as possible between just a standard virtio_net setup and this new
>>>setup. With either the team or bonding setup you end up essentially
>>>forcing the guest to have the bond/team always there even if they are
>>>running only a single interface. Only if they "upgrade" the VM by
>>>adding a VF then it finally gets to do anything.
>>
>> Yeah. There is certainly a dilemma. We have to choose between
>> 1) weird and hackish in-driver semi-bonding that would be simple
>>    for user.
>> 2) the standard way that would be perhaps slighly more complicated
>>    for user.
>
>The problem is for us option 2 is quite a bit uglier. Basically it
>means essentially telling all the distros and such that their cloud
>images have to use team by default on all virtio_net interfaces. It
>pretty much means we have to throw away this as a possible solution
>since you are requiring guest changes that most customers/OS vendors
>would ever accept.
>
>At least with our solution it was the driver making use of the
>functionality if a given feature bit was set. The teaming solution as
>proposed doesn't even give us that option.

I understand your motivation.


>
>>>
>>>What this comes down to for us is the following requirements:
>>>1. The name of the interface cannot change when going from virtio_net,
>>>to virtio_net being bypassed using a VF. We cannot create an interface
>>>on top of the interface, if anything we need to push the original
>>>virtio_net out of the way so that the new team interface takes its
>>>place in the configuration of the system. Otherwise a VM with VF w/
>>>live migration will require a different configuration than one that
>>>just runs virtio_net.
>>
>> Team driver netdev is still the same, no name changes.
>
>Right. Basically we need to have the renaming occur so that any
>existing config gets moved to the upper interface instead of having to
>rely on configuration being adjusted for the team interface.

The initial name of team netdevice is totally up to you.


>
>>>2. We need some way to signal if this VM should be running in an
>>>"upgraded" mode or not. We have been using the backup bit in
>>>virtio_net to do that. If it isn't "upgraded" then we don't need the
>>>team/bond and we can just run with virtio_net.
>>
>> I don't see why the team cannot be there always.
>
>It is more the logistical nightmare. Part of the goal here was to work
>with the cloud base images that are out there such as
>https://alt.fedoraproject.org/cloud/. With just the kernel changes the
>overhead for this stays fairly small and would be pulled in as just a
>standard part of the kernel update process. The virtio bypass only
>pops up if the backup bit is present. With the team solution it
>requires that the base image use the team driver on virtio_net when it
>sees one. I doubt the OSVs would want to do that just because SR-IOV
>isn't that popular of a case.

Again, I undertand your motivation. Yet I don't like your solution.
But if the decision is made to do this in-driver bonding. I would like
to see it baing done some generic way:
1) share the same "in-driver bonding core" code with netvsc
   put to net/core.
2) the "in-driver bonding core" will strictly limit the functionality,
   like active-backup mode only, one vf, one backup, vf netdev type
   check (so noone could enslave a tap or anything else)
If user would need something more, he should employ team/bond.


>
>>>3. We cannot introduce any downtime on the interface when adding a VF
>>>or removing it. The link must stay up the entire time and be able to
>>>handle packets.
>>
>> Sure. That should be handled by the team. Whenever the VF netdev
>> disappears, traffic would switch over to the virtio_net. The benefit of
>> your in-driver bonding solution is that qemu can actually signal the
>> guest driver that the disappearance would happen and do the switch a bit
>> earlier. But that is something that might be implemented in a different
>> channel where the kernel might get notification that certain pci is
>> going to disappear so everyone could prepare. Just an idea.
>
>The signaling isn't too much of an issue since we can just tweak the
>link state of the VF or virtio manually to report the link up or down
>prior to the hot-plug. Now that we are on the same page with the team0

Oh, so you just do "ip link set vfrepresentor down" in the host.
That makes sense. I'm pretty sure that this is not implemented for all
drivers now.



>interface always being there, I don't think 3 is much of a concern
>with the solution as you proposed.

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
  2018-02-22  8:11                             ` Jiri Pirko
@ 2018-02-22 11:54                               ` Or Gerlitz
  2018-02-22 13:07                                 ` Jiri Pirko
  2018-02-22 21:30                                 ` [virtio-dev] " Alexander Duyck
  2018-02-22 21:30                               ` Alexander Duyck
  2 siblings, 1 reply; 121+ messages in thread
From: Or Gerlitz @ 2018-02-22 11:54 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Alexander Duyck, Jakub Kicinski, Samudrala, Sridhar,
	Michael S. Tsirkin, Stephen Hemminger, David Miller, Netdev,
	virtualization, virtio-dev, Brandeburg, Jesse, Duyck,
	Alexander H, Jason Wang, Siwei Liu

On Thu, Feb 22, 2018 at 10:11 AM, Jiri Pirko <jiri@resnulli.us> wrote:
> Wed, Feb 21, 2018 at 09:57:09PM CET, alexander.duyck@gmail.com wrote:

>>The signaling isn't too much of an issue since we can just tweak the
>>link state of the VF or virtio manually to report the link up or down
>>prior to the hot-plug. Now that we are on the same page with the team0

> Oh, so you just do "ip link set vfrepresentor down" in the host.
> That makes sense. I'm pretty sure that this is not implemented for all
> drivers now.

mlx5 supports that, on the representor close ndo we take the VF link
operational v-link down

We should probably also put into the picture some/more aspects
from the host side of things. The provisioning of the v-switch now
have to deal with two channels going into the VM, the PV (virtio)
one and the PT (VF) one.

This should probably boil down to apply teaming/bonding between
the VF representor and a PV backend device, e.g TAP.

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
  2018-02-22 11:54                               ` Or Gerlitz
@ 2018-02-22 13:07                                 ` Jiri Pirko
  2018-02-22 15:30                                     ` [virtio-dev] " Alexander Duyck
  0 siblings, 1 reply; 121+ messages in thread
From: Jiri Pirko @ 2018-02-22 13:07 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Alexander Duyck, Jakub Kicinski, Samudrala, Sridhar,
	Michael S. Tsirkin, Stephen Hemminger, David Miller, Netdev,
	virtualization, virtio-dev, Brandeburg, Jesse, Duyck,
	Alexander H, Jason Wang, Siwei Liu

Thu, Feb 22, 2018 at 12:54:45PM CET, gerlitz.or@gmail.com wrote:
>On Thu, Feb 22, 2018 at 10:11 AM, Jiri Pirko <jiri@resnulli.us> wrote:
>> Wed, Feb 21, 2018 at 09:57:09PM CET, alexander.duyck@gmail.com wrote:
>
>>>The signaling isn't too much of an issue since we can just tweak the
>>>link state of the VF or virtio manually to report the link up or down
>>>prior to the hot-plug. Now that we are on the same page with the team0
>
>> Oh, so you just do "ip link set vfrepresentor down" in the host.
>> That makes sense. I'm pretty sure that this is not implemented for all
>> drivers now.
>
>mlx5 supports that, on the representor close ndo we take the VF link
>operational v-link down
>
>We should probably also put into the picture some/more aspects
>from the host side of things. The provisioning of the v-switch now
>have to deal with two channels going into the VM, the PV (virtio)
>one and the PT (VF) one.
>
>This should probably boil down to apply teaming/bonding between
>the VF representor and a PV backend device, e.g TAP.

Yes, that is correct.

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
  2018-02-22 13:07                                 ` Jiri Pirko
@ 2018-02-22 15:30                                     ` Alexander Duyck
  0 siblings, 0 replies; 121+ messages in thread
From: Alexander Duyck @ 2018-02-22 15:30 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Or Gerlitz, Jakub Kicinski, Samudrala, Sridhar,
	Michael S. Tsirkin, Stephen Hemminger, David Miller, Netdev,
	virtualization, virtio-dev, Brandeburg, Jesse, Duyck,
	Alexander H, Jason Wang, Siwei Liu

On Thu, Feb 22, 2018 at 5:07 AM, Jiri Pirko <jiri@resnulli.us> wrote:
> Thu, Feb 22, 2018 at 12:54:45PM CET, gerlitz.or@gmail.com wrote:
>>On Thu, Feb 22, 2018 at 10:11 AM, Jiri Pirko <jiri@resnulli.us> wrote:
>>> Wed, Feb 21, 2018 at 09:57:09PM CET, alexander.duyck@gmail.com wrote:
>>
>>>>The signaling isn't too much of an issue since we can just tweak the
>>>>link state of the VF or virtio manually to report the link up or down
>>>>prior to the hot-plug. Now that we are on the same page with the team0
>>
>>> Oh, so you just do "ip link set vfrepresentor down" in the host.
>>> That makes sense. I'm pretty sure that this is not implemented for all
>>> drivers now.
>>
>>mlx5 supports that, on the representor close ndo we take the VF link
>>operational v-link down
>>
>>We should probably also put into the picture some/more aspects
>>from the host side of things. The provisioning of the v-switch now
>>have to deal with two channels going into the VM, the PV (virtio)
>>one and the PT (VF) one.
>>
>>This should probably boil down to apply teaming/bonding between
>>the VF representor and a PV backend device, e.g TAP.
>
> Yes, that is correct.

That was my thought on it. If you wanted to you could probably even
look at making the PV the active one in the pair from the host side if
you wanted to avoid the PCIe overhead for things like
broadcast/multicast. The only limitation is that you might need to
have the bond take care of the appropriate switchdev bits so that you
still programmed rules into the hardware even if you are transmitting
down the PV side of the device.

For legacy setups I still need to work on putting together a source
mode macvlan based setup to handle acting like port representors for
the VFs and uplink.

- Alex

^ permalink raw reply	[flat|nested] 121+ messages in thread

* [virtio-dev] Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
@ 2018-02-22 15:30                                     ` Alexander Duyck
  0 siblings, 0 replies; 121+ messages in thread
From: Alexander Duyck @ 2018-02-22 15:30 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Or Gerlitz, Jakub Kicinski, Samudrala, Sridhar,
	Michael S. Tsirkin, Stephen Hemminger, David Miller, Netdev,
	virtualization, virtio-dev, Brandeburg, Jesse, Duyck,
	Alexander H, Jason Wang, Siwei Liu

On Thu, Feb 22, 2018 at 5:07 AM, Jiri Pirko <jiri@resnulli.us> wrote:
> Thu, Feb 22, 2018 at 12:54:45PM CET, gerlitz.or@gmail.com wrote:
>>On Thu, Feb 22, 2018 at 10:11 AM, Jiri Pirko <jiri@resnulli.us> wrote:
>>> Wed, Feb 21, 2018 at 09:57:09PM CET, alexander.duyck@gmail.com wrote:
>>
>>>>The signaling isn't too much of an issue since we can just tweak the
>>>>link state of the VF or virtio manually to report the link up or down
>>>>prior to the hot-plug. Now that we are on the same page with the team0
>>
>>> Oh, so you just do "ip link set vfrepresentor down" in the host.
>>> That makes sense. I'm pretty sure that this is not implemented for all
>>> drivers now.
>>
>>mlx5 supports that, on the representor close ndo we take the VF link
>>operational v-link down
>>
>>We should probably also put into the picture some/more aspects
>>from the host side of things. The provisioning of the v-switch now
>>have to deal with two channels going into the VM, the PV (virtio)
>>one and the PT (VF) one.
>>
>>This should probably boil down to apply teaming/bonding between
>>the VF representor and a PV backend device, e.g TAP.
>
> Yes, that is correct.

That was my thought on it. If you wanted to you could probably even
look at making the PV the active one in the pair from the host side if
you wanted to avoid the PCIe overhead for things like
broadcast/multicast. The only limitation is that you might need to
have the bond take care of the appropriate switchdev bits so that you
still programmed rules into the hardware even if you are transmitting
down the PV side of the device.

For legacy setups I still need to work on putting together a source
mode macvlan based setup to handle acting like port representors for
the VFs and uplink.

- Alex

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
  2018-02-22  8:11                             ` Jiri Pirko
@ 2018-02-22 21:30                                 ` Alexander Duyck
  2018-02-22 21:30                                 ` [virtio-dev] " Alexander Duyck
  2018-02-22 21:30                               ` Alexander Duyck
  2 siblings, 0 replies; 121+ messages in thread
From: Alexander Duyck @ 2018-02-22 21:30 UTC (permalink / raw)
  To: Jiri Pirko, Stephen Hemminger
  Cc: Jakub Kicinski, Samudrala, Sridhar, Michael S. Tsirkin,
	David Miller, Netdev, virtualization, virtio-dev, Brandeburg,
	Jesse, Duyck, Alexander H, Jason Wang, Siwei Liu

On Thu, Feb 22, 2018 at 12:11 AM, Jiri Pirko <jiri@resnulli.us> wrote:
> Wed, Feb 21, 2018 at 09:57:09PM CET, alexander.duyck@gmail.com wrote:
>>On Wed, Feb 21, 2018 at 11:38 AM, Jiri Pirko <jiri@resnulli.us> wrote:
>>> Wed, Feb 21, 2018 at 06:56:35PM CET, alexander.duyck@gmail.com wrote:
>>>>On Wed, Feb 21, 2018 at 8:58 AM, Jiri Pirko <jiri@resnulli.us> wrote:
>>>>> Wed, Feb 21, 2018 at 05:49:49PM CET, alexander.duyck@gmail.com wrote:
>>>>>>On Wed, Feb 21, 2018 at 8:11 AM, Jiri Pirko <jiri@resnulli.us> wrote:
>>>>>>> Wed, Feb 21, 2018 at 04:56:48PM CET, alexander.duyck@gmail.com wrote:
>>>>>>>>On Wed, Feb 21, 2018 at 1:51 AM, Jiri Pirko <jiri@resnulli.us> wrote:
>>>>>>>>> Tue, Feb 20, 2018 at 11:33:56PM CET, kubakici@wp.pl wrote:
>>>>>>>>>>On Tue, 20 Feb 2018 21:14:10 +0100, Jiri Pirko wrote:
>>>>>>>>>>> Yeah, I can see it now :( I guess that the ship has sailed and we are
>>>>>>>>>>> stuck with this ugly thing forever...
>>>>>>>>>>>
>>>>>>>>>>> Could you at least make some common code that is shared in between
>>>>>>>>>>> netvsc and virtio_net so this is handled in exacly the same way in both?
>>>>>>>>>>
>>>>>>>>>>IMHO netvsc is a vendor specific driver which made a mistake on what
>>>>>>>>>>behaviour it provides (or tried to align itself with Windows SR-IOV).
>>>>>>>>>>Let's not make a far, far more commonly deployed and important driver
>>>>>>>>>>(virtio) bug-compatible with netvsc.
>>>>>>>>>
>>>>>>>>> Yeah. netvsc solution is a dangerous precedent here and in my opinition
>>>>>>>>> it was a huge mistake to merge it. I personally would vote to unmerge it
>>>>>>>>> and make the solution based on team/bond.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>To Jiri's initial comments, I feel the same way, in fact I've talked to
>>>>>>>>>>the NetworkManager guys to get auto-bonding based on MACs handled in
>>>>>>>>>>user space.  I think it may very well get done in next versions of NM,
>>>>>>>>>>but isn't done yet.  Stephen also raised the point that not everybody is
>>>>>>>>>>using NM.
>>>>>>>>>
>>>>>>>>> Can be done in NM, networkd or other network management tools.
>>>>>>>>> Even easier to do this in teamd and let them all benefit.
>>>>>>>>>
>>>>>>>>> Actually, I took a stab to implement this in teamd. Took me like an hour
>>>>>>>>> and half.
>>>>>>>>>
>>>>>>>>> You can just run teamd with config option "kidnap" like this:
>>>>>>>>> # teamd/teamd -c '{"kidnap": true }'
>>>>>>>>>
>>>>>>>>> Whenever teamd sees another netdev to appear with the same mac as his,
>>>>>>>>> or whenever teamd sees another netdev to change mac to his,
>>>>>>>>> it enslaves it.
>>>>>>>>>
>>>>>>>>> Here's the patch (quick and dirty):
>>>>>>>>>
>>>>>>>>> Subject: [patch teamd] teamd: introduce kidnap feature
>>>>>>>>>
>>>>>>>>> Signed-off-by: Jiri Pirko <jiri@mellanox.com>
>>>>>>>>
>>>>>>>>So this doesn't really address the original problem we were trying to
>>>>>>>>solve. You asked earlier why the netdev name mattered and it mostly
>>>>>>>>has to do with configuration. Specifically what our patch is
>>>>>>>>attempting to resolve is the issue of how to allow a cloud provider to
>>>>>>>>upgrade their customer to SR-IOV support and live migration without
>>>>>>>>requiring them to reconfigure their guest. So the general idea with
>>>>>>>>our patch is to take a VM that is running with virtio_net only and
>>>>>>>>allow it to instead spawn a virtio_bypass master using the same netdev
>>>>>>>>name as the original virtio, and then have the virtio_net and VF come
>>>>>>>>up and be enslaved by the bypass interface. Doing it this way we can
>>>>>>>>allow for multi-vendor SR-IOV live migration support using a guest
>>>>>>>>that was originally configured for virtio only.
>>>>>>>>
>>>>>>>>The problem with your solution is we already have teaming and bonding
>>>>>>>>as you said. There is already a write-up from Red Hat on how to do it
>>>>>>>>(https://access.redhat.com/documentation/en-us/red_hat_virtualization/4.1/html/virtual_machine_management_guide/sect-migrating_virtual_machines_between_hosts).
>>>>>>>>That is all well and good as long as you are willing to keep around
>>>>>>>>two VM images, one for virtio, and one for SR-IOV with live migration.
>>>>>>>
>>>>>>> You don't need 2 images. You need only one. The one with the team setup.
>>>>>>> That's it. If another netdev with the same mac appears, teamd will
>>>>>>> enslave it and run traffic on it. If not, ok, you'll go only through
>>>>>>> virtio_net.
>>>>>>
>>>>>>Isn't that going to cause the routing table to get messed up when we
>>>>>>rearrange the netdevs? We don't want to have an significant disruption
>>>>>> in traffic when we are adding/removing the VF. It seems like we would
>>>>>>need to invalidate any entries that were configured for the virtio_net
>>>>>>and reestablish them on the new team interface. Part of the criteria
>>>>>>we have been working with is that we should be able to transition from
>>>>>>having a VF to not or vice versa without seeing any significant
>>>>>>disruption in the traffic.
>>>>>
>>>>> What? You have routes on the team netdev. virtio_net and VF are only
>>>>> slaves. What are you talking about? I don't get it :/
>>>>
>>>>So lets walk though this by example. The general idea of the base case
>>>>for all this is somebody starting with virtio_net, we will call the
>>>>interface "ens1" for now. It comes up and is assigned a dhcp address
>>>>and everything works as expected. Now in order to get better
>>>>performance we want to add a VF "ens2", but we don't want a new IP
>>>>address. Now if I understand correctly what will happen is that when
>>>>"ens2" appears on the system teamd will then create a new team
>>>>interface "team0". Before teamd can enslave ens1 it has to down the
>>>
>>> No, you don't understand that correctly.
>>>
>>> There is always ens1 and team0. ens1 is a slave of team0. team0 is the
>>> interface to use, to set ip on etc.
>>>
>>> When ens2 appears, it gets enslaved to team0 as well.
>>>
>>>
>>>>interface if I understand things correctly. This means that we have to
>>>>disrupt network traffic in order for this to work.
>>>>
>>>>To give you an idea of where we were before this became about trying
>>>>to do this in the team or bonding driver, we were debating a 2 netdev
>>>>model versus a 3 netdev model. I will call out the model and the
>>>>advantages/disadvantages of those below.
>>>>
>>>>2 Netdev model, "ens1", enslaves "ens2".
>>>>- Requires dropping in-driver XDP in order to work (won't capture VF
>>>>traffic otherwise)
>>>>- VF takes performance hit for extra qdisc/Tx queue lock of virtio_net interface
>>>>- If you ass-u-me (I haven't been a fan of this model if you can't
>>>>tell) that it is okay to rip out in-driver XDP from virtio_net, then
>>>>you could transition between base virtio, virtio w/ backup bit set.
>>>>- Works for netvsc because they limit their features (no in-driver
>>>>XDP) to guarantee this works.
>>>>
>>>>3 Netdev model, "ens1", enslaves "ens1nbackup" and "ens2"
>>>>- Exposes 2 netdevs "ens1" and "ens1nbackup" when only virtio is present
>>>>- No extra qdisc or locking
>>>>- All virtio_net original functionality still present
>>>>- Not able to transition from virtio to virtio w/ backup without
>>>>disruption (requires hot-plug)
>>>>
>>>>The way I see it the only way your team setup could work would be
>>>>something closer to the 3 netdev model. Basically we would be
>>>>requiring the user to always have the team0 present in order to make
>>>>certain that anything like XDP would be run on the team interface
>>>>instead of assuming that the virtio_net could run by itself. I will
>>>>add it as a third option here to compare to the other 2.
>>>
>>> Yes.
>>>
>>>
>>>>
>>>>3 Netdev "team" model, "team0", enslaves "ens1" and "ens2"
>>>>- Requires guest to configure teamd
>>>>- Exposes "team0" and "ens1" when only virtio is present
>>>>- No extra qdisc or locking
>>>>- Doesn't require "backup" bit in virtio
>>>>
>>>>>>
>>>>>>Also how does this handle any static configuration? I am assuming that
>>>>>>everything here assumes the team will be brought up as soon as it is
>>>>>>seen and assigned a DHCP address.
>>>>>
>>>>> Again. You configure whatever you need on the team netdev.
>>>>
>>>>Just so we are clear, are you then saying that the team0 interface
>>>>will always be present with this configuration? You had made it sound
>>>
>>> Of course.
>>>
>>>
>>>>like it would disappear if you didn't have at least 2 interfaces.
>>>
>>> Where did I make it sound like that? No.
>>
>>I think it was a bit of misspeak/misread specifically I am thinking of:
>>  You don't need 2 images. You need only one. The one with the
>>  team setup. That's it. If another netdev with the same mac appears,
>>  teamd will enslave it and run traffic on it. If not, ok, you'll go only
>>  through virtio_net.
>>
>>I read that as there being no team if the VF wasn't present since you
>>would still be going through team and then virtio_net otherwise.
>
> team netdev is always there.
>
>
>>
>>>
>>>>
>>>>>>
>>>>>>The solution as you have proposed seems problematic at best. I don't
>>>>>>see how the team solution works without introducing some sort of
>>>>>>traffic disruption to either add/remove the VF and bring up/tear down
>>>>>>the team interface. At that point we might as well just give up on
>>>>>>this piece of live migration support entirely since the disruption was
>>>>>>what we were trying to avoid. We might as well just hotplug out the VF
>>>>>>and hotplug in a virtio at the same bus device and function number and
>>>>>>just let udev take care of renaming it for us. The idea was supposed
>>>>>>to be a seamless transition between the two interfaces.
>>>>>
>>>>> Alex. What you are trying to do in this patchset and what netvsc does it
>>>>> essentialy in-driver bonding. Same thing mechanism, rx_handler,
>>>>> everything. I don't really understand what are you talking about. With
>>>>> use of team you will get exactly the same behaviour.
>>>>
>>>>So the goal of the "in-driver bonding" is to make the bonding as
>>>>non-intrusive as possible and require as little user intervention as
>>>>possible. I agree that much of the handling is the same, however the
>>>>control structure and requirements are significantly different. That
>>>>has been what I have been trying to explain. You keep wanting to use
>>>>the existing structures, but they don't really apply cleanly because
>>>>they push control for the interface up into the guest, and that
>>>>doesn't make much sense in the case of virtualization. What is
>>>>happening here is that we are exposing a bond that the guest should
>>>>have no control over, or at least as little as possible. In addition
>>>>making the user have to add additional configuration in the guest
>>>>means that there is that much more that can go wrong if they screw it
>>>>up.
>>>>
>>>>The other problem here is that the transition needs to be as seamless
>>>>as possible between just a standard virtio_net setup and this new
>>>>setup. With either the team or bonding setup you end up essentially
>>>>forcing the guest to have the bond/team always there even if they are
>>>>running only a single interface. Only if they "upgrade" the VM by
>>>>adding a VF then it finally gets to do anything.
>>>
>>> Yeah. There is certainly a dilemma. We have to choose between
>>> 1) weird and hackish in-driver semi-bonding that would be simple
>>>    for user.
>>> 2) the standard way that would be perhaps slighly more complicated
>>>    for user.
>>
>>The problem is for us option 2 is quite a bit uglier. Basically it
>>means essentially telling all the distros and such that their cloud
>>images have to use team by default on all virtio_net interfaces. It
>>pretty much means we have to throw away this as a possible solution
>>since you are requiring guest changes that most customers/OS vendors
>>would ever accept.
>>
>>At least with our solution it was the driver making use of the
>>functionality if a given feature bit was set. The teaming solution as
>>proposed doesn't even give us that option.
>
> I understand your motivation.
>
>
>>
>>>>
>>>>What this comes down to for us is the following requirements:
>>>>1. The name of the interface cannot change when going from virtio_net,
>>>>to virtio_net being bypassed using a VF. We cannot create an interface
>>>>on top of the interface, if anything we need to push the original
>>>>virtio_net out of the way so that the new team interface takes its
>>>>place in the configuration of the system. Otherwise a VM with VF w/
>>>>live migration will require a different configuration than one that
>>>>just runs virtio_net.
>>>
>>> Team driver netdev is still the same, no name changes.
>>
>>Right. Basically we need to have the renaming occur so that any
>>existing config gets moved to the upper interface instead of having to
>>rely on configuration being adjusted for the team interface.
>
> The initial name of team netdevice is totally up to you.
>
>
>>
>>>>2. We need some way to signal if this VM should be running in an
>>>>"upgraded" mode or not. We have been using the backup bit in
>>>>virtio_net to do that. If it isn't "upgraded" then we don't need the
>>>>team/bond and we can just run with virtio_net.
>>>
>>> I don't see why the team cannot be there always.
>>
>>It is more the logistical nightmare. Part of the goal here was to work
>>with the cloud base images that are out there such as
>>https://alt.fedoraproject.org/cloud/. With just the kernel changes the
>>overhead for this stays fairly small and would be pulled in as just a
>>standard part of the kernel update process. The virtio bypass only
>>pops up if the backup bit is present. With the team solution it
>>requires that the base image use the team driver on virtio_net when it
>>sees one. I doubt the OSVs would want to do that just because SR-IOV
>>isn't that popular of a case.
>
> Again, I undertand your motivation. Yet I don't like your solution.
> But if the decision is made to do this in-driver bonding. I would like
> to see it baing done some generic way:
> 1) share the same "in-driver bonding core" code with netvsc
>    put to net/core.
> 2) the "in-driver bonding core" will strictly limit the functionality,
>    like active-backup mode only, one vf, one backup, vf netdev type
>    check (so noone could enslave a tap or anything else)
> If user would need something more, he should employ team/bond.

I'll have to do some research and get back to you with our final
decision on this. There was some internal resistance to splitting out
this code as a separate module, but I think it would need to happen in
order to support multiple drivers.

Also I would be curious how Stephen feels about this. Would the
sharing of the dev, and the use of the phys_port_name on the
base/backup netdev work for netvsc? It seems like it should get them
performance gains on the VF, but I am not sure if there are any
specific requirements that mandated that they had to have 2 netdevs.

Thanks.

- Alex

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
  2018-02-22  8:11                             ` Jiri Pirko
  2018-02-22 11:54                               ` Or Gerlitz
  2018-02-22 21:30                                 ` [virtio-dev] " Alexander Duyck
@ 2018-02-22 21:30                               ` Alexander Duyck
  2 siblings, 0 replies; 121+ messages in thread
From: Alexander Duyck @ 2018-02-22 21:30 UTC (permalink / raw)
  To: Jiri Pirko, Stephen Hemminger
  Cc: Duyck, Alexander H, virtio-dev, Michael S. Tsirkin,
	Jakub Kicinski, Samudrala, Sridhar, virtualization, Siwei Liu,
	Netdev, David Miller

On Thu, Feb 22, 2018 at 12:11 AM, Jiri Pirko <jiri@resnulli.us> wrote:
> Wed, Feb 21, 2018 at 09:57:09PM CET, alexander.duyck@gmail.com wrote:
>>On Wed, Feb 21, 2018 at 11:38 AM, Jiri Pirko <jiri@resnulli.us> wrote:
>>> Wed, Feb 21, 2018 at 06:56:35PM CET, alexander.duyck@gmail.com wrote:
>>>>On Wed, Feb 21, 2018 at 8:58 AM, Jiri Pirko <jiri@resnulli.us> wrote:
>>>>> Wed, Feb 21, 2018 at 05:49:49PM CET, alexander.duyck@gmail.com wrote:
>>>>>>On Wed, Feb 21, 2018 at 8:11 AM, Jiri Pirko <jiri@resnulli.us> wrote:
>>>>>>> Wed, Feb 21, 2018 at 04:56:48PM CET, alexander.duyck@gmail.com wrote:
>>>>>>>>On Wed, Feb 21, 2018 at 1:51 AM, Jiri Pirko <jiri@resnulli.us> wrote:
>>>>>>>>> Tue, Feb 20, 2018 at 11:33:56PM CET, kubakici@wp.pl wrote:
>>>>>>>>>>On Tue, 20 Feb 2018 21:14:10 +0100, Jiri Pirko wrote:
>>>>>>>>>>> Yeah, I can see it now :( I guess that the ship has sailed and we are
>>>>>>>>>>> stuck with this ugly thing forever...
>>>>>>>>>>>
>>>>>>>>>>> Could you at least make some common code that is shared in between
>>>>>>>>>>> netvsc and virtio_net so this is handled in exacly the same way in both?
>>>>>>>>>>
>>>>>>>>>>IMHO netvsc is a vendor specific driver which made a mistake on what
>>>>>>>>>>behaviour it provides (or tried to align itself with Windows SR-IOV).
>>>>>>>>>>Let's not make a far, far more commonly deployed and important driver
>>>>>>>>>>(virtio) bug-compatible with netvsc.
>>>>>>>>>
>>>>>>>>> Yeah. netvsc solution is a dangerous precedent here and in my opinition
>>>>>>>>> it was a huge mistake to merge it. I personally would vote to unmerge it
>>>>>>>>> and make the solution based on team/bond.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>To Jiri's initial comments, I feel the same way, in fact I've talked to
>>>>>>>>>>the NetworkManager guys to get auto-bonding based on MACs handled in
>>>>>>>>>>user space.  I think it may very well get done in next versions of NM,
>>>>>>>>>>but isn't done yet.  Stephen also raised the point that not everybody is
>>>>>>>>>>using NM.
>>>>>>>>>
>>>>>>>>> Can be done in NM, networkd or other network management tools.
>>>>>>>>> Even easier to do this in teamd and let them all benefit.
>>>>>>>>>
>>>>>>>>> Actually, I took a stab to implement this in teamd. Took me like an hour
>>>>>>>>> and half.
>>>>>>>>>
>>>>>>>>> You can just run teamd with config option "kidnap" like this:
>>>>>>>>> # teamd/teamd -c '{"kidnap": true }'
>>>>>>>>>
>>>>>>>>> Whenever teamd sees another netdev to appear with the same mac as his,
>>>>>>>>> or whenever teamd sees another netdev to change mac to his,
>>>>>>>>> it enslaves it.
>>>>>>>>>
>>>>>>>>> Here's the patch (quick and dirty):
>>>>>>>>>
>>>>>>>>> Subject: [patch teamd] teamd: introduce kidnap feature
>>>>>>>>>
>>>>>>>>> Signed-off-by: Jiri Pirko <jiri@mellanox.com>
>>>>>>>>
>>>>>>>>So this doesn't really address the original problem we were trying to
>>>>>>>>solve. You asked earlier why the netdev name mattered and it mostly
>>>>>>>>has to do with configuration. Specifically what our patch is
>>>>>>>>attempting to resolve is the issue of how to allow a cloud provider to
>>>>>>>>upgrade their customer to SR-IOV support and live migration without
>>>>>>>>requiring them to reconfigure their guest. So the general idea with
>>>>>>>>our patch is to take a VM that is running with virtio_net only and
>>>>>>>>allow it to instead spawn a virtio_bypass master using the same netdev
>>>>>>>>name as the original virtio, and then have the virtio_net and VF come
>>>>>>>>up and be enslaved by the bypass interface. Doing it this way we can
>>>>>>>>allow for multi-vendor SR-IOV live migration support using a guest
>>>>>>>>that was originally configured for virtio only.
>>>>>>>>
>>>>>>>>The problem with your solution is we already have teaming and bonding
>>>>>>>>as you said. There is already a write-up from Red Hat on how to do it
>>>>>>>>(https://access.redhat.com/documentation/en-us/red_hat_virtualization/4.1/html/virtual_machine_management_guide/sect-migrating_virtual_machines_between_hosts).
>>>>>>>>That is all well and good as long as you are willing to keep around
>>>>>>>>two VM images, one for virtio, and one for SR-IOV with live migration.
>>>>>>>
>>>>>>> You don't need 2 images. You need only one. The one with the team setup.
>>>>>>> That's it. If another netdev with the same mac appears, teamd will
>>>>>>> enslave it and run traffic on it. If not, ok, you'll go only through
>>>>>>> virtio_net.
>>>>>>
>>>>>>Isn't that going to cause the routing table to get messed up when we
>>>>>>rearrange the netdevs? We don't want to have an significant disruption
>>>>>> in traffic when we are adding/removing the VF. It seems like we would
>>>>>>need to invalidate any entries that were configured for the virtio_net
>>>>>>and reestablish them on the new team interface. Part of the criteria
>>>>>>we have been working with is that we should be able to transition from
>>>>>>having a VF to not or vice versa without seeing any significant
>>>>>>disruption in the traffic.
>>>>>
>>>>> What? You have routes on the team netdev. virtio_net and VF are only
>>>>> slaves. What are you talking about? I don't get it :/
>>>>
>>>>So lets walk though this by example. The general idea of the base case
>>>>for all this is somebody starting with virtio_net, we will call the
>>>>interface "ens1" for now. It comes up and is assigned a dhcp address
>>>>and everything works as expected. Now in order to get better
>>>>performance we want to add a VF "ens2", but we don't want a new IP
>>>>address. Now if I understand correctly what will happen is that when
>>>>"ens2" appears on the system teamd will then create a new team
>>>>interface "team0". Before teamd can enslave ens1 it has to down the
>>>
>>> No, you don't understand that correctly.
>>>
>>> There is always ens1 and team0. ens1 is a slave of team0. team0 is the
>>> interface to use, to set ip on etc.
>>>
>>> When ens2 appears, it gets enslaved to team0 as well.
>>>
>>>
>>>>interface if I understand things correctly. This means that we have to
>>>>disrupt network traffic in order for this to work.
>>>>
>>>>To give you an idea of where we were before this became about trying
>>>>to do this in the team or bonding driver, we were debating a 2 netdev
>>>>model versus a 3 netdev model. I will call out the model and the
>>>>advantages/disadvantages of those below.
>>>>
>>>>2 Netdev model, "ens1", enslaves "ens2".
>>>>- Requires dropping in-driver XDP in order to work (won't capture VF
>>>>traffic otherwise)
>>>>- VF takes performance hit for extra qdisc/Tx queue lock of virtio_net interface
>>>>- If you ass-u-me (I haven't been a fan of this model if you can't
>>>>tell) that it is okay to rip out in-driver XDP from virtio_net, then
>>>>you could transition between base virtio, virtio w/ backup bit set.
>>>>- Works for netvsc because they limit their features (no in-driver
>>>>XDP) to guarantee this works.
>>>>
>>>>3 Netdev model, "ens1", enslaves "ens1nbackup" and "ens2"
>>>>- Exposes 2 netdevs "ens1" and "ens1nbackup" when only virtio is present
>>>>- No extra qdisc or locking
>>>>- All virtio_net original functionality still present
>>>>- Not able to transition from virtio to virtio w/ backup without
>>>>disruption (requires hot-plug)
>>>>
>>>>The way I see it the only way your team setup could work would be
>>>>something closer to the 3 netdev model. Basically we would be
>>>>requiring the user to always have the team0 present in order to make
>>>>certain that anything like XDP would be run on the team interface
>>>>instead of assuming that the virtio_net could run by itself. I will
>>>>add it as a third option here to compare to the other 2.
>>>
>>> Yes.
>>>
>>>
>>>>
>>>>3 Netdev "team" model, "team0", enslaves "ens1" and "ens2"
>>>>- Requires guest to configure teamd
>>>>- Exposes "team0" and "ens1" when only virtio is present
>>>>- No extra qdisc or locking
>>>>- Doesn't require "backup" bit in virtio
>>>>
>>>>>>
>>>>>>Also how does this handle any static configuration? I am assuming that
>>>>>>everything here assumes the team will be brought up as soon as it is
>>>>>>seen and assigned a DHCP address.
>>>>>
>>>>> Again. You configure whatever you need on the team netdev.
>>>>
>>>>Just so we are clear, are you then saying that the team0 interface
>>>>will always be present with this configuration? You had made it sound
>>>
>>> Of course.
>>>
>>>
>>>>like it would disappear if you didn't have at least 2 interfaces.
>>>
>>> Where did I make it sound like that? No.
>>
>>I think it was a bit of misspeak/misread specifically I am thinking of:
>>  You don't need 2 images. You need only one. The one with the
>>  team setup. That's it. If another netdev with the same mac appears,
>>  teamd will enslave it and run traffic on it. If not, ok, you'll go only
>>  through virtio_net.
>>
>>I read that as there being no team if the VF wasn't present since you
>>would still be going through team and then virtio_net otherwise.
>
> team netdev is always there.
>
>
>>
>>>
>>>>
>>>>>>
>>>>>>The solution as you have proposed seems problematic at best. I don't
>>>>>>see how the team solution works without introducing some sort of
>>>>>>traffic disruption to either add/remove the VF and bring up/tear down
>>>>>>the team interface. At that point we might as well just give up on
>>>>>>this piece of live migration support entirely since the disruption was
>>>>>>what we were trying to avoid. We might as well just hotplug out the VF
>>>>>>and hotplug in a virtio at the same bus device and function number and
>>>>>>just let udev take care of renaming it for us. The idea was supposed
>>>>>>to be a seamless transition between the two interfaces.
>>>>>
>>>>> Alex. What you are trying to do in this patchset and what netvsc does it
>>>>> essentialy in-driver bonding. Same thing mechanism, rx_handler,
>>>>> everything. I don't really understand what are you talking about. With
>>>>> use of team you will get exactly the same behaviour.
>>>>
>>>>So the goal of the "in-driver bonding" is to make the bonding as
>>>>non-intrusive as possible and require as little user intervention as
>>>>possible. I agree that much of the handling is the same, however the
>>>>control structure and requirements are significantly different. That
>>>>has been what I have been trying to explain. You keep wanting to use
>>>>the existing structures, but they don't really apply cleanly because
>>>>they push control for the interface up into the guest, and that
>>>>doesn't make much sense in the case of virtualization. What is
>>>>happening here is that we are exposing a bond that the guest should
>>>>have no control over, or at least as little as possible. In addition
>>>>making the user have to add additional configuration in the guest
>>>>means that there is that much more that can go wrong if they screw it
>>>>up.
>>>>
>>>>The other problem here is that the transition needs to be as seamless
>>>>as possible between just a standard virtio_net setup and this new
>>>>setup. With either the team or bonding setup you end up essentially
>>>>forcing the guest to have the bond/team always there even if they are
>>>>running only a single interface. Only if they "upgrade" the VM by
>>>>adding a VF then it finally gets to do anything.
>>>
>>> Yeah. There is certainly a dilemma. We have to choose between
>>> 1) weird and hackish in-driver semi-bonding that would be simple
>>>    for user.
>>> 2) the standard way that would be perhaps slighly more complicated
>>>    for user.
>>
>>The problem is for us option 2 is quite a bit uglier. Basically it
>>means essentially telling all the distros and such that their cloud
>>images have to use team by default on all virtio_net interfaces. It
>>pretty much means we have to throw away this as a possible solution
>>since you are requiring guest changes that most customers/OS vendors
>>would ever accept.
>>
>>At least with our solution it was the driver making use of the
>>functionality if a given feature bit was set. The teaming solution as
>>proposed doesn't even give us that option.
>
> I understand your motivation.
>
>
>>
>>>>
>>>>What this comes down to for us is the following requirements:
>>>>1. The name of the interface cannot change when going from virtio_net,
>>>>to virtio_net being bypassed using a VF. We cannot create an interface
>>>>on top of the interface, if anything we need to push the original
>>>>virtio_net out of the way so that the new team interface takes its
>>>>place in the configuration of the system. Otherwise a VM with VF w/
>>>>live migration will require a different configuration than one that
>>>>just runs virtio_net.
>>>
>>> Team driver netdev is still the same, no name changes.
>>
>>Right. Basically we need to have the renaming occur so that any
>>existing config gets moved to the upper interface instead of having to
>>rely on configuration being adjusted for the team interface.
>
> The initial name of team netdevice is totally up to you.
>
>
>>
>>>>2. We need some way to signal if this VM should be running in an
>>>>"upgraded" mode or not. We have been using the backup bit in
>>>>virtio_net to do that. If it isn't "upgraded" then we don't need the
>>>>team/bond and we can just run with virtio_net.
>>>
>>> I don't see why the team cannot be there always.
>>
>>It is more the logistical nightmare. Part of the goal here was to work
>>with the cloud base images that are out there such as
>>https://alt.fedoraproject.org/cloud/. With just the kernel changes the
>>overhead for this stays fairly small and would be pulled in as just a
>>standard part of the kernel update process. The virtio bypass only
>>pops up if the backup bit is present. With the team solution it
>>requires that the base image use the team driver on virtio_net when it
>>sees one. I doubt the OSVs would want to do that just because SR-IOV
>>isn't that popular of a case.
>
> Again, I undertand your motivation. Yet I don't like your solution.
> But if the decision is made to do this in-driver bonding. I would like
> to see it baing done some generic way:
> 1) share the same "in-driver bonding core" code with netvsc
>    put to net/core.
> 2) the "in-driver bonding core" will strictly limit the functionality,
>    like active-backup mode only, one vf, one backup, vf netdev type
>    check (so noone could enslave a tap or anything else)
> If user would need something more, he should employ team/bond.

I'll have to do some research and get back to you with our final
decision on this. There was some internal resistance to splitting out
this code as a separate module, but I think it would need to happen in
order to support multiple drivers.

Also I would be curious how Stephen feels about this. Would the
sharing of the dev, and the use of the phys_port_name on the
base/backup netdev work for netvsc? It seems like it should get them
performance gains on the VF, but I am not sure if there are any
specific requirements that mandated that they had to have 2 netdevs.

Thanks.

- Alex

^ permalink raw reply	[flat|nested] 121+ messages in thread

* [virtio-dev] Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
@ 2018-02-22 21:30                                 ` Alexander Duyck
  0 siblings, 0 replies; 121+ messages in thread
From: Alexander Duyck @ 2018-02-22 21:30 UTC (permalink / raw)
  To: Jiri Pirko, Stephen Hemminger
  Cc: Jakub Kicinski, Samudrala, Sridhar, Michael S. Tsirkin,
	David Miller, Netdev, virtualization, virtio-dev, Brandeburg,
	Jesse, Duyck, Alexander H, Jason Wang, Siwei Liu

On Thu, Feb 22, 2018 at 12:11 AM, Jiri Pirko <jiri@resnulli.us> wrote:
> Wed, Feb 21, 2018 at 09:57:09PM CET, alexander.duyck@gmail.com wrote:
>>On Wed, Feb 21, 2018 at 11:38 AM, Jiri Pirko <jiri@resnulli.us> wrote:
>>> Wed, Feb 21, 2018 at 06:56:35PM CET, alexander.duyck@gmail.com wrote:
>>>>On Wed, Feb 21, 2018 at 8:58 AM, Jiri Pirko <jiri@resnulli.us> wrote:
>>>>> Wed, Feb 21, 2018 at 05:49:49PM CET, alexander.duyck@gmail.com wrote:
>>>>>>On Wed, Feb 21, 2018 at 8:11 AM, Jiri Pirko <jiri@resnulli.us> wrote:
>>>>>>> Wed, Feb 21, 2018 at 04:56:48PM CET, alexander.duyck@gmail.com wrote:
>>>>>>>>On Wed, Feb 21, 2018 at 1:51 AM, Jiri Pirko <jiri@resnulli.us> wrote:
>>>>>>>>> Tue, Feb 20, 2018 at 11:33:56PM CET, kubakici@wp.pl wrote:
>>>>>>>>>>On Tue, 20 Feb 2018 21:14:10 +0100, Jiri Pirko wrote:
>>>>>>>>>>> Yeah, I can see it now :( I guess that the ship has sailed and we are
>>>>>>>>>>> stuck with this ugly thing forever...
>>>>>>>>>>>
>>>>>>>>>>> Could you at least make some common code that is shared in between
>>>>>>>>>>> netvsc and virtio_net so this is handled in exacly the same way in both?
>>>>>>>>>>
>>>>>>>>>>IMHO netvsc is a vendor specific driver which made a mistake on what
>>>>>>>>>>behaviour it provides (or tried to align itself with Windows SR-IOV).
>>>>>>>>>>Let's not make a far, far more commonly deployed and important driver
>>>>>>>>>>(virtio) bug-compatible with netvsc.
>>>>>>>>>
>>>>>>>>> Yeah. netvsc solution is a dangerous precedent here and in my opinition
>>>>>>>>> it was a huge mistake to merge it. I personally would vote to unmerge it
>>>>>>>>> and make the solution based on team/bond.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>To Jiri's initial comments, I feel the same way, in fact I've talked to
>>>>>>>>>>the NetworkManager guys to get auto-bonding based on MACs handled in
>>>>>>>>>>user space.  I think it may very well get done in next versions of NM,
>>>>>>>>>>but isn't done yet.  Stephen also raised the point that not everybody is
>>>>>>>>>>using NM.
>>>>>>>>>
>>>>>>>>> Can be done in NM, networkd or other network management tools.
>>>>>>>>> Even easier to do this in teamd and let them all benefit.
>>>>>>>>>
>>>>>>>>> Actually, I took a stab to implement this in teamd. Took me like an hour
>>>>>>>>> and half.
>>>>>>>>>
>>>>>>>>> You can just run teamd with config option "kidnap" like this:
>>>>>>>>> # teamd/teamd -c '{"kidnap": true }'
>>>>>>>>>
>>>>>>>>> Whenever teamd sees another netdev to appear with the same mac as his,
>>>>>>>>> or whenever teamd sees another netdev to change mac to his,
>>>>>>>>> it enslaves it.
>>>>>>>>>
>>>>>>>>> Here's the patch (quick and dirty):
>>>>>>>>>
>>>>>>>>> Subject: [patch teamd] teamd: introduce kidnap feature
>>>>>>>>>
>>>>>>>>> Signed-off-by: Jiri Pirko <jiri@mellanox.com>
>>>>>>>>
>>>>>>>>So this doesn't really address the original problem we were trying to
>>>>>>>>solve. You asked earlier why the netdev name mattered and it mostly
>>>>>>>>has to do with configuration. Specifically what our patch is
>>>>>>>>attempting to resolve is the issue of how to allow a cloud provider to
>>>>>>>>upgrade their customer to SR-IOV support and live migration without
>>>>>>>>requiring them to reconfigure their guest. So the general idea with
>>>>>>>>our patch is to take a VM that is running with virtio_net only and
>>>>>>>>allow it to instead spawn a virtio_bypass master using the same netdev
>>>>>>>>name as the original virtio, and then have the virtio_net and VF come
>>>>>>>>up and be enslaved by the bypass interface. Doing it this way we can
>>>>>>>>allow for multi-vendor SR-IOV live migration support using a guest
>>>>>>>>that was originally configured for virtio only.
>>>>>>>>
>>>>>>>>The problem with your solution is we already have teaming and bonding
>>>>>>>>as you said. There is already a write-up from Red Hat on how to do it
>>>>>>>>(https://access.redhat.com/documentation/en-us/red_hat_virtualization/4.1/html/virtual_machine_management_guide/sect-migrating_virtual_machines_between_hosts).
>>>>>>>>That is all well and good as long as you are willing to keep around
>>>>>>>>two VM images, one for virtio, and one for SR-IOV with live migration.
>>>>>>>
>>>>>>> You don't need 2 images. You need only one. The one with the team setup.
>>>>>>> That's it. If another netdev with the same mac appears, teamd will
>>>>>>> enslave it and run traffic on it. If not, ok, you'll go only through
>>>>>>> virtio_net.
>>>>>>
>>>>>>Isn't that going to cause the routing table to get messed up when we
>>>>>>rearrange the netdevs? We don't want to have an significant disruption
>>>>>> in traffic when we are adding/removing the VF. It seems like we would
>>>>>>need to invalidate any entries that were configured for the virtio_net
>>>>>>and reestablish them on the new team interface. Part of the criteria
>>>>>>we have been working with is that we should be able to transition from
>>>>>>having a VF to not or vice versa without seeing any significant
>>>>>>disruption in the traffic.
>>>>>
>>>>> What? You have routes on the team netdev. virtio_net and VF are only
>>>>> slaves. What are you talking about? I don't get it :/
>>>>
>>>>So lets walk though this by example. The general idea of the base case
>>>>for all this is somebody starting with virtio_net, we will call the
>>>>interface "ens1" for now. It comes up and is assigned a dhcp address
>>>>and everything works as expected. Now in order to get better
>>>>performance we want to add a VF "ens2", but we don't want a new IP
>>>>address. Now if I understand correctly what will happen is that when
>>>>"ens2" appears on the system teamd will then create a new team
>>>>interface "team0". Before teamd can enslave ens1 it has to down the
>>>
>>> No, you don't understand that correctly.
>>>
>>> There is always ens1 and team0. ens1 is a slave of team0. team0 is the
>>> interface to use, to set ip on etc.
>>>
>>> When ens2 appears, it gets enslaved to team0 as well.
>>>
>>>
>>>>interface if I understand things correctly. This means that we have to
>>>>disrupt network traffic in order for this to work.
>>>>
>>>>To give you an idea of where we were before this became about trying
>>>>to do this in the team or bonding driver, we were debating a 2 netdev
>>>>model versus a 3 netdev model. I will call out the model and the
>>>>advantages/disadvantages of those below.
>>>>
>>>>2 Netdev model, "ens1", enslaves "ens2".
>>>>- Requires dropping in-driver XDP in order to work (won't capture VF
>>>>traffic otherwise)
>>>>- VF takes performance hit for extra qdisc/Tx queue lock of virtio_net interface
>>>>- If you ass-u-me (I haven't been a fan of this model if you can't
>>>>tell) that it is okay to rip out in-driver XDP from virtio_net, then
>>>>you could transition between base virtio, virtio w/ backup bit set.
>>>>- Works for netvsc because they limit their features (no in-driver
>>>>XDP) to guarantee this works.
>>>>
>>>>3 Netdev model, "ens1", enslaves "ens1nbackup" and "ens2"
>>>>- Exposes 2 netdevs "ens1" and "ens1nbackup" when only virtio is present
>>>>- No extra qdisc or locking
>>>>- All virtio_net original functionality still present
>>>>- Not able to transition from virtio to virtio w/ backup without
>>>>disruption (requires hot-plug)
>>>>
>>>>The way I see it the only way your team setup could work would be
>>>>something closer to the 3 netdev model. Basically we would be
>>>>requiring the user to always have the team0 present in order to make
>>>>certain that anything like XDP would be run on the team interface
>>>>instead of assuming that the virtio_net could run by itself. I will
>>>>add it as a third option here to compare to the other 2.
>>>
>>> Yes.
>>>
>>>
>>>>
>>>>3 Netdev "team" model, "team0", enslaves "ens1" and "ens2"
>>>>- Requires guest to configure teamd
>>>>- Exposes "team0" and "ens1" when only virtio is present
>>>>- No extra qdisc or locking
>>>>- Doesn't require "backup" bit in virtio
>>>>
>>>>>>
>>>>>>Also how does this handle any static configuration? I am assuming that
>>>>>>everything here assumes the team will be brought up as soon as it is
>>>>>>seen and assigned a DHCP address.
>>>>>
>>>>> Again. You configure whatever you need on the team netdev.
>>>>
>>>>Just so we are clear, are you then saying that the team0 interface
>>>>will always be present with this configuration? You had made it sound
>>>
>>> Of course.
>>>
>>>
>>>>like it would disappear if you didn't have at least 2 interfaces.
>>>
>>> Where did I make it sound like that? No.
>>
>>I think it was a bit of misspeak/misread specifically I am thinking of:
>>  You don't need 2 images. You need only one. The one with the
>>  team setup. That's it. If another netdev with the same mac appears,
>>  teamd will enslave it and run traffic on it. If not, ok, you'll go only
>>  through virtio_net.
>>
>>I read that as there being no team if the VF wasn't present since you
>>would still be going through team and then virtio_net otherwise.
>
> team netdev is always there.
>
>
>>
>>>
>>>>
>>>>>>
>>>>>>The solution as you have proposed seems problematic at best. I don't
>>>>>>see how the team solution works without introducing some sort of
>>>>>>traffic disruption to either add/remove the VF and bring up/tear down
>>>>>>the team interface. At that point we might as well just give up on
>>>>>>this piece of live migration support entirely since the disruption was
>>>>>>what we were trying to avoid. We might as well just hotplug out the VF
>>>>>>and hotplug in a virtio at the same bus device and function number and
>>>>>>just let udev take care of renaming it for us. The idea was supposed
>>>>>>to be a seamless transition between the two interfaces.
>>>>>
>>>>> Alex. What you are trying to do in this patchset and what netvsc does it
>>>>> essentialy in-driver bonding. Same thing mechanism, rx_handler,
>>>>> everything. I don't really understand what are you talking about. With
>>>>> use of team you will get exactly the same behaviour.
>>>>
>>>>So the goal of the "in-driver bonding" is to make the bonding as
>>>>non-intrusive as possible and require as little user intervention as
>>>>possible. I agree that much of the handling is the same, however the
>>>>control structure and requirements are significantly different. That
>>>>has been what I have been trying to explain. You keep wanting to use
>>>>the existing structures, but they don't really apply cleanly because
>>>>they push control for the interface up into the guest, and that
>>>>doesn't make much sense in the case of virtualization. What is
>>>>happening here is that we are exposing a bond that the guest should
>>>>have no control over, or at least as little as possible. In addition
>>>>making the user have to add additional configuration in the guest
>>>>means that there is that much more that can go wrong if they screw it
>>>>up.
>>>>
>>>>The other problem here is that the transition needs to be as seamless
>>>>as possible between just a standard virtio_net setup and this new
>>>>setup. With either the team or bonding setup you end up essentially
>>>>forcing the guest to have the bond/team always there even if they are
>>>>running only a single interface. Only if they "upgrade" the VM by
>>>>adding a VF then it finally gets to do anything.
>>>
>>> Yeah. There is certainly a dilemma. We have to choose between
>>> 1) weird and hackish in-driver semi-bonding that would be simple
>>>    for user.
>>> 2) the standard way that would be perhaps slighly more complicated
>>>    for user.
>>
>>The problem is for us option 2 is quite a bit uglier. Basically it
>>means essentially telling all the distros and such that their cloud
>>images have to use team by default on all virtio_net interfaces. It
>>pretty much means we have to throw away this as a possible solution
>>since you are requiring guest changes that most customers/OS vendors
>>would ever accept.
>>
>>At least with our solution it was the driver making use of the
>>functionality if a given feature bit was set. The teaming solution as
>>proposed doesn't even give us that option.
>
> I understand your motivation.
>
>
>>
>>>>
>>>>What this comes down to for us is the following requirements:
>>>>1. The name of the interface cannot change when going from virtio_net,
>>>>to virtio_net being bypassed using a VF. We cannot create an interface
>>>>on top of the interface, if anything we need to push the original
>>>>virtio_net out of the way so that the new team interface takes its
>>>>place in the configuration of the system. Otherwise a VM with VF w/
>>>>live migration will require a different configuration than one that
>>>>just runs virtio_net.
>>>
>>> Team driver netdev is still the same, no name changes.
>>
>>Right. Basically we need to have the renaming occur so that any
>>existing config gets moved to the upper interface instead of having to
>>rely on configuration being adjusted for the team interface.
>
> The initial name of team netdevice is totally up to you.
>
>
>>
>>>>2. We need some way to signal if this VM should be running in an
>>>>"upgraded" mode or not. We have been using the backup bit in
>>>>virtio_net to do that. If it isn't "upgraded" then we don't need the
>>>>team/bond and we can just run with virtio_net.
>>>
>>> I don't see why the team cannot be there always.
>>
>>It is more the logistical nightmare. Part of the goal here was to work
>>with the cloud base images that are out there such as
>>https://alt.fedoraproject.org/cloud/. With just the kernel changes the
>>overhead for this stays fairly small and would be pulled in as just a
>>standard part of the kernel update process. The virtio bypass only
>>pops up if the backup bit is present. With the team solution it
>>requires that the base image use the team driver on virtio_net when it
>>sees one. I doubt the OSVs would want to do that just because SR-IOV
>>isn't that popular of a case.
>
> Again, I undertand your motivation. Yet I don't like your solution.
> But if the decision is made to do this in-driver bonding. I would like
> to see it baing done some generic way:
> 1) share the same "in-driver bonding core" code with netvsc
>    put to net/core.
> 2) the "in-driver bonding core" will strictly limit the functionality,
>    like active-backup mode only, one vf, one backup, vf netdev type
>    check (so noone could enslave a tap or anything else)
> If user would need something more, he should employ team/bond.

I'll have to do some research and get back to you with our final
decision on this. There was some internal resistance to splitting out
this code as a separate module, but I think it would need to happen in
order to support multiple drivers.

Also I would be curious how Stephen feels about this. Would the
sharing of the dev, and the use of the phys_port_name on the
base/backup netdev work for netvsc? It seems like it should get them
performance gains on the VF, but I am not sure if there are any
specific requirements that mandated that they had to have 2 netdevs.

Thanks.

- Alex

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
  2018-02-22  2:35             ` [virtio-dev] " Samudrala, Sridhar
@ 2018-02-23 22:22               ` Siwei Liu
  -1 siblings, 0 replies; 121+ messages in thread
From: Siwei Liu @ 2018-02-23 22:22 UTC (permalink / raw)
  To: Samudrala, Sridhar
  Cc: Duyck, Alexander H, virtio-dev, Michael S. Tsirkin,
	Jakub Kicinski, Netdev, Alexander Duyck, virtualization,
	David Miller

On Wed, Feb 21, 2018 at 6:35 PM, Samudrala, Sridhar
<sridhar.samudrala@intel.com> wrote:
> On 2/21/2018 5:59 PM, Siwei Liu wrote:
>>
>> On Wed, Feb 21, 2018 at 4:17 PM, Alexander Duyck
>> <alexander.duyck@gmail.com> wrote:
>>>
>>> On Wed, Feb 21, 2018 at 3:50 PM, Siwei Liu <loseweigh@gmail.com> wrote:
>>>>
>>>> I haven't checked emails for days and did not realize the new revision
>>>> had already came out. And thank you for the effort, this revision
>>>> really looks to be a step forward towards our use case and is close to
>>>> what we wanted to do. A few questions in line.
>>>>
>>>> On Sat, Feb 17, 2018 at 9:12 AM, Alexander Duyck
>>>> <alexander.duyck@gmail.com> wrote:
>>>>>
>>>>> On Fri, Feb 16, 2018 at 6:38 PM, Jakub Kicinski <kubakici@wp.pl> wrote:
>>>>>>
>>>>>> On Fri, 16 Feb 2018 10:11:19 -0800, Sridhar Samudrala wrote:
>>>>>>>
>>>>>>> Ppatch 2 is in response to the community request for a 3 netdev
>>>>>>> solution.  However, it creates some issues we'll get into in a
>>>>>>> moment.
>>>>>>> It extends virtio_net to use alternate datapath when available and
>>>>>>> registered. When BACKUP feature is enabled, virtio_net driver creates
>>>>>>> an additional 'bypass' netdev that acts as a master device and
>>>>>>> controls
>>>>>>> 2 slave devices.  The original virtio_net netdev is registered as
>>>>>>> 'backup' netdev and a passthru/vf device with the same MAC gets
>>>>>>> registered as 'active' netdev. Both 'bypass' and 'backup' netdevs are
>>>>>>> associated with the same 'pci' device.  The user accesses the network
>>>>>>> interface via 'bypass' netdev. The 'bypass' netdev chooses 'active'
>>>>>>> netdev
>>>>>>> as default for transmits when it is available with link up and
>>>>>>> running.
>>>>>>
>>>>>> Thank you do doing this.
>>>>>>
>>>>>>> We noticed a couple of issues with this approach during testing.
>>>>>>> - As both 'bypass' and 'backup' netdevs are associated with the same
>>>>>>>    virtio pci device, udev tries to rename both of them with the same
>>>>>>> name
>>>>>>>    and the 2nd rename will fail. This would be OK as long as the
>>>>>>> first netdev
>>>>>>>    to be renamed is the 'bypass' netdev, but the order in which udev
>>>>>>> gets
>>>>>>>    to rename the 2 netdevs is not reliable.
>>>>>>
>>>>>> Out of curiosity - why do you link the master netdev to the virtio
>>>>>> struct device?
>>>>>
>>>>> The basic idea of all this is that we wanted this to work with an
>>>>> existing VM image that was using virtio. As such we were trying to
>>>>> make it so that the bypass interface takes the place of the original
>>>>> virtio and get udev to rename the bypass to what the original
>>>>> virtio_net was.
>>>>
>>>> Could it made it also possible to take over the config from VF instead
>>>> of virtio on an existing VM image? And get udev rename the bypass
>>>> netdev to what the original VF was. I don't say tightly binding the
>>>> bypass master to only virtio or VF, but I think we should provide both
>>>> options to support different upgrade paths. Possibly we could tweak
>>>> the device tree layout to reuse the same PCI slot for the master
>>>> bypass netdev, such that udev would not get confused when renaming the
>>>> device. The VF needs to use a different function slot afterwards.
>>>> Perhaps we might need to a special multiseat like QEMU device for that
>>>> purpose?
>>>>
>>>> Our case we'll upgrade the config from VF to virtio-bypass directly.
>>>
>>> So if I am understanding what you are saying you are wanting to flip
>>> the backup interface from the virtio to a VF. The problem is that
>>> becomes a bit of a vendor lock-in solution since it would rely on a
>>> specific VF driver. I would agree with Jiri that we don't want to go
>>> down that path. We don't want every VF out there firing up its own
>>> separate bond. Ideally you want the hypervisor to be able to manage
>>> all of this which is why it makes sense to have virtio manage this and
>>> why this is associated with the virtio_net interface.
>>
>> No, that's not what I was talking about of course. I thought you
>> mentioned the upgrade scenario this patch would like to address is to
>> use the bypass interface "to take the place of the original virtio,
>> and get udev to rename the bypass to what the original virtio_net
>> was". That is one of the possible upgrade paths for sure. However the
>> upgrade path I was seeking is to use the bypass interface to take the
>> place of original VF interface while retaining the name and network
>> configs, which generally can be done simply with kernel upgrade. It
>> would become limiting as this patch makes the bypass interface share
>> the same virtio pci device with virito backup. Can this bypass
>> interface be made general to take place of any pci device other than
>> virtio-net? This will be more helpful as the cloud users who has
>> existing setup on VF interface don't have to recreate it on virtio-net
>> and VF separately again.
>
>
> Yes. This sounds interesting. Looks like you want an existing VM image with
> VF only configuration to get transparent live migration support by adding
> virtio_net with BACKUP feature.  We may need another feature bit to switch
> between these 2 options.

Yes, that's what I was thinking about. I have been building something
like this before, and would like to get back after merging with your
patch.

>
>
>
>>
>>> The other bits get into more complexity then we are ready to handle
>>> for now. I think I might have talked about something similar that I
>>> was referring to as a "virtio-bond" where you would have a PCI/PCIe
>>> tree topology that makes this easier to sort out, and the "virtio-bond
>>> would be used to handle coordination/configuration of a much more
>>> complex interface.
>>
>> That was one way to solve this problem but I'd like to see simple ways
>> to sort it out.
>>
>>>>>> FWIW two solutions that immediately come to mind is to export "backup"
>>>>>> as phys_port_name of the backup virtio link and/or assign a name to
>>>>>> the
>>>>>> master like you are doing already.  I think team uses team%d and bond
>>>>>> uses bond%d, soft naming of master devices seems quite natural in this
>>>>>> case.
>>>>>
>>>>> I figured I had overlooked something like that.. Thanks for pointing
>>>>> this out. Okay so I think the phys_port_name approach might resolve
>>>>> the original issue. If I am reading things correctly what we end up
>>>>> with is the master showing up as "ens1" for example and the backup
>>>>> showing up as "ens1nbackup". Am I understanding that right?
>>>>>
>>>>> The problem with the team/bond%d approach is that it creates a new
>>>>> netdevice and so it would require guest configuration changes.
>>>>>
>>>>>> IMHO phys_port_name == "backup" if BACKUP bit is set on slave virtio
>>>>>> link is quite neat.
>>>>>
>>>>> I agree. For non-"backup" virio_net devices would it be okay for us to
>>>>> just return -EOPNOTSUPP? I assume it would be and that way the legacy
>>>>> behavior could be maintained although the function still exists.
>>>>>
>>>>>>> - When the 'active' netdev is unplugged OR not present on a
>>>>>>> destination
>>>>>>>    system after live migration, the user will see 2 virtio_net
>>>>>>> netdevs.
>>>>>>
>>>>>> That's necessary and expected, all configuration applies to the master
>>>>>> so master must exist.
>>>>>
>>>>> With the naming issue resolved this is the only item left outstanding.
>>>>> This becomes a matter of form vs function.
>>>>>
>>>>> The main complaint about the "3 netdev" solution is a bit confusing to
>>>>> have the 2 netdevs present if the VF isn't there. The idea is that
>>>>> having the extra "master" netdev there if there isn't really a bond is
>>>>> a bit ugly.
>>>>
>>>> Is it this uglier in terms of user experience rather than
>>>> functionality? I don't want it dynamically changed between 2-netdev
>>>> and 3-netdev depending on the presence of VF. That gets back to my
>>>> original question and suggestion earlier: why not just hide the lower
>>>> netdevs from udev renaming and such? Which important observability
>>>> benefits users may get if exposing the lower netdevs?
>>>>
>>>> Thanks,
>>>> -Siwei
>>>
>>> The only real advantage to a 2 netdev solution is that it looks like
>>> the netvsc solution, however it doesn't behave like it since there are
>>> some features like XDP that may not function correctly if they are
>>> left enabled in the virtio_net interface.
>>>
>>> As far as functionality the advantage of not hiding the lower devices
>>> is that they are free to be managed. The problem with pushing all of
>>> the configuration into the upper device is that you are limited to the
>>> intersection of the features of the lower devices. This can be
>>> limiting for some setups as some VFs support things like more queues,
>>> or better interrupt moderation options than others so trying to make
>>> everything work with one config would be ugly.
>>
>> It depends on how you build it and the way you expect it to work. IMHO
>> the lower devices don't need to be directly managed at all, otherwise
>> it ends up with loss of configuration across migration, and it really
>> does not bring much value than having a general team or bond device.
>> Users still have to reconfigure those queue settings and interrupt
>> moderation options after all. The new upper device could take the
>> assumption that the VF/PT lower device always has superior feature set
>> than virtio-net in order to apply advanced configuration. The upper
>> device should remember all configurations previously done and apply
>> supporting ones to active device automatically when switching the
>> datapath.
>>
> It should be possible to extend this patchset to support migration of
> additional
> settings  by enabling additional ndo_ops and ethtool_ops on the upper dev
> and propagating them down the lower devices and replaying the settings after
> the VF is replugged after migration.

Indeed. But your 3rd patch will collapse this merit of the 3-netdev
into the former 2-netdev model - I hope it's just for demostrating the
possibility of dynamically switching ndo_ops and ethtool_ops but not
actually making it complicated in implementing this further.

Thanks,
-Siwei
>
> Thanks
> Sridhar

^ permalink raw reply	[flat|nested] 121+ messages in thread

* [virtio-dev] Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
@ 2018-02-23 22:22               ` Siwei Liu
  0 siblings, 0 replies; 121+ messages in thread
From: Siwei Liu @ 2018-02-23 22:22 UTC (permalink / raw)
  To: Samudrala, Sridhar
  Cc: Alexander Duyck, Jakub Kicinski, Michael S. Tsirkin,
	Stephen Hemminger, David Miller, Netdev, virtualization,
	virtio-dev, Brandeburg, Jesse, Duyck, Alexander H, Jason Wang

On Wed, Feb 21, 2018 at 6:35 PM, Samudrala, Sridhar
<sridhar.samudrala@intel.com> wrote:
> On 2/21/2018 5:59 PM, Siwei Liu wrote:
>>
>> On Wed, Feb 21, 2018 at 4:17 PM, Alexander Duyck
>> <alexander.duyck@gmail.com> wrote:
>>>
>>> On Wed, Feb 21, 2018 at 3:50 PM, Siwei Liu <loseweigh@gmail.com> wrote:
>>>>
>>>> I haven't checked emails for days and did not realize the new revision
>>>> had already came out. And thank you for the effort, this revision
>>>> really looks to be a step forward towards our use case and is close to
>>>> what we wanted to do. A few questions in line.
>>>>
>>>> On Sat, Feb 17, 2018 at 9:12 AM, Alexander Duyck
>>>> <alexander.duyck@gmail.com> wrote:
>>>>>
>>>>> On Fri, Feb 16, 2018 at 6:38 PM, Jakub Kicinski <kubakici@wp.pl> wrote:
>>>>>>
>>>>>> On Fri, 16 Feb 2018 10:11:19 -0800, Sridhar Samudrala wrote:
>>>>>>>
>>>>>>> Ppatch 2 is in response to the community request for a 3 netdev
>>>>>>> solution.  However, it creates some issues we'll get into in a
>>>>>>> moment.
>>>>>>> It extends virtio_net to use alternate datapath when available and
>>>>>>> registered. When BACKUP feature is enabled, virtio_net driver creates
>>>>>>> an additional 'bypass' netdev that acts as a master device and
>>>>>>> controls
>>>>>>> 2 slave devices.  The original virtio_net netdev is registered as
>>>>>>> 'backup' netdev and a passthru/vf device with the same MAC gets
>>>>>>> registered as 'active' netdev. Both 'bypass' and 'backup' netdevs are
>>>>>>> associated with the same 'pci' device.  The user accesses the network
>>>>>>> interface via 'bypass' netdev. The 'bypass' netdev chooses 'active'
>>>>>>> netdev
>>>>>>> as default for transmits when it is available with link up and
>>>>>>> running.
>>>>>>
>>>>>> Thank you do doing this.
>>>>>>
>>>>>>> We noticed a couple of issues with this approach during testing.
>>>>>>> - As both 'bypass' and 'backup' netdevs are associated with the same
>>>>>>>    virtio pci device, udev tries to rename both of them with the same
>>>>>>> name
>>>>>>>    and the 2nd rename will fail. This would be OK as long as the
>>>>>>> first netdev
>>>>>>>    to be renamed is the 'bypass' netdev, but the order in which udev
>>>>>>> gets
>>>>>>>    to rename the 2 netdevs is not reliable.
>>>>>>
>>>>>> Out of curiosity - why do you link the master netdev to the virtio
>>>>>> struct device?
>>>>>
>>>>> The basic idea of all this is that we wanted this to work with an
>>>>> existing VM image that was using virtio. As such we were trying to
>>>>> make it so that the bypass interface takes the place of the original
>>>>> virtio and get udev to rename the bypass to what the original
>>>>> virtio_net was.
>>>>
>>>> Could it made it also possible to take over the config from VF instead
>>>> of virtio on an existing VM image? And get udev rename the bypass
>>>> netdev to what the original VF was. I don't say tightly binding the
>>>> bypass master to only virtio or VF, but I think we should provide both
>>>> options to support different upgrade paths. Possibly we could tweak
>>>> the device tree layout to reuse the same PCI slot for the master
>>>> bypass netdev, such that udev would not get confused when renaming the
>>>> device. The VF needs to use a different function slot afterwards.
>>>> Perhaps we might need to a special multiseat like QEMU device for that
>>>> purpose?
>>>>
>>>> Our case we'll upgrade the config from VF to virtio-bypass directly.
>>>
>>> So if I am understanding what you are saying you are wanting to flip
>>> the backup interface from the virtio to a VF. The problem is that
>>> becomes a bit of a vendor lock-in solution since it would rely on a
>>> specific VF driver. I would agree with Jiri that we don't want to go
>>> down that path. We don't want every VF out there firing up its own
>>> separate bond. Ideally you want the hypervisor to be able to manage
>>> all of this which is why it makes sense to have virtio manage this and
>>> why this is associated with the virtio_net interface.
>>
>> No, that's not what I was talking about of course. I thought you
>> mentioned the upgrade scenario this patch would like to address is to
>> use the bypass interface "to take the place of the original virtio,
>> and get udev to rename the bypass to what the original virtio_net
>> was". That is one of the possible upgrade paths for sure. However the
>> upgrade path I was seeking is to use the bypass interface to take the
>> place of original VF interface while retaining the name and network
>> configs, which generally can be done simply with kernel upgrade. It
>> would become limiting as this patch makes the bypass interface share
>> the same virtio pci device with virito backup. Can this bypass
>> interface be made general to take place of any pci device other than
>> virtio-net? This will be more helpful as the cloud users who has
>> existing setup on VF interface don't have to recreate it on virtio-net
>> and VF separately again.
>
>
> Yes. This sounds interesting. Looks like you want an existing VM image with
> VF only configuration to get transparent live migration support by adding
> virtio_net with BACKUP feature.  We may need another feature bit to switch
> between these 2 options.

Yes, that's what I was thinking about. I have been building something
like this before, and would like to get back after merging with your
patch.

>
>
>
>>
>>> The other bits get into more complexity then we are ready to handle
>>> for now. I think I might have talked about something similar that I
>>> was referring to as a "virtio-bond" where you would have a PCI/PCIe
>>> tree topology that makes this easier to sort out, and the "virtio-bond
>>> would be used to handle coordination/configuration of a much more
>>> complex interface.
>>
>> That was one way to solve this problem but I'd like to see simple ways
>> to sort it out.
>>
>>>>>> FWIW two solutions that immediately come to mind is to export "backup"
>>>>>> as phys_port_name of the backup virtio link and/or assign a name to
>>>>>> the
>>>>>> master like you are doing already.  I think team uses team%d and bond
>>>>>> uses bond%d, soft naming of master devices seems quite natural in this
>>>>>> case.
>>>>>
>>>>> I figured I had overlooked something like that.. Thanks for pointing
>>>>> this out. Okay so I think the phys_port_name approach might resolve
>>>>> the original issue. If I am reading things correctly what we end up
>>>>> with is the master showing up as "ens1" for example and the backup
>>>>> showing up as "ens1nbackup". Am I understanding that right?
>>>>>
>>>>> The problem with the team/bond%d approach is that it creates a new
>>>>> netdevice and so it would require guest configuration changes.
>>>>>
>>>>>> IMHO phys_port_name == "backup" if BACKUP bit is set on slave virtio
>>>>>> link is quite neat.
>>>>>
>>>>> I agree. For non-"backup" virio_net devices would it be okay for us to
>>>>> just return -EOPNOTSUPP? I assume it would be and that way the legacy
>>>>> behavior could be maintained although the function still exists.
>>>>>
>>>>>>> - When the 'active' netdev is unplugged OR not present on a
>>>>>>> destination
>>>>>>>    system after live migration, the user will see 2 virtio_net
>>>>>>> netdevs.
>>>>>>
>>>>>> That's necessary and expected, all configuration applies to the master
>>>>>> so master must exist.
>>>>>
>>>>> With the naming issue resolved this is the only item left outstanding.
>>>>> This becomes a matter of form vs function.
>>>>>
>>>>> The main complaint about the "3 netdev" solution is a bit confusing to
>>>>> have the 2 netdevs present if the VF isn't there. The idea is that
>>>>> having the extra "master" netdev there if there isn't really a bond is
>>>>> a bit ugly.
>>>>
>>>> Is it this uglier in terms of user experience rather than
>>>> functionality? I don't want it dynamically changed between 2-netdev
>>>> and 3-netdev depending on the presence of VF. That gets back to my
>>>> original question and suggestion earlier: why not just hide the lower
>>>> netdevs from udev renaming and such? Which important observability
>>>> benefits users may get if exposing the lower netdevs?
>>>>
>>>> Thanks,
>>>> -Siwei
>>>
>>> The only real advantage to a 2 netdev solution is that it looks like
>>> the netvsc solution, however it doesn't behave like it since there are
>>> some features like XDP that may not function correctly if they are
>>> left enabled in the virtio_net interface.
>>>
>>> As far as functionality the advantage of not hiding the lower devices
>>> is that they are free to be managed. The problem with pushing all of
>>> the configuration into the upper device is that you are limited to the
>>> intersection of the features of the lower devices. This can be
>>> limiting for some setups as some VFs support things like more queues,
>>> or better interrupt moderation options than others so trying to make
>>> everything work with one config would be ugly.
>>
>> It depends on how you build it and the way you expect it to work. IMHO
>> the lower devices don't need to be directly managed at all, otherwise
>> it ends up with loss of configuration across migration, and it really
>> does not bring much value than having a general team or bond device.
>> Users still have to reconfigure those queue settings and interrupt
>> moderation options after all. The new upper device could take the
>> assumption that the VF/PT lower device always has superior feature set
>> than virtio-net in order to apply advanced configuration. The upper
>> device should remember all configurations previously done and apply
>> supporting ones to active device automatically when switching the
>> datapath.
>>
> It should be possible to extend this patchset to support migration of
> additional
> settings  by enabling additional ndo_ops and ethtool_ops on the upper dev
> and propagating them down the lower devices and replaying the settings after
> the VF is replugged after migration.

Indeed. But your 3rd patch will collapse this merit of the 3-netdev
into the former 2-netdev model - I hope it's just for demostrating the
possibility of dynamically switching ndo_ops and ethtool_ops but not
actually making it complicated in implementing this further.

Thanks,
-Siwei
>
> Thanks
> Sridhar

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
  2018-02-23 22:22               ` [virtio-dev] " Siwei Liu
  (?)
@ 2018-02-23 22:38               ` Jiri Pirko
  2018-02-24  0:17                   ` [virtio-dev] " Siwei Liu
  -1 siblings, 1 reply; 121+ messages in thread
From: Jiri Pirko @ 2018-02-23 22:38 UTC (permalink / raw)
  To: Siwei Liu
  Cc: Duyck, Alexander H, virtio-dev, Michael S. Tsirkin,
	Jakub Kicinski, Samudrala, Sridhar, Alexander Duyck,
	virtualization, Netdev, David Miller

Fri, Feb 23, 2018 at 11:22:36PM CET, loseweigh@gmail.com wrote:

[...]

>>>
>>> No, that's not what I was talking about of course. I thought you
>>> mentioned the upgrade scenario this patch would like to address is to
>>> use the bypass interface "to take the place of the original virtio,
>>> and get udev to rename the bypass to what the original virtio_net
>>> was". That is one of the possible upgrade paths for sure. However the
>>> upgrade path I was seeking is to use the bypass interface to take the
>>> place of original VF interface while retaining the name and network
>>> configs, which generally can be done simply with kernel upgrade. It
>>> would become limiting as this patch makes the bypass interface share
>>> the same virtio pci device with virito backup. Can this bypass
>>> interface be made general to take place of any pci device other than
>>> virtio-net? This will be more helpful as the cloud users who has
>>> existing setup on VF interface don't have to recreate it on virtio-net
>>> and VF separately again.

How that could work? If you have the VF netdev with all configuration
including IPs and routes and whatever - now you want to do migration
so you add virtio_net and do some weird in-driver bonding with it. But
then, VF disappears and the VF netdev with that and also all
configuration it had.
I don't think this scenario is valid.


>>
>>
>> Yes. This sounds interesting. Looks like you want an existing VM image with
>> VF only configuration to get transparent live migration support by adding
>> virtio_net with BACKUP feature.  We may need another feature bit to switch
>> between these 2 options.
>
>Yes, that's what I was thinking about. I have been building something
>like this before, and would like to get back after merging with your
>patch.

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
  2018-02-22 21:30                                 ` [virtio-dev] " Alexander Duyck
  (?)
@ 2018-02-23 23:59                                 ` Stephen Hemminger
  2018-02-25 22:21                                   ` Alexander Duyck
                                                     ` (2 more replies)
  -1 siblings, 3 replies; 121+ messages in thread
From: Stephen Hemminger @ 2018-02-23 23:59 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Duyck, Alexander H, virtio-dev, Jiri Pirko, Michael S. Tsirkin,
	Jakub Kicinski, Samudrala, Sridhar, virtualization, Siwei Liu,
	Netdev, David Miller

On Thu, 22 Feb 2018 13:30:12 -0800
Alexander Duyck <alexander.duyck@gmail.com> wrote:

> > Again, I undertand your motivation. Yet I don't like your solution.
> > But if the decision is made to do this in-driver bonding. I would like
> > to see it baing done some generic way:
> > 1) share the same "in-driver bonding core" code with netvsc
> >    put to net/core.
> > 2) the "in-driver bonding core" will strictly limit the functionality,
> >    like active-backup mode only, one vf, one backup, vf netdev type
> >    check (so noone could enslave a tap or anything else)
> > If user would need something more, he should employ team/bond.  

Sharing would be good, but netvsc world would really like to only have
one visible network device.

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
  2018-02-22  0:17         ` [virtio-dev] " Alexander Duyck
                           ` (2 preceding siblings ...)
  (?)
@ 2018-02-24  0:03         ` Stephen Hemminger
  2018-02-25 22:17             ` [virtio-dev] " Alexander Duyck
  2018-02-25 22:17           ` Alexander Duyck
  -1 siblings, 2 replies; 121+ messages in thread
From: Stephen Hemminger @ 2018-02-24  0:03 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Duyck, Alexander H, virtio-dev, Michael S. Tsirkin,
	Jakub Kicinski, Netdev, virtualization, Siwei Liu,
	Sridhar Samudrala, David Miller

(pruned to reduce thread)

On Wed, 21 Feb 2018 16:17:19 -0800
Alexander Duyck <alexander.duyck@gmail.com> wrote:

> >>> FWIW two solutions that immediately come to mind is to export "backup"
> >>> as phys_port_name of the backup virtio link and/or assign a name to the
> >>> master like you are doing already.  I think team uses team%d and bond
> >>> uses bond%d, soft naming of master devices seems quite natural in this
> >>> case.  
> >>
> >> I figured I had overlooked something like that.. Thanks for pointing
> >> this out. Okay so I think the phys_port_name approach might resolve
> >> the original issue. If I am reading things correctly what we end up
> >> with is the master showing up as "ens1" for example and the backup
> >> showing up as "ens1nbackup". Am I understanding that right?
> >>
> >> The problem with the team/bond%d approach is that it creates a new
> >> netdevice and so it would require guest configuration changes.
> >>  
> >>> IMHO phys_port_name == "backup" if BACKUP bit is set on slave virtio
> >>> link is quite neat.  
> >>
> >> I agree. For non-"backup" virio_net devices would it be okay for us to
> >> just return -EOPNOTSUPP? I assume it would be and that way the legacy
> >> behavior could be maintained although the function still exists.
> >>  
> >>>> - When the 'active' netdev is unplugged OR not present on a destination
> >>>>   system after live migration, the user will see 2 virtio_net netdevs.  
> >>>
> >>> That's necessary and expected, all configuration applies to the master
> >>> so master must exist.  
> >>
> >> With the naming issue resolved this is the only item left outstanding.
> >> This becomes a matter of form vs function.
> >>
> >> The main complaint about the "3 netdev" solution is a bit confusing to
> >> have the 2 netdevs present if the VF isn't there. The idea is that
> >> having the extra "master" netdev there if there isn't really a bond is
> >> a bit ugly.  
> >
> > Is it this uglier in terms of user experience rather than
> > functionality? I don't want it dynamically changed between 2-netdev
> > and 3-netdev depending on the presence of VF. That gets back to my
> > original question and suggestion earlier: why not just hide the lower
> > netdevs from udev renaming and such? Which important observability
> > benefits users may get if exposing the lower netdevs?
> >
> > Thanks,
> > -Siwei  
> 
> The only real advantage to a 2 netdev solution is that it looks like
> the netvsc solution, however it doesn't behave like it since there are
> some features like XDP that may not function correctly if they are
> left enabled in the virtio_net interface.
> 
> As far as functionality the advantage of not hiding the lower devices
> is that they are free to be managed. The problem with pushing all of
> the configuration into the upper device is that you are limited to the
> intersection of the features of the lower devices. This can be
> limiting for some setups as some VFs support things like more queues,
> or better interrupt moderation options than others so trying to make
> everything work with one config would be ugly.
> 


Let's not make XDP the blocker for doing the best solution
from the end user point of view. XDP is just yet another offload
thing which needs to be handled.  The current backup device solution
used in netvsc doesn't handle the full range of offload options
(things like flow direction, DCB, etc); no one but the HW vendors
seems to care.

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
  2018-02-23 22:38               ` Jiri Pirko
@ 2018-02-24  0:17                   ` Siwei Liu
  0 siblings, 0 replies; 121+ messages in thread
From: Siwei Liu @ 2018-02-24  0:17 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Duyck, Alexander H, virtio-dev, Michael S. Tsirkin,
	Jakub Kicinski, Samudrala, Sridhar, Alexander Duyck,
	virtualization, Netdev, David Miller

On Fri, Feb 23, 2018 at 2:38 PM, Jiri Pirko <jiri@resnulli.us> wrote:
> Fri, Feb 23, 2018 at 11:22:36PM CET, loseweigh@gmail.com wrote:
>
> [...]
>
>>>>
>>>> No, that's not what I was talking about of course. I thought you
>>>> mentioned the upgrade scenario this patch would like to address is to
>>>> use the bypass interface "to take the place of the original virtio,
>>>> and get udev to rename the bypass to what the original virtio_net
>>>> was". That is one of the possible upgrade paths for sure. However the
>>>> upgrade path I was seeking is to use the bypass interface to take the
>>>> place of original VF interface while retaining the name and network
>>>> configs, which generally can be done simply with kernel upgrade. It
>>>> would become limiting as this patch makes the bypass interface share
>>>> the same virtio pci device with virito backup. Can this bypass
>>>> interface be made general to take place of any pci device other than
>>>> virtio-net? This will be more helpful as the cloud users who has
>>>> existing setup on VF interface don't have to recreate it on virtio-net
>>>> and VF separately again.
>
> How that could work? If you have the VF netdev with all configuration
> including IPs and routes and whatever - now you want to do migration
> so you add virtio_net and do some weird in-driver bonding with it. But
> then, VF disappears and the VF netdev with that and also all
> configuration it had.
> I don't think this scenario is valid.

We are talking about making udev aware of the new virtio-bypass to
rebind the name of the old VF interface with supposedly virtio-bypass
*post the kernel upgrade*. Of course, this needs virtio-net backend to
supply the [bdf] info where the VF/PT device was located.

-Siwei


>
>
>>>
>>>
>>> Yes. This sounds interesting. Looks like you want an existing VM image with
>>> VF only configuration to get transparent live migration support by adding
>>> virtio_net with BACKUP feature.  We may need another feature bit to switch
>>> between these 2 options.
>>
>>Yes, that's what I was thinking about. I have been building something
>>like this before, and would like to get back after merging with your
>>patch.

^ permalink raw reply	[flat|nested] 121+ messages in thread

* [virtio-dev] Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
@ 2018-02-24  0:17                   ` Siwei Liu
  0 siblings, 0 replies; 121+ messages in thread
From: Siwei Liu @ 2018-02-24  0:17 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Samudrala, Sridhar, Alexander Duyck, Jakub Kicinski,
	Michael S. Tsirkin, Stephen Hemminger, David Miller, Netdev,
	virtualization, virtio-dev, Brandeburg, Jesse, Duyck,
	Alexander H, Jason Wang

On Fri, Feb 23, 2018 at 2:38 PM, Jiri Pirko <jiri@resnulli.us> wrote:
> Fri, Feb 23, 2018 at 11:22:36PM CET, loseweigh@gmail.com wrote:
>
> [...]
>
>>>>
>>>> No, that's not what I was talking about of course. I thought you
>>>> mentioned the upgrade scenario this patch would like to address is to
>>>> use the bypass interface "to take the place of the original virtio,
>>>> and get udev to rename the bypass to what the original virtio_net
>>>> was". That is one of the possible upgrade paths for sure. However the
>>>> upgrade path I was seeking is to use the bypass interface to take the
>>>> place of original VF interface while retaining the name and network
>>>> configs, which generally can be done simply with kernel upgrade. It
>>>> would become limiting as this patch makes the bypass interface share
>>>> the same virtio pci device with virito backup. Can this bypass
>>>> interface be made general to take place of any pci device other than
>>>> virtio-net? This will be more helpful as the cloud users who has
>>>> existing setup on VF interface don't have to recreate it on virtio-net
>>>> and VF separately again.
>
> How that could work? If you have the VF netdev with all configuration
> including IPs and routes and whatever - now you want to do migration
> so you add virtio_net and do some weird in-driver bonding with it. But
> then, VF disappears and the VF netdev with that and also all
> configuration it had.
> I don't think this scenario is valid.

We are talking about making udev aware of the new virtio-bypass to
rebind the name of the old VF interface with supposedly virtio-bypass
*post the kernel upgrade*. Of course, this needs virtio-net backend to
supply the [bdf] info where the VF/PT device was located.

-Siwei


>
>
>>>
>>>
>>> Yes. This sounds interesting. Looks like you want an existing VM image with
>>> VF only configuration to get transparent live migration support by adding
>>> virtio_net with BACKUP feature.  We may need another feature bit to switch
>>> between these 2 options.
>>
>>Yes, that's what I was thinking about. I have been building something
>>like this before, and would like to get back after merging with your
>>patch.

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
  2018-02-24  0:03         ` Stephen Hemminger
@ 2018-02-25 22:17             ` Alexander Duyck
  2018-02-25 22:17           ` Alexander Duyck
  1 sibling, 0 replies; 121+ messages in thread
From: Alexander Duyck @ 2018-02-25 22:17 UTC (permalink / raw)
  To: Stephen Hemminger, Jakub Kicinski, Michael S. Tsirkin, Siwei Liu,
	Jiri Pirko, Jason Wang
  Cc: Sridhar Samudrala, David Miller, Netdev, virtualization,
	virtio-dev, Brandeburg, Jesse, Duyck, Alexander H

On Fri, Feb 23, 2018 at 4:03 PM, Stephen Hemminger
<stephen@networkplumber.org> wrote:
> (pruned to reduce thread)
>
> On Wed, 21 Feb 2018 16:17:19 -0800
> Alexander Duyck <alexander.duyck@gmail.com> wrote:
>
>> >>> FWIW two solutions that immediately come to mind is to export "backup"
>> >>> as phys_port_name of the backup virtio link and/or assign a name to the
>> >>> master like you are doing already.  I think team uses team%d and bond
>> >>> uses bond%d, soft naming of master devices seems quite natural in this
>> >>> case.
>> >>
>> >> I figured I had overlooked something like that.. Thanks for pointing
>> >> this out. Okay so I think the phys_port_name approach might resolve
>> >> the original issue. If I am reading things correctly what we end up
>> >> with is the master showing up as "ens1" for example and the backup
>> >> showing up as "ens1nbackup". Am I understanding that right?
>> >>
>> >> The problem with the team/bond%d approach is that it creates a new
>> >> netdevice and so it would require guest configuration changes.
>> >>
>> >>> IMHO phys_port_name == "backup" if BACKUP bit is set on slave virtio
>> >>> link is quite neat.
>> >>
>> >> I agree. For non-"backup" virio_net devices would it be okay for us to
>> >> just return -EOPNOTSUPP? I assume it would be and that way the legacy
>> >> behavior could be maintained although the function still exists.
>> >>
>> >>>> - When the 'active' netdev is unplugged OR not present on a destination
>> >>>>   system after live migration, the user will see 2 virtio_net netdevs.
>> >>>
>> >>> That's necessary and expected, all configuration applies to the master
>> >>> so master must exist.
>> >>
>> >> With the naming issue resolved this is the only item left outstanding.
>> >> This becomes a matter of form vs function.
>> >>
>> >> The main complaint about the "3 netdev" solution is a bit confusing to
>> >> have the 2 netdevs present if the VF isn't there. The idea is that
>> >> having the extra "master" netdev there if there isn't really a bond is
>> >> a bit ugly.
>> >
>> > Is it this uglier in terms of user experience rather than
>> > functionality? I don't want it dynamically changed between 2-netdev
>> > and 3-netdev depending on the presence of VF. That gets back to my
>> > original question and suggestion earlier: why not just hide the lower
>> > netdevs from udev renaming and such? Which important observability
>> > benefits users may get if exposing the lower netdevs?
>> >
>> > Thanks,
>> > -Siwei
>>
>> The only real advantage to a 2 netdev solution is that it looks like
>> the netvsc solution, however it doesn't behave like it since there are
>> some features like XDP that may not function correctly if they are
>> left enabled in the virtio_net interface.
>>
>> As far as functionality the advantage of not hiding the lower devices
>> is that they are free to be managed. The problem with pushing all of
>> the configuration into the upper device is that you are limited to the
>> intersection of the features of the lower devices. This can be
>> limiting for some setups as some VFs support things like more queues,
>> or better interrupt moderation options than others so trying to make
>> everything work with one config would be ugly.
>>
>
>
> Let's not make XDP the blocker for doing the best solution
> from the end user point of view. XDP is just yet another offload
> thing which needs to be handled.  The current backup device solution
> used in netvsc doesn't handle the full range of offload options
> (things like flow direction, DCB, etc); no one but the HW vendors
> seems to care.

XDP isn't the blocker here. As far as I am concerned we can go either
way, with a 2 netdev or a 3 netdev solution. We just need to make sure
we are aware of all the trade-offs, and make a decision one way or the
other. This is quickly turning into a bikeshed and I would prefer us
to all agree, or at least disagree and commit, on which way to go
before we burn more cycles on a patch set that seems to be getting
tied up in debate.

With the 2 netdev solution we have to limit the functionality so that
we don't break things when we bypass the guts of the driver to hand
traffic off to the VF. Then ends up meaning that we are stuck with an
extra qdisc and Tx queue lock in the transmit path of the VF, and we
cannot rely on any in-driver Rx functionality to work such as
in-driver XDP. However the advantage here is that this is how netvsc
is already doing things.

The issue with the 3 netdev solution is that you are stuck with 2
netdevs ("ens1", "ens1nbackup") when the VF is not present. It could
be argued this isn't a very elegant looking solution, especially when
the VF is not present. With virtio this makes more sense though as you
are still able to expose the full functionality of the lower device so
you don't have to strip or drop any of the existing net device ops if
the "backup" bit is present.

Ultimately I would have preferred to have the 3 netdev solution go
with virtio only as it would have allowed for healthy competition
between the two designs and we could have seen which one would have
ultimately won out, but if we have to decide this now we need to do so
before we put too much more effort into the patches as these end up
becoming two very different solutions, especially if we have to apply
the solution to both drivers. My preference would still be 3 netdevs
since we could apply this to netvsc without too many changes, but I
will agree with whatever conclusion we can come to in terms of how
this is supposed to work for both netvsc and virtio_bypass.

- Alex

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
  2018-02-24  0:03         ` Stephen Hemminger
  2018-02-25 22:17             ` [virtio-dev] " Alexander Duyck
@ 2018-02-25 22:17           ` Alexander Duyck
  1 sibling, 0 replies; 121+ messages in thread
From: Alexander Duyck @ 2018-02-25 22:17 UTC (permalink / raw)
  To: Stephen Hemminger, Jakub Kicinski, Michael S. Tsirkin, Siwei Liu,
	Jiri Pirko, Jason Wang
  Cc: Duyck, Alexander H, virtio-dev, Sridhar Samudrala,
	virtualization, Netdev, David Miller

On Fri, Feb 23, 2018 at 4:03 PM, Stephen Hemminger
<stephen@networkplumber.org> wrote:
> (pruned to reduce thread)
>
> On Wed, 21 Feb 2018 16:17:19 -0800
> Alexander Duyck <alexander.duyck@gmail.com> wrote:
>
>> >>> FWIW two solutions that immediately come to mind is to export "backup"
>> >>> as phys_port_name of the backup virtio link and/or assign a name to the
>> >>> master like you are doing already.  I think team uses team%d and bond
>> >>> uses bond%d, soft naming of master devices seems quite natural in this
>> >>> case.
>> >>
>> >> I figured I had overlooked something like that.. Thanks for pointing
>> >> this out. Okay so I think the phys_port_name approach might resolve
>> >> the original issue. If I am reading things correctly what we end up
>> >> with is the master showing up as "ens1" for example and the backup
>> >> showing up as "ens1nbackup". Am I understanding that right?
>> >>
>> >> The problem with the team/bond%d approach is that it creates a new
>> >> netdevice and so it would require guest configuration changes.
>> >>
>> >>> IMHO phys_port_name == "backup" if BACKUP bit is set on slave virtio
>> >>> link is quite neat.
>> >>
>> >> I agree. For non-"backup" virio_net devices would it be okay for us to
>> >> just return -EOPNOTSUPP? I assume it would be and that way the legacy
>> >> behavior could be maintained although the function still exists.
>> >>
>> >>>> - When the 'active' netdev is unplugged OR not present on a destination
>> >>>>   system after live migration, the user will see 2 virtio_net netdevs.
>> >>>
>> >>> That's necessary and expected, all configuration applies to the master
>> >>> so master must exist.
>> >>
>> >> With the naming issue resolved this is the only item left outstanding.
>> >> This becomes a matter of form vs function.
>> >>
>> >> The main complaint about the "3 netdev" solution is a bit confusing to
>> >> have the 2 netdevs present if the VF isn't there. The idea is that
>> >> having the extra "master" netdev there if there isn't really a bond is
>> >> a bit ugly.
>> >
>> > Is it this uglier in terms of user experience rather than
>> > functionality? I don't want it dynamically changed between 2-netdev
>> > and 3-netdev depending on the presence of VF. That gets back to my
>> > original question and suggestion earlier: why not just hide the lower
>> > netdevs from udev renaming and such? Which important observability
>> > benefits users may get if exposing the lower netdevs?
>> >
>> > Thanks,
>> > -Siwei
>>
>> The only real advantage to a 2 netdev solution is that it looks like
>> the netvsc solution, however it doesn't behave like it since there are
>> some features like XDP that may not function correctly if they are
>> left enabled in the virtio_net interface.
>>
>> As far as functionality the advantage of not hiding the lower devices
>> is that they are free to be managed. The problem with pushing all of
>> the configuration into the upper device is that you are limited to the
>> intersection of the features of the lower devices. This can be
>> limiting for some setups as some VFs support things like more queues,
>> or better interrupt moderation options than others so trying to make
>> everything work with one config would be ugly.
>>
>
>
> Let's not make XDP the blocker for doing the best solution
> from the end user point of view. XDP is just yet another offload
> thing which needs to be handled.  The current backup device solution
> used in netvsc doesn't handle the full range of offload options
> (things like flow direction, DCB, etc); no one but the HW vendors
> seems to care.

XDP isn't the blocker here. As far as I am concerned we can go either
way, with a 2 netdev or a 3 netdev solution. We just need to make sure
we are aware of all the trade-offs, and make a decision one way or the
other. This is quickly turning into a bikeshed and I would prefer us
to all agree, or at least disagree and commit, on which way to go
before we burn more cycles on a patch set that seems to be getting
tied up in debate.

With the 2 netdev solution we have to limit the functionality so that
we don't break things when we bypass the guts of the driver to hand
traffic off to the VF. Then ends up meaning that we are stuck with an
extra qdisc and Tx queue lock in the transmit path of the VF, and we
cannot rely on any in-driver Rx functionality to work such as
in-driver XDP. However the advantage here is that this is how netvsc
is already doing things.

The issue with the 3 netdev solution is that you are stuck with 2
netdevs ("ens1", "ens1nbackup") when the VF is not present. It could
be argued this isn't a very elegant looking solution, especially when
the VF is not present. With virtio this makes more sense though as you
are still able to expose the full functionality of the lower device so
you don't have to strip or drop any of the existing net device ops if
the "backup" bit is present.

Ultimately I would have preferred to have the 3 netdev solution go
with virtio only as it would have allowed for healthy competition
between the two designs and we could have seen which one would have
ultimately won out, but if we have to decide this now we need to do so
before we put too much more effort into the patches as these end up
becoming two very different solutions, especially if we have to apply
the solution to both drivers. My preference would still be 3 netdevs
since we could apply this to netvsc without too many changes, but I
will agree with whatever conclusion we can come to in terms of how
this is supposed to work for both netvsc and virtio_bypass.

- Alex

^ permalink raw reply	[flat|nested] 121+ messages in thread

* [virtio-dev] Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
@ 2018-02-25 22:17             ` Alexander Duyck
  0 siblings, 0 replies; 121+ messages in thread
From: Alexander Duyck @ 2018-02-25 22:17 UTC (permalink / raw)
  To: Stephen Hemminger, Jakub Kicinski, Michael S. Tsirkin, Siwei Liu,
	Jiri Pirko, Jason Wang
  Cc: Sridhar Samudrala, David Miller, Netdev, virtualization,
	virtio-dev, Brandeburg, Jesse, Duyck, Alexander H

On Fri, Feb 23, 2018 at 4:03 PM, Stephen Hemminger
<stephen@networkplumber.org> wrote:
> (pruned to reduce thread)
>
> On Wed, 21 Feb 2018 16:17:19 -0800
> Alexander Duyck <alexander.duyck@gmail.com> wrote:
>
>> >>> FWIW two solutions that immediately come to mind is to export "backup"
>> >>> as phys_port_name of the backup virtio link and/or assign a name to the
>> >>> master like you are doing already.  I think team uses team%d and bond
>> >>> uses bond%d, soft naming of master devices seems quite natural in this
>> >>> case.
>> >>
>> >> I figured I had overlooked something like that.. Thanks for pointing
>> >> this out. Okay so I think the phys_port_name approach might resolve
>> >> the original issue. If I am reading things correctly what we end up
>> >> with is the master showing up as "ens1" for example and the backup
>> >> showing up as "ens1nbackup". Am I understanding that right?
>> >>
>> >> The problem with the team/bond%d approach is that it creates a new
>> >> netdevice and so it would require guest configuration changes.
>> >>
>> >>> IMHO phys_port_name == "backup" if BACKUP bit is set on slave virtio
>> >>> link is quite neat.
>> >>
>> >> I agree. For non-"backup" virio_net devices would it be okay for us to
>> >> just return -EOPNOTSUPP? I assume it would be and that way the legacy
>> >> behavior could be maintained although the function still exists.
>> >>
>> >>>> - When the 'active' netdev is unplugged OR not present on a destination
>> >>>>   system after live migration, the user will see 2 virtio_net netdevs.
>> >>>
>> >>> That's necessary and expected, all configuration applies to the master
>> >>> so master must exist.
>> >>
>> >> With the naming issue resolved this is the only item left outstanding.
>> >> This becomes a matter of form vs function.
>> >>
>> >> The main complaint about the "3 netdev" solution is a bit confusing to
>> >> have the 2 netdevs present if the VF isn't there. The idea is that
>> >> having the extra "master" netdev there if there isn't really a bond is
>> >> a bit ugly.
>> >
>> > Is it this uglier in terms of user experience rather than
>> > functionality? I don't want it dynamically changed between 2-netdev
>> > and 3-netdev depending on the presence of VF. That gets back to my
>> > original question and suggestion earlier: why not just hide the lower
>> > netdevs from udev renaming and such? Which important observability
>> > benefits users may get if exposing the lower netdevs?
>> >
>> > Thanks,
>> > -Siwei
>>
>> The only real advantage to a 2 netdev solution is that it looks like
>> the netvsc solution, however it doesn't behave like it since there are
>> some features like XDP that may not function correctly if they are
>> left enabled in the virtio_net interface.
>>
>> As far as functionality the advantage of not hiding the lower devices
>> is that they are free to be managed. The problem with pushing all of
>> the configuration into the upper device is that you are limited to the
>> intersection of the features of the lower devices. This can be
>> limiting for some setups as some VFs support things like more queues,
>> or better interrupt moderation options than others so trying to make
>> everything work with one config would be ugly.
>>
>
>
> Let's not make XDP the blocker for doing the best solution
> from the end user point of view. XDP is just yet another offload
> thing which needs to be handled.  The current backup device solution
> used in netvsc doesn't handle the full range of offload options
> (things like flow direction, DCB, etc); no one but the HW vendors
> seems to care.

XDP isn't the blocker here. As far as I am concerned we can go either
way, with a 2 netdev or a 3 netdev solution. We just need to make sure
we are aware of all the trade-offs, and make a decision one way or the
other. This is quickly turning into a bikeshed and I would prefer us
to all agree, or at least disagree and commit, on which way to go
before we burn more cycles on a patch set that seems to be getting
tied up in debate.

With the 2 netdev solution we have to limit the functionality so that
we don't break things when we bypass the guts of the driver to hand
traffic off to the VF. Then ends up meaning that we are stuck with an
extra qdisc and Tx queue lock in the transmit path of the VF, and we
cannot rely on any in-driver Rx functionality to work such as
in-driver XDP. However the advantage here is that this is how netvsc
is already doing things.

The issue with the 3 netdev solution is that you are stuck with 2
netdevs ("ens1", "ens1nbackup") when the VF is not present. It could
be argued this isn't a very elegant looking solution, especially when
the VF is not present. With virtio this makes more sense though as you
are still able to expose the full functionality of the lower device so
you don't have to strip or drop any of the existing net device ops if
the "backup" bit is present.

Ultimately I would have preferred to have the 3 netdev solution go
with virtio only as it would have allowed for healthy competition
between the two designs and we could have seen which one would have
ultimately won out, but if we have to decide this now we need to do so
before we put too much more effort into the patches as these end up
becoming two very different solutions, especially if we have to apply
the solution to both drivers. My preference would still be 3 netdevs
since we could apply this to netvsc without too many changes, but I
will agree with whatever conclusion we can come to in terms of how
this is supposed to work for both netvsc and virtio_bypass.

- Alex

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
  2018-02-23 23:59                                 ` Stephen Hemminger
@ 2018-02-25 22:21                                     ` Alexander Duyck
  2018-02-25 22:21                                     ` [virtio-dev] " Alexander Duyck
  2018-02-26  7:19                                   ` Jiri Pirko
  2 siblings, 0 replies; 121+ messages in thread
From: Alexander Duyck @ 2018-02-25 22:21 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Jiri Pirko, Jakub Kicinski, Samudrala, Sridhar,
	Michael S. Tsirkin, David Miller, Netdev, virtualization,
	virtio-dev, Brandeburg, Jesse, Duyck, Alexander H, Jason Wang,
	Siwei Liu

On Fri, Feb 23, 2018 at 3:59 PM, Stephen Hemminger
<stephen@networkplumber.org> wrote:
> On Thu, 22 Feb 2018 13:30:12 -0800
> Alexander Duyck <alexander.duyck@gmail.com> wrote:
>
>> > Again, I undertand your motivation. Yet I don't like your solution.
>> > But if the decision is made to do this in-driver bonding. I would like
>> > to see it baing done some generic way:
>> > 1) share the same "in-driver bonding core" code with netvsc
>> >    put to net/core.
>> > 2) the "in-driver bonding core" will strictly limit the functionality,
>> >    like active-backup mode only, one vf, one backup, vf netdev type
>> >    check (so noone could enslave a tap or anything else)
>> > If user would need something more, he should employ team/bond.
>
> Sharing would be good, but netvsc world would really like to only have
> one visible network device.

Other than the netdev count are there any other issues we need to be
thinking about?

If I am not mistaken you netvsc doesn't put any broadcast/multicast
filters on the VF. If we ended up doing that in order to support the
virtio based solution would that cause any issues? I just realized we
had overlooked dealing with multicast in our current solution so we
will probably be looking at syncing the multicast list like what
occurs in netvsc, however we will need to do it for both the VF and
the virtio interfaces.

Thanks.

- Alex

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
  2018-02-23 23:59                                 ` Stephen Hemminger
@ 2018-02-25 22:21                                   ` Alexander Duyck
  2018-02-25 22:21                                     ` [virtio-dev] " Alexander Duyck
  2018-02-26  7:19                                   ` Jiri Pirko
  2 siblings, 0 replies; 121+ messages in thread
From: Alexander Duyck @ 2018-02-25 22:21 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Duyck, Alexander H, virtio-dev, Jiri Pirko, Michael S. Tsirkin,
	Jakub Kicinski, Samudrala, Sridhar, virtualization, Siwei Liu,
	Netdev, David Miller

On Fri, Feb 23, 2018 at 3:59 PM, Stephen Hemminger
<stephen@networkplumber.org> wrote:
> On Thu, 22 Feb 2018 13:30:12 -0800
> Alexander Duyck <alexander.duyck@gmail.com> wrote:
>
>> > Again, I undertand your motivation. Yet I don't like your solution.
>> > But if the decision is made to do this in-driver bonding. I would like
>> > to see it baing done some generic way:
>> > 1) share the same "in-driver bonding core" code with netvsc
>> >    put to net/core.
>> > 2) the "in-driver bonding core" will strictly limit the functionality,
>> >    like active-backup mode only, one vf, one backup, vf netdev type
>> >    check (so noone could enslave a tap or anything else)
>> > If user would need something more, he should employ team/bond.
>
> Sharing would be good, but netvsc world would really like to only have
> one visible network device.

Other than the netdev count are there any other issues we need to be
thinking about?

If I am not mistaken you netvsc doesn't put any broadcast/multicast
filters on the VF. If we ended up doing that in order to support the
virtio based solution would that cause any issues? I just realized we
had overlooked dealing with multicast in our current solution so we
will probably be looking at syncing the multicast list like what
occurs in netvsc, however we will need to do it for both the VF and
the virtio interfaces.

Thanks.

- Alex

^ permalink raw reply	[flat|nested] 121+ messages in thread

* [virtio-dev] Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
@ 2018-02-25 22:21                                     ` Alexander Duyck
  0 siblings, 0 replies; 121+ messages in thread
From: Alexander Duyck @ 2018-02-25 22:21 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Jiri Pirko, Jakub Kicinski, Samudrala, Sridhar,
	Michael S. Tsirkin, David Miller, Netdev, virtualization,
	virtio-dev, Brandeburg, Jesse, Duyck, Alexander H, Jason Wang,
	Siwei Liu

On Fri, Feb 23, 2018 at 3:59 PM, Stephen Hemminger
<stephen@networkplumber.org> wrote:
> On Thu, 22 Feb 2018 13:30:12 -0800
> Alexander Duyck <alexander.duyck@gmail.com> wrote:
>
>> > Again, I undertand your motivation. Yet I don't like your solution.
>> > But if the decision is made to do this in-driver bonding. I would like
>> > to see it baing done some generic way:
>> > 1) share the same "in-driver bonding core" code with netvsc
>> >    put to net/core.
>> > 2) the "in-driver bonding core" will strictly limit the functionality,
>> >    like active-backup mode only, one vf, one backup, vf netdev type
>> >    check (so noone could enslave a tap or anything else)
>> > If user would need something more, he should employ team/bond.
>
> Sharing would be good, but netvsc world would really like to only have
> one visible network device.

Other than the netdev count are there any other issues we need to be
thinking about?

If I am not mistaken you netvsc doesn't put any broadcast/multicast
filters on the VF. If we ended up doing that in order to support the
virtio based solution would that cause any issues? I just realized we
had overlooked dealing with multicast in our current solution so we
will probably be looking at syncing the multicast list like what
occurs in netvsc, however we will need to do it for both the VF and
the virtio interfaces.

Thanks.

- Alex

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
  2018-02-23 23:59                                 ` Stephen Hemminger
  2018-02-25 22:21                                   ` Alexander Duyck
  2018-02-25 22:21                                     ` [virtio-dev] " Alexander Duyck
@ 2018-02-26  7:19                                   ` Jiri Pirko
  2018-02-27  1:02                                     ` Stephen Hemminger
  2 siblings, 1 reply; 121+ messages in thread
From: Jiri Pirko @ 2018-02-26  7:19 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Duyck, Alexander H, virtio-dev, Michael S. Tsirkin,
	Jakub Kicinski, Samudrala, Sridhar, Alexander Duyck,
	virtualization, Siwei Liu, Netdev, David Miller

Sat, Feb 24, 2018 at 12:59:04AM CET, stephen@networkplumber.org wrote:
>On Thu, 22 Feb 2018 13:30:12 -0800
>Alexander Duyck <alexander.duyck@gmail.com> wrote:
>
>> > Again, I undertand your motivation. Yet I don't like your solution.
>> > But if the decision is made to do this in-driver bonding. I would like
>> > to see it baing done some generic way:
>> > 1) share the same "in-driver bonding core" code with netvsc
>> >    put to net/core.
>> > 2) the "in-driver bonding core" will strictly limit the functionality,
>> >    like active-backup mode only, one vf, one backup, vf netdev type
>> >    check (so noone could enslave a tap or anything else)
>> > If user would need something more, he should employ team/bond.  
>
>Sharing would be good, but netvsc world would really like to only have
>one visible network device.

Why do you mind? All would be the same, there would be just another
netdevice unused by the vm user (same as the vf netdev).

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
  2018-02-26  7:19                                   ` Jiri Pirko
@ 2018-02-27  1:02                                     ` Stephen Hemminger
  2018-02-27  1:18                                         ` [virtio-dev] " Michael S. Tsirkin
  0 siblings, 1 reply; 121+ messages in thread
From: Stephen Hemminger @ 2018-02-27  1:02 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Duyck, Alexander H, virtio-dev, Michael S. Tsirkin,
	Jakub Kicinski, Samudrala, Sridhar, Alexander Duyck,
	virtualization, Siwei Liu, Netdev, David Miller

On Mon, 26 Feb 2018 08:19:24 +0100
Jiri Pirko <jiri@resnulli.us> wrote:

> Sat, Feb 24, 2018 at 12:59:04AM CET, stephen@networkplumber.org wrote:
> >On Thu, 22 Feb 2018 13:30:12 -0800
> >Alexander Duyck <alexander.duyck@gmail.com> wrote:
> >  
> >> > Again, I undertand your motivation. Yet I don't like your solution.
> >> > But if the decision is made to do this in-driver bonding. I would like
> >> > to see it baing done some generic way:
> >> > 1) share the same "in-driver bonding core" code with netvsc
> >> >    put to net/core.
> >> > 2) the "in-driver bonding core" will strictly limit the functionality,
> >> >    like active-backup mode only, one vf, one backup, vf netdev type
> >> >    check (so noone could enslave a tap or anything else)
> >> > If user would need something more, he should employ team/bond.    
> >
> >Sharing would be good, but netvsc world would really like to only have
> >one visible network device.  
> 
> Why do you mind? All would be the same, there would be just another
> netdevice unused by the vm user (same as the vf netdev).
> 

I mind because our requirement is no changes to userspace.
No special udev rules, no bonding script, no setup.

Things like cloudinit running on current distro's expect to see a single
eth0.  The VF device show up can also be an issue because distro's have
stupid rules like Network Manager trying to start DHCP on every interface.
We deal with that now by doing stuff like udev rules to get it to stop
but that is still causing user errors.

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
  2018-02-27  1:02                                     ` Stephen Hemminger
@ 2018-02-27  1:18                                         ` Michael S. Tsirkin
  0 siblings, 0 replies; 121+ messages in thread
From: Michael S. Tsirkin @ 2018-02-27  1:18 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Duyck, Alexander H, virtio-dev, Jiri Pirko, Jakub Kicinski,
	Samudrala, Sridhar, Alexander Duyck, virtualization, Siwei Liu,
	Netdev, David Miller

On Mon, Feb 26, 2018 at 05:02:18PM -0800, Stephen Hemminger wrote:
> On Mon, 26 Feb 2018 08:19:24 +0100
> Jiri Pirko <jiri@resnulli.us> wrote:
> 
> > Sat, Feb 24, 2018 at 12:59:04AM CET, stephen@networkplumber.org wrote:
> > >On Thu, 22 Feb 2018 13:30:12 -0800
> > >Alexander Duyck <alexander.duyck@gmail.com> wrote:
> > >  
> > >> > Again, I undertand your motivation. Yet I don't like your solution.
> > >> > But if the decision is made to do this in-driver bonding. I would like
> > >> > to see it baing done some generic way:
> > >> > 1) share the same "in-driver bonding core" code with netvsc
> > >> >    put to net/core.
> > >> > 2) the "in-driver bonding core" will strictly limit the functionality,
> > >> >    like active-backup mode only, one vf, one backup, vf netdev type
> > >> >    check (so noone could enslave a tap or anything else)
> > >> > If user would need something more, he should employ team/bond.    
> > >
> > >Sharing would be good, but netvsc world would really like to only have
> > >one visible network device.  
> > 
> > Why do you mind? All would be the same, there would be just another
> > netdevice unused by the vm user (same as the vf netdev).
> > 
> 
> I mind because our requirement is no changes to userspace.
> No special udev rules, no bonding script, no setup.

Agreed. It is mostly fine from this point of view, except that you need
to know to skip the slaves.  Maybe we could look at some kind of
trick e.g. pretending link is down for slaves?

> Things like cloudinit running on current distro's expect to see a single
> eth0.  The VF device show up can also be an issue because distro's have
> stupid rules like Network Manager trying to start DHCP on every interface.
> We deal with that now by doing stuff like udev rules to get it to stop
> but that is still causing user errors.

So the ideal of a single net device isn't achieved by netvsc.

Since you have scripts to skip the PT device, can't they
hind the PV slave too? How do they identify the device to skip?

I agree it would be nice to have a way to hide the extra netdev
from userspace.

The benefit of the separation is that each slave device can
be configured with e.g. its own native ethtool commands for
optimum performance.

-- 
MST

^ permalink raw reply	[flat|nested] 121+ messages in thread

* [virtio-dev] Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
@ 2018-02-27  1:18                                         ` Michael S. Tsirkin
  0 siblings, 0 replies; 121+ messages in thread
From: Michael S. Tsirkin @ 2018-02-27  1:18 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Jiri Pirko, Alexander Duyck, Jakub Kicinski, Samudrala, Sridhar,
	David Miller, Netdev, virtualization, virtio-dev, Brandeburg,
	Jesse, Duyck, Alexander H, Jason Wang, Siwei Liu

On Mon, Feb 26, 2018 at 05:02:18PM -0800, Stephen Hemminger wrote:
> On Mon, 26 Feb 2018 08:19:24 +0100
> Jiri Pirko <jiri@resnulli.us> wrote:
> 
> > Sat, Feb 24, 2018 at 12:59:04AM CET, stephen@networkplumber.org wrote:
> > >On Thu, 22 Feb 2018 13:30:12 -0800
> > >Alexander Duyck <alexander.duyck@gmail.com> wrote:
> > >  
> > >> > Again, I undertand your motivation. Yet I don't like your solution.
> > >> > But if the decision is made to do this in-driver bonding. I would like
> > >> > to see it baing done some generic way:
> > >> > 1) share the same "in-driver bonding core" code with netvsc
> > >> >    put to net/core.
> > >> > 2) the "in-driver bonding core" will strictly limit the functionality,
> > >> >    like active-backup mode only, one vf, one backup, vf netdev type
> > >> >    check (so noone could enslave a tap or anything else)
> > >> > If user would need something more, he should employ team/bond.    
> > >
> > >Sharing would be good, but netvsc world would really like to only have
> > >one visible network device.  
> > 
> > Why do you mind? All would be the same, there would be just another
> > netdevice unused by the vm user (same as the vf netdev).
> > 
> 
> I mind because our requirement is no changes to userspace.
> No special udev rules, no bonding script, no setup.

Agreed. It is mostly fine from this point of view, except that you need
to know to skip the slaves.  Maybe we could look at some kind of
trick e.g. pretending link is down for slaves?

> Things like cloudinit running on current distro's expect to see a single
> eth0.  The VF device show up can also be an issue because distro's have
> stupid rules like Network Manager trying to start DHCP on every interface.
> We deal with that now by doing stuff like udev rules to get it to stop
> but that is still causing user errors.

So the ideal of a single net device isn't achieved by netvsc.

Since you have scripts to skip the PT device, can't they
hind the PV slave too? How do they identify the device to skip?

I agree it would be nice to have a way to hide the extra netdev
from userspace.

The benefit of the separation is that each slave device can
be configured with e.g. its own native ethtool commands for
optimum performance.

-- 
MST

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
  2018-02-27  1:18                                         ` [virtio-dev] " Michael S. Tsirkin
  (?)
@ 2018-02-27  8:27                                         ` Jiri Pirko
  -1 siblings, 0 replies; 121+ messages in thread
From: Jiri Pirko @ 2018-02-27  8:27 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Duyck, Alexander H, virtio-dev, Jakub Kicinski, Netdev,
	Alexander Duyck, virtualization, Siwei Liu, Samudrala, Sridhar,
	David Miller

Tue, Feb 27, 2018 at 02:18:12AM CET, mst@redhat.com wrote:
>On Mon, Feb 26, 2018 at 05:02:18PM -0800, Stephen Hemminger wrote:
>> On Mon, 26 Feb 2018 08:19:24 +0100
>> Jiri Pirko <jiri@resnulli.us> wrote:
>> 
>> > Sat, Feb 24, 2018 at 12:59:04AM CET, stephen@networkplumber.org wrote:
>> > >On Thu, 22 Feb 2018 13:30:12 -0800
>> > >Alexander Duyck <alexander.duyck@gmail.com> wrote:
>> > >  
>> > >> > Again, I undertand your motivation. Yet I don't like your solution.
>> > >> > But if the decision is made to do this in-driver bonding. I would like
>> > >> > to see it baing done some generic way:
>> > >> > 1) share the same "in-driver bonding core" code with netvsc
>> > >> >    put to net/core.
>> > >> > 2) the "in-driver bonding core" will strictly limit the functionality,
>> > >> >    like active-backup mode only, one vf, one backup, vf netdev type
>> > >> >    check (so noone could enslave a tap or anything else)
>> > >> > If user would need something more, he should employ team/bond.    
>> > >
>> > >Sharing would be good, but netvsc world would really like to only have
>> > >one visible network device.  
>> > 
>> > Why do you mind? All would be the same, there would be just another
>> > netdevice unused by the vm user (same as the vf netdev).
>> > 
>> 
>> I mind because our requirement is no changes to userspace.
>> No special udev rules, no bonding script, no setup.
>
>Agreed. It is mostly fine from this point of view, except that you need
>to know to skip the slaves.  Maybe we could look at some kind of
>trick e.g. pretending link is down for slaves?

:O Another hack. Please, don't.


>
>> Things like cloudinit running on current distro's expect to see a single
>> eth0.  The VF device show up can also be an issue because distro's have
>> stupid rules like Network Manager trying to start DHCP on every interface.
>> We deal with that now by doing stuff like udev rules to get it to stop
>> but that is still causing user errors.

So that means that with an extra netdev for "virtio_net bypass" you will
face exactly the same problems. Should not be an issue for you then.


>
>So the ideal of a single net device isn't achieved by netvsc.
>
>Since you have scripts to skip the PT device, can't they
>hind the PV slave too? How do they identify the device to skip?
>
>I agree it would be nice to have a way to hide the extra netdev
>from userspace.

"A hidden netdevice", hmm. I believe that instead of doing hacks like
this, we should fix userspace to treat particular netdevices correctly.


>
>The benefit of the separation is that each slave device can
>be configured with e.g. its own native ethtool commands for
>optimum performance.
>
>-- 
>MST

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
  2018-02-20 16:04     ` [virtio-dev] " Alexander Duyck
  (?)
  (?)
@ 2018-02-27  8:49     ` Jiri Pirko
  2018-02-27 21:16       ` Alexander Duyck
                         ` (3 more replies)
  -1 siblings, 4 replies; 121+ messages in thread
From: Jiri Pirko @ 2018-02-27  8:49 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Duyck, Alexander H, virtio-dev, Michael S. Tsirkin,
	Jakub Kicinski, Sridhar Samudrala, virtualization, Siwei Liu,
	Netdev, David Miller

Tue, Feb 20, 2018 at 05:04:29PM CET, alexander.duyck@gmail.com wrote:
>On Tue, Feb 20, 2018 at 2:42 AM, Jiri Pirko <jiri@resnulli.us> wrote:
>> Fri, Feb 16, 2018 at 07:11:19PM CET, sridhar.samudrala@intel.com wrote:
>>>Patch 1 introduces a new feature bit VIRTIO_NET_F_BACKUP that can be
>>>used by hypervisor to indicate that virtio_net interface should act as
>>>a backup for another device with the same MAC address.
>>>
>>>Ppatch 2 is in response to the community request for a 3 netdev
>>>solution.  However, it creates some issues we'll get into in a moment.
>>>It extends virtio_net to use alternate datapath when available and
>>>registered. When BACKUP feature is enabled, virtio_net driver creates
>>>an additional 'bypass' netdev that acts as a master device and controls
>>>2 slave devices.  The original virtio_net netdev is registered as
>>>'backup' netdev and a passthru/vf device with the same MAC gets
>>>registered as 'active' netdev. Both 'bypass' and 'backup' netdevs are
>>>associated with the same 'pci' device.  The user accesses the network
>>>interface via 'bypass' netdev. The 'bypass' netdev chooses 'active' netdev
>>>as default for transmits when it is available with link up and running.
>>
>> Sorry, but this is ridiculous. You are apparently re-implemeting part
>> of bonding driver as a part of NIC driver. Bond and team drivers
>> are mature solutions, well tested, broadly used, with lots of issues
>> resolved in the past. What you try to introduce is a weird shortcut
>> that already has couple of issues as you mentioned and will certanly
>> have many more. Also, I'm pretty sure that in future, someone comes up
>> with ideas like multiple VFs, LACP and similar bonding things.
>
>The problem with the bond and team drivers is they are too large and
>have too many interfaces available for configuration so as a result
>they can really screw this interface up.
>
>Essentially this is meant to be a bond that is more-or-less managed by
>the host, not the guest. We want the host to be able to configure it
>and have it automatically kick in on the guest. For now we want to
>avoid adding too much complexity as this is meant to be just the first
>step. Trying to go in and implement the whole solution right from the
>start based on existing drivers is going to be a massive time sink and
>will likely never get completed due to the fact that there is always
>going to be some other thing that will interfere.
>
>My personal hope is that we can look at doing a virtio-bond sort of
>device that will handle all this as well as providing a communication
>channel, but that is much further down the road. For now we only have
>a single bit so the goal for now is trying to keep this as simple as
>possible.

I have another usecase that would require the solution to be different
then what you suggest. Consider following scenario:
- baremetal has 2 sr-iov nics
- there is a vm, has 1 VF from each nics: vf0, vf1. No virtio_net
- baremetal would like to somehow tell the VM to bond vf0 and vf1
  together and how this bonding should be configured, according to how
  the VF representors are configured on the baremetal (LACP for example)

The baremetal could decide to remove any VF during the VM runtime, it
can add another VF there. For migration, it can add virtio_net. The VM
should be inctructed to bond all interfaces together according to how
baremetal decided - as it knows better.

For this we need a separate communication channel from baremetal to VM
(perhaps something re-usable already exists), we need something to
listen to the events coming from this channel (kernel/userspace) and to
react accordingly (create bond/team, enslave, etc).

Now the question is: is it possible to merge the demands you have and
the generic needs I described into a single solution? From what I see,
that would be quite hard/impossible. So at the end, I think that we have
to end-up with 2 solutions:
1) virtio_net, netvsc in-driver bonding - very limited, stupid, 0config
   solution that works for all (no matter what OS you use in VM)
2) team/bond solution with assistance of preferably userspace daemon
   getting info from baremetal. This is not 0config, but minimal config
   - user just have to define this "magic bonding" should be on.
   This covers all possible usecases, including multiple VFs, RDMA, etc.

Thoughts?

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
  2018-02-27  8:49     ` Jiri Pirko
@ 2018-02-27 21:16         ` Alexander Duyck
  2018-02-27 21:16         ` [virtio-dev] " Alexander Duyck
                           ` (2 subsequent siblings)
  3 siblings, 0 replies; 121+ messages in thread
From: Alexander Duyck @ 2018-02-27 21:16 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Sridhar Samudrala, Michael S. Tsirkin, Stephen Hemminger,
	David Miller, Netdev, virtualization, virtio-dev, Brandeburg,
	Jesse, Duyck, Alexander H, Jakub Kicinski, Jason Wang, Siwei Liu

On Tue, Feb 27, 2018 at 12:49 AM, Jiri Pirko <jiri@resnulli.us> wrote:
> Tue, Feb 20, 2018 at 05:04:29PM CET, alexander.duyck@gmail.com wrote:
>>On Tue, Feb 20, 2018 at 2:42 AM, Jiri Pirko <jiri@resnulli.us> wrote:
>>> Fri, Feb 16, 2018 at 07:11:19PM CET, sridhar.samudrala@intel.com wrote:
>>>>Patch 1 introduces a new feature bit VIRTIO_NET_F_BACKUP that can be
>>>>used by hypervisor to indicate that virtio_net interface should act as
>>>>a backup for another device with the same MAC address.
>>>>
>>>>Ppatch 2 is in response to the community request for a 3 netdev
>>>>solution.  However, it creates some issues we'll get into in a moment.
>>>>It extends virtio_net to use alternate datapath when available and
>>>>registered. When BACKUP feature is enabled, virtio_net driver creates
>>>>an additional 'bypass' netdev that acts as a master device and controls
>>>>2 slave devices.  The original virtio_net netdev is registered as
>>>>'backup' netdev and a passthru/vf device with the same MAC gets
>>>>registered as 'active' netdev. Both 'bypass' and 'backup' netdevs are
>>>>associated with the same 'pci' device.  The user accesses the network
>>>>interface via 'bypass' netdev. The 'bypass' netdev chooses 'active' netdev
>>>>as default for transmits when it is available with link up and running.
>>>
>>> Sorry, but this is ridiculous. You are apparently re-implemeting part
>>> of bonding driver as a part of NIC driver. Bond and team drivers
>>> are mature solutions, well tested, broadly used, with lots of issues
>>> resolved in the past. What you try to introduce is a weird shortcut
>>> that already has couple of issues as you mentioned and will certanly
>>> have many more. Also, I'm pretty sure that in future, someone comes up
>>> with ideas like multiple VFs, LACP and similar bonding things.
>>
>>The problem with the bond and team drivers is they are too large and
>>have too many interfaces available for configuration so as a result
>>they can really screw this interface up.
>>
>>Essentially this is meant to be a bond that is more-or-less managed by
>>the host, not the guest. We want the host to be able to configure it
>>and have it automatically kick in on the guest. For now we want to
>>avoid adding too much complexity as this is meant to be just the first
>>step. Trying to go in and implement the whole solution right from the
>>start based on existing drivers is going to be a massive time sink and
>>will likely never get completed due to the fact that there is always
>>going to be some other thing that will interfere.
>>
>>My personal hope is that we can look at doing a virtio-bond sort of
>>device that will handle all this as well as providing a communication
>>channel, but that is much further down the road. For now we only have
>>a single bit so the goal for now is trying to keep this as simple as
>>possible.
>
> I have another usecase that would require the solution to be different
> then what you suggest. Consider following scenario:
> - baremetal has 2 sr-iov nics
> - there is a vm, has 1 VF from each nics: vf0, vf1. No virtio_net
> - baremetal would like to somehow tell the VM to bond vf0 and vf1
>   together and how this bonding should be configured, according to how
>   the VF representors are configured on the baremetal (LACP for example)
>
> The baremetal could decide to remove any VF during the VM runtime, it
> can add another VF there. For migration, it can add virtio_net. The VM
> should be inctructed to bond all interfaces together according to how
> baremetal decided - as it knows better.
>
> For this we need a separate communication channel from baremetal to VM
> (perhaps something re-usable already exists), we need something to
> listen to the events coming from this channel (kernel/userspace) and to
> react accordingly (create bond/team, enslave, etc).
>
> Now the question is: is it possible to merge the demands you have and
> the generic needs I described into a single solution? From what I see,
> that would be quite hard/impossible. So at the end, I think that we have
> to end-up with 2 solutions:
> 1) virtio_net, netvsc in-driver bonding - very limited, stupid, 0config
>    solution that works for all (no matter what OS you use in VM)
> 2) team/bond solution with assistance of preferably userspace daemon
>    getting info from baremetal. This is not 0config, but minimal config
>    - user just have to define this "magic bonding" should be on.
>    This covers all possible usecases, including multiple VFs, RDMA, etc.
>
> Thoughts?

So that is about what I had in mind. We end up having to do something
completely different to support this more complex solution. I think we
might have referred to it as v2/v3 in a different thread, and
virt-bond in this thread.

Basically we need some sort of PCI or PCIe topology mapping for the
devices that can be translated into something we can communicate over
the communication channel. After that we also have the added
complexity of how do we figure out which Tx path we want to choose.
This is one of the reasons why I was thinking of something like a eBPF
blob that is handed up from the host side and into the guest to select
the Tx queue. That way when we add some new approach such as a
NUMA/cpu based netdev selection then we just provide an eBPF blob that
does that. Most of this is just theoretical at this point though since
I haven't had a chance to look into it too deeply yet. If you want to
take something like this on the help would always be welcome.. :)

The other thing I am looking at is trying to find a good way to do
dirty page tracking in the hypervisor using something like a
para-virtual IOMMU. However I don't have any ETA on that as I am just
starting out and have limited development time. If we get that in
place we can leave the VF in the guest until the very last moments
instead of having to remove it before we start the live migration.

- Alex

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
  2018-02-27  8:49     ` Jiri Pirko
@ 2018-02-27 21:16       ` Alexander Duyck
  2018-02-27 21:16         ` [virtio-dev] " Alexander Duyck
                         ` (2 subsequent siblings)
  3 siblings, 0 replies; 121+ messages in thread
From: Alexander Duyck @ 2018-02-27 21:16 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Duyck, Alexander H, virtio-dev, Michael S. Tsirkin,
	Jakub Kicinski, Sridhar Samudrala, virtualization, Siwei Liu,
	Netdev, David Miller

On Tue, Feb 27, 2018 at 12:49 AM, Jiri Pirko <jiri@resnulli.us> wrote:
> Tue, Feb 20, 2018 at 05:04:29PM CET, alexander.duyck@gmail.com wrote:
>>On Tue, Feb 20, 2018 at 2:42 AM, Jiri Pirko <jiri@resnulli.us> wrote:
>>> Fri, Feb 16, 2018 at 07:11:19PM CET, sridhar.samudrala@intel.com wrote:
>>>>Patch 1 introduces a new feature bit VIRTIO_NET_F_BACKUP that can be
>>>>used by hypervisor to indicate that virtio_net interface should act as
>>>>a backup for another device with the same MAC address.
>>>>
>>>>Ppatch 2 is in response to the community request for a 3 netdev
>>>>solution.  However, it creates some issues we'll get into in a moment.
>>>>It extends virtio_net to use alternate datapath when available and
>>>>registered. When BACKUP feature is enabled, virtio_net driver creates
>>>>an additional 'bypass' netdev that acts as a master device and controls
>>>>2 slave devices.  The original virtio_net netdev is registered as
>>>>'backup' netdev and a passthru/vf device with the same MAC gets
>>>>registered as 'active' netdev. Both 'bypass' and 'backup' netdevs are
>>>>associated with the same 'pci' device.  The user accesses the network
>>>>interface via 'bypass' netdev. The 'bypass' netdev chooses 'active' netdev
>>>>as default for transmits when it is available with link up and running.
>>>
>>> Sorry, but this is ridiculous. You are apparently re-implemeting part
>>> of bonding driver as a part of NIC driver. Bond and team drivers
>>> are mature solutions, well tested, broadly used, with lots of issues
>>> resolved in the past. What you try to introduce is a weird shortcut
>>> that already has couple of issues as you mentioned and will certanly
>>> have many more. Also, I'm pretty sure that in future, someone comes up
>>> with ideas like multiple VFs, LACP and similar bonding things.
>>
>>The problem with the bond and team drivers is they are too large and
>>have too many interfaces available for configuration so as a result
>>they can really screw this interface up.
>>
>>Essentially this is meant to be a bond that is more-or-less managed by
>>the host, not the guest. We want the host to be able to configure it
>>and have it automatically kick in on the guest. For now we want to
>>avoid adding too much complexity as this is meant to be just the first
>>step. Trying to go in and implement the whole solution right from the
>>start based on existing drivers is going to be a massive time sink and
>>will likely never get completed due to the fact that there is always
>>going to be some other thing that will interfere.
>>
>>My personal hope is that we can look at doing a virtio-bond sort of
>>device that will handle all this as well as providing a communication
>>channel, but that is much further down the road. For now we only have
>>a single bit so the goal for now is trying to keep this as simple as
>>possible.
>
> I have another usecase that would require the solution to be different
> then what you suggest. Consider following scenario:
> - baremetal has 2 sr-iov nics
> - there is a vm, has 1 VF from each nics: vf0, vf1. No virtio_net
> - baremetal would like to somehow tell the VM to bond vf0 and vf1
>   together and how this bonding should be configured, according to how
>   the VF representors are configured on the baremetal (LACP for example)
>
> The baremetal could decide to remove any VF during the VM runtime, it
> can add another VF there. For migration, it can add virtio_net. The VM
> should be inctructed to bond all interfaces together according to how
> baremetal decided - as it knows better.
>
> For this we need a separate communication channel from baremetal to VM
> (perhaps something re-usable already exists), we need something to
> listen to the events coming from this channel (kernel/userspace) and to
> react accordingly (create bond/team, enslave, etc).
>
> Now the question is: is it possible to merge the demands you have and
> the generic needs I described into a single solution? From what I see,
> that would be quite hard/impossible. So at the end, I think that we have
> to end-up with 2 solutions:
> 1) virtio_net, netvsc in-driver bonding - very limited, stupid, 0config
>    solution that works for all (no matter what OS you use in VM)
> 2) team/bond solution with assistance of preferably userspace daemon
>    getting info from baremetal. This is not 0config, but minimal config
>    - user just have to define this "magic bonding" should be on.
>    This covers all possible usecases, including multiple VFs, RDMA, etc.
>
> Thoughts?

So that is about what I had in mind. We end up having to do something
completely different to support this more complex solution. I think we
might have referred to it as v2/v3 in a different thread, and
virt-bond in this thread.

Basically we need some sort of PCI or PCIe topology mapping for the
devices that can be translated into something we can communicate over
the communication channel. After that we also have the added
complexity of how do we figure out which Tx path we want to choose.
This is one of the reasons why I was thinking of something like a eBPF
blob that is handed up from the host side and into the guest to select
the Tx queue. That way when we add some new approach such as a
NUMA/cpu based netdev selection then we just provide an eBPF blob that
does that. Most of this is just theoretical at this point though since
I haven't had a chance to look into it too deeply yet. If you want to
take something like this on the help would always be welcome.. :)

The other thing I am looking at is trying to find a good way to do
dirty page tracking in the hypervisor using something like a
para-virtual IOMMU. However I don't have any ETA on that as I am just
starting out and have limited development time. If we get that in
place we can leave the VF in the guest until the very last moments
instead of having to remove it before we start the live migration.

- Alex

^ permalink raw reply	[flat|nested] 121+ messages in thread

* [virtio-dev] Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
@ 2018-02-27 21:16         ` Alexander Duyck
  0 siblings, 0 replies; 121+ messages in thread
From: Alexander Duyck @ 2018-02-27 21:16 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Sridhar Samudrala, Michael S. Tsirkin, Stephen Hemminger,
	David Miller, Netdev, virtualization, virtio-dev, Brandeburg,
	Jesse, Duyck, Alexander H, Jakub Kicinski, Jason Wang, Siwei Liu

On Tue, Feb 27, 2018 at 12:49 AM, Jiri Pirko <jiri@resnulli.us> wrote:
> Tue, Feb 20, 2018 at 05:04:29PM CET, alexander.duyck@gmail.com wrote:
>>On Tue, Feb 20, 2018 at 2:42 AM, Jiri Pirko <jiri@resnulli.us> wrote:
>>> Fri, Feb 16, 2018 at 07:11:19PM CET, sridhar.samudrala@intel.com wrote:
>>>>Patch 1 introduces a new feature bit VIRTIO_NET_F_BACKUP that can be
>>>>used by hypervisor to indicate that virtio_net interface should act as
>>>>a backup for another device with the same MAC address.
>>>>
>>>>Ppatch 2 is in response to the community request for a 3 netdev
>>>>solution.  However, it creates some issues we'll get into in a moment.
>>>>It extends virtio_net to use alternate datapath when available and
>>>>registered. When BACKUP feature is enabled, virtio_net driver creates
>>>>an additional 'bypass' netdev that acts as a master device and controls
>>>>2 slave devices.  The original virtio_net netdev is registered as
>>>>'backup' netdev and a passthru/vf device with the same MAC gets
>>>>registered as 'active' netdev. Both 'bypass' and 'backup' netdevs are
>>>>associated with the same 'pci' device.  The user accesses the network
>>>>interface via 'bypass' netdev. The 'bypass' netdev chooses 'active' netdev
>>>>as default for transmits when it is available with link up and running.
>>>
>>> Sorry, but this is ridiculous. You are apparently re-implemeting part
>>> of bonding driver as a part of NIC driver. Bond and team drivers
>>> are mature solutions, well tested, broadly used, with lots of issues
>>> resolved in the past. What you try to introduce is a weird shortcut
>>> that already has couple of issues as you mentioned and will certanly
>>> have many more. Also, I'm pretty sure that in future, someone comes up
>>> with ideas like multiple VFs, LACP and similar bonding things.
>>
>>The problem with the bond and team drivers is they are too large and
>>have too many interfaces available for configuration so as a result
>>they can really screw this interface up.
>>
>>Essentially this is meant to be a bond that is more-or-less managed by
>>the host, not the guest. We want the host to be able to configure it
>>and have it automatically kick in on the guest. For now we want to
>>avoid adding too much complexity as this is meant to be just the first
>>step. Trying to go in and implement the whole solution right from the
>>start based on existing drivers is going to be a massive time sink and
>>will likely never get completed due to the fact that there is always
>>going to be some other thing that will interfere.
>>
>>My personal hope is that we can look at doing a virtio-bond sort of
>>device that will handle all this as well as providing a communication
>>channel, but that is much further down the road. For now we only have
>>a single bit so the goal for now is trying to keep this as simple as
>>possible.
>
> I have another usecase that would require the solution to be different
> then what you suggest. Consider following scenario:
> - baremetal has 2 sr-iov nics
> - there is a vm, has 1 VF from each nics: vf0, vf1. No virtio_net
> - baremetal would like to somehow tell the VM to bond vf0 and vf1
>   together and how this bonding should be configured, according to how
>   the VF representors are configured on the baremetal (LACP for example)
>
> The baremetal could decide to remove any VF during the VM runtime, it
> can add another VF there. For migration, it can add virtio_net. The VM
> should be inctructed to bond all interfaces together according to how
> baremetal decided - as it knows better.
>
> For this we need a separate communication channel from baremetal to VM
> (perhaps something re-usable already exists), we need something to
> listen to the events coming from this channel (kernel/userspace) and to
> react accordingly (create bond/team, enslave, etc).
>
> Now the question is: is it possible to merge the demands you have and
> the generic needs I described into a single solution? From what I see,
> that would be quite hard/impossible. So at the end, I think that we have
> to end-up with 2 solutions:
> 1) virtio_net, netvsc in-driver bonding - very limited, stupid, 0config
>    solution that works for all (no matter what OS you use in VM)
> 2) team/bond solution with assistance of preferably userspace daemon
>    getting info from baremetal. This is not 0config, but minimal config
>    - user just have to define this "magic bonding" should be on.
>    This covers all possible usecases, including multiple VFs, RDMA, etc.
>
> Thoughts?

So that is about what I had in mind. We end up having to do something
completely different to support this more complex solution. I think we
might have referred to it as v2/v3 in a different thread, and
virt-bond in this thread.

Basically we need some sort of PCI or PCIe topology mapping for the
devices that can be translated into something we can communicate over
the communication channel. After that we also have the added
complexity of how do we figure out which Tx path we want to choose.
This is one of the reasons why I was thinking of something like a eBPF
blob that is handed up from the host side and into the guest to select
the Tx queue. That way when we add some new approach such as a
NUMA/cpu based netdev selection then we just provide an eBPF blob that
does that. Most of this is just theoretical at this point though since
I haven't had a chance to look into it too deeply yet. If you want to
take something like this on the help would always be welcome.. :)

The other thing I am looking at is trying to find a good way to do
dirty page tracking in the hypervisor using something like a
para-virtual IOMMU. However I don't have any ETA on that as I am just
starting out and have limited development time. If we get that in
place we can leave the VF in the guest until the very last moments
instead of having to remove it before we start the live migration.

- Alex

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
  2018-02-27 21:16         ` [virtio-dev] " Alexander Duyck
@ 2018-02-27 21:23           ` Michael S. Tsirkin
  -1 siblings, 0 replies; 121+ messages in thread
From: Michael S. Tsirkin @ 2018-02-27 21:23 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Duyck, Alexander H, virtio-dev, Jiri Pirko, Jakub Kicinski,
	Sridhar Samudrala, virtualization, Siwei Liu, Netdev,
	David Miller

On Tue, Feb 27, 2018 at 01:16:21PM -0800, Alexander Duyck wrote:
> The other thing I am looking at is trying to find a good way to do
> dirty page tracking in the hypervisor using something like a
> para-virtual IOMMU. However I don't have any ETA on that as I am just
> starting out and have limited development time. If we get that in
> place we can leave the VF in the guest until the very last moments
> instead of having to remove it before we start the live migration.
> 
> - Alex

I actually think your old RFC would be a good starting point:
https://lkml.org/lkml/2016/1/5/104

What is missing is I think enabling/disabling dynamically.

Seems to be easier than tracking by the hypervisor.

-- 
MST

^ permalink raw reply	[flat|nested] 121+ messages in thread

* [virtio-dev] Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
@ 2018-02-27 21:23           ` Michael S. Tsirkin
  0 siblings, 0 replies; 121+ messages in thread
From: Michael S. Tsirkin @ 2018-02-27 21:23 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Jiri Pirko, Sridhar Samudrala, Stephen Hemminger, David Miller,
	Netdev, virtualization, virtio-dev, Brandeburg, Jesse, Duyck,
	Alexander H, Jakub Kicinski, Jason Wang, Siwei Liu

On Tue, Feb 27, 2018 at 01:16:21PM -0800, Alexander Duyck wrote:
> The other thing I am looking at is trying to find a good way to do
> dirty page tracking in the hypervisor using something like a
> para-virtual IOMMU. However I don't have any ETA on that as I am just
> starting out and have limited development time. If we get that in
> place we can leave the VF in the guest until the very last moments
> instead of having to remove it before we start the live migration.
> 
> - Alex

I actually think your old RFC would be a good starting point:
https://lkml.org/lkml/2016/1/5/104

What is missing is I think enabling/disabling dynamically.

Seems to be easier than tracking by the hypervisor.

-- 
MST

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
  2018-02-27  8:49     ` Jiri Pirko
@ 2018-02-27 21:30         ` Michael S. Tsirkin
  2018-02-27 21:16         ` [virtio-dev] " Alexander Duyck
                           ` (2 subsequent siblings)
  3 siblings, 0 replies; 121+ messages in thread
From: Michael S. Tsirkin @ 2018-02-27 21:30 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Alexander Duyck, Sridhar Samudrala, Stephen Hemminger,
	David Miller, Netdev, virtualization, virtio-dev, Brandeburg,
	Jesse, Duyck, Alexander H, Jakub Kicinski, Jason Wang, Siwei Liu

On Tue, Feb 27, 2018 at 09:49:59AM +0100, Jiri Pirko wrote:
> Now the question is: is it possible to merge the demands you have and
> the generic needs I described into a single solution? From what I see,
> that would be quite hard/impossible. So at the end, I think that we have
> to end-up with 2 solutions:
> 1) virtio_net, netvsc in-driver bonding - very limited, stupid, 0config
>    solution that works for all (no matter what OS you use in VM)
> 2) team/bond solution with assistance of preferably userspace daemon
>    getting info from baremetal. This is not 0config, but minimal config
>    - user just have to define this "magic bonding" should be on.
>    This covers all possible usecases, including multiple VFs, RDMA, etc.
> 
> Thoughts?

I think I agree. This RFC is trying to do 1 above.  Looks like we now
all agree 1 and 2 are not exclusive, both have place in the kernel. Is
that right?

-- 
MST

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
  2018-02-27  8:49     ` Jiri Pirko
                         ` (2 preceding siblings ...)
  2018-02-27 21:30         ` [virtio-dev] " Michael S. Tsirkin
@ 2018-02-27 21:30       ` Michael S. Tsirkin
  3 siblings, 0 replies; 121+ messages in thread
From: Michael S. Tsirkin @ 2018-02-27 21:30 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Duyck, Alexander H, virtio-dev, Jakub Kicinski,
	Sridhar Samudrala, Alexander Duyck, virtualization, Siwei Liu,
	Netdev, David Miller

On Tue, Feb 27, 2018 at 09:49:59AM +0100, Jiri Pirko wrote:
> Now the question is: is it possible to merge the demands you have and
> the generic needs I described into a single solution? From what I see,
> that would be quite hard/impossible. So at the end, I think that we have
> to end-up with 2 solutions:
> 1) virtio_net, netvsc in-driver bonding - very limited, stupid, 0config
>    solution that works for all (no matter what OS you use in VM)
> 2) team/bond solution with assistance of preferably userspace daemon
>    getting info from baremetal. This is not 0config, but minimal config
>    - user just have to define this "magic bonding" should be on.
>    This covers all possible usecases, including multiple VFs, RDMA, etc.
> 
> Thoughts?

I think I agree. This RFC is trying to do 1 above.  Looks like we now
all agree 1 and 2 are not exclusive, both have place in the kernel. Is
that right?

-- 
MST

^ permalink raw reply	[flat|nested] 121+ messages in thread

* [virtio-dev] Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
@ 2018-02-27 21:30         ` Michael S. Tsirkin
  0 siblings, 0 replies; 121+ messages in thread
From: Michael S. Tsirkin @ 2018-02-27 21:30 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Alexander Duyck, Sridhar Samudrala, Stephen Hemminger,
	David Miller, Netdev, virtualization, virtio-dev, Brandeburg,
	Jesse, Duyck, Alexander H, Jakub Kicinski, Jason Wang, Siwei Liu

On Tue, Feb 27, 2018 at 09:49:59AM +0100, Jiri Pirko wrote:
> Now the question is: is it possible to merge the demands you have and
> the generic needs I described into a single solution? From what I see,
> that would be quite hard/impossible. So at the end, I think that we have
> to end-up with 2 solutions:
> 1) virtio_net, netvsc in-driver bonding - very limited, stupid, 0config
>    solution that works for all (no matter what OS you use in VM)
> 2) team/bond solution with assistance of preferably userspace daemon
>    getting info from baremetal. This is not 0config, but minimal config
>    - user just have to define this "magic bonding" should be on.
>    This covers all possible usecases, including multiple VFs, RDMA, etc.
> 
> Thoughts?

I think I agree. This RFC is trying to do 1 above.  Looks like we now
all agree 1 and 2 are not exclusive, both have place in the kernel. Is
that right?

-- 
MST

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
  2018-02-27 21:16         ` [virtio-dev] " Alexander Duyck
  (?)
  (?)
@ 2018-02-27 21:41         ` Jakub Kicinski
  2018-02-28  7:08           ` Jiri Pirko
  -1 siblings, 1 reply; 121+ messages in thread
From: Jakub Kicinski @ 2018-02-27 21:41 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Duyck, Alexander H, virtio-dev, Jiri Pirko, Michael S. Tsirkin,
	Sridhar Samudrala, virtualization, Siwei Liu, Netdev,
	David Miller

On Tue, 27 Feb 2018 13:16:21 -0800, Alexander Duyck wrote:
> Basically we need some sort of PCI or PCIe topology mapping for the
> devices that can be translated into something we can communicate over
> the communication channel. 

Hm.  This is probably a completely stupid idea, but if we need to
start marshalling configuration requests/hints maybe the entire problem
could be solved by opening a netlink socket from hypervisor?  Even make
teamd run on the hypervisor side...

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
  2018-02-27 21:41         ` Jakub Kicinski
@ 2018-02-28  7:08           ` Jiri Pirko
  2018-02-28 14:32               ` [virtio-dev] " Michael S. Tsirkin
  0 siblings, 1 reply; 121+ messages in thread
From: Jiri Pirko @ 2018-02-28  7:08 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Duyck, Alexander H, virtio-dev, Michael S. Tsirkin,
	Sridhar Samudrala, Alexander Duyck, virtualization, Siwei Liu,
	Netdev, David Miller

Tue, Feb 27, 2018 at 10:41:49PM CET, kubakici@wp.pl wrote:
>On Tue, 27 Feb 2018 13:16:21 -0800, Alexander Duyck wrote:
>> Basically we need some sort of PCI or PCIe topology mapping for the
>> devices that can be translated into something we can communicate over
>> the communication channel. 
>
>Hm.  This is probably a completely stupid idea, but if we need to
>start marshalling configuration requests/hints maybe the entire problem
>could be solved by opening a netlink socket from hypervisor?  Even make
>teamd run on the hypervisor side...

Interesting. That would be more trickier then just to fwd 1 genetlink
socket to the hypervisor.

Also, I think that the solution should handle multiple guest oses. What
I'm thinking about is some generic bonding description passed over some
communication channel into vm. The vm either use it for configuration,
or ignores it if it is not smart enough/updated enough.

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
  2018-02-28  7:08           ` Jiri Pirko
@ 2018-02-28 14:32               ` Michael S. Tsirkin
  0 siblings, 0 replies; 121+ messages in thread
From: Michael S. Tsirkin @ 2018-02-28 14:32 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Duyck, Alexander H, virtio-dev, Jakub Kicinski,
	Sridhar Samudrala, Alexander Duyck, virtualization, Siwei Liu,
	Netdev, David Miller

On Wed, Feb 28, 2018 at 08:08:39AM +0100, Jiri Pirko wrote:
> Tue, Feb 27, 2018 at 10:41:49PM CET, kubakici@wp.pl wrote:
> >On Tue, 27 Feb 2018 13:16:21 -0800, Alexander Duyck wrote:
> >> Basically we need some sort of PCI or PCIe topology mapping for the
> >> devices that can be translated into something we can communicate over
> >> the communication channel. 
> >
> >Hm.  This is probably a completely stupid idea, but if we need to
> >start marshalling configuration requests/hints maybe the entire problem
> >could be solved by opening a netlink socket from hypervisor?  Even make
> >teamd run on the hypervisor side...
> 
> Interesting. That would be more trickier then just to fwd 1 genetlink
> socket to the hypervisor.
> 
> Also, I think that the solution should handle multiple guest oses. What
> I'm thinking about is some generic bonding description passed over some
> communication channel into vm. The vm either use it for configuration,
> or ignores it if it is not smart enough/updated enough.

For sure, we could build virtio-bond to pass that info to guests.

Such an advisory mechanism would not be a replacement for the mandatory
passthrough fallback flag proposed, but OTOH it's much more flexible.

-- 
MST

^ permalink raw reply	[flat|nested] 121+ messages in thread

* [virtio-dev] Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
@ 2018-02-28 14:32               ` Michael S. Tsirkin
  0 siblings, 0 replies; 121+ messages in thread
From: Michael S. Tsirkin @ 2018-02-28 14:32 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Jakub Kicinski, Alexander Duyck, Sridhar Samudrala,
	Stephen Hemminger, David Miller, Netdev, virtualization,
	virtio-dev, Brandeburg, Jesse, Duyck, Alexander H, Jason Wang,
	Siwei Liu

On Wed, Feb 28, 2018 at 08:08:39AM +0100, Jiri Pirko wrote:
> Tue, Feb 27, 2018 at 10:41:49PM CET, kubakici@wp.pl wrote:
> >On Tue, 27 Feb 2018 13:16:21 -0800, Alexander Duyck wrote:
> >> Basically we need some sort of PCI or PCIe topology mapping for the
> >> devices that can be translated into something we can communicate over
> >> the communication channel. 
> >
> >Hm.  This is probably a completely stupid idea, but if we need to
> >start marshalling configuration requests/hints maybe the entire problem
> >could be solved by opening a netlink socket from hypervisor?  Even make
> >teamd run on the hypervisor side...
> 
> Interesting. That would be more trickier then just to fwd 1 genetlink
> socket to the hypervisor.
> 
> Also, I think that the solution should handle multiple guest oses. What
> I'm thinking about is some generic bonding description passed over some
> communication channel into vm. The vm either use it for configuration,
> or ignores it if it is not smart enough/updated enough.

For sure, we could build virtio-bond to pass that info to guests.

Such an advisory mechanism would not be a replacement for the mandatory
passthrough fallback flag proposed, but OTOH it's much more flexible.

-- 
MST

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
  2018-02-28 14:32               ` [virtio-dev] " Michael S. Tsirkin
  (?)
@ 2018-02-28 15:11               ` Jiri Pirko
  2018-02-28 15:45                 ` Michael S. Tsirkin
  2018-02-28 15:45                   ` [virtio-dev] " Michael S. Tsirkin
  -1 siblings, 2 replies; 121+ messages in thread
From: Jiri Pirko @ 2018-02-28 15:11 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Duyck, Alexander H, virtio-dev, Jakub Kicinski,
	Sridhar Samudrala, Alexander Duyck, virtualization, Siwei Liu,
	Netdev, David Miller

Wed, Feb 28, 2018 at 03:32:44PM CET, mst@redhat.com wrote:
>On Wed, Feb 28, 2018 at 08:08:39AM +0100, Jiri Pirko wrote:
>> Tue, Feb 27, 2018 at 10:41:49PM CET, kubakici@wp.pl wrote:
>> >On Tue, 27 Feb 2018 13:16:21 -0800, Alexander Duyck wrote:
>> >> Basically we need some sort of PCI or PCIe topology mapping for the
>> >> devices that can be translated into something we can communicate over
>> >> the communication channel. 
>> >
>> >Hm.  This is probably a completely stupid idea, but if we need to
>> >start marshalling configuration requests/hints maybe the entire problem
>> >could be solved by opening a netlink socket from hypervisor?  Even make
>> >teamd run on the hypervisor side...
>> 
>> Interesting. That would be more trickier then just to fwd 1 genetlink
>> socket to the hypervisor.
>> 
>> Also, I think that the solution should handle multiple guest oses. What
>> I'm thinking about is some generic bonding description passed over some
>> communication channel into vm. The vm either use it for configuration,
>> or ignores it if it is not smart enough/updated enough.
>
>For sure, we could build virtio-bond to pass that info to guests.

What do you mean by "virtio-bond". virtio_net extension?

>
>Such an advisory mechanism would not be a replacement for the mandatory
>passthrough fallback flag proposed, but OTOH it's much more flexible.
>
>-- 
>MST

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
  2018-02-28 15:11               ` Jiri Pirko
@ 2018-02-28 15:45                   ` Michael S. Tsirkin
  2018-02-28 15:45                   ` [virtio-dev] " Michael S. Tsirkin
  1 sibling, 0 replies; 121+ messages in thread
From: Michael S. Tsirkin @ 2018-02-28 15:45 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Jakub Kicinski, Alexander Duyck, Sridhar Samudrala,
	Stephen Hemminger, David Miller, Netdev, virtualization,
	virtio-dev, Brandeburg, Jesse, Duyck, Alexander H, Jason Wang,
	Siwei Liu

On Wed, Feb 28, 2018 at 04:11:31PM +0100, Jiri Pirko wrote:
> Wed, Feb 28, 2018 at 03:32:44PM CET, mst@redhat.com wrote:
> >On Wed, Feb 28, 2018 at 08:08:39AM +0100, Jiri Pirko wrote:
> >> Tue, Feb 27, 2018 at 10:41:49PM CET, kubakici@wp.pl wrote:
> >> >On Tue, 27 Feb 2018 13:16:21 -0800, Alexander Duyck wrote:
> >> >> Basically we need some sort of PCI or PCIe topology mapping for the
> >> >> devices that can be translated into something we can communicate over
> >> >> the communication channel. 
> >> >
> >> >Hm.  This is probably a completely stupid idea, but if we need to
> >> >start marshalling configuration requests/hints maybe the entire problem
> >> >could be solved by opening a netlink socket from hypervisor?  Even make
> >> >teamd run on the hypervisor side...
> >> 
> >> Interesting. That would be more trickier then just to fwd 1 genetlink
> >> socket to the hypervisor.
> >> 
> >> Also, I think that the solution should handle multiple guest oses. What
> >> I'm thinking about is some generic bonding description passed over some
> >> communication channel into vm. The vm either use it for configuration,
> >> or ignores it if it is not smart enough/updated enough.
> >
> >For sure, we could build virtio-bond to pass that info to guests.
> 
> What do you mean by "virtio-bond". virtio_net extension?

I mean a new device supplying topology information to guests,
with updates whenever VMs are started, stopped or migrated.

> >
> >Such an advisory mechanism would not be a replacement for the mandatory
> >passthrough fallback flag proposed, but OTOH it's much more flexible.
> >
> >-- 
> >MST

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
  2018-02-28 15:11               ` Jiri Pirko
@ 2018-02-28 15:45                 ` Michael S. Tsirkin
  2018-02-28 15:45                   ` [virtio-dev] " Michael S. Tsirkin
  1 sibling, 0 replies; 121+ messages in thread
From: Michael S. Tsirkin @ 2018-02-28 15:45 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Duyck, Alexander H, virtio-dev, Jakub Kicinski,
	Sridhar Samudrala, Alexander Duyck, virtualization, Siwei Liu,
	Netdev, David Miller

On Wed, Feb 28, 2018 at 04:11:31PM +0100, Jiri Pirko wrote:
> Wed, Feb 28, 2018 at 03:32:44PM CET, mst@redhat.com wrote:
> >On Wed, Feb 28, 2018 at 08:08:39AM +0100, Jiri Pirko wrote:
> >> Tue, Feb 27, 2018 at 10:41:49PM CET, kubakici@wp.pl wrote:
> >> >On Tue, 27 Feb 2018 13:16:21 -0800, Alexander Duyck wrote:
> >> >> Basically we need some sort of PCI or PCIe topology mapping for the
> >> >> devices that can be translated into something we can communicate over
> >> >> the communication channel. 
> >> >
> >> >Hm.  This is probably a completely stupid idea, but if we need to
> >> >start marshalling configuration requests/hints maybe the entire problem
> >> >could be solved by opening a netlink socket from hypervisor?  Even make
> >> >teamd run on the hypervisor side...
> >> 
> >> Interesting. That would be more trickier then just to fwd 1 genetlink
> >> socket to the hypervisor.
> >> 
> >> Also, I think that the solution should handle multiple guest oses. What
> >> I'm thinking about is some generic bonding description passed over some
> >> communication channel into vm. The vm either use it for configuration,
> >> or ignores it if it is not smart enough/updated enough.
> >
> >For sure, we could build virtio-bond to pass that info to guests.
> 
> What do you mean by "virtio-bond". virtio_net extension?

I mean a new device supplying topology information to guests,
with updates whenever VMs are started, stopped or migrated.

> >
> >Such an advisory mechanism would not be a replacement for the mandatory
> >passthrough fallback flag proposed, but OTOH it's much more flexible.
> >
> >-- 
> >MST

^ permalink raw reply	[flat|nested] 121+ messages in thread

* [virtio-dev] Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
@ 2018-02-28 15:45                   ` Michael S. Tsirkin
  0 siblings, 0 replies; 121+ messages in thread
From: Michael S. Tsirkin @ 2018-02-28 15:45 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Jakub Kicinski, Alexander Duyck, Sridhar Samudrala,
	Stephen Hemminger, David Miller, Netdev, virtualization,
	virtio-dev, Brandeburg, Jesse, Duyck, Alexander H, Jason Wang,
	Siwei Liu

On Wed, Feb 28, 2018 at 04:11:31PM +0100, Jiri Pirko wrote:
> Wed, Feb 28, 2018 at 03:32:44PM CET, mst@redhat.com wrote:
> >On Wed, Feb 28, 2018 at 08:08:39AM +0100, Jiri Pirko wrote:
> >> Tue, Feb 27, 2018 at 10:41:49PM CET, kubakici@wp.pl wrote:
> >> >On Tue, 27 Feb 2018 13:16:21 -0800, Alexander Duyck wrote:
> >> >> Basically we need some sort of PCI or PCIe topology mapping for the
> >> >> devices that can be translated into something we can communicate over
> >> >> the communication channel. 
> >> >
> >> >Hm.  This is probably a completely stupid idea, but if we need to
> >> >start marshalling configuration requests/hints maybe the entire problem
> >> >could be solved by opening a netlink socket from hypervisor?  Even make
> >> >teamd run on the hypervisor side...
> >> 
> >> Interesting. That would be more trickier then just to fwd 1 genetlink
> >> socket to the hypervisor.
> >> 
> >> Also, I think that the solution should handle multiple guest oses. What
> >> I'm thinking about is some generic bonding description passed over some
> >> communication channel into vm. The vm either use it for configuration,
> >> or ignores it if it is not smart enough/updated enough.
> >
> >For sure, we could build virtio-bond to pass that info to guests.
> 
> What do you mean by "virtio-bond". virtio_net extension?

I mean a new device supplying topology information to guests,
with updates whenever VMs are started, stopped or migrated.

> >
> >Such an advisory mechanism would not be a replacement for the mandatory
> >passthrough fallback flag proposed, but OTOH it's much more flexible.
> >
> >-- 
> >MST

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
  2018-02-28 15:45                   ` [virtio-dev] " Michael S. Tsirkin
  (?)
@ 2018-02-28 19:25                   ` Jiri Pirko
  2018-02-28 20:48                     ` Michael S. Tsirkin
  2018-02-28 20:48                       ` [virtio-dev] " Michael S. Tsirkin
  -1 siblings, 2 replies; 121+ messages in thread
From: Jiri Pirko @ 2018-02-28 19:25 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Duyck, Alexander H, virtio-dev, Jakub Kicinski,
	Sridhar Samudrala, Alexander Duyck, virtualization, Siwei Liu,
	Netdev, David Miller

Wed, Feb 28, 2018 at 04:45:39PM CET, mst@redhat.com wrote:
>On Wed, Feb 28, 2018 at 04:11:31PM +0100, Jiri Pirko wrote:
>> Wed, Feb 28, 2018 at 03:32:44PM CET, mst@redhat.com wrote:
>> >On Wed, Feb 28, 2018 at 08:08:39AM +0100, Jiri Pirko wrote:
>> >> Tue, Feb 27, 2018 at 10:41:49PM CET, kubakici@wp.pl wrote:
>> >> >On Tue, 27 Feb 2018 13:16:21 -0800, Alexander Duyck wrote:
>> >> >> Basically we need some sort of PCI or PCIe topology mapping for the
>> >> >> devices that can be translated into something we can communicate over
>> >> >> the communication channel. 
>> >> >
>> >> >Hm.  This is probably a completely stupid idea, but if we need to
>> >> >start marshalling configuration requests/hints maybe the entire problem
>> >> >could be solved by opening a netlink socket from hypervisor?  Even make
>> >> >teamd run on the hypervisor side...
>> >> 
>> >> Interesting. That would be more trickier then just to fwd 1 genetlink
>> >> socket to the hypervisor.
>> >> 
>> >> Also, I think that the solution should handle multiple guest oses. What
>> >> I'm thinking about is some generic bonding description passed over some
>> >> communication channel into vm. The vm either use it for configuration,
>> >> or ignores it if it is not smart enough/updated enough.
>> >
>> >For sure, we could build virtio-bond to pass that info to guests.
>> 
>> What do you mean by "virtio-bond". virtio_net extension?
>
>I mean a new device supplying topology information to guests,
>with updates whenever VMs are started, stopped or migrated.

Good. Any idea how that device would look like? Also, any idea how to
handle in in kernel and how to pass along this info to userspace?
Is there anything similar out there?

Thanks!

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
  2018-02-28 19:25                   ` Jiri Pirko
@ 2018-02-28 20:48                       ` Michael S. Tsirkin
  2018-02-28 20:48                       ` [virtio-dev] " Michael S. Tsirkin
  1 sibling, 0 replies; 121+ messages in thread
From: Michael S. Tsirkin @ 2018-02-28 20:48 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Jakub Kicinski, Alexander Duyck, Sridhar Samudrala,
	Stephen Hemminger, David Miller, Netdev, virtualization,
	virtio-dev, Brandeburg, Jesse, Duyck, Alexander H, Jason Wang,
	Siwei Liu

On Wed, Feb 28, 2018 at 08:25:01PM +0100, Jiri Pirko wrote:
> Wed, Feb 28, 2018 at 04:45:39PM CET, mst@redhat.com wrote:
> >On Wed, Feb 28, 2018 at 04:11:31PM +0100, Jiri Pirko wrote:
> >> Wed, Feb 28, 2018 at 03:32:44PM CET, mst@redhat.com wrote:
> >> >On Wed, Feb 28, 2018 at 08:08:39AM +0100, Jiri Pirko wrote:
> >> >> Tue, Feb 27, 2018 at 10:41:49PM CET, kubakici@wp.pl wrote:
> >> >> >On Tue, 27 Feb 2018 13:16:21 -0800, Alexander Duyck wrote:
> >> >> >> Basically we need some sort of PCI or PCIe topology mapping for the
> >> >> >> devices that can be translated into something we can communicate over
> >> >> >> the communication channel. 
> >> >> >
> >> >> >Hm.  This is probably a completely stupid idea, but if we need to
> >> >> >start marshalling configuration requests/hints maybe the entire problem
> >> >> >could be solved by opening a netlink socket from hypervisor?  Even make
> >> >> >teamd run on the hypervisor side...
> >> >> 
> >> >> Interesting. That would be more trickier then just to fwd 1 genetlink
> >> >> socket to the hypervisor.
> >> >> 
> >> >> Also, I think that the solution should handle multiple guest oses. What
> >> >> I'm thinking about is some generic bonding description passed over some
> >> >> communication channel into vm. The vm either use it for configuration,
> >> >> or ignores it if it is not smart enough/updated enough.
> >> >
> >> >For sure, we could build virtio-bond to pass that info to guests.
> >> 
> >> What do you mean by "virtio-bond". virtio_net extension?
> >
> >I mean a new device supplying topology information to guests,
> >with updates whenever VMs are started, stopped or migrated.
> 
> Good. Any idea how that device would look like? Also, any idea how to
> handle in in kernel and how to pass along this info to userspace?
> Is there anything similar out there?
> 
> Thanks!

E.g. balloon is used to pass hints about amount of memory
guest should use. We could do something similar.

I imagine device can send a configuration interrupt
on each topology change. Kernel wakes up userspace pollers.
Userspace starts doing reads from a char device and
figures out what changed.

Which info is needed there? I am not sure.
How about list of MAC/VLAN addresses coupled to list of
devices to queue on (specified by mac? by PCI address)?

Or do we ever need to go higher level and make decisions
based on IP addresses as well?

-- 
MST

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
  2018-02-28 19:25                   ` Jiri Pirko
@ 2018-02-28 20:48                     ` Michael S. Tsirkin
  2018-02-28 20:48                       ` [virtio-dev] " Michael S. Tsirkin
  1 sibling, 0 replies; 121+ messages in thread
From: Michael S. Tsirkin @ 2018-02-28 20:48 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Duyck, Alexander H, virtio-dev, Jakub Kicinski,
	Sridhar Samudrala, Alexander Duyck, virtualization, Siwei Liu,
	Netdev, David Miller

On Wed, Feb 28, 2018 at 08:25:01PM +0100, Jiri Pirko wrote:
> Wed, Feb 28, 2018 at 04:45:39PM CET, mst@redhat.com wrote:
> >On Wed, Feb 28, 2018 at 04:11:31PM +0100, Jiri Pirko wrote:
> >> Wed, Feb 28, 2018 at 03:32:44PM CET, mst@redhat.com wrote:
> >> >On Wed, Feb 28, 2018 at 08:08:39AM +0100, Jiri Pirko wrote:
> >> >> Tue, Feb 27, 2018 at 10:41:49PM CET, kubakici@wp.pl wrote:
> >> >> >On Tue, 27 Feb 2018 13:16:21 -0800, Alexander Duyck wrote:
> >> >> >> Basically we need some sort of PCI or PCIe topology mapping for the
> >> >> >> devices that can be translated into something we can communicate over
> >> >> >> the communication channel. 
> >> >> >
> >> >> >Hm.  This is probably a completely stupid idea, but if we need to
> >> >> >start marshalling configuration requests/hints maybe the entire problem
> >> >> >could be solved by opening a netlink socket from hypervisor?  Even make
> >> >> >teamd run on the hypervisor side...
> >> >> 
> >> >> Interesting. That would be more trickier then just to fwd 1 genetlink
> >> >> socket to the hypervisor.
> >> >> 
> >> >> Also, I think that the solution should handle multiple guest oses. What
> >> >> I'm thinking about is some generic bonding description passed over some
> >> >> communication channel into vm. The vm either use it for configuration,
> >> >> or ignores it if it is not smart enough/updated enough.
> >> >
> >> >For sure, we could build virtio-bond to pass that info to guests.
> >> 
> >> What do you mean by "virtio-bond". virtio_net extension?
> >
> >I mean a new device supplying topology information to guests,
> >with updates whenever VMs are started, stopped or migrated.
> 
> Good. Any idea how that device would look like? Also, any idea how to
> handle in in kernel and how to pass along this info to userspace?
> Is there anything similar out there?
> 
> Thanks!

E.g. balloon is used to pass hints about amount of memory
guest should use. We could do something similar.

I imagine device can send a configuration interrupt
on each topology change. Kernel wakes up userspace pollers.
Userspace starts doing reads from a char device and
figures out what changed.

Which info is needed there? I am not sure.
How about list of MAC/VLAN addresses coupled to list of
devices to queue on (specified by mac? by PCI address)?

Or do we ever need to go higher level and make decisions
based on IP addresses as well?

-- 
MST

^ permalink raw reply	[flat|nested] 121+ messages in thread

* [virtio-dev] Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
@ 2018-02-28 20:48                       ` Michael S. Tsirkin
  0 siblings, 0 replies; 121+ messages in thread
From: Michael S. Tsirkin @ 2018-02-28 20:48 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Jakub Kicinski, Alexander Duyck, Sridhar Samudrala,
	Stephen Hemminger, David Miller, Netdev, virtualization,
	virtio-dev, Brandeburg, Jesse, Duyck, Alexander H, Jason Wang,
	Siwei Liu

On Wed, Feb 28, 2018 at 08:25:01PM +0100, Jiri Pirko wrote:
> Wed, Feb 28, 2018 at 04:45:39PM CET, mst@redhat.com wrote:
> >On Wed, Feb 28, 2018 at 04:11:31PM +0100, Jiri Pirko wrote:
> >> Wed, Feb 28, 2018 at 03:32:44PM CET, mst@redhat.com wrote:
> >> >On Wed, Feb 28, 2018 at 08:08:39AM +0100, Jiri Pirko wrote:
> >> >> Tue, Feb 27, 2018 at 10:41:49PM CET, kubakici@wp.pl wrote:
> >> >> >On Tue, 27 Feb 2018 13:16:21 -0800, Alexander Duyck wrote:
> >> >> >> Basically we need some sort of PCI or PCIe topology mapping for the
> >> >> >> devices that can be translated into something we can communicate over
> >> >> >> the communication channel. 
> >> >> >
> >> >> >Hm.  This is probably a completely stupid idea, but if we need to
> >> >> >start marshalling configuration requests/hints maybe the entire problem
> >> >> >could be solved by opening a netlink socket from hypervisor?  Even make
> >> >> >teamd run on the hypervisor side...
> >> >> 
> >> >> Interesting. That would be more trickier then just to fwd 1 genetlink
> >> >> socket to the hypervisor.
> >> >> 
> >> >> Also, I think that the solution should handle multiple guest oses. What
> >> >> I'm thinking about is some generic bonding description passed over some
> >> >> communication channel into vm. The vm either use it for configuration,
> >> >> or ignores it if it is not smart enough/updated enough.
> >> >
> >> >For sure, we could build virtio-bond to pass that info to guests.
> >> 
> >> What do you mean by "virtio-bond". virtio_net extension?
> >
> >I mean a new device supplying topology information to guests,
> >with updates whenever VMs are started, stopped or migrated.
> 
> Good. Any idea how that device would look like? Also, any idea how to
> handle in in kernel and how to pass along this info to userspace?
> Is there anything similar out there?
> 
> Thanks!

E.g. balloon is used to pass hints about amount of memory
guest should use. We could do something similar.

I imagine device can send a configuration interrupt
on each topology change. Kernel wakes up userspace pollers.
Userspace starts doing reads from a char device and
figures out what changed.

Which info is needed there? I am not sure.
How about list of MAC/VLAN addresses coupled to list of
devices to queue on (specified by mac? by PCI address)?

Or do we ever need to go higher level and make decisions
based on IP addresses as well?

-- 
MST

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 121+ messages in thread

* [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
@ 2018-02-16 18:11 Sridhar Samudrala
  0 siblings, 0 replies; 121+ messages in thread
From: Sridhar Samudrala @ 2018-02-16 18:11 UTC (permalink / raw)
  To: mst, stephen, davem, netdev, virtualization, virtio-dev,
	jesse.brandeburg, alexander.h.duyck, kubakici, sridhar.samudrala,
	jasowang, loseweigh

Patch 1 introduces a new feature bit VIRTIO_NET_F_BACKUP that can be
used by hypervisor to indicate that virtio_net interface should act as
a backup for another device with the same MAC address.

Ppatch 2 is in response to the community request for a 3 netdev
solution.  However, it creates some issues we'll get into in a moment.
It extends virtio_net to use alternate datapath when available and
registered. When BACKUP feature is enabled, virtio_net driver creates
an additional 'bypass' netdev that acts as a master device and controls
2 slave devices.  The original virtio_net netdev is registered as
'backup' netdev and a passthru/vf device with the same MAC gets
registered as 'active' netdev. Both 'bypass' and 'backup' netdevs are
associated with the same 'pci' device.  The user accesses the network
interface via 'bypass' netdev. The 'bypass' netdev chooses 'active' netdev
as default for transmits when it is available with link up and running.

We noticed a couple of issues with this approach during testing.
- As both 'bypass' and 'backup' netdevs are associated with the same
  virtio pci device, udev tries to rename both of them with the same name
  and the 2nd rename will fail. This would be OK as long as the first netdev
  to be renamed is the 'bypass' netdev, but the order in which udev gets
  to rename the 2 netdevs is not reliable. 
- When the 'active' netdev is unplugged OR not present on a destination
  system after live migration, the user will see 2 virtio_net netdevs.

Patch 3 refactors much of the changes made in patch 2, which was done on 
purpose just to show the solution we recommend as part of one patch set.  
If we submit a final version of this, we would combine patch 2/3 together.
This patch removes the creation of an additional netdev, Instead, it
uses a new virtnet_bypass_info struct added to the original 'backup' netdev
to track the 'bypass' information and introduces an additional set of ndo and 
ethtool ops that are used when BACKUP feature is enabled.

One difference with the 3 netdev model compared to the 2 netdev model is that
the 'bypass' netdev is created with 'noqueue' qdisc marked as 'NETIF_F_LLTX'. 
This avoids going through an additional qdisc and acquiring an additional
qdisc and tx lock during transmits.
If we can replace the qdisc of virtio netdev dynamically, it should be
possible to get these optimizations enabled even with 2 netdev model when
BACKUP feature is enabled.

As this patch series is initially focusing on usecases where hypervisor 
fully controls the VM networking and the guest is not expected to directly 
configure any hardware settings, it doesn't expose all the ndo/ethtool ops
that are supported by virtio_net at this time. To support additional usecases,
it should be possible to enable additional ops later by caching the state
in virtio netdev and replaying when the 'active' netdev gets registered. 
 
The hypervisor needs to enable only one datapath at any time so that packets
don't get looped back to the VM over the other datapath. When a VF is
plugged, the virtio datapath link state can be marked as down.
At the time of live migration, the hypervisor needs to unplug the VF device
from the guest on the source host and reset the MAC filter of the VF to
initiate failover of datapath to virtio before starting the migration. After
the migration is completed, the destination hypervisor sets the MAC filter
on the VF and plugs it back to the guest to switch over to VF datapath.

This patch is based on the discussion initiated by Jesse on this thread.
https://marc.info/?l=linux-virtualization&m=151189725224231&w=2

Sridhar Samudrala (3):
  virtio_net: Introduce VIRTIO_NET_F_BACKUP feature bit
  virtio_net: Extend virtio to use VF datapath when available
  virtio_net: Enable alternate datapath without creating an additional
    netdev

 drivers/net/virtio_net.c        | 564 +++++++++++++++++++++++++++++++++++++++-
 include/uapi/linux/virtio_net.h |   3 +
 2 files changed, 563 insertions(+), 4 deletions(-)

-- 
2.14.3

^ permalink raw reply	[flat|nested] 121+ messages in thread

end of thread, other threads:[~2018-02-28 20:48 UTC | newest]

Thread overview: 121+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-02-16 18:11 [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device Sridhar Samudrala
2018-02-16 18:11 ` [virtio-dev] " Sridhar Samudrala
2018-02-16 18:11 ` [RFC PATCH v3 1/3] virtio_net: Introduce VIRTIO_NET_F_BACKUP feature bit Sridhar Samudrala
2018-02-16 18:11 ` Sridhar Samudrala
2018-02-16 18:11   ` [virtio-dev] " Sridhar Samudrala
2018-02-16 18:11 ` [RFC PATCH v3 2/3] virtio_net: Extend virtio to use VF datapath when available Sridhar Samudrala
2018-02-16 18:11 ` Sridhar Samudrala
2018-02-16 18:11   ` [virtio-dev] " Sridhar Samudrala
2018-02-17  3:04   ` Jakub Kicinski
2018-02-17 17:41     ` Alexander Duyck
2018-02-17  3:04   ` Jakub Kicinski
2018-02-16 18:11 ` [RFC PATCH v3 3/3] virtio_net: Enable alternate datapath without creating an additional netdev Sridhar Samudrala
2018-02-16 18:11   ` [virtio-dev] " Sridhar Samudrala
2018-02-16 18:11 ` Sridhar Samudrala
2018-02-17  2:38 ` [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device Jakub Kicinski
2018-02-17  2:38 ` Jakub Kicinski
2018-02-17 17:12   ` Alexander Duyck
2018-02-17 17:12     ` [virtio-dev] " Alexander Duyck
2018-02-19  6:11     ` Jakub Kicinski
2018-02-20 16:26       ` Samudrala, Sridhar
2018-02-20 16:26         ` [virtio-dev] " Samudrala, Sridhar
2018-02-20 16:26       ` Samudrala, Sridhar
2018-02-21 23:50     ` Siwei Liu
2018-02-21 23:50       ` [virtio-dev] " Siwei Liu
2018-02-22  0:17       ` Alexander Duyck
2018-02-22  0:17       ` Alexander Duyck
2018-02-22  0:17         ` [virtio-dev] " Alexander Duyck
2018-02-22  1:59         ` Siwei Liu
2018-02-22  1:59         ` Siwei Liu
2018-02-22  1:59           ` [virtio-dev] " Siwei Liu
2018-02-22  2:35           ` Samudrala, Sridhar
2018-02-22  2:35           ` Samudrala, Sridhar
2018-02-22  2:35             ` [virtio-dev] " Samudrala, Sridhar
2018-02-22  3:28             ` Samudrala, Sridhar
2018-02-22  3:28               ` [virtio-dev] " Samudrala, Sridhar
2018-02-23 22:22             ` Siwei Liu
2018-02-23 22:22               ` [virtio-dev] " Siwei Liu
2018-02-23 22:38               ` Jiri Pirko
2018-02-24  0:17                 ` Siwei Liu
2018-02-24  0:17                   ` [virtio-dev] " Siwei Liu
2018-02-24  0:03         ` Stephen Hemminger
2018-02-25 22:17           ` Alexander Duyck
2018-02-25 22:17             ` [virtio-dev] " Alexander Duyck
2018-02-25 22:17           ` Alexander Duyck
2018-02-21 23:50     ` Siwei Liu
2018-02-17 17:12   ` Alexander Duyck
2018-02-20 10:42 ` Jiri Pirko
2018-02-20 16:04   ` Alexander Duyck
2018-02-20 16:04     ` [virtio-dev] " Alexander Duyck
2018-02-20 16:29     ` Jiri Pirko
2018-02-20 17:14       ` Samudrala, Sridhar
2018-02-20 17:14         ` [virtio-dev] " Samudrala, Sridhar
2018-02-20 20:14         ` Jiri Pirko
2018-02-20 21:02           ` Alexander Duyck
2018-02-20 21:02             ` [virtio-dev] " Alexander Duyck
2018-02-20 21:02           ` Alexander Duyck
2018-02-20 22:33           ` Jakub Kicinski
2018-02-21  9:51             ` Jiri Pirko
2018-02-21 15:56               ` Alexander Duyck
2018-02-21 15:56                 ` [virtio-dev] " Alexander Duyck
2018-02-21 16:11                 ` Jiri Pirko
2018-02-21 16:49                   ` Alexander Duyck
2018-02-21 16:49                     ` [virtio-dev] " Alexander Duyck
2018-02-21 16:58                     ` Jiri Pirko
2018-02-21 17:56                       ` Alexander Duyck
2018-02-21 17:56                       ` Alexander Duyck
2018-02-21 17:56                         ` [virtio-dev] " Alexander Duyck
2018-02-21 19:38                         ` Jiri Pirko
2018-02-21 20:57                           ` Alexander Duyck
2018-02-21 20:57                             ` [virtio-dev] " Alexander Duyck
2018-02-22  2:02                             ` Jakub Kicinski
2018-02-22  2:15                               ` Samudrala, Sridhar
2018-02-22  2:15                                 ` [virtio-dev] " Samudrala, Sridhar
2018-02-22  2:15                               ` Samudrala, Sridhar
2018-02-22  8:11                             ` Jiri Pirko
2018-02-22 11:54                               ` Or Gerlitz
2018-02-22 13:07                                 ` Jiri Pirko
2018-02-22 15:30                                   ` Alexander Duyck
2018-02-22 15:30                                     ` [virtio-dev] " Alexander Duyck
2018-02-22 21:30                               ` Alexander Duyck
2018-02-22 21:30                                 ` [virtio-dev] " Alexander Duyck
2018-02-23 23:59                                 ` Stephen Hemminger
2018-02-25 22:21                                   ` Alexander Duyck
2018-02-25 22:21                                   ` Alexander Duyck
2018-02-25 22:21                                     ` [virtio-dev] " Alexander Duyck
2018-02-26  7:19                                   ` Jiri Pirko
2018-02-27  1:02                                     ` Stephen Hemminger
2018-02-27  1:18                                       ` Michael S. Tsirkin
2018-02-27  1:18                                         ` [virtio-dev] " Michael S. Tsirkin
2018-02-27  8:27                                         ` Jiri Pirko
2018-02-22 21:30                               ` Alexander Duyck
2018-02-21 20:57                           ` Alexander Duyck
2018-02-21 16:49                   ` Alexander Duyck
2018-02-21 15:56               ` Alexander Duyck
2018-02-20 17:14       ` Samudrala, Sridhar
2018-02-20 17:23       ` Alexander Duyck
2018-02-20 17:23         ` [virtio-dev] " Alexander Duyck
2018-02-20 19:53         ` Jiri Pirko
2018-02-27  8:49     ` Jiri Pirko
2018-02-27 21:16       ` Alexander Duyck
2018-02-27 21:16       ` Alexander Duyck
2018-02-27 21:16         ` [virtio-dev] " Alexander Duyck
2018-02-27 21:23         ` Michael S. Tsirkin
2018-02-27 21:23           ` [virtio-dev] " Michael S. Tsirkin
2018-02-27 21:41         ` Jakub Kicinski
2018-02-28  7:08           ` Jiri Pirko
2018-02-28 14:32             ` Michael S. Tsirkin
2018-02-28 14:32               ` [virtio-dev] " Michael S. Tsirkin
2018-02-28 15:11               ` Jiri Pirko
2018-02-28 15:45                 ` Michael S. Tsirkin
2018-02-28 15:45                 ` Michael S. Tsirkin
2018-02-28 15:45                   ` [virtio-dev] " Michael S. Tsirkin
2018-02-28 19:25                   ` Jiri Pirko
2018-02-28 20:48                     ` Michael S. Tsirkin
2018-02-28 20:48                     ` Michael S. Tsirkin
2018-02-28 20:48                       ` [virtio-dev] " Michael S. Tsirkin
2018-02-27 21:30       ` Michael S. Tsirkin
2018-02-27 21:30         ` [virtio-dev] " Michael S. Tsirkin
2018-02-27 21:30       ` Michael S. Tsirkin
2018-02-20 16:04   ` Alexander Duyck
2018-02-16 18:11 Sridhar Samudrala

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.